CIIProCluster: Developing Read-Across Predictive Toxicity Models Using Big Data

1
CIIProCluster: Developing Read-Across Predictive Toxicity Models Using Big Data Daniel P Russo 1 , Wenyi Wang 1 , Judy Strickland 2 , Sunil Shende 1, 3 , Hao Zhu 1, 4 1 The Rutgers Center for Computational and Integrative Biology, Camden, New Jersey 08102; 2 ILS, Research Triangle Park, North Carolina 27709; 3 Department of Computer Science, Rutgers University, Camden, New Jersey 08102; 4 Department of Chemistry, Rutgers University, Camden, New Jersey 08102 Accurate predictive models for complex toxicity endpoints, e.g. oral acute toxicity, is the goal of computational toxicology but not successful for most studies. The difficulties in model development for these endpoints can be attributed to the complex mechanisms relevant to the toxicity phenomena. Incorporating biological data (i.e., bioassays) into model development has been shown to be valuable to improve predictivity and allows for intuition on mechanisms of action for toxicants. However, in the current big data era, finding and characterizing relevant biological data to evaluate the chemical toxicity of interest is a major challenge. The Chemical In-vitro, In-vivo Profiling (CIIPro) portal was created to use big data sources for the prediction of new compounds (ciipro.rutgers.edu). As a major advancement of the CIIPro project, we present CIIProCluster, a new read-across approach for creating predictive toxicity models based on the available bioassay data for chemicals of interest. In this workflow, chemical substructures responsible for bioassay activation are used to aggregate bioassays (Figure 1). Clusters of assays show toxicity-relevant mechanisms and can be used to predict complex toxicity endpoints (e.g., oral acute toxicity). Introduction Figure 3. The reliability of biosimilarity calculations when using ‘big data’ needs to be considered. First, all available PubChem biological data for 7,385 diverse oral acute toxicity values (i.e., LD 50 values) were extracted using the CIIPro portal (ciipro.rutgers.edu). This generated a profile of 3,468 compounds and 1,948 PubChem assays. Chemical fingerprints were created for each compound using a circular fingerprint algorithm obtained from the cheminformatics software, RDKit. Here, a bit vector denoting the presence or absence of a chemical substructure for each compound was created. For each bioassay-fingerprint pair, the CIIProCluster algorithm uses Fisher’s exact test for statistical significance to determine the relevance between fingerprints and bioassay activity (p < 0.05). Using this information, the CIIProCluster algorithm can create a network of bioassays clustered by the chemical fingerprints that were deemed to be contributory to bioassay activity as seen in Figure 2. Methods Figure 2. A network of 1,948 assays clustered by chemical fragments relevant to assay activity. Colors represent modularity class. . Whole cell assays e.g., oxidative stress Cell-free assays e.g., protein-receptor assays Whole cell assays e.g., cytotoxicity Figure 1. Chemical fragments relevant to bioassay activation could delineate mechanisms of toxicity phenomena. National Institute of Health: 1R15ES023148 Society of Toxicology: Colgate-Palmolive Grant for Alternative Research Russo, D.P. et al. Bioinformatics DOI: 10.1093/bioinformatics/btw640 (2016). Zhang, J., Hsieh, J.-H. & Zhu, H. PLoS ONE 9, e99863 (2014). Zhu, H. et al. Chem. Res. Toxicol. 27, 1643–1651 (2014). Kleinstreuer, N. C. et al. Nat Biotech 32, 583–591 (2014). References Methods – cont’d. Funding Resources Presumably, compounds with similar responses in a cluster of assays should exhibit similar toxicological effects. However, the bias of inactive and missing data needs to be considered when using biological data. To deal with this type of biased biological data with many missing data for target compounds, we used two metrics described previously (Equations 1 and 2 and Figure 3). Here, and represent the sets of active responses for compounds A and B, respectively. Similarly, and , represent the sets of inactive responses. , = + + ∙+ + (1) , = + ∙+ + (2) In this study, we choose to investigate the predictive power of one cluster of bioassays exhibiting the least amount of dispersion and with descriptions most seemingly relevant to toxicity (blue cluster in the top right corner of Figure 2). All bioassays in this group are tumor cell line growth inhibition assays (i.e., cytotoxicity assays). To evaluate the predictive power of these bioassays, we found nearest neighbors in the training set for each test compound using Equation 1. We can restrict biosimilarity calculation to be valid only if the confidence score meets a confidence threshold using Equation 2. Iterating through these confidence values, a clear improvement of prediction, especially for toxic compounds can be viewed (Figure 4). predicted value true value Figure 4. The results of a leave one out cross validation through various confidence intervals. The confidence value equating to meaningful predictions occurs at approximately 5 (Figure 4). This corresponds to a confidence value where at least one active response is involved in the biosimilarity calculation. These results show the effective use of cytotoxicity bioassays as predictors for oral toxicity on a small set of compounds. A better predictive model for oral acute toxicity can be created when other bioassay clusters are involved. Furthermore, this workflow can be extended to develop predictive models for other animal toxicity endpoints. Results and Discussion

Transcript of CIIProCluster: Developing Read-Across Predictive Toxicity Models Using Big Data

Page 1: CIIProCluster: Developing Read-Across Predictive Toxicity Models Using Big Data

CIIProCluster: Developing Read-Across Predictive

Toxicity Models Using Big Data Daniel P Russo1, Wenyi Wang1, Judy Strickland2, Sunil Shende1, 3, Hao Zhu1, 4

1The Rutgers Center for Computational and Integrative Biology, Camden, New Jersey 08102; 2ILS, Research Triangle Park, North Carolina 27709;

3Department of Computer Science, Rutgers University, Camden, New Jersey 08102; 4Department of Chemistry, Rutgers University, Camden, New Jersey 08102

Accurate predictive models for complex toxicity endpoints, e.g. oral acute

toxicity, is the goal of computational toxicology but not successful for most studies. The difficulties in model development for these endpoints can be attributed to the complex mechanisms relevant to the toxicity phenomena. Incorporating biological data (i.e., bioassays) into model development has been shown to be valuable to improve predictivity and allows for intuition on mechanisms of action for toxicants. However, in the current big data era, finding and characterizing relevant biological data to evaluate the chemical toxicity of interest is a major challenge. The Chemical In-vitro, In-vivo Profiling (CIIPro) portal was created to use big data sources for the prediction of new compounds (ciipro.rutgers.edu). As a major advancement of the CIIPro project, we present CIIProCluster, a new read-across approach for creating predictive toxicity models based on the available bioassay data for chemicals of interest. In this workflow, chemical substructures responsible for bioassay activation are used to aggregate bioassays (Figure 1). Clusters of assays show toxicity-relevant mechanisms and can be used to predict complex toxicity endpoints (e.g., oral acute toxicity).

Introduction

Figure 3. The reliability of biosimilarity calculations when using ‘big data’ needs to be considered.

First, all available PubChem biological data for 7,385 diverse oral acute toxicity values (i.e., LD50 values) were extracted using the CIIPro portal (ciipro.rutgers.edu). This generated a profile of 3,468 compounds and 1,948 PubChem assays. Chemical fingerprints were created for each compound using a circular fingerprint algorithm obtained from the cheminformatics software, RDKit. Here, a bit vector denoting the presence or absence of a chemical substructure for each compound was created. For each bioassay-fingerprint pair, the CIIProCluster algorithm uses Fisher’s exact test for statistical significance to determine the relevance between fingerprints and bioassay activity (p < 0.05). Using this information, the CIIProCluster algorithm can create a network of bioassays clustered by the chemical fingerprints that were deemed to be contributory to bioassay activity as seen in Figure 2.

Methods

Figure 2. A network of 1,948 assays clustered by chemical fragments relevant to assay activity. Colors represent modularity class. .

Whole cell assays • e.g., oxidative stress

Cell-free assays • e.g., protein-receptor

assays

Whole cell assays • e.g., cytotoxicity

Figure 1. Chemical fragments relevant to bioassay activation could delineate mechanisms of toxicity phenomena.

• National Institute of Health: 1R15ES023148 • Society of Toxicology: Colgate-Palmolive Grant

for Alternative Research

Russo, D.P. et al. Bioinformatics DOI: 10.1093/bioinformatics/btw640 (2016). Zhang, J., Hsieh, J.-H. & Zhu, H. PLoS ONE 9, e99863 (2014). Zhu, H. et al. Chem. Res. Toxicol. 27, 1643–1651 (2014). Kleinstreuer, N. C. et al. Nat Biotech 32, 583–591 (2014).

References

Methods – cont’d.

Funding Resources

Presumably, compounds with similar responses in a cluster of assays should exhibit similar toxicological effects. However, the bias of inactive and missing data needs to be considered when using biological data. To deal with this type of biased biological data with many missing data for target compounds, we used two metrics described previously (Equations 1 and 2 and Figure 3). Here, 𝐴𝑎 and 𝐵𝑎 represent the sets of active responses for compounds A and B, respectively. Similarly, 𝐴𝑖 and 𝐵𝑖, represent the sets of inactive responses.

𝐵𝑖𝑜𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴𝑎∩𝐵𝑎 + 𝐴𝑖∩𝐵𝑖 ∙𝑤

𝐴𝑎∩𝐵𝑎 + 𝐴𝑖∩𝐵𝑖 ∙𝑤+ 𝐴𝑎∩𝐵𝑖 + 𝐴𝑖∩𝐵𝑎 (1)

𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐴, 𝐵 = 𝐴𝑎 ∩ 𝐵𝑎 + 𝐴𝑖 ∩ 𝐵𝑖 ∙ 𝑤 + 𝐴𝑎 ∩ 𝐵𝑖 + 𝐴𝑖 ∩ 𝐵𝑎 (2)

In this study, we choose to investigate the predictive power of one cluster of bioassays exhibiting the least amount of dispersion and with descriptions most seemingly relevant to toxicity (blue cluster in the top right corner of Figure 2). All bioassays in this group are tumor cell line growth inhibition assays (i.e., cytotoxicity assays). To evaluate the predictive power of these bioassays, we found nearest neighbors in the training set for each test compound using Equation 1. We can restrict biosimilarity calculation to be valid only if the confidence score meets a confidence threshold using Equation 2. Iterating through these confidence values, a clear improvement of prediction, especially for toxic compounds can be viewed (Figure 4).

predicted value

tru

e va

lue

Figure 4. The results of a leave one out cross validation through various confidence intervals.

The confidence value equating to meaningful predictions occurs at approximately 5 (Figure 4). This corresponds to a confidence value where at least one active response is involved in the biosimilarity calculation. These results show the effective use of cytotoxicity bioassays as predictors for oral toxicity on a small set of compounds. A better predictive model for oral acute toxicity can be created when other bioassay clusters are involved. Furthermore, this workflow can be extended to develop predictive models for other animal toxicity endpoints.

Results and Discussion