New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011....

13
Variation of background knowledge in an industrial application of ILP Stephen H. Muggleton * , Jianzhong Chen * , Hiroaki Watanabe * , Stuart J. Dunbar , Charles Baxter , Richard Currie , Jos´ e Domingo Salazar , Jan Taubert + and Michael J.E. Sternberg * Imperial College London * Syngenta Ltd BBSRC Rothamsted Research + Abstract. In several recent papers ILP has been applied to Systems Biology problems, in which it has been used to fill gaps in the descrip- tions of biological networks. In the present paper we describe two new applications of this type in the area of plant biology. These applications are of particular interest to the agrochemical industry in which improve- ments in plant strains can have benefits for modelling crop development. The background knowledge in these applications is extensive and is de- rived from public databases in a Prolog format using a new system called Ondex (developers BBSRC Rothamsted). In this paper we explore the question of how much of this background knowledge it is beneficial to in- clude, taking into account accuracy increases versus increases in learning time. The results indicate that relatively shallow background knowledge is needed to achieve maximum accuracy. 1 Introduction Systems Biology is a rapidly evolving discipline that seeks to determine how complex biological systems function [6]. It works by integrating experimentally derived information with mathematical and computational models. Through an iterative process of experimentation and modelling, Systems Biology aims to understand how individual components interact to govern the functioning of the system as a whole. In several recent papers [13, 3], Inductive Logic Programing (ILP) has been applied to Systems Biology problems, in which it has been used to fill gaps in the descriptions of biological networks. Two new industrial applications of this type in the area of plant biology are being explored within the Syngenta University Innovation Centre (see Section 2). This centre of excellence in Systems Biology aims to address biological research questions related to the improvement of plant strains involved in crops. The centre uses mathematical and computational modelling techniques developed at Imperial College London [2]. The background knowledge in these applications is generated by a system called Ondex [7]. Ondex is unique in its ability to generate large amounts of Prolog background knowledge on cell biochemistry by parsing, filtering and combining various publicly-available databases.

Transcript of New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011....

Page 1: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

Variation of background knowledge in an

industrial application of ILP

Stephen H. Muggleton∗, Jianzhong Chen∗, Hiroaki Watanabe∗, Stuart J.Dunbar†, Charles Baxter†, Richard Currie†, Jose Domingo Salazar†, Jan

Taubert+ and Michael J.E. Sternberg∗

Imperial College London∗

Syngenta Ltd†

BBSRC Rothamsted Research+

Abstract. In several recent papers ILP has been applied to SystemsBiology problems, in which it has been used to fill gaps in the descrip-tions of biological networks. In the present paper we describe two newapplications of this type in the area of plant biology. These applicationsare of particular interest to the agrochemical industry in which improve-ments in plant strains can have benefits for modelling crop development.The background knowledge in these applications is extensive and is de-rived from public databases in a Prolog format using a new system calledOndex (developers BBSRC Rothamsted). In this paper we explore thequestion of how much of this background knowledge it is beneficial to in-clude, taking into account accuracy increases versus increases in learningtime. The results indicate that relatively shallow background knowledgeis needed to achieve maximum accuracy.

1 Introduction

Systems Biology is a rapidly evolving discipline that seeks to determine howcomplex biological systems function [6]. It works by integrating experimentallyderived information with mathematical and computational models. Through aniterative process of experimentation and modelling, Systems Biology aims tounderstand how individual components interact to govern the functioning of thesystem as a whole. In several recent papers [13, 3], Inductive Logic Programing(ILP) has been applied to Systems Biology problems, in which it has been usedto fill gaps in the descriptions of biological networks.

Two new industrial applications of this type in the area of plant biology arebeing explored within the Syngenta University Innovation Centre (see Section 2).This centre of excellence in Systems Biology aims to address biological researchquestions related to the improvement of plant strains involved in crops. Thecentre uses mathematical and computational modelling techniques developed atImperial College London [2]. The background knowledge in these applications isgenerated by a system called Ondex [7]. Ondex is unique in its ability to generatelarge amounts of Prolog background knowledge on cell biochemistry by parsing,filtering and combining various publicly-available databases.

Page 2: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

In this paper we use Ondex to explore the performance effects of varyingthe amount of background knowledge available to an ILP learning engine. Thisis done by generating variations of background knowledge using the RelationNeighbours Filter in Ondex together with a tehnique for sampling Relations.The effects of varying the background knowledge are measured on both learn-ing time and predictive accuracy. The experimental results indicate that whilelearning time increases monotonically with the amount of background knowl-edge, relatively shallow degree of background knowledge is required to achievemaximum accuracy.

The paper is arranged as follows. In Section 2 we introduce the applicationson which the experiments were conducted. We then describe the Ondex systemfor generating the background knowledge in Section 3. The experiments are thendescribed in Section 4. Finally we conclude and describe further work in Section5.

2 Application descriptions

2.1 University Innovation Centre (UIC) overview

Agricultural research relies on understanding interactions of genes and chemicalsin a biological context. The search for a new biological trait to use in a conven-tional or genetic modification breeding programme is complex. It can take upto ten years and millions of dollars to bring such a development to market. Thesame is true for the search for new agrochemicals. Part of this development pro-cess is an assessment of the safety of the gene or chemical to the environmentand its potential toxicity to both mammals and beneficial organisms. SystemsBiology takes a new, integrated, approach to address these important challenges.Syngenta [1] is a leading Agrichemical company with a number one position inchemicals and is number three in high value seeds. Syngenta has establisheda “University Innovation Centre” (UIC) on Systems Biology at Imperial Col-lege London [2] to implement a “systems approach” to agricultural research.The centre has begun with two pioneer projects, tomato ripening and predictivetoxicology.

2.2 Tomato application

The characteristics of the tomato fruit that reaches the consumer are definedby the combination of its biochemical and textural properties. Metabolic com-ponents (volatiles, pigments, sugars and amino acids) define the appearanceand flavour whilst structural properties (cell adhesion, cell size, cuticle thick-ness, water content) define mouth-feel and texture perception. Together thesecomponents determine fruit quality and are crucial in influencing the success ofcommercial varieties.

At the genetic and biochemical level the regulation of fruit development andripening remains poorly understood. In this project we are applying ILP to

Page 3: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

deepen our understanding of the metabolic processes controlling tomato fruit de-velopment. By applying machine learning techniques to transcript and metabo-lite profiling data we are developing metabolic networks and building a predictivemodel of tomato ripening and fruit quality. Through the coordinated analysisof gene expression and metabolite changes across fruit development we aim toidentify new genetic targets that play a role in controlling the ripening process.Such knowledge will allow us to focus on these genetic control points in breedingnew tomato varieties, thus producing the most favourable combination of fruitquality characters in the ripe fruit.

2.3 Predictive Toxicology

An assessment of the potential to cause cancer is a key component of the riskassessment on a new Crop Protection Active Ingredient. The two year and 80week bioassays in rats and mice, respectively, provide the Hazard Informationto evaluate this risk. If tumours are observed in these trials, an assessment oftheir relevance for human risk may then be required. This typically makes useof the IPCS/HESI Human Relevance Framework, where the first step is thedevelopment of a mode-of-action case to describe the series of causal key eventsthat lead to rodent tumours. The second step is to examine the plausibility ofthese key events occurring in humans and so guide an assessment of the relevanceof the rodent findings.

This project aims to build a model that integrates the metabolic and geneexpression regulatory networks that underlie initial key events in liver tumourpromotion induced by model non-genotoxic carcinogens. It is envisaged thatcycles of hypothesis generation informed by model building and experimentaltesting will allow the identification of those regulatory components that arekey components in liver tumour promoting modes of action. Ultimately thiswill allow us to improve mechanistic understanding and so provide key data toexplain the basis of the thresholds in dose and species specificity in response,thereby allowing more informed human health risk assessments.

3 Ondex: a Biological Background Knowledge Generator

Data integration in the life sciences still remains a significant challenge for bioin-formatics [5]. Rather than developing a bespoke data integration solution forassembling the background knowledge for the machine learning task, the opensource data integration framework Ondex [7] was selected as a general solutionto bringing all the required data together. Ondex uses a graph-based approachwith a data warehouse for integrating biological data. The nodes in the graphrepresent biological concepts, e.g. enzymes and metabolites. Edges in the graphrepresent relations between biological concepts, e.g. a set of enzymes catalysesa biochemical reaction. Both nodes and edges in the graph can have additionalattributes, e.g. an enzyme name or an amino acid sequence. One of the reasons

Page 4: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

for choosing the Ondex system as a background knowledge generator is the nat-ural correspondence between its graph representation and the requirement ofgenerating background knowledge as Prolog clauses for ILP learning.

Data from key biological pathway and gene function information resourcesincluding KEGG [11], LycoCyc [4] were transformed into a semantically con-sistent graph representation using Ondex. In order to create a non-redundantand coherent knowledge base, mapping methods were used to identify equivalentand similar entities among the different data sources. Once the databases wereintegrated, the resulting knowledge base was available for further analysis andvisualization using the graph-based methods built in Ondex.

A key feature of the Ondex user client is that it allows the extraction of sub-graphs based on certain criteria. For example, it is simple to extract sub-graphsselected on by the class of biological concepts or relations, or where conceptsposses a particular attribute, or from a graph-neighbourhood around particularnodes of interest. Such criteria can be combined in a workflow to manipulate theinformation to be included into the background knowledge. In order to supportILP learning, a general background knowledge generating utility was built thatrespected an agreed Prolog syntax, with Ondex concept class names and rela-tion type names becoming predicate symbols and attributes becoming Prologterm structures. Every concept and relation was given a unique ID as the firstargument in each predicate, which was used by other predicates to associateattributes with concepts and relations. The translation of attributes of conceptsand relations were defined using the Ondex-generated Prolog code export utility.By following agreed conventions it was also possible to translate Prolog formatback into an Ondex graph, thus enabling the results of the machine learningprocess to be imported back into Ondex where they could be visualised in thecontext of the original knowledge base.

4 Experiments

Two independent experiments were conducted in the study to empirically inves-tigate the null hypothesis: variations of background knowledge (BK) do not leadto increased predictive accuracy.

4.1 Materials and methods

Tomato application An initial ground background knowledge base was de-rived from LycoCyc database [4, 9] and exported as Prolog format using theOndex system. The knowledge base depicts the relational structure of tomatobiological network (shown in Fig. 1), including the fundamental components,e.g. compounds, reactions, enzymes, and their relations, e.g. consumed by, pro-duced by, catalysed by, etc. Two types of raw data were provided by the do-main experts in the experiments using gene mutants to study altered tomatoripening - concentration changes of metabolites and gene transcripts for fourgenotypes during 13 time slices. The data were expressed in terms of binary

Page 5: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

(up/down-regulation) for the purposes of applying ILP. In order to deal withthe many-to-one relationships among gene transcripts, enzymes and reactions,a set of relevant transcriptomic data were further ‘compressed’ into one valueusing SUM aggregate for a reaction. Three datasets were chosen for modelingtomato aspartate and the connected subnetwork. Each contains the concentrationchanges of 10 metabolites as learning examples and 16 transcripts as observedfacts on a particular time point1.

Substrate1

Substratem

Product1

Productn

{Enzyme}/ECr

inhibition/activation?

···

···

· · ·

· · ·

· · ·

· · ·

Fig. 1. Illustration of the relational background knowledge, where a set of compounds{Substrate}m

1 are consumed by a reaction r, which produce a set of compounds{Product}n

1 ; r is catalysed by a set of enzymes with an EC number {Enzyme}/ECwhich is aggregated from a set of gene transcripts; the ILP learning is to abduce inhi-bition/activation occured in r.

C{S}1{S}m {P}1 {P}n

{E}r

{E} {E}r r· · ·

{E}r

{E}{E}rr · · ·

| || || distance = 1distance = 1| ||

distance <= kdistance <= k

{S}

. . .

r{E}

r{E}

{P}

... r{E}

r{E}

Fig. 2. Illustration of biological network structure and k-compound-neighbours (k-cn,k ≥ 1) of a centroid compound C, where {S}, {P}, {E}, r stand for a set of substratecompounds, product compounds, enzymes and a reaction, respectively

Variations of the background knowledge were generated using the Ondex re-lation neighbours filter (RNF), which extracts a subset from an original networkgiven a set of centroid nodes and some distance k. In our experiments, the subnet-work consists of k-compound-neighbours (k-cn, as illustrated in Fig. 2) , whichis defined based on a corresponding k-reaction-neighbours (k-rn) as follows.

0-cn = {centroid compounds}

k-rn =⋃

c∈(k−1)-cn

(r | consumed by(c, r)) +⋃

c∈(k−1)-cn

(r | produced by(c, r))

k-cn =⋃

r∈k-rn

(c | consumed by(c, r)) +⋃

r∈k-rn

(c | produced by(c, r))

Furthermore, a sampling relation neighbours filter (SRNF) is developed inorder to generate a series of evenly varied variations, containing sampled subsets

1 They respectively represent three time points - 15 days post anthesis, the breakerpoint when the fruit starts to change colour, and 7 days after the breaker stage.

Page 6: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

of k-rn in which the size of k-cn can be controlled by a given sampling rate. Thealgorithm of SRNF is shown in Table 1.

1.0-cn={learning examples};2.for each distance k = 1, . . . , K, where K is a distance threshold corresponding to

the original background knowledge base2.1.k-rn={}, for each c ∈ (k − 1)-cn

k-rn = k-rn ∪{r | consumed by(c, r) ∨ produced by(c, r)};2.2.ks-rn = sample(k-rn,sr), where sr is a manually set sampling rate;2.3.k-cn={}, for each r ∈ ks-rn

k-cn = k-cn ∪{c | consumed by(c, r) ∨ produced by(c, r)};2.4.output k-cn.

Table 1. Algorithm of sampling relation neighbours filter (SRNF)

The datasets and variations of ground background knowledge were given tothe abductive ILP [13] system Progol5.0 [10] together with a set of non-groundrules, which describe the underlying transitive behaviour of concentrations ofmetabolites and enzymes. Progol5.0 was then required to derive inhibition onreactions. Predictive performance was tested and evaluated against variations ofthe background knowledge (k-cn).

Predictive Toxicology application In the Predictive Toxicology application,the metabolomic and genomic data reported in [14] were integrated with theKEGG Rattus norvegicus (rat) database using Ondex. The data integration wasperformed in the following four steps. The first step was to extract relational in-formation from the KEGG database using the Ondex KEGG parser. After pars-ing an XML version of KEGG database, unnecessary information was discardedusing Ondex’s filtering functions. We constructed an initial model by keepingthe 6 meta concepts (Gene, Protein, Enzyme, Enzyme Classification, Reactionand Compound) and 6 relations as shown in Figure 3. Note that a meta conceptis a set of unique concepts. For example, the meta concept Compound is a setof compounds collected in KEGG COMPOUND database.

The second step was to define new relations at the meta-concept level inorder to build our own models. In this paper we design mappings from Gene toEnzyme Classification (EC) to map the effects of the gene expressions at the EClevel. The following shows a chain of meta concepts in Prolog.

project(Gene,EC) : − is encoded by(Protein,Gene), is a(Enzyme, Protein),

catalyzed by(Reaction,Enzyme),

part of catalyzing class(Reaction,EC)

In the above formula, the new predicate, project/2, can be used to define theprojection from Gene to EC. In the Predictive Toxicology application, this pro-

Page 7: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

Fig. 3. The 6 meta classes and with their associated relations

jection forms a many-to-one mapping. Part of the projections is included inAppendix 1.

The third step was to integrate the parsed Ondex model with the numericalgene expression data at the EC level. We applied the Ondex tab delimited fileparser and stored the parsed numerical data as attribute values in the associ-ated concepts. The gene expression data at day 14 at 1000 ppm of Phenobar-bital (PB) [14] were mapped onto EC using the above projections. The SUMaggregated expression data were used to calculate the fold changes in expressionrelative to time-matched controls. The numerical fold changes were transformedinto binary (up/down-regulation) for our ILP experiments.

The final step was to combine the metabolomic data with the above inte-grated Ondex model. We exported the constructed Ondex model into Prologusing the Ondex Prolog export utility. The metabolomic data were manuallyencoded in Prolog and combined with the Ondex Prolog code. A part of ourOndex Prolog model is shown in Appendix 2. The metabolite changes (increaseor decrease) at day 14 at 1000 ppm of PB [12] were combined with the exportedOndex Prolog.

In the Predictive Toxicology application, a new filter was designed for gen-erating variations of background knowledge. This is called a minimum spanningtree filter (MiST) in which the projected expression data at the EC level aretreated as weights. Intuitively the filter keeps informative reactions in the path-way and filters non-informative reactions out of the pathway. The notion ofthe informative reaction is based on the level of the projected gene expressions.If a member of EC, ec, is associated with a strongly expressed gene, we treatthe reaction catalysed by ec as informative regardless of up-regulation or down-regulation. We implement this idea by defining the weight as w = e|1/log(fc)| inwhich fc is a numerical fold change value. Figure 4 shows the MiST algorithmwhich is based on Kruskal’s algorithm [8].

We created 20 variations of background knowledge by applying the Ondexrelation neighbours filter (RNF) followed by the MiST filter. More precisely, afterparsing the KEGG rat database in step 1, we applied the RNF filter to the parsedOndex model for each neighbourhood distance k = 1, ..., 10. The resulting 10variations of background knowledge are referred to as the background knowledge

Page 8: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

Fig. 4. MiST algorithm in the minimum spanning tree filter

by RNF. In Appendix 3, we show a visualisation of the background knowledgegenerated using RNF (k = 1) in Figure 9 and its meta-level view in Figure 10.

For each of these variations, we performed step 2 and step 3 in order tocompute the weights at the EC level. We applied the MiST filter for each of thevariation and the resulting 10 variants are referred to as the MiST backgroundknowledge.

The variations of the background knowledge, the integrated examples, andnon-ground Prolog rules were given to the abductive ILP system Progol5.0.Note that the same non-ground Prolog rules were applied for the Tomato andPredictive Toxicology experiments. Progol5.0 executed abductive inferences inorder to generate hypotheses of inhibition on reactions for each variation of thebackground knowledge.

4.2 Results and discussion

Tomato application Leave-one-out cross validation was used to evaluate theexperiments, in which predictive accuracy and running time were computed forvariation of the background knowledge. Fig. 5(a) shows that the sizes of k-cngenerated by RNF increased sharply when k ≤ 2 but only have minor changeswhen k > 2; whereas the sizes of k-cn generated by SRNF increase graduallyand evenly with increasing values of k. Fig. 5(b) indicates that the running timeincreases linearly with the size of the background knowledge. Fig. 6 shows that(1) all the experiments get at least default predictive accuracy; (2) maximum

Page 9: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

accuracy could be achieved with relatively shallow bcakground knowledge (i.e.smaller k); (3) the sizes of background knowledge generated by SRNF at whichthe predictive accuracy reaches its maximum value s smaller than those gener-ated by RNF for two datasets, and are almost level in the third dataset.

In addition, we define a k-model to be the learning result (inhibition/activationabduced) using k-cn. A k-model will be referred to as stable if it is equivalentto a K-model (see step 2 of Table 1) with maximum accuracy. The vertical linesin Fig. 5 show that SRNF generates a smaller size background knowledge withless running time to reach the least k values of which k-model starts to be stablethan RNF for all the three datasets (k ≥ 3 for RNF and k ≥ 7 for SRNF). Insummary, it is possible to find more stable, shallow and fine (or smaller) back-ground knowledge that achieves maximum predictive accuracy with less runningtime by using SRNF rather than RNF. SRNF also enables us to investigate thefiner changes between variation of BK in a controllable way. The null hypothesisset for the experiments has been rejected by these results.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

1 2 3 4 5 6 7 8 9 10

Siz

e of

BK

(K

B)

Neighbours distance (k)

(a) Size of BK v.s. Variation of BK

k-cn

(R

NF

) st

arts

to b

e st

able

k-cn

(S

RN

F)

star

ts to

be

stab

le

BK produced by RNFBK produced by SRNF

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8 9 10

Run

ning

tim

e (s

econ

ds)

Neighbours distance (k)

(b) Running time v.s. Variation of BK

k-cn

(R

NF

) st

arts

to b

e st

able

k-cn

(S

RN

F)

star

ts to

be

stab

le

Dataset 1, RNFDataset 2, RNFDataset 3, RNF

Dataset 1, SRNFDataset 2, SRNFDataset 3, SRNF

Fig. 5. (a) Size of variation of BK (b) Running time v.s. variation of BK (k-cn) pro-duced by RNF and SRNF for the three datasets

Predictive Toxicology application Leave-one-out cross validation was per-formed for the selected 9 metabolites. In Figure 7, the sizes of variation of BKsby RNF increased gradually whereas the MiST sizes remained constant. Fig-ure 8 shows that (1) the predictive accuracies of the MiST approach were alwayshigher than the results of RNF in which RNF only provided the default accuracyand (2) the best predictive accuracy by MiST was achieved at neighbourhooddistance 1.

In the Predictive Toxicology application, the null hypothesis for the experi-ments has been rejected by these empirical results.

Page 10: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

Pre

dict

ive

accu

racy

(%

)

Neighbours distance (k)

(a) Predictive accuracy vs. Variation of BK produced by RNF

Dataset 1 (default is 80%)Dataset 2 (default is 60%)Dataset 3 (default is 60%)

Default accuracy

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

Pre

dict

ive

accu

racy

(%

)

Neighbours distance (k)

(b) Predictive accuracy vs. Variation of BK produced by SRNF

Dataset 1 (default is 80%)Dataset 2 (default is 60%)Dataset 3 (default is 60%)

Default accuracy

Fig. 6. Predictive accuracy v.s. variation of BK (k-cn) produced by (a) RNF and (b)SRNF for the three datasets

Fig. 7. Size of variations of BKsFig. 8. Predictive accuracy by RNF andMiST

5 Conclusions and further work

This paper explores the application of ILP to Systems Biology. These applica-tions involve modelling interations between components of biological systemsusing abductive inductive logic programing. We also introduce a powerful newsystem called Ondex which can be used to generate Prolog background knowl-edge iby parsing and filtering public dataases on cell biochemistry. Two industrialapplications are described which are being studied in the Syngenta UniversityInnovation Centre. With the extensive background knowledge generated, we ex-plore the question of how variations of background knowledge affect learningtime and predictive accuracy of the same ILP learning system.

Through two independent experiments in tomato biology and predictive tox-icology, we conclude that relatively shallow background knowledge can be usedto achieve maximum accuracy. In addition the experiments indicate that use of

Page 11: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

neighbourhod further reduces the learning time required to achieve maximumaccuracy.

In further work we aim to improve the non-ground background knowledgerules used in the experiments. We also intend to extend the results using datasetsof all time points, and investigate the biological significance of the learned the-ories.

Acknowledgments

The authors would like to acknowledge the support of Syngenta in its fundig ofthe University innovations Centre at Imperial College. The first author wouldalso like to thank the Royal Academy of Engineering and Microsoft Research fortheir support of his research chair.

References

1. Syngenta Ltd. http://www.syngenta.com/en/index.html.2. Syngenta University Innovation Centre. http://www3.imperial.ac.uk/syngenta-

uic.3. J. Chen, S.H. Muggleton, and J. Santos. Learning probabilistic logic models from

probabilistic examples. Machine Learning, 73(1):55–85, 2008. 10.1007/s10994-008-5076-4.

4. L.A. Mueller et al. The SOL Genomics Network. A Comparative Resource forSolanaceae Biology and Beyond. Plant Physiology, 138(3):1310–1317, 2005.

5. C. Goble and R. Stevens. State of the nation in data integration for bioinformatics.Journal of Biomedical Informatics, 41(5):687–693, 2008.

6. H. Kitano. Computational systems biology. Nature, 420:206–210, 2002.7. J. Kohler, J. Baumbach, J. Taubert, M. Specht, A. Skusa, A. Ruegg, C. Rawlings,

P. Verrier, and S. Philippi. Graph-based analysis and visualization of experimentalresults with ONDEX. Bioinformatics, 22(11):1383–1390, 2006.

8. Joseph. B. Kruskal. On the shortest spanning subtree of a graph and the travelingsalesman problem. Proceedings of the American Mathematical Society, 7(1):48–50,1956.

9. LycoCyc. Solanum lycopersicum database. http://solcyc.solgenomics.net//LYCO/.10. S.H. Muggleton and C.H. Bryant. Theory completion using inverse entailment. In

Proc. of the 10th International Workshop on Inductive Logic Programming (ILP-00), pages 130–146, Berlin, 2000. Springer-Verlag.

11. H Ogata, S Goto, K Sato, W Fujibuchi, H Bono, and M Kanehisa. KEGG: KyotoEncyclopedia of Genes and Genomes. Nucl. Acids Res., 27(1):29–34, 1999.

12. Denis V. Rubtsov, Claire Waterman, Richard A. Currie, Catherine Waterfield,Jos Domingo Salazar, Jayne Wright, and Julian L. Griffin. Application of abayesian deconvolution approach for high-resolution 1h nmr spectra to assessingthe metabolic effects of acute phenobarbital exposure in liver tissue. AnalyticalChemistry, 82(11):4479–4485, 2010.

13. A. Tamaddoni-Nezhad, R. Chaleil, A. Kakas, and S.H. Muggleton. Applicationof abductive ILP to learning metabolic network inhibition from temporal data.Machine Learning, 64:209–230, 2006. DOI: 10.1007/s10994-006-8988-x.

Page 12: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

14. Claire Waterman, Richard Currie, Lisa Cottrell, Jacky Dow, Jayne Wright, Cather-ine Waterfield, and Julian Griffin. An integrated functional genomic study of acutephenobarbital exposure in the rat. BMC Genomics, 11(1):9, 2010.

Appendix 1: Sample of projection/2

%

%project(KEGG GeneID, EC_Number).

%

project(’RNO:24788_GE’,’1.1.1.14’).

% rno:24788 is mapped onto EC1.1.1.14

project(’RNO:24534_GE’,’1.1.1.27’).

% rno:24534 is mapped onto EC1.1.1.27

project(’RNO:24533_GE’,’1.1.1.27’).

% rno:24533 is mapped onto EC1.1.1.27

Appendix 2: Sample of Ondex Prolog in the Predictive Toxicologydomain

enzyme(’4400’,’derived enzyme’,’KEGG’,’IMPD’).

% ’4440’ is an Ondex ID

concept_name(’4400’,’Nrk1’).

% ’4400’ has the concept name ’Nrkl’

reaction(’14366’,’irreversible’,’KEGG’,’IMPD’).

% ’14366’ is an Ondex ID

concept_name(’14366’,’cpd:C05841 => cpd:C05841’).

% ’14366’ for the reaction ’cpd:C05841 => cpd:C05841’

part_of_catalyzing_class(’4400’,’4401’,’IMPD’).

% ’4401’ is classified into EC ’4401’

concept_name(’4401’,’2.7.1.-’).

% ’2.7.1.-’ is an EC

catalyzed_by(’14366’,’4400’,’IMPD’).

% ’14366’ is catalyzed by ’4400’

relation_day14_1500mg(is_related_to,’28501’,’22436’,’1.1385442’).

% ’28501’ is an Ondex ID

% ’22436’ is an Ondex ID

% ’1.1385442’ is an expression level

probe(’28501’,’UC’,’IMPD’).

% ’28501’ is for a probe

concept_name(’28501’,’1396933_s_at’).

% ’1396933_s_at’ is a probe name

gene(’22436’,’UC’,’IMPD’).

% ’22436’ is for a gene

concept_name(’22436’,’RNO:191574_GE’).

% ’RNO:191574_GE’ is a gene name

Appendix 3: Sample of Ondex Models

Page 13: New Variation of background knowledge in an industrial application …shm/Papers/uic.pdf · 2011. 3. 30. · model of tomato ripening and fruit quality. Through the coordinated analysis

Fig. 9. Visualised concepts in the Predictive Toxicology domain by Ondex (neighbourdistance k = 1). Each node represents a concept and a directed edge represents a binaryrelation between two concepts. Users can edit the constructed model by (1) clickingthe object in the view and (2) applying Ondex utilities such as filters and relationcollapsers.

Fig. 10. Meta View of the Ondex model in Figure 9. Users can edit models also at thismeta level.