Post on 14-Dec-2015
1
A Visual Analytics Approach to Augmenting Formal Concepts with Relational Background Knowledge in a Biological Domain
7th December
2010
Elma Akand*, Mike Bain, Mark Temple
*CSE, UNSW/School of Biomedical and Health Sciences,UWS
The Sixth Australasian Ontology Workshop, Adelaide University of South Australia
Outline
Machine learning and data mining in bioinformatics
Domain Ontologies in biomedical applications
Formal Concept Analysis
MCW algorithm (Mining Closed itemsets for Web apps)
BioLattice – a web based browser
Experimental Application: systems biology
Part-1: Concept ranking by gene interaction
Part-2: Relational learning of multiple-stress rules
Machine learning & Data mining in Bioinformatics
Bioinformatics
“Bioinformatics is the study of information content and information flow in biological systems and processes” (Michael Liebman,1995) Machine Learning & Data mining
-Can offer automatic knowledge acquisition
-Process to discover knowledge by analyzing data from different perspectives and can contribute greatly in building knowledge base Our work: focus on knowledge-based machine learning- Previous work: learning from ontologies - Current work: ontology construction by learning- Potential application areas: ontologies – central to eCommerce, eHealth- Current application area: systems biology – predict gene function, data integration
Ontology
In philosophy - concerned with nature and relations of being
In knowledge representation - study of categorization of things:
Informal Ontology
Formal Ontology
Natural language
First order logic or a variant
Upper Ontology
Domain Ontology
Specific
General
Ontology
Ontology – "specification of a conceptualization” (Gruber, 1993)
Conceptualization – "formalization of knowledge in declarative form” (Genesereth and Nilsson, 1987)
Gene Ontology
Missing concepts and relations
One gene annotated with different GO terms with a term specialization of other
a
b
xy
x
gene: x concepts : a ,brelations : (i) x- a (ii) x- b and (iii) b - a
Formal Concept Analysis (FCA)
Mathematical order theory (Rudolf Wille in the early 80s)
-Derives conceptual structures out of data
-Method for data analysis, knowledge representation and information management
Components
-Formal context, concept , concept lattice
four-legged
hair-covered
intelligent marine thumbed
cats x x
dogs x x
dolphins x x
gibbons x x x
humans x x
whales x x
Formal concepts in a concept lattice({cats, gibbons, dogs, dolphins, humans, whales}, {-})
Bottom
({gibbons, dolphins, humans, whales}, {intelligent})
({dolphins, whales}, {intelligent, marine})
({cats, gibbons, dogs}, {hair-covered})
({cats, dogs}, {hair-covered, four-legged})
({gibbons, humans}, {intelligent, thumbed})
({gibbons}, {intelligent, hair-covered, thumbed})
({-}, {intelligent, hair-covered, thumbed, marine, four-legged})
2
1
56
Top
3
4
Formal context: an n by m Boolean matrixm attributes A columns n objects O rows
Formal concept: Galois connection <X, Y> X is a subset of A, Y is a subset of O
Concept lattice loosely interpretable in ontology terms:concept definitions and cf. T-box
sub-concept relations
concept membership cf. A-box
by objects
FCA in data mining
FCA can be seen as a clustering technique in machine learning
-Most of the work is in a propositional framework
In data mining closed itemset mining is an efficient alternative to FCA
A frequent itemset X is closed if there exists no proper superset Y such that
Y⊃X with support(Y)=support(X)
E.g., if X = {a,b,c,d} and Y ={a,b,c,d,e} and support(Y)=support(X), then X is not closed
Parameters to avoid building entire lattice
-Extent size must be greater than minsup
Existing closed itemset mining algorithms
-Data structures to speed up closed itemset mining
-But may not build lattice, or include extents
MCW algorithm (Mining Closed itemsets for Web apps)
Vertical data format
IT-tree (itemset-tidset tree) search space
-node has X x t(X) and all children have prefix X
Pruning
- 4 set difference closure operators
Subsumption check
- A look-up table to record all attributes and their occurrences in closed concepts
Lattice
- adding concepts following a general to specific order
D
2
4
5
6
A
1
3
4
5
C
1
2
3
4
5
6
T
1
3
5
6
W
1
2
3
4
5
attribute Concept_id
D C1,C2
T C3,C4
A C4,C5
W C2,C4,C5,C6
C C1,C2,C3,C4,C5,C6,C7
Is {TA}{135} closed?i(135)={TAWC}
Closure operators
{TA}{135}={TW}{135} ->{TAW}{135}
{D}{2456}⊂{C}{123456}->{DC}{2456}
{D}{2456} and {W}{12345}->{DW}{245}
D
2
4
5
6
A
1
3
4
5
C
1
2
3
4
5
6
T
1
3
5
6
W
1
2
3
4
5Based on CHARM (Zaki, 2005)
Visual analytics
-combination of information visualization with machine learning and data analysis (Keim et al., 2008)
Visualization of concept lattice
- provides overview of the structure of the domain - means for further data analysis, e.g., classification, clustering, implication discovery, rule
learning
Previous work
- lattice navigation since Godin et al. (1993)
-Browsable concept lattice, e.g., Kim & Compton (2004)
Our current work
- on augmenting concept lattice by integrating multiple sources of knowledge (Gene Ontology, protein interactions) for further analysis & machine learning
Concept lattice as a visual analytics approach
Case study: Yeast systems biology
Browsable concept lattice
more general
Biological validation (1) : synthetic lethality
Synthetic lethal interactionif cell is viable when either gene A or B are individually deleted, but cannot grow when both are deleted.
Our results show that 72 (119) concepts in the lattice more likely than random chance at p < 0.01 (p < 0.05) to contain synthetic lethal pairs.
Protein-protein interaction data
Microarray gene-expression data
Transcription factor binding data (ChIP-chip)
Ontology data
Biochemical pathway data
Inductive Logic
Programming
concept(A):- ppi(B,A,C), ppi(B,A,E), ppi(B,C,E)tfbinds(D,C),fbinds(F,E)
First-order rule
Biological validation (2) : ILP learning of concept definitions
Transcription factors
RSM19 required for H2O2 response; RSM19, RSM22 and MRPS17 in “mitochondrial ribosomal small subunit” stable complex; and RSM22, MRPS17 bound by transcription factors under amino acid starvation.
Example rule:
Conclusions
Many real-world domains are data-intensive
Machine learning and data mining applications required to generate predictive and useful outputs
We focus on knowledge-based learning for comprehensibility – use ontologies
Formal concept analysis as a framework for ontology structure
Use data mining techniques for efficient concept lattice generation
Visual analytics approach: browsable lattice, added background knowledge
Initial validation on a case study from yeast systems biology
Investigate pseudo-intents to simplify concept lattice
Investigate variants of concept lattice structures-e.g., concept lattice of inverse context
Add concept definitions to background knowledge in ILP
Future work