Bioinformatics, Data Integration and Machine Learning a Thesis Proposal
-
Upload
lesley-good -
Category
Documents
-
view
37 -
download
1
description
Transcript of Bioinformatics, Data Integration and Machine Learning a Thesis Proposal
Apr 19, 2023 1
Bioinformatics, Data Integration and Machine
Learninga Thesis Proposal
Kaushik SinhaSupervisors: Prof. Gagan Agrawal and Prof. Mikhail
Belkin
Apr 19, 2023 2
Roadmap Motivation Our Approach Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis Proposed Work
Deep Web Mining for Biological Data Semi-supervised Ranking Multiple-instance Learning
Conclusion
Apr 19, 2023 3
Motivation Integration is hard
Data explosion Data size & number of data sources
New analysis tools Autonomous resources
Heterogeneous data representation & various interfaces
Frequent Updates New trend: web and grid services
Apr 19, 2023 4
Motivation contd… In recent years DNA microarry and other gene
and protein assays have become essential tools for biologists
Next step of biological enquiry is to find out What is known about these genes? How are these genes related to each other or other
genes identified in similar studies? However, major difficulties are
How do we extract key properties shared by a candidate genes?
How do we generate reasonable hypothesis to explain them?
How do we define and evaluate similarity between sets of genes?
Apr 19, 2023 5
Motivating Example Suppose after a micro array experiment a biologist suspects that a
small set of genes are related to a disease This can be confirmed by searching existing literature One would expect related genes to appear together in literature Due to sheer volume
Searching is time consuming and error prone Some complications could arise as well
However, suppose Gene A and C are related and both of them are weakly related to gene B
In literature, one would expect A,C appear together OR/AND A,B appear together B,C appear together
How do we efficiently conclude that A,C are actually related?
Apr 19, 2023 6
Our Approach Using data mining / machine learning
techniques to extract useful information from biological data
Different forms of data Flat-file data Microarray data Online literature abstracts
Develop different forms of tools Layout extractor Hypergraph mining Similarity measure among sets of genes
Apr 19, 2023 7
Roadmap Motivation Our Approach Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis Proposed Work
Deep Web Mining for Biological Data Semi-supervised Ranking Multiple-instance Learning
Conclusion
Apr 19, 2023 8
Learning Layout of a Flat-File In general – intractable Try and learn the layout, have a
domain expert verify Key issue: what delimiters are
being used ?
Apr 19, 2023 9
Finding Delimiters Some knowledge from domain
expert is required (Semi-automatic) Naïve approaches
Frequency Counting Counts frequently occurring single tokens
(word separated by space) Sequence Mining
Counts frequently occurring sequence of tokens
Apr 19, 2023 10
Assumptions Biological datasets are written for
humans to read It is very unlikely that delimiters will be
scattered all around, in different places in a line
Position of the possible delimiters might provide useful information
Combination of positional and frequency information might be a better choice
Apr 19, 2023 11
Positional Weight
Let P be the different positions in a line where a token can appear
For each position i є P, tot_seqji represents total # of
token sequences of length j starting at position i
For each position i є P, tot_unique_seqji represents total
# of unique token sequences of length j starting at position i
For any tuple (i,j), p_ratio(i,j) is defined as shown above
p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j) є (0,1)
ji
ji
sequniqetot
seqtotjiratiop
__
_),(_
Apr 19, 2023 12
Delimiter score (d_score) Frequency weight for any token sequence sj
i with length j and starting at position i, f_wt(sj
i), is obtained by log normalizing frequency f(sj
i)
Obviously, f_wt(sji) є (0,1)
Positional and frequency weight now can be combined together to get d_score as follows,
d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sj
i) Where α є(0,1)
Thus d_scrore has the following two properties, d_score(sj
i) є(0,1) d_score(sj
i) > d_score(sjk) implies sj
i is more likely to be a delimiter than sj
k
Apr 19, 2023 13
Generating layout descriptor
Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA
This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states
The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters
Apr 19, 2023 14
Results By suitably varying α, a tight
superset of possible delimiters are found
A domain expert can then help to identify the true delimiters
Results from 3 different flat file datasets are as follows
Apr 19, 2023 15
Comparison with naïve approaches
d_score based approach definitely does a better job as compared to the naïve approaches
The following table clearly shows the improvement
Apr 19, 2023 16
Realistic Situation The task of identifying complete list
of correct delimiters is difficult Most likely we will end up with
getting an incomplete list of delimiters
The delimiters which does not appear in every data record (optional) are the ones to be possibly missed
Apr 19, 2023 17
Identifying Optional Delimiters Given a list of incomplete
delimiters how can we identify optional delimiters, if any? Build a NFA based on given
incomplete information Perform clustering to identify possible
crucial delimiters Perform contrast analysis
Apr 19, 2023 18
Crucial delimiter A delimiter is considered crucial, if
missing delimiters will appear immediately following these delimiters
The goal is to create two clusters, one having delimiters which are not crucial The other one having crucial delimiters
Apr 19, 2023 19
Identifying crucial delimiters:A few definitions Succ(X): Set of delimiters that can
immediately follow X Dist_App: # of groups of occurrences of
X based on # of text lines between X and immediately next delimiter
Info_Tuple(nXi,fX
i,tXi): Information for
each Dist_App Info_Tuple_List Lx: For any X, list of all
possible Info_Tuple.
Apr 19, 2023 20
Metric for clustering
rXf is likely to be low if an optional delimiter appears
immediately after X, and high otherwise Choose a suitable cut-off value rc and assign
delimiters to different groups as follows,- If rX
f < rc, assign X to a group containing possible crucial delimiters
Else assign X to the group containing non crucial delimiters
totalX
XfX f
fr
max
Apr 19, 2023 21
Observations and Facts Missing optional delimiters can appear
immediately after crucial delimiters ONLY Non-crucial delimiters can be pruned away Consider two Info_Tuples (nX
1, fX1 ,tX
1) and (nX
2, fX2 ,tX
2) in LX
If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- nX
1 > nX2
Missing delimiter will appear in tX1 but not in tX
2
Apr 19, 2023 22
A hypothetical example illustrating Contrast Analysis
Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows,
L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt)
Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows,
S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 }
Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or
is verified by a domain expert as a valid delimiter
15 Sf 25 Sf
Apr 19, 2023 23
Contrast Analysis For any i,j, if nX
i > nXj , look for frequently
occurring sequences in tXi and tX
j, call them fsX
i and fsXj respectively
If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter
If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter
iXfsfs j
Xfsfs
Apr 19, 2023 24
Generalized Contrast Analysis In case of more than two Info_Tuples,
identify mean of all nXi values
Form a group by appending text from all Info_Tuples, where
Form another group by appending text from all Info_Tuples, where
Perform contrast analysis among all such possible groups
totalX
l
i
iX
iX
meanX f
fnn
1
meanX
iX nn
meanX
jX nn
Apr 19, 2023 25
Another example illustrating Generalized Contrast Analysis
Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3
, as follows, L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt) L3=(15, 10, l3 .txt)
Mean number of lines, Append l2 .txt and l3 .txt , call it t2 .txt Sequence mining on l1 .txt and t2 .txt yields two sets of frequently
occurring sequences, S1 and S2 , as follows, S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 }
Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or is verified
by a domain expert as a valid delimiter
15 Sf 25 Sf
09.33101220
)1015()1220()2050(
meanXn
Apr 19, 2023 26
Overall Algorithms
Apr 19, 2023 27
Results: Optional delimiters
% Pruning=
Apr 19, 2023 28
Results: Non-optional Missing delimiters
Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too
If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails
If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works
Apr 19, 2023 29
Roadmap Motivation Our Approach Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis Proposed Work
Deep Web Mining for Biological Data Semi-supervised Ranking Multiple-instance Learning
Conclusion
Apr 19, 2023 30
Hypergraph Mining Basic Motivation
To find useful “Transitive Relation” (hypergraphs) among genes
Example (Gene-Disease Relationship) Gene A is related to a gene B Gene B is related to a gene C Is Gene A related to Gene C ?
Gene Source Microarray Experiments
Information Source Online Literature abstracts
Apr 19, 2023 31
Formal Problem Definition Given
A dictionary KT of keywords A dictionary KM of user provided key words (KTכKM) Collection of literature abstracts,- each abstract is
represented as a set of keywords Task
To find hyperedges exceeding user defined threshold, each of which involves a set of key words from KM and are potentially connected by another set of linking words from KT-KM
Apr 19, 2023 32
Modeling Purpose
To use a similar approach as frequent itemset mining Define
total weight=support + cross support Support: set of keywords appear together in one
document Cross support: set of keywords can be partitioned so
that each partition appears in different document Issues
Since downclosure property does not hold for total weight modified downclosure property can be defined
Apr 19, 2023 33
Idea Support satisfies downclosure property
Let X be a set, Ω be its power set. A function f : Ω →R+ satisfies downclosure property if for all A,B ∈ Ω , A כ B ,f(B)>f(A)
Cross support can be designed to be restricted below a particular value, i.e., it is bounded
Form a function h as addition of two functions h=f+g f satisfies downclosure property g is bounded
h satisfies modified down closure property For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))}
This property can be used to devise efficient algorithm
Apr 19, 2023 34
Results
Apr 19, 2023 35
Similarity Measure among sets of genes
Each file containing gene names can be considered as a Discrete Random Variable (DRV)
Each such DRV can take several values (gene names)
For two such files X,Y and for any pair (x,y), x∈X and y∈Y, p(x,y) can be computed from online abstracts based on co-occurrence
Now defining Z=g(X,Y), Z is a RV Expectation of Z can be used as a similarity
measure Different g gives rise to different similarity
measure
Apr 19, 2023 36
Roadmap Motivation Our Approach Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis Proposed Work
Deep Web Mining for Biological Data Semi-supervised Ranking Multiple-instance Learning
Conclusion
Apr 19, 2023 37
Query Planning for Deepweb Mining
A huge source of online biological information is available in the form of deepweb
An online query form query form needs to be filled out
Required information is available by filling out may such forms from different websites
There might be some dependency among these forms
Requires Redundancy elimination
Apr 19, 2023 38
Roadmap Motivation Our Approach Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis Proposed Work
Deep Web Mining for Biological Data Semi-supervised Ranking Multiple-instance Learning
Conclusion
Apr 19, 2023 39
Semi-supervised Ranking Ranking
Given a training set of examples with labels/pair wise relationships
Task is to rank an unseen test set, i.e. to get a permutation so that relevant examples are ranked higher than irrelevant ones
This corresponds to learning a ranking function Semi-supervised Ranking
Incorporating unlabeled examples to learn the ranking function
Out of sample extension
Apr 19, 2023 40
Potential Application Following a microarray experiment it might be
possible to guess if gene A is more important than gene B involved in the experiment
However all possible order relationship is time consuming end error prone
Thus, from a small set of order relationship and using other genes from the experiment as unlabeled data a semi-supervised ranking function can be learned
Apr 19, 2023 41
Roadmap Motivation Our Approach Current Work
Learning Layouts of Flat-file Biological Datasets
Exploratory Tools for Biological Data Analysis Proposed Work
Deep Web Mining for Biological Data Semi-supervised Ranking Multiple-instance Learning
Conclusion
Apr 19, 2023 42
Multiple Instance Learning Instead of instance-label pair (x,y), bag-label
pair (B,y) is provided as training data A bag contains multiple instances A bag label is negative, if each instance in
the bag has negative label A bag label is positive, if there exists at least
one instance with positive label Given an unseen bag, the task is to predict
its label
Apr 19, 2023 43
Potential Application Following a microarray experiment it
might be possible to form bags of genes with appropriate labels
From different biological labs doing similar experiments, many such bags can be obtained to use as training data
Before, designing a new microarray experiment, gene set can be selected based on multiple instance learning
Apr 19, 2023 44
Summary Use of data mining /machine learning
techniques to extract information for biological data
Work done Learning layouts of flat-file biological datasets Hypergraph Mining Similarity Measure among sets of genes
Proposed Work Study and application of machine learning techniques