Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at...
Transcript of Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at...
Overview of Research at Bioinformatics Lab
Li Liao
Develop new algorithms and (statistical) learning methods that help solve biological problems
> Capable of incorporating domain knowledge
> Effective, Expressive, Interpretable
1Li Liao, SIG NewGrad, 09/28/2009
Motivations
• Understanding correlations between genotype
and phenotype
• Predicting genotype <=> phenotype
• Some Phenotype examples:
– Protein function
– Drug/therapy response
– Drug-drug interactions for expression
– Drug mechanism
– Interacting pathways of metabolism
2Li Liao, SIG NewGrad, 09/28/2009
Bioinformatics in a … cell
3Li Liao, SIG NewGrad, 09/28/2009
Credit:Kellis & Indyk4Li Liao, SIG NewGrad, 09/28/2009
Projects– Genome sequencing and assembly (funded by NSF)
– Homology detection, protein family classification (funded by a DuPont S&E award)
Support Vector Machines
Hidden Markov models
Graph theoretic methods
– Probabilistic modeling for BioSequence (funded by NIH)
HMMs, and beyond
Motifs finding
Secondary structure
– Systems BioinformaticsPrediction of Protein-Protein Interactions
Inference of Gene Regulatory Networks
Prediction of other regulatory elements
Pattern analysis for RNAi (funded by UDRF)
– Comparative Genomics Identify genome features for diagnostic and therapeutic purposes
(funded by an Army grant)
5Li Liao, SIG NewGrad, 09/28/2009
PeopleCurrent members:
- Roger Craig (PhD student)
- Alvaro Gonzalez (PhD student)
- Kevin McCormick (PhD student)
- Colin Kern (PhD student)
Past members:
- Dr. Wen-Zhong Wang (Postdoc Fellow)
- Robel Kahsay (Ph.D. currently at DuPont Central Research & Development)
- Kishore Narra (M.S. currently at VistaPrint, Inc.)
- Arpita Gandhi (M.S. currently at Colgate-Palmolive Company)
- Gaurav Jain (M.S. currently at Institute of Genomics, Univ. of Maryland)
- Shivakundan Singh Tej (M.S.)
- Tapan Patel (B.S. currently in MD/PhD program at U Penn)
- Laura Shankman (B.S., currently in PhD program at U Virginia)
6Li Liao, SIG NewGrad, 09/28/2009
7Li Liao, SIG NewGrad, 09/28/2009
8Li Liao, SIG NewGrad, 09/28/2009
Hybrid Hierarchical Assembly
• Three types of reads: Sanger (~1000bp), 454 (~100bp), and SBS (~30bp).
• Assembly of individual types using the best suited assemblers.– Phrap, TIGR, etc. for Sanger reads
– Euler assembler and Newbler for 454 reads
– Euler short, Shorty for SBS reads
• Hybrid and hierarchical – Use longer reads as scaffolding to resolve repeat regions
that are difficult for shorter reads
– Use contigs from shorter reads (pyrosequencing) as pseudoreads to bridge gaps (nonclonable and hard stops) with Sanger reads.
9Li Liao, SIG NewGrad, 09/28/2009
Major Findings
• Hybrid hierarchical assembly is proved to be an effective
way for assembling short reads
• Incremental approach to selecting ABI reads is more
effective than random approach in generating high
coverage contigs
• Staged assembly using Phrap is an effective alternative to
the proprietary Newbler assembler.
Publications:Gonzalez & Liao, BMC Bioinformatics 2008, 9:102.
10Li Liao, SIG NewGrad, 09/28/2009
Blue lines are contigs generated from hybrid assembly
11Li Liao, SIG NewGrad, 09/28/2009
Detect remote homologues
Attributes:
- Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree).
How to incorporate domain specific knowledge into the model so a classifier can be more accurate?
Results:
- Quasi-consensus based comparison of profile HMM for protein sequences
(Kahsay et al, Bioinformatics 2005)
- Using extended phylogenetic profiles and support vector machines for protein
family classification (Narra & Liao, SNPD04, Craig & Liao, ICMLA’05, Craig &
Liao SAC’06, Craig & Liao, Int’l J. Bioinfo & DM 2007)
- Combining Pairwise Sequence Similarity and Support Vector Machines for
Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003)
12Li Liao, SIG NewGrad, 09/28/2009
Non-linear mapping to a feature space
Φ( )
xi Φ(xi)Φ(xj)
xj
L() = i ½ i j yi yj Φ (xi )·Φ (xj )
13Li Liao, SIG NewGrad, 09/28/2009
1 1 1 1 0
1 1 1 1 1= 3 0.5
0 1 1 1 1 = 3 0.1x =
y =
z =
Ham
min
g d
ista
nce
Tre
e-bas
ed d
ista
nce
Data: phylogenetic profiles
- How to account for correlations among profile components?
profile extension (Narra & Liao, SNPD 04)
Transductive learning (Craig & Liao, ICMLA’05, SAC’06, IJBDM, 2007)
14Li Liao, SIG NewGrad, 09/28/2009
1 1 0 1 0 0 0 1 1
1
0.33
0.67
0.34
0.5
0.75
0.55
1 0.33 0.67 0.34 0.5 0.75 0.55
Post-order traversal
15Li Liao, SIG NewGrad, 09/28/2009
16Li Liao, SIG NewGrad, 09/28/2009
Sequence Models (HMMs and beyond)
Motivations: What is responsible for the function?
– Patterns/motifs
– Secondary structure
To capture long range correlations of bio sequences
– Transporter proteins
– RNA secondary structure
Methods: generative versus discriminative
– Linear dependent processes
– Stochastic grammars
– Model equivalence
17Li Liao, SIG NewGrad, 09/28/2009
TMMOD: An improved hidden Markov model for predicting
transmembrane topology
(Kahsay, Gao & Liao. Bioinformatics 2005)
18Li Liao, SIG NewGrad, 09/28/2009
Mod. Reg.Data
set
Correct
topology
Correct
location
Sens-
itivity
Speci-
ficity
TMMOD 1
(a)
(b)
(c)
S-83
65 (78.3%)
51 (61.4%)
64 (77.1%)
67 (80.7%)
52 (62.7%)
65 (78.3%)
97.4%
71.3%
97.1%
97.4%
71.3%
97.1%
TMMOD 2
(a)
(b)
(c)
S-83
61 (73.5%)
54 (65.1%)
54 (65.1%)
65 (78.3%)
61 (73.5%)
66 (79.5%)
99.4%
93.8%
99.7%
97.4%
71.3%
97.1%
TMMOD 3
(a)
(b)
(c)
S-83
70 (84.3%)
64 (77.1%)
74 (89.2%)
71 (85.5%)
65 (78.3%)
74 (89.2%)
98.2%
95.3%
99.1%
97.4%
71.3%
97.1%
TMHMM S-83
64 (77.1%) 69 (83.1%) 96.2% 96.2%
PHDtm S-83
(85.5%) (88.0%) 98.8% 95.2%
TMMOD 1
(a)
(b)
(c)
S-160
117 (73.1%)
92 (57.5%)
117 (73.1%)
128 (80.0%)
103 (64.4%)
126 (78.8%)
97.4%
77.4%
96.1%
97.0%
80.8%
96.7%
TMMOD 2
(a)
(b)
(c)
S-160
120 (75.0%)
97 (60.6%)
118 (73.8%)
132 (82.5%)
121 (75.6%)
135 (84.4%)
98.4%
97.7%
98.4%
97.2%
95.6%
97.2%
TMMOD 3
(a)
(b)
(c)
S-160
120 (75.0%)
110 (68.8%)
135 (84.4%)
133 (83.1%)
124 (77.5%)
143 (89.4%)
97.8%
94.5%
98.3%
97.6%
98.1%
98.1%
TMHMM S-160 123 (76.9%) 134 (83.8%) 97.1% 97.7%
19Li Liao, SIG NewGrad, 09/28/2009
20Li Liao, SIG NewGrad, 09/28/2009
21Li Liao, SIG NewGrad, 09/28/2009
Li Liao, SIG NewGrad, 09/28/2009 22
Inferring Regulatory Networks from Time Course Expression Data(Gandhi, Cogburn & Liao, 2008)
Expression Profile Clustering
Binary heat map
Boolean
network
algorithm
K-mean
Li Liao, SIG NewGrad, 09/28/2009 23
GENOMIC COMPARISON OF BACTERIAL SPECIES BASED
ON METABOLIC CHARACTERISTICS (Jain, Wang, Boyd, Liao, ISIBM 2009)
Li Liao, SIG NewGrad, 09/28/2009 24