Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at...

24
Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods that help solve biological problems > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable 1 Li Liao, SIG NewGrad, 09/28/2009

Transcript of Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at...

Page 1: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Overview of Research at Bioinformatics Lab

Li Liao

Develop new algorithms and (statistical) learning methods that help solve biological problems

> Capable of incorporating domain knowledge

> Effective, Expressive, Interpretable

1Li Liao, SIG NewGrad, 09/28/2009

Page 2: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Motivations

• Understanding correlations between genotype

and phenotype

• Predicting genotype <=> phenotype

• Some Phenotype examples:

– Protein function

– Drug/therapy response

– Drug-drug interactions for expression

– Drug mechanism

– Interacting pathways of metabolism

2Li Liao, SIG NewGrad, 09/28/2009

Page 3: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Bioinformatics in a … cell

3Li Liao, SIG NewGrad, 09/28/2009

Page 4: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Credit:Kellis & Indyk4Li Liao, SIG NewGrad, 09/28/2009

Page 5: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Projects– Genome sequencing and assembly (funded by NSF)

– Homology detection, protein family classification (funded by a DuPont S&E award)

Support Vector Machines

Hidden Markov models

Graph theoretic methods

– Probabilistic modeling for BioSequence (funded by NIH)

HMMs, and beyond

Motifs finding

Secondary structure

– Systems BioinformaticsPrediction of Protein-Protein Interactions

Inference of Gene Regulatory Networks

Prediction of other regulatory elements

Pattern analysis for RNAi (funded by UDRF)

– Comparative Genomics Identify genome features for diagnostic and therapeutic purposes

(funded by an Army grant)

5Li Liao, SIG NewGrad, 09/28/2009

Page 6: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

PeopleCurrent members:

- Roger Craig (PhD student)

- Alvaro Gonzalez (PhD student)

- Kevin McCormick (PhD student)

- Colin Kern (PhD student)

Past members:

- Dr. Wen-Zhong Wang (Postdoc Fellow)

- Robel Kahsay (Ph.D. currently at DuPont Central Research & Development)

- Kishore Narra (M.S. currently at VistaPrint, Inc.)

- Arpita Gandhi (M.S. currently at Colgate-Palmolive Company)

- Gaurav Jain (M.S. currently at Institute of Genomics, Univ. of Maryland)

- Shivakundan Singh Tej (M.S.)

- Tapan Patel (B.S. currently in MD/PhD program at U Penn)

- Laura Shankman (B.S., currently in PhD program at U Virginia)

6Li Liao, SIG NewGrad, 09/28/2009

Page 7: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

7Li Liao, SIG NewGrad, 09/28/2009

Page 8: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

8Li Liao, SIG NewGrad, 09/28/2009

Page 9: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Hybrid Hierarchical Assembly

• Three types of reads: Sanger (~1000bp), 454 (~100bp), and SBS (~30bp).

• Assembly of individual types using the best suited assemblers.– Phrap, TIGR, etc. for Sanger reads

– Euler assembler and Newbler for 454 reads

– Euler short, Shorty for SBS reads

• Hybrid and hierarchical – Use longer reads as scaffolding to resolve repeat regions

that are difficult for shorter reads

– Use contigs from shorter reads (pyrosequencing) as pseudoreads to bridge gaps (nonclonable and hard stops) with Sanger reads.

9Li Liao, SIG NewGrad, 09/28/2009

Page 10: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Major Findings

• Hybrid hierarchical assembly is proved to be an effective

way for assembling short reads

• Incremental approach to selecting ABI reads is more

effective than random approach in generating high

coverage contigs

• Staged assembly using Phrap is an effective alternative to

the proprietary Newbler assembler.

Publications:Gonzalez & Liao, BMC Bioinformatics 2008, 9:102.

10Li Liao, SIG NewGrad, 09/28/2009

Page 11: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Blue lines are contigs generated from hybrid assembly

11Li Liao, SIG NewGrad, 09/28/2009

Page 12: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Detect remote homologues

Attributes:

- Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree).

How to incorporate domain specific knowledge into the model so a classifier can be more accurate?

Results:

- Quasi-consensus based comparison of profile HMM for protein sequences

(Kahsay et al, Bioinformatics 2005)

- Using extended phylogenetic profiles and support vector machines for protein

family classification (Narra & Liao, SNPD04, Craig & Liao, ICMLA’05, Craig &

Liao SAC’06, Craig & Liao, Int’l J. Bioinfo & DM 2007)

- Combining Pairwise Sequence Similarity and Support Vector Machines for

Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003)

12Li Liao, SIG NewGrad, 09/28/2009

Page 13: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Non-linear mapping to a feature space

Φ( )

xi Φ(xi)Φ(xj)

xj

L() = i ½ i j yi yj Φ (xi )·Φ (xj )

13Li Liao, SIG NewGrad, 09/28/2009

Page 14: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

1 1 1 1 0

1 1 1 1 1= 3 0.5

0 1 1 1 1 = 3 0.1x =

y =

z =

Ham

min

g d

ista

nce

Tre

e-bas

ed d

ista

nce

Data: phylogenetic profiles

- How to account for correlations among profile components?

profile extension (Narra & Liao, SNPD 04)

Transductive learning (Craig & Liao, ICMLA’05, SAC’06, IJBDM, 2007)

14Li Liao, SIG NewGrad, 09/28/2009

Page 15: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

1 1 0 1 0 0 0 1 1

1

0.33

0.67

0.34

0.5

0.75

0.55

1 0.33 0.67 0.34 0.5 0.75 0.55

Post-order traversal

15Li Liao, SIG NewGrad, 09/28/2009

Page 16: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

16Li Liao, SIG NewGrad, 09/28/2009

Page 17: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Sequence Models (HMMs and beyond)

Motivations: What is responsible for the function?

– Patterns/motifs

– Secondary structure

To capture long range correlations of bio sequences

– Transporter proteins

– RNA secondary structure

Methods: generative versus discriminative

– Linear dependent processes

– Stochastic grammars

– Model equivalence

17Li Liao, SIG NewGrad, 09/28/2009

Page 18: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

TMMOD: An improved hidden Markov model for predicting

transmembrane topology

(Kahsay, Gao & Liao. Bioinformatics 2005)

18Li Liao, SIG NewGrad, 09/28/2009

Page 19: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Mod. Reg.Data

set

Correct

topology

Correct

location

Sens-

itivity

Speci-

ficity

TMMOD 1

(a)

(b)

(c)

S-83

65 (78.3%)

51 (61.4%)

64 (77.1%)

67 (80.7%)

52 (62.7%)

65 (78.3%)

97.4%

71.3%

97.1%

97.4%

71.3%

97.1%

TMMOD 2

(a)

(b)

(c)

S-83

61 (73.5%)

54 (65.1%)

54 (65.1%)

65 (78.3%)

61 (73.5%)

66 (79.5%)

99.4%

93.8%

99.7%

97.4%

71.3%

97.1%

TMMOD 3

(a)

(b)

(c)

S-83

70 (84.3%)

64 (77.1%)

74 (89.2%)

71 (85.5%)

65 (78.3%)

74 (89.2%)

98.2%

95.3%

99.1%

97.4%

71.3%

97.1%

TMHMM S-83

64 (77.1%) 69 (83.1%) 96.2% 96.2%

PHDtm S-83

(85.5%) (88.0%) 98.8% 95.2%

TMMOD 1

(a)

(b)

(c)

S-160

117 (73.1%)

92 (57.5%)

117 (73.1%)

128 (80.0%)

103 (64.4%)

126 (78.8%)

97.4%

77.4%

96.1%

97.0%

80.8%

96.7%

TMMOD 2

(a)

(b)

(c)

S-160

120 (75.0%)

97 (60.6%)

118 (73.8%)

132 (82.5%)

121 (75.6%)

135 (84.4%)

98.4%

97.7%

98.4%

97.2%

95.6%

97.2%

TMMOD 3

(a)

(b)

(c)

S-160

120 (75.0%)

110 (68.8%)

135 (84.4%)

133 (83.1%)

124 (77.5%)

143 (89.4%)

97.8%

94.5%

98.3%

97.6%

98.1%

98.1%

TMHMM S-160 123 (76.9%) 134 (83.8%) 97.1% 97.7%

19Li Liao, SIG NewGrad, 09/28/2009

Page 20: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

20Li Liao, SIG NewGrad, 09/28/2009

Page 21: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

21Li Liao, SIG NewGrad, 09/28/2009

Page 22: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Li Liao, SIG NewGrad, 09/28/2009 22

Inferring Regulatory Networks from Time Course Expression Data(Gandhi, Cogburn & Liao, 2008)

Expression Profile Clustering

Binary heat map

Boolean

network

algorithm

K-mean

Page 23: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Li Liao, SIG NewGrad, 09/28/2009 23

GENOMIC COMPARISON OF BACTERIAL SPECIES BASED

ON METABOLIC CHARACTERISTICS (Jain, Wang, Boyd, Liao, ISIBM 2009)

Page 24: Overview of Research at Bioinformatics Lablliao/signewgrad09f/faculty...Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods

Li Liao, SIG NewGrad, 09/28/2009 24