1 Classification of real and pseudo microRNA precursors using local structure-sequence features and...

24
1 Classification of real and pseudo Classification of real and pseudo microRNA precursors using local microRNA precursors using local structure-sequence features and support structure-sequence features and support vector machine vector machine Chenghai Xue, Fei Li, Tao He, Guo-Ping Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xeugong Zhang Liu, Yanda Li, and Xeugong Zhang CISC 841 Bioinformatics Nehar

Transcript of 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and...

Page 1: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

1

Classification of real and pseudo Classification of real and pseudo microRNA precursors using local microRNA precursors using local structure-sequence features and structure-sequence features and

support vector machinesupport vector machineChenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu,

Yanda Li, and Xeugong ZhangYanda Li, and Xeugong Zhang

CISC 841 Bioinformatics Nehar

Page 2: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

2

Background: miRNAsBackground: miRNAs

Single-stranded RNA, ~ 20-25 nucleotides, Single-stranded RNA, ~ 20-25 nucleotides, that play a regulatory role in gene that play a regulatory role in gene expression.expression.

Transcribed as long primary miRNA having Transcribed as long primary miRNA having a hairpin structure.a hairpin structure.

pri-miRNA processed by nuclear RNase III pri-miRNA processed by nuclear RNase III Drosha into ~60-70 nt long pre-miRNA.Drosha into ~60-70 nt long pre-miRNA.

pre-miRNA actively transported from the pre-miRNA actively transported from the nucleus to the cytoplasm by Exportin-5.nucleus to the cytoplasm by Exportin-5.

Cleaved into ~20-25 nt mature miRNA.Cleaved into ~20-25 nt mature miRNA.

Page 3: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

3

Background: The Background: The ‘hairpin loop’‘hairpin loop’

Sequence of nucleotides where two Sequence of nucleotides where two segments can form base-pairs with each segments can form base-pairs with each other, but a segment within that sequence other, but a segment within that sequence can not.can not.

Page 4: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

4

Background: The Background: The ‘hairpin loop’‘hairpin loop’

The sequence The sequence

---CCTGCXXXXXXXGCAGG--- ---CCTGCXXXXXXXGCAGG---

Forms the hairpin structureForms the hairpin structure

---C G------C G---C G C G T A T A G C G C C G C G

X X X X X X X X X X X X

X X

Page 5: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

5

Background: The Background: The ‘hairpin loop’‘hairpin loop’

Sequence of nucleotides where two segments Sequence of nucleotides where two segments can form base-pairs with each other, but a can form base-pairs with each other, but a segment within that sequence can not.segment within that sequence can not.

The pre-miRNA 'hairpin' is an important The pre-miRNA 'hairpin' is an important secondary structure for identifying miRNAs.secondary structure for identifying miRNAs.

Since mature miRNAs are very short (~20 Since mature miRNAs are very short (~20 nt), sequence alignment is not very useful for nt), sequence alignment is not very useful for identification of miRNAs.identification of miRNAs.

Solution is to make use the hairpin structure Solution is to make use the hairpin structure of pre-miRNA.of pre-miRNA.

Page 6: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

6

The problem The problem There are many sequence segments that fold There are many sequence segments that fold

into similar stem-loop hairpin structure.into similar stem-loop hairpin structure. so existing methods for identification of so existing methods for identification of

miRNAs miRNAs mustmust utilize comparative genomics utilize comparative genomics information besides the structure features. information besides the structure features. An example: Filter out hairpins not An example: Filter out hairpins not conserved in related species.conserved in related species.

This implies an inability to identify miRNAs This implies an inability to identify miRNAs without close known homologues.without close known homologues.

Furthermore, for species without closely Furthermore, for species without closely related species sequenced comparative related species sequenced comparative genomics approaches can't be applied.genomics approaches can't be applied.

Page 7: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

7

Proposed solution Proposed solution

ab initio (from first principles) classification ab initio (from first principles) classification of real pre-miRNA from "pseudo " pre- of real pre-miRNA from "pseudo " pre-miRNA i.e. non pre-miRNA sequence having miRNA i.e. non pre-miRNA sequence having the hairpin structure.the hairpin structure.

Get a set of novel features that combine Get a set of novel features that combine local structure and sequence information of local structure and sequence information of pre-miRNA stem-loops.pre-miRNA stem-loops.

Use SVM to classify as pre-miRNA and Use SVM to classify as pre-miRNA and pseudo pre-miRNA.pseudo pre-miRNA.

Page 8: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

8

The datasetsThe datasets

Sets of human pre-miRNA and pseudo-miRNA Sets of human pre-miRNA and pseudo-miRNA hairpins collected to train SVM and evaluate hairpins collected to train SVM and evaluate performance.performance.

Human pre-miRNA downloaded from the miRNA Human pre-miRNA downloaded from the miRNA registry database. only pre-miRNAs without registry database. only pre-miRNAs without multiple loops considered (~193 or 93% of multiple loops considered (~193 or 93% of database.)database.)

pseudo and candidate miRNA hairpins. pseudo and candidate miRNA hairpins. Segments having stem-loop structure similar to Segments having stem-loop structure similar to pre-miRNA but aren't pre-miRNA. pre-miRNA but aren't pre-miRNA.

CODING dataset and the CONSERVED-CODING dataset and the CONSERVED-HAIRPIN dataset.HAIRPIN dataset.

Page 9: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

9

The Coding datasetThe Coding dataset

Collected from protein coding regions.Collected from protein coding regions. Used as negative samples in training and Used as negative samples in training and

validation of classifier.validation of classifier. Length distribution kept identical to pre-Length distribution kept identical to pre-

miRNAs.miRNAs. Criteria for selection: Criteria for selection:

minimum 18 base pairings on the stem and hairpin. minimum 18 base pairings on the stem and hairpin. Maximum of -15 kcal/mol free energy of secondary Maximum of -15 kcal/mol free energy of secondary

structure. (numbers correspond to limits for structure. (numbers correspond to limits for genuine human pre-miRNAs.)genuine human pre-miRNAs.)

8,494 pre-miRNA-8,494 pre-miRNA-likelike hairpins in this dataset. hairpins in this dataset.

Page 10: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

10

The Conserved-hairpin The Conserved-hairpin datasetdataset

Extracted from genome region of position Extracted from genome region of position 56,000,001 – 57,000,000 on human 56,000,001 – 57,000,000 on human chromosome 19 ( UCSC db.)chromosome 19 ( UCSC db.)

Used as a candidate dataset to evaluate the Used as a candidate dataset to evaluate the classifier.classifier.

2,444 hairpins from sequences conserved 2,444 hairpins from sequences conserved between Human and mouse.between Human and mouse.

Most hairpins likely to be pseudo-Most hairpins likely to be pseudo-miRNAs. In fact, only 3 known miRNAs miRNAs. In fact, only 3 known miRNAs in this dataset.in this dataset.

Page 11: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

11

Training and Test setsTraining and Test sets

For classification experiments, one training For classification experiments, one training set and two test sets built from the 3 datasets.set and two test sets built from the 3 datasets.

TR-C: Training set.TR-C: Training set. 163 human pre-miRNAs (+ve samples) from the 163 human pre-miRNAs (+ve samples) from the

193 human pre-miRNAs.193 human pre-miRNAs. 168 pseudo pre-miRNAs (-ve samples.) from the 168 pseudo pre-miRNAs (-ve samples.) from the

Coding dataset.Coding dataset. TE-C: Test set 1.TE-C: Test set 1.

Remaining 30 human pre-miRNAs; 1000 pseudo Remaining 30 human pre-miRNAs; 1000 pseudo pre-miRNAs (avoiding those in TR-C.)pre-miRNAs (avoiding those in TR-C.)

Conserved-hairpin dataset: Test set 2.Conserved-hairpin dataset: Test set 2.

Page 12: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

12

Two further test setsTwo further test sets

Apply the SVM trained using previous sets Apply the SVM trained using previous sets on two further test sets.on two further test sets.

Cross-Species test setCross-Species test set 581 pre-miRNAs from 11 species.581 pre-miRNAs from 11 species.

Updated test setUpdated test set New batch of reported human miRNA.New batch of reported human miRNA. Includes 39 non-redundant pre-miRNAs without Includes 39 non-redundant pre-miRNAs without

multiple loops.multiple loops.

Page 13: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

13

Local contiguous structure-Local contiguous structure-sequence featuressequence features

Local sequence features are important in pre-miRNAs.Local sequence features are important in pre-miRNAs. Authors claim – Distribution of local sub-structures Authors claim – Distribution of local sub-structures

(i.e. continuously paired or unpaired structures) of (i.e. continuously paired or unpaired structures) of pre-miRNAs are significantly distinct from pseudo pre-pre-miRNAs are significantly distinct from pseudo pre-miRNAs.miRNAs.

Use a combination of local structure with sequence Use a combination of local structure with sequence information to classify real vs. pseudo miRNA information to classify real vs. pseudo miRNA hairpins.hairpins.

Focus on information of 3 adjacent nucleotides (triplet Focus on information of 3 adjacent nucleotides (triplet elements.)elements.)

““(“ and “)” mean paired at 5’-end and 3’-end. “.” (“ and “)” mean paired at 5’-end and 3’-end. “.” means unpaired. Paper doesn’t make 5’ – 3’ means unpaired. Paper doesn’t make 5’ – 3’ distinction.distinction.

Page 14: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

14

Structure-sequence Structure-sequence featuresfeatures

8 possible 8 possible structure structure compositions for compositions for each triplet each triplet [[ “(((“, “(((“, “((.”, “(..”, and so “((.”, “(..”, and so onon]]

32, (U,C,G,A)x8 32, (U,C,G,A)x8 structure –structure –sequencesequence combinations if we combinations if we consider the consider the middle nt.middle nt.

Page 15: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

15

Structure-sequence Structure-sequence featuresfeatures

e.g. U((( => e.g. U((( => middle nt is U and middle nt is U and all three nts are all three nts are paired.paired.

Count appearance Count appearance of each triplet to of each triplet to get a 32-get a 32-dimensional dimensional feature vector feature vector (normalized).(normalized).

Page 16: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

16

SVM ClassificationSVM Classification The SVM classifier is trained with TE-C & applied The SVM classifier is trained with TE-C & applied

to other test sets.to other test sets. From TR-C 28/30 human pre-miRNA and 881/1000 From TR-C 28/30 human pre-miRNA and 881/1000

pseudo-miRNAs correctly identified.pseudo-miRNAs correctly identified. On Conserved hairpin set 2174/2444 structures On Conserved hairpin set 2174/2444 structures

classified as false miRNAs.classified as false miRNAs.

Page 17: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

17

SVM ClassificationSVM Classification

The triplet elements reflect contiguous fine-The triplet elements reflect contiguous fine-structures and sequence composition. For structures and sequence composition. For instance “(((” => stacking of paired bases, instance “(((” => stacking of paired bases, and “…” => bulge loops.and “…” => bulge loops.

The success of the classifier shows that these The success of the classifier shows that these features reflect intrinsic characteristics of features reflect intrinsic characteristics of pre-miRNAs.pre-miRNAs.

““(((” appears at higher frequency in pre-(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in miRNAs. And “…” appears more often in pseudo miRNAs.pseudo miRNAs.

Page 18: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

18

SVM ClassificationSVM Classification

Average freq. of triplets in training dataset

Page 19: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

19

SVM ClassificationSVM Classification The triplet elements reflect contiguous fine-The triplet elements reflect contiguous fine-

structures and sequence composition. For instance structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => “(((” => stacking of paired bases, and “…” => bulge loops.bulge loops.

The success of the classifier shows that these The success of the classifier shows that these features reflect intrinsic characteristics of pre-features reflect intrinsic characteristics of pre-miRNAs.miRNAs.

““(((” appears at higher frequency in pre-miRNAs. (((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs.And “…” appears more often in pseudo miRNAs.

Observations can be linked to the stability of the Observations can be linked to the stability of the secondary structure. Stacking of more continuously secondary structure. Stacking of more continuously paired nts decreases free energy. So, pre-miRNAs paired nts decreases free energy. So, pre-miRNAs are more stable.are more stable.

Page 20: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

20

SVM ClassificationSVM Classification

Sequence informationSequence information Frequency of same triplet structure with different Frequency of same triplet structure with different

middle nts in real pre-miRNAs, and across real middle nts in real pre-miRNAs, and across real and psuedo miRNAs varies.and psuedo miRNAs varies.

Page 21: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

21

SVM ClassificationSVM Classification

Average freq. of triplets in training dataset

Page 22: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

22

SVM Classification SVM Classification across speciesacross species

Applied the classifier trained on human data Applied the classifier trained on human data to other species (Cross-Species test set.)to other species (Cross-Species test set.)

Pretty good performance in identifying true Pretty good performance in identifying true pre-miRNAs.pre-miRNAs.

581 known pre-miRNA of 11 species. 90.9% 581 known pre-miRNA of 11 species. 90.9% overall accuracy.overall accuracy.

Page 23: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

23

SVM Classification SVM Classification across speciesacross species

Page 24: 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

24

ConclusionConclusion

Ab initioAb initio methods for distinguishing true methods for distinguishing true pre-miRNA from pre-miRNA-like hairpin pre-miRNA from pre-miRNA-like hairpin structures are very important.structures are very important.

The triplet-SVM classifier describes fine The triplet-SVM classifier describes fine grained sequence-structure characteristics. grained sequence-structure characteristics.

90% accuracy on human data.90% accuracy on human data. UptoUpto 90% accuracy on 11 other species 90% accuracy on 11 other species

(including plants and virus) (including plants and virus) without using without using comparative genomics information.comparative genomics information.

Current specificity of about 89% is not Current specificity of about 89% is not enough for genome-wide applications.enough for genome-wide applications.