MSc Project Ppt
Transcript of MSc Project Ppt
NON-CODING RNA PREDICTION OF CLINICALLY
IMPORTANT MYCOPLASMA BY COMPARATIVE
GENOMIC ANALYSIS
Dissertation submitted to the Madurai Kamaraj University in partial fulfillment
for the requirement of Masters of Science in Biotechnology
Regn. No:A242009
School of Biotechnology
Madurai Kamaraj University
Madurai
OBJECTIVES:
• To choose the best possible approach to predict the
ncRNA
• To standardize the procedure required for the
approach selected.approach selected.
• Identification and characterization of the ncRNAs
from clinically important Mycoplasma.
• To form the base for the automization procedure for
the ncRNA prediction.
Past
• Sequence similarity search, Statistical analysis, Transcription signal analysis,
Comparative genomic analysis.
• Existing methods are biased to particular classes of ncRNAs only.
•tRNAscan-SE, Mir-Scan etc.,
QRNA - A BlendQRNA - A Blend
• Secondary structure alone is not statistically significant for the detection of
ncRNAs.
• Important sequences that code for proteins and performing important functions
are conserved across the related organisms.
QRNA was developed to screen the conserved RNA secondary structures from the
background of the other conserved sequences.
OUTLINE
INTERGENIC REGIONS OF ORGANISM OF INTEREST
↓
SEARCH FOR HOMOLOGY ACROSS RELEATED ORGANISMS
↓
PARSE THE ALIGNMENTS WITH CERTAIN CUTOFFS
blastn
Perl scripts
PARSE THE ALIGNMENTS WITH CERTAIN CUTOFFS
↓
THE ALIGNMENTS WERE GIVEN AS INPUT FOR THE QRNA
↓
PUTATIVE ncRNA
PROTEIN CODING REGION→INTERGENIC REGION
.ptt file
↓
Co-ordinates of protein coding regions
↓
Intergenic region co-ordinates
↓↓
Intergenic region co-ordinates
difference > 50 nucleotides
↓
Range file
↓
Intergenic sequence extraction by EMBOSS application
extractseq –regions @rangefile -separate
GENOME LENGTH COMPARISION OF
THE MYCOPLASMA
Genome Size Comparision
M.pne
M.gen
Organism Genome
size
M.gallisepticum 9,96,422
M.genitalium 580,074
M.mycoides 12,11,703
M.gen- Mycoplasma genetalium
M.pne- Mycoplasma pneumoniae
M.pul- Mycoplasma pulmonis
M.gal- Mycoplasma gallisepticum
M.myc- Mycoplasma mycoides
M.pen- Mycoplasma penetrans0 500000 1000000 1500000
M.pen
M.myc
M.gal
M.pul
M.pne
Genome length
M.mycoides 12,11,703
M.penetrans 13,58,633
M.pneumoniae 8,16,394
M.pulmonis 9,63,879
MYCOPLASMA GENOME – INTERGENIC REGION
20%
40%
60%
80%
100%
BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC
REGION IN THE GENOME OF MYCOPLASMA
0%
20%
M.p
en
M.m
yc
M.g
al
M.p
ul
M.p
ne
M.g
enH
omo s
apie
ns
PROTEIN TABLE OF THE GENOME
Mycoplasma genitalium G37 complete genome - 0..580074
480 proteins
Location Strand Length PID Gene Synonym Code COG ProductProduct
735..1829 + 364 3844620MG001 - - - (dnaN)
1829..2761 + 310 1045670MG002 - - - dnaJ 2846..4798 + 650 1045671MG003 - - - (gyrB)
4813..7323 + 836 1045672MG004 - - - (gyrA)
7295..8548 + 417 1045673MG005 - - - (serS) 8552..9184 + 210 1045674MG006 - - - (tmk)
9157..9921 + 254 1045675MG007 - - - hypothetical
9924..11252 + 442 1045676MG008 - - - (tdhF)
…… …….. … ….. ……….. ……… .. .. .. …
Protein Co-ordinates Intergeinc Co-ordinates
735 1829
1829 2761
2846 4798
1 734
2762 2845
4799 4812→2846 4798
4813 7323
7295 8548
8552 9184
9157 9921
……. …….
4799 4812
7224 7294
8549 8551
9183 9156
……. …….
→
CURINGRaw intergenisc coordinates
Starting Ending Length1 734 7342762 2845 844799 4812 147324 7294 -298549 8551 39185 9156 -289922 9923 211253 11251 -112041 12068 28
Curing of Intergenic Regions
400
600
800
1000
1200
No
. o
f In
terg
en
ic R
eg
ion
s
Before
After 12041 12068 2812726 12701 -2413566 13569 414434 14395 -3815317 15555 239
0
200
400
M.p
enM
.myc
M.g
alM
.pul
M.p
neM
.genN
o.
of
Inte
rge
nic
Re
gio
ns
Starting Ending Length1 734 7342762 2845 8415317 15555 2390
Intergenic region coordiantes which are
more than 50 nucleotides in length
GRAPH SHOWING THE CULLING OF THE INTERGENIC SEQUENCES BY THE
C PROGRAMME THAT SELECTS THE REGIONS WHOSE LENGTH IS GREATER THAN OR EQUAL
TO 50 NUCLEOTIDES ONLY
INTERGENIC SEQUENCES
>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequenceAAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGCAAAAGCTTCTGTACTGTTTATTTA>L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequenceACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTTAATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequence>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequenceATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAAAGCAA>L43967_20356_20543 Mycoplasma genitalium G37 intergenic sequenceCTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAAGGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTAAAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATTTAGCAGAA …………………………………………………………………………………………………………..
Intergenic sequences extracted in Fasta format
Similarity Search - WU BLAST 2.0
Organism Database
Created
Organisms in
Database
M.gallisepticum gempppdb M.genitalium
M.mycoides
M.penetrans
M.pneumoniae
Organism Database
Created
Organisms in
Database
M.penetrans ggmpnpudb M.gallisepticum
M.genitalium
M.mycoides
M.pneumoniae
•Six genome databases were made each excluding one organism
•Intergenic sequences of each organism were searched for similarity (blastn)
against the database which doesn’t consist the organisms genome
M.pneumoniae
M.pulmonis
M.genitalium gampppdb M.gallisepticum
M.mycoides
M.penetrans
M.pneumoniae
M.pulmonis
M.mycoides ggpppdb M.gallisepticum
M.genitalium
M.mycoides
M.penetrans
M.pneumoniae
M.pulmonis
M.pneumoniae
M.pulmonis
M.pneumoniae ggmpepudb M.gallisepticum
M.genitalium
M.mycoides
M.penetrans
M.pulmonis
M.pulmonis ggmpepndb M.gallisepticum
M.genitalium
M.mycoides
M.penetrans
M.pneumoniae
Table showing the list of databases made and the organisms
Parsing alignments - Factors
• Perl script is used to parse the blast alignments
• blastn2qrnadepth.pl is used to parse the alignments.
• Factors considered in parsing
– I trimming– I trimming
• Evalue
• Minimum and Maximum Identity of alignments
• Length of the alignment
– II trimming
• Score
• Depth of alignments
• Shift
Parsing alignments – QRNA input
• Perl script generates various files
– QRNA input file : filename.q file
• It is a collection of sequences in fasta format, where two
sequences are the two component of an alignmnet with sequences are the two component of an alignmnet with
gaps left in place.
– Parsing report file : filename.q.rep
• It is a report of the blastn alignment that have been
pruned in the process of creating the QRNA input file.
QRNA input file>L43967_15317_15555-1>179-Mycoplasma
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATT
T
>gb-U00089--19096>19275-Mycoplasma
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT
GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA
ACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATT
T
>L43967_19760_19824-5<65-Mycoplasma
TTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAA
AT
>emb-BX293980.1--57200>57261-Mycoplasma
TTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAA
AT
Parsing Report FileFILE: genblast
DIR: /home/kalyankpy/coput2/blast//
FIRST TRIMMING
Minimum length = 1
Maximum Evalue = 0.01
Minimum %id = 0
Maximum %id = 100
SECOND TRIMMING
Alignments culled by = SC
Depth of alignments = 1
shift = 1shift = 1
113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence
Total # alignments: 1121 After First trimming: 88 After Second trimming: 2
57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence
Total # alignments: 152 After First trimming: 3 After Second trimming: 3
……………………………………………………………………………………………….
……………………………………………………………………………………………….
Total #Queries 122
Total #Alignments 53927 ave_len = 309.5
After first trimming 18851 ave_len = 552.6
After second trimming 386 ave_len = 404.2
No. of Blast hits
850
1012
565
386
M.gal
M.pul
M.pne
M.gen
No. of
alignments
53927
44433
360830
154026
No. of blastn hits selected for qrna input
GRAPH SHOWING NUMBER OF ALIGNMENTS
SELECTED FOR QRNA INPUT FOR EACH
GENOME THROUGH THE PERLSCRIPT
1852
1787
M.pen
M.myc 430551
560263
QRNA – PARAMETERS
• Scanning window approach
– Window =150 nt; Extension = 50 nt
• Maximum length 9999999
• Local viterbi algorithm
• RIBOPROB matrix
• Shuffling the sequence maintaining the composition
QRNA OUTPUT#---------------------------------------------------------------------
# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m (Sept 1997)
#---------------------------------------------------------------------
# PAM model = BLOSUM62
#---------------------------------------------------------------------
# RNA model = /mix_tied_linux.cfg
# RIBOPROB matrix = /RIBOPROB85-60.mat
#---------------------------------------------------------------------
# seq file = /home/kalyankpy/perlscriptresult/genblast.q
# #seqs: 772 (max_len = 3420)
#---------------------------------------------------------------------#---------------------------------------------------------------------
# window version: window = 150 slide = 50 -- length range = [0,9999999]
#---------------------------------------------------------------------
# 1 [both strands] (sre_shuffled)
>L43967_1_734-90>722-Mycoplasma (664)
>gb-U00089--130>767-Mycoplasma (664)
length of whole alignment after removing common gaps: 664
Divergence time (variable): 0.401
[alignment ID = 61.75 MUT = 29.67 GAP = 8.58
………………………………………………………… ……………….. ( CONTD..)
length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)
posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43)
posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)
L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTT
gb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT
………………………………………………………………………………………………………………………………………………………………………………
LOCAL_DIAG_VITERBI -- [Inside SCFG]
QRNA OUTPUT
OTH ends *(+) = (0..[150]..149)
OTH ends (-) = (0..[150]..149)
COD ends *(+) = (120..[27]..146)
COD ends (-) = (41..[12]..52)
RNA ends *(+) = (0..[21]..20)
RNA ends (-) = (0..[150]..149)
winner = OTH
OTH = 184.281 COD = 166.408 RNA = 179.710
logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571
sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571
Number of non-coding predicted
10
20
30
40
50
60
Nu
mb
er
No. of ncRNAs predicted
0
10
M.pen M.myc M.gal M.pul M.pne M.gen
Number of ncRNA predicted for
each organism
Range of Non-coding RNA
100
150
200
250
300
350
Len
gth
(n
t)
Length Range of Non-coding RNA predicted
0
50
100
M.pen M.myc M.gal M.pul M.pne M.gen
Len
gth
(n
t)
PICTURE SHOWING THE LENGTH
RANGE OF NON-CODING RNAs.
(Vertical bars represent the spread of scores
and horizontal bar represent the average)
Putative Vs Annotated
•The predicted ncRNa were searched for similarity against the biochemically characterized ncRNA of Bacteria ( Non-coding RNA database at http://biobases.ibch.poznan.pl/nc, updated 2002)
•Found similar to the Mc_MCS4 ncRNA of Mycoplasma capricolum.
•Mc_MCS4 was already characterized to be having extensive homology with
the eukaryotic U6 snRNA.
•Another motif in one of the putative ncRNA was found to be conserved •Another motif in one of the putative ncRNA was found to be conserved across E.coli, S.typhi, K.pneumoniae as a part of MicF ncRNA in these organsims.
•MicF was characterised to be regulating the expression of OmpF protein in
these organisms.
•Similarity was also found with OxyS ncRNA of E.coli.
•OxyS was found to modulate the expression of various genes in response to
Hydrogen peroxide.
- In Eukaryotes
• Similarity was observed with few miRNAs that were present in the miRNA database (Rfam miRNA registry)
• Same stretch of sequence was present in Human, • Same stretch of sequence was present in Human,
Rat and Mouse miRNA.
• Small stretches of similarity was observed with various ncRNAs playing role in regulation of development also.
Sequences producing High-scoring Segment Pairs: Score P(N) N
hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop 91 0.26 1
rno-mir-190 MI0000933 Rattus norvegicus miR-190 stem-loop 91 0.26 1
mmu-mir-190 MI0000232 Mus musculus miR-190 stem-loop 86 0.48 1
>hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop
Length = 85
Minus Strand HSPs:Minus Strand HSPs:
Score = 91 (19.7 bits), Expect = 0.31, P = 0.26
Identities = 45/68 (66%), Positives = 45/68 (66%), Strand = Minus / Plus
Query: 77 AGGTTTAGGTGTTCT-TATTT-ATTTATTAGGTTGTTTAGTT--TC-AATTATTTTTGGA 23
||| | |||| | | ||| || |||||||||||| | || || || ||| | | |
Sbjct: 4 AGGCCTCTGTGTGATATGTTTGATATATTAGGTTGTT-ATTTAATCCAACTATATATCAA 62
Query: 22 ATACTAGT 15
| | || |
Sbjct: 63 ACA-TATT 69
>Hs_NTT
Length = 17,572
Plus Strand HSPs:
Score = 116 (23.5 bits), Expect = 0.025, P = 0.024
Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus
Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65
|| |||| | || ||| | | || | |||| | ||| | |||| ||| ||||
Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394
Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98
|||| | || ||| |||| | ||||| |||
Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427
CONCLUSIONS
• Comparative genomic analysis was selected for the ncRNA prediction.
• Procedure for the prediction was standardized.
• One of the putative ncRNA was found to be similar to the already characterized ncRNA from similar to the already characterized ncRNA from the same genus.
• Conserved region of MicF was found to be present in the putative ncRNA also.
• Identification of the eukaryotic miRNA counterpart in Mycoplasma.
Future Plans• To develop programmes for getting the intergenic
region co-ordinates given the protein table file as
input.
• To verify the genuinity of the predictions beyond
the homologous regions found in bacteria.
• To extend the prediction procedure for Eukaryotes.• To extend the prediction procedure for Eukaryotes.
• To develop the procedure required for classification
of the predicted ncRNAs into subclasses.
• To identify the functions of the putative ncRNAs by
searching their effector targets.
• To automize the whole procedure.
ACKNOWLEDGMENTSDr. Z. A. Rafi
Dr. S. Krishnaswamy
The Whole SBT family
Ministry of Human Recourses Development
Department of Education
Department of Science and TechnologyDepartment of Science and Technology
Department of Biotechnology
All my classmates