MSc Project Ppt

NON-CODING RNA PREDICTION OF CLINICALLY

IMPORTANT MYCOPLASMA BY COMPARATIVE

GENOMIC ANALYSIS

Dissertation submitted to the Madurai Kamaraj University in partial fulfillment

for the requirement of Masters of Science in Biotechnology

Regn. No:A242009

School of Biotechnology

Madurai Kamaraj University

Madurai

OBJECTIVES:

• To choose the best possible approach to predict the

ncRNA

• To standardize the procedure required for the

approach selected.approach selected.

• Identification and characterization of the ncRNAs

from clinically important Mycoplasma.

• To form the base for the automization procedure for

the ncRNA prediction.

Past

• Sequence similarity search, Statistical analysis, Transcription signal analysis,

Comparative genomic analysis.

• Existing methods are biased to particular classes of ncRNAs only.

•tRNAscan-SE, Mir-Scan etc.,

QRNA - A BlendQRNA - A Blend

• Secondary structure alone is not statistically significant for the detection of

ncRNAs.

• Important sequences that code for proteins and performing important functions

are conserved across the related organisms.

QRNA was developed to screen the conserved RNA secondary structures from the

background of the other conserved sequences.

OUTLINE

INTERGENIC REGIONS OF ORGANISM OF INTEREST

↓

SEARCH FOR HOMOLOGY ACROSS RELEATED ORGANISMS

↓

PARSE THE ALIGNMENTS WITH CERTAIN CUTOFFS

blastn

Perl scripts

PARSE THE ALIGNMENTS WITH CERTAIN CUTOFFS

↓

THE ALIGNMENTS WERE GIVEN AS INPUT FOR THE QRNA

↓

PUTATIVE ncRNA

PROTEIN CODING REGION→INTERGENIC REGION

.ptt file

↓

Co-ordinates of protein coding regions

↓

Intergenic region co-ordinates

↓↓

Intergenic region co-ordinates

difference > 50 nucleotides

↓

Range file

↓

Intergenic sequence extraction by EMBOSS application

extractseq –regions @rangefile -separate

GENOME LENGTH COMPARISION OF

THE MYCOPLASMA

Genome Size Comparision

M.pne

M.gen

Organism Genome

size

M.gallisepticum 9,96,422

M.genitalium 580,074

M.mycoides 12,11,703

M.gen- Mycoplasma genetalium

M.pne- Mycoplasma pneumoniae

M.pul- Mycoplasma pulmonis

M.gal- Mycoplasma gallisepticum

M.myc- Mycoplasma mycoides

M.pen- Mycoplasma penetrans0 500000 1000000 1500000

M.pen

M.myc

M.gal

M.pul

M.pne

Genome length

M.mycoides 12,11,703

M.penetrans 13,58,633

M.pneumoniae 8,16,394

M.pulmonis 9,63,879

MYCOPLASMA GENOME – INTERGENIC REGION

20%

40%

60%

80%

100%

BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC

REGION IN THE GENOME OF MYCOPLASMA

0%

20%

M.p

en

M.m

yc

M.g

al

M.p

ul

M.p

ne

M.g

enH

omo s

apie

ns

PROTEIN TABLE OF THE GENOME

Mycoplasma genitalium G37 complete genome - 0..580074

480 proteins

Location Strand Length PID Gene Synonym Code COG ProductProduct

735..1829 + 364 3844620MG001 - - - (dnaN)

1829..2761 + 310 1045670MG002 - - - dnaJ 2846..4798 + 650 1045671MG003 - - - (gyrB)

4813..7323 + 836 1045672MG004 - - - (gyrA)

7295..8548 + 417 1045673MG005 - - - (serS) 8552..9184 + 210 1045674MG006 - - - (tmk)

9157..9921 + 254 1045675MG007 - - - hypothetical

9924..11252 + 442 1045676MG008 - - - (tdhF)

…… …….. … ….. ……….. ……… .. .. .. …

Protein Co-ordinates Intergeinc Co-ordinates

735 1829

1829 2761

2846 4798

1 734

2762 2845

4799 4812→2846 4798

4813 7323

7295 8548

8552 9184

9157 9921

……. …….

4799 4812

7224 7294

8549 8551

9183 9156

……. …….

→

CURINGRaw intergenisc coordinates

Starting Ending Length1 734 7342762 2845 844799 4812 147324 7294 -298549 8551 39185 9156 -289922 9923 211253 11251 -112041 12068 28

Curing of Intergenic Regions

400

600

800

1000

1200

No

. o

f In

terg

en

ic R

eg

ion

s

Before

After 12041 12068 2812726 12701 -2413566 13569 414434 14395 -3815317 15555 239

0

200

400

M.p

enM

.myc

M.g

alM

.pul

M.p

neM

.genN

o.

of

Inte

rge

nic

Re

gio

ns

Starting Ending Length1 734 7342762 2845 8415317 15555 2390

Intergenic region coordiantes which are

more than 50 nucleotides in length

GRAPH SHOWING THE CULLING OF THE INTERGENIC SEQUENCES BY THE

C PROGRAMME THAT SELECTS THE REGIONS WHOSE LENGTH IS GREATER THAN OR EQUAL

TO 50 NUCLEOTIDES ONLY

INTERGENIC SEQUENCES

>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequenceAAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGCAAAAGCTTCTGTACTGTTTATTTA>L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequenceACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTTAATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequence>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequenceATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAAAGCAA>L43967_20356_20543 Mycoplasma genitalium G37 intergenic sequenceCTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAAGGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTAAAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATTTAGCAGAA …………………………………………………………………………………………………………..

Intergenic sequences extracted in Fasta format

Similarity Search - WU BLAST 2.0

Organism Database

Created

Organisms in

Database

M.gallisepticum gempppdb M.genitalium

M.mycoides

M.penetrans

M.pneumoniae

Organism Database

Created

Organisms in

Database

M.penetrans ggmpnpudb M.gallisepticum

M.genitalium

M.mycoides

M.pneumoniae

•Six genome databases were made each excluding one organism

•Intergenic sequences of each organism were searched for similarity (blastn)

against the database which doesn’t consist the organisms genome

M.pneumoniae

M.pulmonis

M.genitalium gampppdb M.gallisepticum

M.mycoides

M.penetrans

M.pneumoniae

M.pulmonis

M.mycoides ggpppdb M.gallisepticum

M.genitalium

M.mycoides

M.penetrans

M.pneumoniae

M.pulmonis

M.pneumoniae

M.pulmonis

M.pneumoniae ggmpepudb M.gallisepticum

M.genitalium

M.mycoides

M.penetrans

M.pulmonis

M.pulmonis ggmpepndb M.gallisepticum

M.genitalium

M.mycoides

M.penetrans

M.pneumoniae

Table showing the list of databases made and the organisms

Parsing alignments - Factors

• Perl script is used to parse the blast alignments

• blastn2qrnadepth.pl is used to parse the alignments.

• Factors considered in parsing

– I trimming– I trimming

• Evalue

• Minimum and Maximum Identity of alignments

• Length of the alignment

– II trimming

• Score

• Depth of alignments

• Shift

Parsing alignments – QRNA input

• Perl script generates various files

– QRNA input file : filename.q file

• It is a collection of sequences in fasta format, where two

sequences are the two component of an alignmnet with sequences are the two component of an alignmnet with

gaps left in place.

– Parsing report file : filename.q.rep

• It is a report of the blastn alignment that have been

pruned in the process of creating the QRNA input file.

QRNA input file>L43967_15317_15555-1>179-Mycoplasma

ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT

GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA

ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATT

T

>gb-U00089--19096>19275-Mycoplasma

ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT

GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA

ACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATT

T

>L43967_19760_19824-5<65-Mycoplasma

TTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAA

AT

>emb-BX293980.1--57200>57261-Mycoplasma

TTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAA

AT

Parsing Report FileFILE: genblast

DIR: /home/kalyankpy/coput2/blast//

FIRST TRIMMING

Minimum length = 1

Maximum Evalue = 0.01

Minimum %id = 0

Maximum %id = 100

SECOND TRIMMING

Alignments culled by = SC

Depth of alignments = 1

shift = 1shift = 1

113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence

Total # alignments: 1121 After First trimming: 88 After Second trimming: 2

57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence

Total # alignments: 152 After First trimming: 3 After Second trimming: 3

……………………………………………………………………………………………….

……………………………………………………………………………………………….

Total #Queries 122

Total #Alignments 53927 ave_len = 309.5

After first trimming 18851 ave_len = 552.6

After second trimming 386 ave_len = 404.2

No. of Blast hits

850

1012

565

386

M.gal

M.pul

M.pne

M.gen

No. of

alignments

53927

44433

360830

154026

No. of blastn hits selected for qrna input

GRAPH SHOWING NUMBER OF ALIGNMENTS

SELECTED FOR QRNA INPUT FOR EACH

GENOME THROUGH THE PERLSCRIPT

1852

1787

M.pen

M.myc 430551

560263

QRNA – PARAMETERS

• Scanning window approach

– Window =150 nt; Extension = 50 nt

• Maximum length 9999999

• Local viterbi algorithm

• RIBOPROB matrix

• Shuffling the sequence maintaining the composition

QRNA OUTPUT#---------------------------------------------------------------------

# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m (Sept 1997)

#---------------------------------------------------------------------

# PAM model = BLOSUM62

#---------------------------------------------------------------------

# RNA model = /mix_tied_linux.cfg

# RIBOPROB matrix = /RIBOPROB85-60.mat

#---------------------------------------------------------------------

# seq file = /home/kalyankpy/perlscriptresult/genblast.q

# #seqs: 772 (max_len = 3420)

#---------------------------------------------------------------------#---------------------------------------------------------------------

# window version: window = 150 slide = 50 -- length range = [0,9999999]

#---------------------------------------------------------------------

# 1 [both strands] (sre_shuffled)

>L43967_1_734-90>722-Mycoplasma (664)

>gb-U00089--130>767-Mycoplasma (664)

length of whole alignment after removing common gaps: 664

Divergence time (variable): 0.401

[alignment ID = 61.75 MUT = 29.67 GAP = 8.58

………………………………………………………… ……………….. ( CONTD..)

length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)

posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43)

posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)

L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTT

gb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT

………………………………………………………………………………………………………………………………………………………………………………

LOCAL_DIAG_VITERBI -- [Inside SCFG]

QRNA OUTPUT

OTH ends *(+) = (0..[150]..149)

OTH ends (-) = (0..[150]..149)

COD ends *(+) = (120..[27]..146)

COD ends (-) = (41..[12]..52)

RNA ends *(+) = (0..[21]..20)

RNA ends (-) = (0..[150]..149)

winner = OTH

OTH = 184.281 COD = 166.408 RNA = 179.710

logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571

sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571

Number of non-coding predicted

10

20

30

40

50

60

Nu

mb

er

No. of ncRNAs predicted

0

10

M.pen M.myc M.gal M.pul M.pne M.gen

Number of ncRNA predicted for

each organism

Range of Non-coding RNA

100

150

200

250

300

350

Len

gth

(n

t)

Length Range of Non-coding RNA predicted

0

50

100

M.pen M.myc M.gal M.pul M.pne M.gen

Len

gth

(n

t)

PICTURE SHOWING THE LENGTH

RANGE OF NON-CODING RNAs.

(Vertical bars represent the spread of scores

and horizontal bar represent the average)

Putative Vs Annotated

•The predicted ncRNa were searched for similarity against the biochemically characterized ncRNA of Bacteria ( Non-coding RNA database at http://biobases.ibch.poznan.pl/nc, updated 2002)

•Found similar to the Mc_MCS4 ncRNA of Mycoplasma capricolum.

•Mc_MCS4 was already characterized to be having extensive homology with

the eukaryotic U6 snRNA.

•Another motif in one of the putative ncRNA was found to be conserved •Another motif in one of the putative ncRNA was found to be conserved across E.coli, S.typhi, K.pneumoniae as a part of MicF ncRNA in these organsims.

•MicF was characterised to be regulating the expression of OmpF protein in

these organisms.

•Similarity was also found with OxyS ncRNA of E.coli.

•OxyS was found to modulate the expression of various genes in response to

Hydrogen peroxide.

- In Eukaryotes

• Similarity was observed with few miRNAs that were present in the miRNA database (Rfam miRNA registry)

• Same stretch of sequence was present in Human, • Same stretch of sequence was present in Human,

Rat and Mouse miRNA.

• Small stretches of similarity was observed with various ncRNAs playing role in regulation of development also.

Sequences producing High-scoring Segment Pairs: Score P(N) N

hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop 91 0.26 1

rno-mir-190 MI0000933 Rattus norvegicus miR-190 stem-loop 91 0.26 1

mmu-mir-190 MI0000232 Mus musculus miR-190 stem-loop 86 0.48 1

>hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop

Length = 85

Minus Strand HSPs:Minus Strand HSPs:

Score = 91 (19.7 bits), Expect = 0.31, P = 0.26

Identities = 45/68 (66%), Positives = 45/68 (66%), Strand = Minus / Plus

Query: 77 AGGTTTAGGTGTTCT-TATTT-ATTTATTAGGTTGTTTAGTT--TC-AATTATTTTTGGA 23

||| | |||| | | ||| || |||||||||||| | || || || ||| | | |

Sbjct: 4 AGGCCTCTGTGTGATATGTTTGATATATTAGGTTGTT-ATTTAATCCAACTATATATCAA 62

Query: 22 ATACTAGT 15

| | || |

Sbjct: 63 ACA-TATT 69

>Hs_NTT

Length = 17,572

Plus Strand HSPs:

Score = 116 (23.5 bits), Expect = 0.025, P = 0.024

Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus

Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65

|| |||| | || ||| | | || | |||| | ||| | |||| ||| ||||

Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394

Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98

|||| | || ||| |||| | ||||| |||

Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427

CONCLUSIONS

• Comparative genomic analysis was selected for the ncRNA prediction.

• Procedure for the prediction was standardized.

• One of the putative ncRNA was found to be similar to the already characterized ncRNA from similar to the already characterized ncRNA from the same genus.

• Conserved region of MicF was found to be present in the putative ncRNA also.

• Identification of the eukaryotic miRNA counterpart in Mycoplasma.

Future Plans• To develop programmes for getting the intergenic

region co-ordinates given the protein table file as

input.

• To verify the genuinity of the predictions beyond

the homologous regions found in bacteria.

• To extend the prediction procedure for Eukaryotes.• To extend the prediction procedure for Eukaryotes.

• To develop the procedure required for classification

of the predicted ncRNAs into subclasses.

• To identify the functions of the putative ncRNAs by

searching their effector targets.

• To automize the whole procedure.

ACKNOWLEDGMENTSDr. Z. A. Rafi

Dr. S. Krishnaswamy

The Whole SBT family

Ministry of Human Recourses Development

Department of Education

Department of Science and TechnologyDepartment of Science and Technology

Department of Biotechnology

All my classmates

MSc Project Ppt

Documents

Transcript of MSc Project Ppt