Towards a model for -1 frameshift sites
description
Transcript of Towards a model for -1 frameshift sites
1
Towards a model for -1 frameshift sites
Alain Denise1,2, Michaël Bekaert1, Laure Bidou1, Guillemette Duchateau-Nguyen1,
Jean-Paul Forest2, Christine Froidevaux2,
Isabelle Hatin1, Jean-Pierre Rousset1, Michel Termier1
1 IGM (Institut de Génétique et Microbiologie)2 LRI (Laboratoire de Recherche en Informatique)
Université Paris-Sud, Orsay
2
Translation
CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’
mRNA
3
Translation
CAU AUG GAU UAC AUG GUC UAA GAU
The ribosome reads bases by triplets (or codons)from a START codon
ribosome
5’ 3’
4
Translation
CAU AUG GAU UAC AUG GUC UAA GAU
The ribosome synthetizes one amino-acid per codon
5’ 3’
5
Translation
CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’
6
Translation
CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’
7
Translation
CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’
8
Translation
CAU AUG GAU UAC AUG GUC UAA GAU5’ 3’
9
Translation
CAU AUG GAU UAC AUG GUC UAA GAU
The synthesis goes on until a STOP codon is read
5’ 3’
1 mRNA gives 1 protein
10
Experimental fact
• Some mRNAs encode two distinct proteins with same 5’ end
11
Programmed -1 frameshifting
Non-deterministic event
ORF1a
START0 STOP0
0 phase
STOP-1
ORF1b -1 phase
usualtranslation
-1 frameshift
1 mRNA gives 2 distinct proteinswith accurate ratio
12
Typical -1 frameshift site [Brierley, 1989]
NNX XXY YYZ
AUG P SP
S1
L1
S2
L2
L’1
Slippery sequence Secondary structure
5’
3’
13
IBV frameshift site
UAU UUA AAC
AUG
S1
S2
Slippery sequence Pseudoknot
5’
3’
GGGUAC
UGACGAUGGGG
GCUG AUACCCC
A G G C U C G
U C C G A G C
G
UUGC
GAAA
15
Translation with frameshift
UAU UUA AAC GGG UAC
AUG
5’
3’
UGACGAUGGGG
GCUG AUACCCC
A G G C U C G
U C C G A G C
G
UUGC
GAAA
16
Translation with frameshift
UAU UUA AAC GGG UAC
5’
3’
UGACGAUGGGG
GCUG AUACCCC
A G G C U C G
U C C G A G C
G
UUGC
GAAA
17
Translation with frameshift
UAU UUA AAC GGG UAC
5’
3’
UGACGAUGGGG
GCUG AUACCCC
A G G C U C G
U C C G A G C
G
UUGC
GAAA
-1 shift
18
UA UUU AAA CGG GUA CGG GGU AGC AGU
Translation with frameshift
5’
3’
19
UA UUU AAA CGG GUA CGG GGU AGC AGU
Translation with frameshift
5’
3’
20
UA UUU AAA CGG GUA CGG GGU AGC AGU
Translation with frameshift
5’
3’
21
UA UUU AAA CGG GUA CGG GGU AGC AGU
Translation with frameshift
5’
3’
22
Goals
To improve the known model for viral frameshift sites
To identify new frameshift sites in viral and non viral genomes
23
Our approach
Biologicalsequences
Formalmodels
Predictiontools
In silicoand in vivo
validation
Applications toother genomes
representexplainpredict
24
IBV frameshift site: spacer
5’
3’
GGGUAC
25
Spacer consensus
HAST-1 UAC AAA
BEV UGU UG
EAV UGA GAG
HCV GAG UC
IBV GGG UAC
MHV GGG UU
TGEV GAG
RCNMV UAG GC
BWYV GGA GUG
PLRV GGG CAA
BLV UAA UAG A
FIV UGG AAG GC
HIV-1 GGG AAG AU
HTLV-2UCC UUA A
JSR UGG GUG A
MMTV gag-pro UUG UAA A
MMTV pro-pol UGA U
RSV UAG GGA
SRV-1 GGA CUG A
Consensus UGG UAG AGAA GUA
26
Lab experiments
lacZ luc
-1 phase
pSV40 lacZ luc
0 phase
pSV40 FS signal
FS signal N
Test construct
Control construct
Expression reporter FS reporter
27
Spacer: lab experiments
Spacer relative FS rate
wild-type IBV GGGUA 100U mutant UGGUA 100
A mutant AGGUA 55C mutant CGGUA 32CC mutant CCGUA 70CCU mutant CCUUA 49
28
Refining the model: Machine learning
• To identify relevant properties that characterize FS sites
• Disjunctive learning: all sequences do not frameshift for the same reasons [Giedroc et al., 2000]
29
Annotating data: spacer
5’
3’
GGGUAC
30
Example of data: SP
• SP = GGGUAC
– number of A = 1; C = 1; G = 3; U = 1;
– % of A = 33; C = 33; G = 50; U = 33;
– first = G;
– last = C;
31
Annotating data: stem 1
UGACGAUGGGG
GCUG AUACCCC
5’
3’
32
Example of data: stem 1
• S1 =
– 5' side : GGGGUAGCAGU– 3' side : CCCCAUAGUCG
– stability : -20,7 kcal/mol
33
Annotating data: full sequence
U UUA AAC
5’
3’
GGGUAC
UGACGAUGGGG
GCUG AUACCCC
A G G C U C G
U C C G A G C
G
UUGC
GAAA
34
Example of data : FS rate
FS rate = 22 %
35
GloBo
Disjunctive learning algorithm
Suited to small amount of data
Won the PTE challenge on analogous data
36
Example of rulesIf
SP length 5 and number of G in S1.5’ bottom half 3 and
number of G in S1.5’ 4 and %T in S2.5’ 30 and%G in S2.5’ 70
then FS rate 5%
If %G in S1.5' bottom half 80 and %C in L1 45
then FS rate 5%
If
SP length 5 and S1.3' length 6 and %C in S1.3' 45
then FS rate 5%
...
37
Covering and prediction
If
SP length 5 and number of G in S1.5’ bottom half 3 and
number of G in S1.5’ 4 and %T in S2.5’ 30 and%G in S2.5’ 70
then FS rate 5%
Covering of examples : 70 %
Examples predicted in test set : 80 %
38
Is R1relevant for frameshift ?
Stem 1 5’-side relative FS R1 rate
wild-type IBV GGGGU AUCAGU 100 yesmutant 1 GGUCG AUCAGU 41 yesmutant 2 GGGGU UCUACA 55 yes
mutant 3 GCUCG AUCAGU 36 nomutant 4 GCCCU AUCAGU 73 no
39
Covering and prediction
If
SP length 5 and S1.3' length 6 and %C in S1.3' 45
then FS rate 5%
Covering of examples : 45 %
Examples predicted in test set : 40 %
40
Conclusion
• Spacer:– correlation between primary sequence and
FS rate has been established– systematic experimentation going on
41
Conclusion
Biologicalsequences
Formalmodels
Predictiontools
In silicoand in vivo
validation
Applications toother genomes
58
SpacerVirus Sequence
HAST-I : U A C A A ABEV : U G U U GEAV : U G A G A GHCV : G A G U CIBV : G G G U A CMHV : G G G U UTGEV : G A GRCNMV : U A G G CBWYV : G G A G U GPLRV : G G G C A ABLV : U A A U A G AFIV : U G G A A G G CHIV-1 : G G G A A G A UHTLV-II : U C C U U A AJSR : U G G G U G AMMTV : U U G U A A AMMTV : U G A URSV : U A G G G ASRV-1 : G G A C U G A
Consensus : U G G U A G AG A A G U A