PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …
Transcript of PHYSICAL BIOINFORMATICS METHODS TO UNDERSTAND THE …
The Pennsylvania State University
The Graduate School
PHYSICAL BIOINFORMATICS METHODS TO
UNDERSTAND THE CAUSES AND CONSEQUENCES OF
VARIABLE CODON TRANSLATION RATES
A Dissertation in
Bioinformatics and Genomics
by
Nabeel Ahmed
© 2019 Nabeel Ahmed
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
August 2019
ii
The dissertation of Nabeel Ahmed was reviewed and approved* by the following:
Edward P. O’Brien
Associate Professor of Chemistry
Dissertation Adviser
Chair of Committee
István Albert
Associate Professor of Bioinformatics
Sarah M. Assmann
Waller Professor of Biology
Naomi S. Altman
Professor of Statistics and Bioinformatics
Cooduvalli S. Shashikant
Associate Professor of Molecular and Developmental Biology
Chair, Intercollege Graduate Degree Program in Bioinformatics and Genomics
*Signatures are on file in the Graduate School.
iii
ABSTRACT
The process of translating the genetic information encoded in an mRNA molecule to a
protein is crucial to cellular life and plays an important role in regulating gene expression.
The steady state in vivo protein concentrations are determined in part at the level of
translation. Therefore, uncovering the mechanisms of translational control can help us
understand a crucial component of cellular dynamics. The rate at which individual codons
are translated play an important role in deciding the fate of nascent proteins and affect the
downstream cellular processes they take part in. Hence, measurement of the translation
rates at all codon positions within a transcript would help us understand their role in
regulating co-translational processes such as protein folding and chaperone binding. With
the development of high-throughput Next Generation Sequencing technology in the last
decade, a method called Ribo-Seq can capture a transcriptome-wide snapshot of
translation at nucleotide resolution. However, no gold-standard method for extracting
translation rates from Ribo-Seq data exists and there have been contradictory biological
inferences drawn from different analyses methods. In this dissertation, I present novel
methods based on mathematical optimization and chemical kinetic modeling to correctly
identify the A-site within Ribo-Seq reads and quantify absolute codon translation rates.
This dissertation also highlights two novel biological insights and discoveries namely i)
that the primary structure of a protein encodes translation rate information through pairs
of evolutionarily selected amino acids and ii) that translation kinetics and co-translational
chaperone binding are coordinated.
In Chapter 1, I describe the current state of research in translation and how
translation rates have been estimated previously. I also discuss current methods for
analyzing Ribo-Seq data and their limitations.
In Chapter 2, I report a method that solves the essential first-step of determining
where the A-site of the ribosome was on ribosome-protected mRNA fragments generated
by Ribo-Seq. It is well-known that during translation elongation, the A-site of a ribosome
can occupy only the coding region between second and stop codon of a transcript. Turning
this fundamental fact into a mathematical optimization problem, I identify an offset for the
A-site from the 5′ end of the fragment that maximizes the number of reads between the
second and stop codons of a transcript. A-site offset tables are generated for wide range
of fragment sizes obtained from Ribo-Seq data for S. cerevisiae and mouse embryonic
stem cells. I present results showing that our method out-performs 11 other contemporary
iv
methods for estimating the A-site position using known A-site stalling signals in polyproline
motifs.
In Chapter 3, I present a method for estimating absolute codon translation rates
based on chemical kinetic modeling of translation. Applying this method to high-coverage
transcripts, I show that translation rates of the codons have up to 26-fold variability in S.
cerevisiae and even the same codon type, at different positions on a single transcript can
have very different translation rates. Different molecular factors like cognate tRNA
concentration, downstream mRNA secondary structure, presence of proline in P-site, etc.
are identified that influence the translation rate of a codon in its A-site. Hence codon
translation rates are determined mostly by the context of the region flanking the codon
within a transcript
In Chapter 4, I describe the novel discovery that the chemical identity of pairs of
amino acids, when located in the P-site and A-site of the ribosome can causally and
predictably influence codon translation rates. Analysis of Ribo-Seq data from S. cerevisiae
exhibited correlations indicating that the presence of particular amino acids, when present
in the P-site and A-site can slow down or speed up the translation of the codon in the A-
site. To test for causation, twelve amino acid mutations were introduced into the primary
structure of non-essential S. cerevisiae proteins that the bioinformatic analysis predicts
will either speed up, slow down, or cause no change in translation rate when the mutated
residue is in the P-site. In all cases, the resulting change in ribosome density at the A-site
matches the prediction. Enrichment/depletion analyses of these amino acid pairs across
the proteome suggest evolutionary pressures are selecting against slow-translating pairs
of amino acids, but retaining them in regions where they might aid the efficiency of co-
translational processes.
Chapter 5 of this dissertation demonstrates for the first time evidence of
coordination between translation kinetics and co-translational binding of chaperones.
Using in vivo selective ribosome profiling approach, the binding profile of a Hsp70
chaperone Ssb was characterized and correlated with codon translation rates obtained
from Ribosome Profiling. It was found that periods of Ssb binding to the nascent
polypeptide chain outside the ribosome exit tunnel were correlated with faster translation
of mRNA segments within the ribosome. This translational speedup is maintained in a
strain with Ssb deleted indicating that this speedup is caused by features encoded within
the mRNA. I demonstrate that the distribution of molecular factors highlighted in Chapter
v
3 and 4 across these mRNA fragments causes a speedup of translation in these fragments
to coincide with binding of Ssb.
In Chapter 6, I summarize my findings and their implications for characterizing the
principles of translation kinetics and their influence on co-translational processes. The
methods presented in this dissertation will hopefully provide an easy-to-implement
standardized protocol for processing Ribo-Seq data by correctly mapping the reads using
the provided offset table and quantify absolute rates. Identification of a novel factor like
amino acid pairs should motivate researchers to investigate the importance of pairs and
the potential role of loss of this pairing at sensitive sites in causing disorders. Finally, co-
ordination of translation kinetics with co-translational folding should open up avenues to
investigate the loss of chaperone binding due to altered translation kinetics caused by
synonymous mutations. Finally, the methods and studies described in this dissertation
demonstrates integration of useful information from next-generation sequencing datasets
with chemical kinetic models. The projects in this dissertation also showcase the power of
biophysical modelling in explaining the dynamics of cellular processes and it offers a multi-
disciplinary perspective of biology from physical sciences.
vi
TABLE OF CONTENTS
LIST OF FIGURES ………………………………………………………………………………………………………………………..ix
LIST OF TABLES …………………………………………………………………………………………………………………………xii
ACKNOWLEDGEMENTS ……………………………………………………………………………………………………………xiv
Chapter 1 INTRODUCTION ............................................................................................................... 1
1.1 Overview ................................................................................................................................ 1
1.2 Translation and its importance .............................................................................................. 1
1.3 Previous estimates of translation rates ................................................................................. 3
1.4 Ribo-Seq measures the location and number of actively translating ribosomes .................. 4
1.4.1 Approaches for identifying the A-site ............................................................................. 5
1.5 Approaches for estimating translation rates using Ribo-Seq ................................................ 6
1.6 Molecular factors influencing translation elongation ............................................................ 7
1.7 Influence of translation kinetics on co-translational processes ............................................ 8
1.8 Objectives of dissertation ...................................................................................................... 9
Chapter 2 IDENTIFYING A- AND P-SITE LOCATIONS ON RIBOSOME-PROTECTED MRNA
FRAGMENTS USING INTEGER PROGRAMMING ............................................................................. 12
2.1 Abstract ................................................................................................................................ 12
2.2 Introduction ......................................................................................................................... 12
2.3 Results .................................................................................................................................. 14
2.3.1 Integer Programming Algorithm ................................................................................... 14
2.3.2 Illustrating the Integer Programming optimization procedure .................................... 17
2.3.3 A-site locations in S. cerevisiae Ribo-Seq data are fragment size and frame dependent
............................................................................................................................................... 18
2.3.4 Higher coverage leads to more unique offsets ............................................................. 18
2.3.5 Consistency across different datasets .......................................................................... 22
2.3.6 Robustness of the offset table to threshold variation .................................................. 22
2.3.7 Testing the Integer Programming algorithm against artificial Ribo-Seq data .............. 23
2.3.8 A-site offsets in mouse embryonic stem cells .............................................................. 23
2.3.9 Integer Programming does not yield unique offsets for E.coli ..................................... 24
2.3.10 Reproducing known PPX and XPP motifs that lead to translational slowdown ......... 25
2.3.11 Greater A-site location accuracy than other methods ............................................... 26
2.4 Discussion ............................................................................................................................. 29
2.5 Methods ............................................................................................................................... 32
vii
2.5.1 Ribo-Seq datasets ......................................................................................................... 32
2.5.2 Gene selection, analyses and statistical tests ............................................................... 34
2.6 Acknowledgements .............................................................................................................. 37
2.7 Data Availability ................................................................................................................... 37
Chapter 3 A CHEMICAL KINETIC BASIS FOR MEASURING TRANSLATION ELONGATION RATES
FROM RIBOSOME PROFILING DATA .............................................................................................. 38
3.1 Abstract ................................................................................................................................ 38
3.2 Author Summary .................................................................................................................. 39
3.3 Introduction ......................................................................................................................... 39
3.4 Results .................................................................................................................................. 41
3.4.1 Theory ........................................................................................................................... 41
3.4.2 Application .................................................................................................................... 42
3.5 Discussion ............................................................................................................................. 48
3.6 Methods ............................................................................................................................... 50
3.6.1 Simulated steady state ribosome profiling data. .......................................................... 50
3.6.2 In silico measurement of average protein synthesis and codon translation times ...... 51
3.6.3 Analysis of ribosome profiling and RNA-Seq data ........................................................ 52
3.6.4 Assignment of mRNA secondary structure ................................................................... 53
Chapter 4 EVOLUTIONARILY SELECTED AMINO ACID PAIRS ENCODE TRANSLATION-ELONGATION
RATE INFORMATION ...................................................................................................................... 54
4.1 Abstract ................................................................................................................................ 54
4.2 Main Text ............................................................................................................................. 54
Chapter 5 EVOLUTIONARILY-ENCODED TRANSLATION KINETICS COORDINATE CO-
TRANSLATIONAL SSB CHAPERONE BINDING IN YEAST .................................................................. 66
5.1 Abstract ................................................................................................................................ 66
5.2 Introduction ......................................................................................................................... 67
5.3 Results .................................................................................................................................. 68
5.3.1 Selective Profiling of Ssb-Bound Ribosomes ................................................................. 68
5.3.2 Coordination of Ssb Binding with Translation Elongation Rates .................................. 69
5.4 Discussion ............................................................................................................................. 73
5.5 Methods ............................................................................................................................... 74
5.5.1 Translation kinetics analysis .......................................................................................... 74
5.5.2 Speed-up of translation ................................................................................................ 75
5.5.3 Contribution of mRNA versus Ssb binding .................................................................... 75
viii
5.5.4 Enrichment/Depletion of Fast/Slow codons ................................................................. 76
5.5.5 Upstream charged residues .......................................................................................... 76
5.5.6 Downstream mRNA secondary structure ..................................................................... 76
Chapter 6 CONCLUSIONS AND FUTURE DIRECTIONS .................................................................... 77
6.1 Conclusions .......................................................................................................................... 77
6.2 Future Directions ................................................................................................................. 79
6.2.1 Synonymous mutations and diseases ........................................................................... 79
6.2.2 Test phenotypic effect of loss of amino acid pairing due to mutations ....................... 79
6.2.3 Causally test the effect of altered translation kinetics on Ssb chaperone binding ...... 80
Appendix A CHAPTER 2 SUPPORTING INFORMATION ................................................................... 83
A.1 Supporting Figures ............................................................................................................... 83
A.2 Supplementary Tables ......................................................................................................... 90
Appendix B CHAPTER 3 SUPPORTING INFORMATION ................................................................. 101
B.1 Supplementary Methods ................................................................................................... 101
B.1.1 Derivation of Eq. (3.3) from Eq. (3.1) and Eq. (3.2) .................................................... 101
B.1.2. Estimation of 𝝉 < 𝒊 > ................................................................................................. 102
B.2 Supplementary Figures ...................................................................................................... 103
B.3 Supplementary Tables ....................................................................................................... 107
Appendix C CHAPTER 4 SUPPORTING INFORMATION ................................................................. 116
C.1 Methods ............................................................................................................................. 116
C.1.1 Details of Experiments ................................................................................................ 116
C.1.2 Computational analyses of Ribo-Seq data .................................................................. 118
C.2 Supplementary Figures ...................................................................................................... 125
C.3 Supplementary Tables ....................................................................................................... 135
Appendix D CHAPTER 5 SUPPORTING INFORMATION ................................................................. 139
D.1 Derivations Demonstrating that the Fold Enrichment Is Directly Proportional to the Ssb-
Binding Probability ................................................................................................................... 139
D.1.1 Proof 1: Demonstration that the FE is directly proportional to the probability of Ssb
binding ................................................................................................................................. 139
D.1.2 Proof 2: Demonstration that SeRP reads are a function of the elongation rate, and
that the Fold Enrichment metric controls for this effect ..................................................... 140
REFERENCES ................................................................................................................................. 142
ix
LIST OF FIGURES
Figure 1.1. Type of rates involved in translation. ............................................................. 2
Figure 1.2. Overview of Ribo-Seq ................................................................................... 4
Figure 2.1. The A-site location can be defined as an offset from the 5′ end of ribosome-
protected fragments. ......................................................................................................16
Figure 2.2. mRNA fragment size distribution for S. cerevisiae Ribo-Seq dataset from Pop
and co-workers (A) and the Pooled dataset (B). ............................................................17
Figure 2.3. Distribution of offset values from the Integer Programming algorithm applied
to transcripts from S. cerevisiae. ....................................................................................19
Figure 2.4. Increasing coverage identifies A-site locations for 𝑆 and 𝐹 combinations that
were initially ambiguous. ...............................................................................................20
Figure 2.5. Several PPX and XPP motifs lead to ribosomal stalling in S. cerevisiae. .....26
Figure 2.6. The Integer Programming algorithm correctly assigns greater ribosome
density than other methods to the Glycine in PPG motifs in S. cerevisiae and to
Glutamic acid in PPE motifs in mESCs. .......................................................................28
Figure 3.1. Eq. (3.5) accurately determines codon translation times from simulated
ribosome profiles. ..........................................................................................................43
Figure 3.2. Wide variability in individual codon translation rates in vivo. ........................45
Figure 3.3. Molecular factors shaping the variability of individual codon translation rates.
......................................................................................................................................47
Figure 4.1. Computational analyses of Ribosome profiling data demonstrate that identity
of amino acids in the P- and A-sites can influence the translation speed of the A-site codon.
......................................................................................................................................56
Figure 4.2. Ribosome profiling experiments in which mutations are made to the P-site
residue measure changes in translation speed that are consistent with the predictions from
Figure 4.1b. ...................................................................................................................58
Figure 4.3 Depending on the amino acid pair, translation speed is influenced by either the
identity of the tRNA pairs, the amino acid pairs, or both. ...............................................60
x
Figure 4.4. Evolution selects for fast-translating pairs across the proteome but enriches
slow-translating pairs across interdomain linker regions. ...............................................63
Figure 5.1. Schematic representing the ribosome footprint x obtained from selective Ribo-
Seq when Ssb is bound to the region of nascent chain n amino acids upstream of x. ....69
Figure 5.2. Altered Translation Kinetics of Ssb-Bound Ribosomes ................................71
Figure 5.3. Identifying Ssb-Bound mRNA Segments and the Molecular Origins of
Translation Acceleration ................................................................................................73
Figure 6.1. Illustration of the hypothesis that a change in translation-elongation rates will
lead to disruption of Ssb binding. ...................................................................................82
Figure A.1. Fragment size distribution in (A) Pooled Ribo-Seq data in mouse embryonic
stem cells (mESCs) and (B) Pooled Ribo-Seq data in Escherichia coli. .........................83
Figure A.2. Pairwise comparison of fragment-size and frame distributions between genes
in S. cerevisiae. .............................................................................................................84
Figure A.3. Integer Programming algorithm correctly reproduces the true A-site offsets
from Artificial Ribo-Seq data. .........................................................................................85
Figure A.4. Meta-gene analysis in Pooled Ribo-Seq data reveal excess ribosome density
in E.coli genes beyond CDS regions. ............................................................................86
Figure A.5. Stalling at PPE and PPD motifs are reproduced in mESCs. ........................87
Figure A.6. Sequence-independent translational pause observed post-initiation in S.
cerevisiae and mESCs. .................................................................................................88
Figure A.7. The Integer Programming algorithm correctly assigns greater ribosome density
to the Glycine residue in PPG motifs than other methods in S. cerevisiae. ....................89
Figure B.1. Comparison of the properties of the 117- and 364-transcript data sets from
studies of Nissley et al.9 and Williams et al.114, respectively, to the entire S. cerevisiae
transcriptome. .............................................................................................................. 103
Figure B.2. Translation time distributions for the 64 codon types. ................................ 104
Figure B.3 Codon translation rates are highly correlated across datasets and with rates
from method of Dao Duc and Song . ........................................................................... 105
xi
Figure B.4. Molecular factors shaping the variability of individual codon translation rates
in the dataset from Williams et al.114. ........................................................................... 106
Figure C.1. The percent change in median normalized ribosome density 𝜌 for a given pair
of amino acids in the P-site and A-site, relative to any other amino acid being in the P-site
(Eq. C.2). . ................................................................................................................... 126
Figure C.2. The sign of the percent change in ribosome density (Eq. C.2) for the fast and
slow translating amino acid pairs remains the same after controlling for different molecular
factors known to influence translation speed. .............................................................. 128
Figure C.3. The ribosome profiling data for all the mutant strains demonstrate consistent
fragment size distribution, strong 3 nt periodicity, robust frame distribution and high
pairwise correlation of individual transcript's ribosome profiles…………………………..129
Figure C.4. Ribosome profiles of mutant and wild-type strains are highly correlated. ….130
Figure C.5. Optimal and non-optimal codons are equally distributed between the domain
and linker regions of proteins for both fast- and slow-translating amino acid pairs. ...... 131
Figure C.6. Fast-translating amino acid pairs are enriched in those transcript segments
that are being translated when the chaperone Ssb is bound to the nascent chain. ...... 132
Figure C.7. Translation speed differences are not explained by wobble decoding in the P-
and A-sites. ................................................................................................................. 133
Figure C.8. Samples prepared in the same phase (single batch on same day) exhibit
higher correlations than samples prepared in different phases. ................................... 134
xii
LIST OF TABLES
Table 2.1. A-site locations (nucleotide offsets from 5′ end) determined by applying the
Integer Programming algorithm to the Pooled dataset in S. cerevisiae are shown as a
function of fragment size and frame. ..............................................................................21
Table A.1. Number of genes for the various fragment size and frame combinations that
meet the criteria of at least 1 read per codon on average in the Pop and Pooled datasets
of S. cerevisiae. .............................................................................................................90
Table A.2. Initial offset tables after application of Integer Programming algorithm to Pop
and Pooled datasets in S. cerevisiae. ............................................................................91
Table A.3. For unique offsets described in Table 2.1, the robustness to variation in
parameters and consistency across different Ribo-Seq datasets are described with
additional sub columns. .................................................................................................92
Table A.4. Input A-site offset tables used in the creation of artificial Ribo-Seq data (table
below, see Methods). Offset A-site tables (next page) output by the Integer Programming
method when applied to artificial Ribo-Seq data constructed using the input tables (Top)
and P(𝑆, 𝐹) distribution with mode (28, 0) and variance 𝜆 = 48 (Distribution 5 in Figure
A.3). ..............................................................................................................................93
Table A.5. Initial offset table after application of Integer Programming algorithm to a
Pooled dataset in mESCs consisting of all genes. Offset table after application of Integer
Programming algorithm to a Pooled dataset of E. coli. ..................................................95
Table A.6. A-site locations (nucleotide offsets from 5΄ end) determined by applying the
Integer Programming algorithm to the Pooled dataset in mESCs are shown as a function
of fragment size and frame. ...........................................................................................96
Table A.7. Number of genes in the combination of fragment size and frame meeting the
criteria of at least 1 read per codon on average in mESCs and E. coli Pooled datasets.97
Table A.8. Median normalized ribosome densities for 61 codon types were correlated with
tRNA abundance for the Integer Programming method and 11 other contemporary
methods (see Methods for details). ................................................................................98
Table A.9. Publicly available datasets used in the study. ...............................................99
Table A.10. A-site offsets determined using the publicly available R packages – Plastid38
, RiboProfiling92 and riboWaltz37. ................................................................................. 100
xiii
Table B.1. Statistics for the translation time distributions of 64 codon types obtained from
the Nissley dataset ...................................................................................................... 108
Table B.2. Statistics for the translation time distributions of 64 codon types obtained from
the Williams dataset .................................................................................................... 112
Table C.1. Ribo-Seq was obtained from five different published studies. ..................... 135
Table C.2. Details on the 12 single amino acid mutations that were made across 5 different
genes. ......................................................................................................................... 136
Table C.3. Statistics of read mapping for ribosome profiling experiments for the mutant
strains carried out in this study. ……………………………………………………………..137
Table C.4. Three mutations to gene YOL109W to test the contribution of amino acid and
tRNA identity. .............................................................................................................. 138
xiv
ACKNOWLEDGEMENTS
First and foremost, I would like to thank God Almighty for always keeping me motivated
for the long and challenging journey of a PhD. I am grateful for the intellect that God has
bestowed upon me to contribute towards pushing our understanding of nature and life
even if it is only bit by bit. Learning about nature and getting to know the interplay of
complex network of molecular machines that together create functioning biological
systems have always amazed me and brought me closer to God Almighty.
I would like to thank the National Science Foundation, National Institutes of
Health, and Human Frontier Science Program for funding the work described in this
dissertation in part. Any opinions, findings, and conclusions or recommendations
expressed in this dissertation are those of the mine and my collaborators and do not
necessarily reflect the views of these funding agencies.
I would like to thank my dissertation advisor, Ed O’Brien, without whose constant
support and encouragement, this PhD would not have been possible. Ed has been a
wonderful advisor who always made sure to bring the best work out of me and taught me
to think about my research from different perspectives. I have learned a great deal about
how to propose and execute a research project from Ed and this will go a long way for me
to have a successful career as a scientist. I would also like to thank my committee
members, Professors Istvan Albert, Sarah Assmann and Naomi Altman, for their
thoughtful questions and criticisms that helped me improve upon my research projects. I
am grateful to Shashi who played an important role in bringing me to Penn State and
constantly provided support and encouragement during our meetings.
I would also like to extent my gratitude to my wonderful collaborators at University
of Heidelberg whose contributions made it possible to experimentally validate most of my
computational research findings. I would like to thank Bernd and Günter for providing
resources, insightful ideas and feedback that made it possible to ask the pertinent
research questions and extract exciting findings from our analyses. I would like to thank
Ulrike for having patience and running long and challenging experiments for our projects.
I am grateful to Kristina for working with me and Ed on Ssb project and providing all data
and useful insights needed to execute our part of the project. I would also like to thank
Pietro at University of Cambridge for working together on development of computational
methods and his diplomatic statements that helped us swiftly respond to harsh reviewers
comments. Coming back to people who have been in closer physical proximity, I would
like to thank other members of the O’Brien lab with whom I had a great time working with
xv
over the past 5 years. Thank you Ajeet, Dan, Sarah, Joe, Dave, Fabio, Ben, Ian, Yang
and Yiyun for always being supportive whenever I have reached out to you for help.
Finally, I need to acknowledge my gratitude and thanks to the most important
people in my life. I am indebted to my father who inspired me to undertake a career in
science. His steadfast support and constant encouragement has kept me focused on my
research and convinced me to never give up. His wonderful achievements as a
hydrogeologist has always inspired me and I hope I could achieve even half of what he
had achieved in his scientific career. I would like to thank my Mom for always believing in
me and always encouraging me to never stop trying. Thanks to my brother, Adeel for
always being there for me and my grandmother for her love, prayers and wishes. I would
also like to honor the memory of two individuals who are no more but would have been
very proud to see me attain a PhD. To Baji, my paternal grandmother, I wish you could be
here to see me finish my PhD. It was her hard labor that uplifted our family out of poverty,
made sure my father received his education and subsequently led us to achieve highest
academic honors. Also, to my maternal grandfather who always made sure to make me
understand the value of education and knowledge during my childhood. I know that you
would be proud of my achievement.
Lastly, I need to thank my better half, my wife Anam. The last 2 years of my life
have been the most wonderful ever since I met you. I am always amazed by the positive
attitude you bring to all discussions we have. I am grateful to you for having the patience
to bear with me – with my rants, complaints and long hours away at work. I am grateful to
you for always making things easy for me. It would not have been possible to complete
my dissertation without your unwavering love and support.
1
Chapter 1
INTRODUCTION
1.1 Overview
This chapter introduces the background and motivation for all the studies presented in
this dissertation. First, I describe the recent evidence demonstrating the importance of
translation in determining the protein abundance in vivo. Next, I discuss earlier single
gene methodologies and sequence-based measures used as estimates of translation
rates. Then, I introduce Ribosome Profiling, also known as Ribo-Seq, whose data forms
the basis for many of the projects in this dissertation, the current methods to model Ribo-
Seq data and their limitations. Next, I detail the evidence that translation kinetics has
downstream effects on co-translational processes. Lastly, I outline how the research
projects discussed in Chapters 2, 3, 4 & 5 in this dissertation overcome the limitations of
current analysis methods and how the developed methods offer novel biological insights.
1.2 Translation and its importance
Proteins play an integral role in the functioning of a cell. Their cellular concentrations are
determined dynamically through the processes of transcription, translation and
degradation. Through advances in mass spectrometry, it has been possible to directly
characterize proteins from cells but the estimates of their concentrations are qualitative
at best and it has been difficult to detect low expressed proteins1. mRNA copy numbers
are easy to measure through inexpensive microarray studies and recently by high-
throughput RNA sequencing. Consequently, gene expression has been mostly quantified
by mRNA levels that act as a proxy for the final protein levels. Schwanhäusser et al2 used
pulse labeling of radioactive variants of amino acids and nucleosides in a population of
unperturbed mouse fibroblasts cells to determine the turnover and half-lives of proteins
and their corresponding mRNA transcripts in a single experiment. The mRNA and protein
levels quantified in the same experiment demonstrated that only 40% of variability in
protein levels is explained by mRNA levels. According to their model, the translation rate
constants are better predictors of protein levels rather than mRNA levels. Therefore,
uncovering the mechanisms of translational control of gene expression can help us
understand a crucial understudied component of cellular dynamics.
2
Translation is the process by which the genomic information encoded in
messenger RNA (mRNA) is converted into a newly synthesized (“nascent”) protein3.
Translation occurs through the action of the ribosome, a polymerase that initiates
translation by binding at the start codon on an mRNA molecule. Next, the ribosome
elongates (i.e., synthesizes) the nascent protein by uni-directionally sliding along the
transcript, one codon at a time, catalyzing peptide bond formation. Translation terminates
once the ribosome reaches the stop codon. A codon is a triplet of nucleotides, and the
61 sense codons encode the 20 naturally occurring amino acids – the building blocks of
proteins. The ribosome reads off this codon information and catalyzes peptide bond
formation between amino acid groups that are bound to transfer RNA (tRNA). The
ribosome contains three sites in which tRNA molecules can reside – the acceptor site (A-
site), the peptidyl site (P-site), and the exit site (E-site). The A-site contains the codon
that is being translated and binds the cognate amino-acylated-tRNA molecule, the P-site
contains the tRNA to which the nascent protein is covalently attached, and the E-site
contains the deacylated-tRNA that is ejected from the ribosome before the next codon is
translated.
The rates associated with translation (Figure 1.1) determine the time scales of
protein synthesis4,5, influence protein expression levels6, and have recently been shown
to influence the structure and function of the protein produced7–11. These rates include
the initiation rate (how fast the ribosome binds to the start codon), individual codon
translation rates at the A-site (how fast the ribosome moves from one codon position to
the next), and the average elongation rate (how fast the ribosome moves from one codon
position to the next, averaged over all the codon positions in a transcript). During the
elongation step of translation, the ribosome synthesizes a protein by sliding along an
Initiation Elongation Termination
𝛼 𝑘𝐴,𝑗+2
𝑘𝐴,𝑗+1 𝑘𝐴,𝑗 𝑘𝐴,2
𝑗 + 1 𝑗 𝑗 − 1 1 … …
𝑁𝐶
𝛽
Figure 1.1. Type of rates involved in translation. Translation is initiated by the binding
of ribosome subunits to the mRNA transcript at rate 𝛼. The ribosome then elongates at
rate 𝑘A to each successive codon until it reaches the stop codon, 𝑁C, where translation is
terminated with rate 𝛽 and the full-length protein (blue string) is released.
3
mRNA molecule and translates different codons into amino acids at different rates12. The
rate at which individual codons are translated by the ribosome can determine whether a
nascent protein will fold and function, misfold and malfunction, aggregate or efficiently
translocate to a different cellular compartment13,14,15. Hence, measurement of the
translation rates at all codon positions within a transcript would be crucial to uncover the
mechanism of translational control. Translation elongation rate is synonymous with
codon translation rate. The mean translation time of a codon is the inverse of the codon’s
translation rate and these three terms are used interchangeably throughout this
dissertation.
1.3 Previous estimates of translation rates
Direct measurement of codon translation rates in vivo is nontrivial and translation
efficiency (rate of translation initiation or protein synthesis) has typically been estimated
by measures of codon usage bias and tRNA abundance that has been found to be
correlated with protein abundance16. Despite the degeneracy in the genetic code, the
frequency of usage of synonymous codons is highly biased17. This phenomenon is
referred to as codon usage bias. Frequent codons are generally correlated to high tRNA
abundance18 and the bias is more strongly observed in highly expressed genes across
diverse organisms19. Due to the evolutionarily conserved nature of codon usage bias,
translational efficiency was often approximated by indexes of codon usage20 and tRNA
abundance21. The intuitive hypothesis has been that frequent codons are translated
faster than rare codons and hence the biased codon usage and tRNA abundance have
co-evolved for the efficient use of translational machinery17. Though studies have shown
that substituting frequent codons with rare codons decreases overall protein
abundance22, there is no direct biochemical evidence that a change is translation
elongation rate causes a decrease in protein synthesis.
In the 1980’s and 90’s enzymology23 and cell biology24 assays were developed to
measure average translation-elongation rates one gene at a time or averaged over a
cell’s translatome. The enzymology techniques involved controlling the time at which
initiation of a transcript occurred, and then monitoring the subsequent appearance of
enzymatic activity. The time point at which the enzyme’s specific activity saturated,
divided by the enzyme’s length in residues, provided a measure of the transcript’s
average codon translation speed. Alternatively, the cell biology assays would
simultaneously measure the total mass of newly synthesized mRNAs and proteins
4
produced in cells over some time period via pulse-
chase experiments, and then fit those data to a model
that reported the average elongation rate, among
other quantities. A drawback of the enzymology
approach is that it is not high throughput – the
measurements can only be done one gene at a time.
Additionally, to be accurate, this approach requires
that any acquisition of enzymatic activity occur on a
faster characteristic time scale than that of protein
synthesis. The cell biology approach is prone to large
errors because gross measurements of total protein
mass were used and the results depend on the details
of the model used to extract the rates.
1.4 Ribo-Seq measures the location and number of
actively translating ribosomes
Ribo-Seq25 is a Next-Generation Sequencing
technique in which translation is rapidly halted in cells
through the use of antibiotics or flash freezing.
Subsequent cell lysis and mRNA digestion of the
lysate using an RNase enzyme26 (Figure 1.2a) results
in a pool of ribosome-protected mRNA fragments that
is amplified and sequenced. The number and length of
mRNA fragments that map to the coding sequences
(CDSs) of transcripts is a function of the location and
number of ribosomes that were sitting at a particular
location on different copies of the same transcript
when translation was halted. When a ribosome dwells
for a longer time at a particular codon position, more
reads map to it relative to a codon position that is
translated faster on the same transcript (Figure 1.2b).
Hence, the read distribution across a CDS is, in part,
a function of the individual translation elongation rates
of each codon. The advent of Ribo-Seq provided a
Nuclease Digestion
RNA isolation
Sequencing and alignment
No
.of
read
s
Nucleotide position
Coding sequence region
a)
No
.of
read
s
𝒋
No
.of
read
s
𝒋+1
b)
Figure 1.2. Overview of Ribo-Seq
(a) Steps of Ribo-Seq experiment:
Unprotected mRNA fragments
(purple regions not covered by
green ribosomes) are digested by
nuclease enzyme such that only
ribosome-protected mRNA
fragments are isolated and
subsequently sequenced and
aligned to the transcriptome to
generate the ribosome profile. (b)
Slow translation leads to a higher
number of reads at codon position
𝑗 compared to fast translation at
codon 𝑗 + 1.
5
codon level resolution of translation that can be used to estimate translation elongation
rates.
Over the past decade since the introduction of Ribo-Seq, several biases in the
experimental protocol have been identified and improvements have been proposed to
avoid such biases27. The most prominent bias that has been quantified is the effect of
cycloheximide (CHX) drug treatment that has been used to halt translation in earlier Ribo-
Seq studies. It was shown that the translation arrest induced by CHX was not perfect and
it led to continued elongation and distortion of ribosome density across downstream
codons28,29. Improved protocols now use flash-freezing for halting translation and
different ribonucleases for mRNA digestion are available for different organisms. One of
the fundamental computational challenge in the analyses of Ribo-Seq data is to map
reads to the correct codon position within the resulting ribosome-protected mRNA
fragments. To quantify individual codon translation rates, we must be able to accurately
identify which codon was being translated at the ribosome’s A-site. Otherwise, ribosome
density will be assigned to the wrong codon and the measured rates will be erroneous.
In Ribo-Seq experiments26, however, the location of the ribosome’s A-site on a ribosome-
protected mRNA fragment is not known a priori; additional information and assumptions
must be introduced to estimate their locations.
1.4.1 Approaches for identifying the A-site
Recent methods25,29–38 to estimate the A-site location are based on heuristic and
statistical learning approaches. For a canonical ribosome-protected fragment of 28 nt,
the A-site has been identified to be 15 nt from 5΄ end (see Figure 2.1A)25. This information
is used by many methods as a heuristic to qualitatively guess the location of the A-site
for non-canonical fragment lengths. Most of these methods can only be applied to a
narrow range of fragment lengths25,29,35,39 and hence do not utilize all of the reads
generated in a Ribo-Seq experiment. Others use simple heuristics, such as pausing at
codons of certain amino acids in response to specific growth media and drug treatments.
For example, drug treatment with 3-amino-1,2,4-triazole depletes the cellular
concentration of tRNAHis, and is expected to lead to a higher ribosome density when
histidine codons are in the A-site31. With such methods, the A-site location in S.
cerevisiae Ribo-Seq datasets has been estimated to be 15 nt from the 5΄ end of
ribosome-protected fragments of size 28 nt 25,40, 16 nt for fragment size 29 nt 40, and 15
nts from the 5΄ end of fragments that are 30 nt in length35. Additionally, frame-specific
6
offsets of 14 to 17 nts from the 5΄ end for fragments between 28 and 30 nt in length are
used29,41. Alternatively, the Center-weighted Method smooths the ribosome density
across several codons34, and thus translation properties of individual codons cannot be
accurately ascertained with this approach. Therefore, to accurately identify the A-site
location, an approach is needed that is firmly rooted in biological principles that can also
be applied to the wide range of fragment lengths generated by a Ribo-Seq experiment.
1.5 Approaches for estimating translation rates using Ribo-Seq
Ribo-Seq overcomes the drawbacks of both enzymology and cell biology assay-based
methods to measure translation rates: it is high-throughput; it directly measures ribosome
positions on individual transcripts; and it measures a signal that is proportional to time
spent by the ribosome on a codon. Therefore, several analytical methods4,35,39,42–44 have
been developed that often estimate qualitative, relative differences in translation speed.
For example, in one method35 the “Ribosome Residence Time” of each codon type is
estimated as the proportion of Ribo-Seq reads for that codon type relative to the average
number of reads present in a local 20-codon window centered at the codon of interest.
However, this provides only a rough, relative measure of translation rates between
different codon types and imposes the assumption that each codon type translates at the
same rate. A simple thought experiment reveals the large errors that can arise from this
“local window” approach. Consider a 100-codon transcript in which the first half is
uniformly translated twice as slowly as the second half, resulting in twice as much
ribosome density in the first half compared to the second. Further, assume that the codon
in the 75th position (in the fast-translating region) is the only codon that translates slowly,
with 50% more reads than in its local window. With these conditions, codon 75 is being
translated at the transcript’s average codon translation speed. And yet, applying this local
window approach we would incorrectly conclude that codon 75 is being translated 1.5
times faster than the average codon translation speed. Dana and Tuller44 defined a
translation efficiency index for the mRNA transcripts called Mean Typical Decoding Rates
(MTDR) which is the geometric mean of translation rates of all codons within the
transcript. Pop et al.39 models a ribosome flow process while softly constraining the
translation rate of a codon type to be same throughout the cell. Gardin et al.35 do the
same but use relative ribosome densities in 20 codon windows, which has the effect of
reducing the variability in translation speeds. Thus, these methods ignore the variability
in the codon translation rate of the same codon type in different parts of the same
7
transcript. Many of these methods may miss out on the variability in the codon translation
rate of the same codon type within the same transcript, and while all measure relative
rates between codons, none measure absolute translation rates of individual codons.
1.6 Molecular factors influencing translation elongation
As described previously, codon translation rates were estimated using measures of
codon usage that correlated with cognate tRNA abundance16. 61 codon types are
decoded by 42 tRNA families in S. cerevisiae and each tRNA family has variable copies
of genes encoding them across the genome45. Since there are only 42 tRNA types for
decoding 61 codon types, multiple codons are decoded by wobble decoding mechanism
in which the third nucleotide in the codon and anti-codon does not exhibit Watson-Crick
complementarity46. A codon optimality measure has been used in the literature taking
into account the cognate tRNA interactions and wobble base pairing47,48. Optimal codons
were defined as codons used commonly across the genome and decoded by tRNAs with
higher gene copy number. Non-optimal codons were mostly rare codons decoded by
lower abundant tRNA or through wobble decoding mechanism.
With the development of Ribo-Seq, the codon translation rates obtained were
correlated with codon optimality. Some studies35,44 showed that biased codon usage
strongly correlates with codon translation rate while others39,42,43 demonstrated that
synonymous codons do not differ in their codon translation rates. However, the
discrepancies were attributed to technical biases in Ribo-Seq, specifically the use of
cycloheximide29. An improved Ribo-Seq study found that codon translation rates are
correlated but could explain only 27% of the variation in the rates41. Wobble decoding
has been shown to slow translation in metazoans using Ribo-Seq49 but no definitive
evidence exists for any systematic slowdown caused by Wobble decoding mechanism in
other organisms.
Advances in structural methods, single-molecule methods and Ribo-Seq have
identified several other factors that can potentially influence translation50. This includes
features of both mRNA and nascent chain. mRNA secondary structure can be barrier for
translocation of the ribosome along the transcript and it can result in a slowdown of
translation while the structure is unwound by helicase activity of the ribosome51,52.
Analysis of initial Ribo-Seq data has also found a correlation with ribosome density and
folding energy of mRNA secondary structures53,54. Other features that can influence
translation kinetics are tRNA modifications that can alter decoding efficiency55 and stress
8
conditions that can change the dynamic pool of tRNA thus affecting the decoding of
different codon types56,57.
The nascent chain features having an influence on translation rates include the
presence of proline residues. Proline is a well-established poor peptidyl donor and
acceptor when present in the P- and A-sites respectively58,59. Ribo-Seq studies confirmed
that presence of proline will lead to slowdown of translation60. The slowdown of
translation is extensive for polyproline motifs which requires external translation factors
to rescue translation61–66. This phenomenon has been determined through
enzymology61,62 and toe printing67 studies and has been extensively characterized to be
rescued by factors like EF-P and eIF5A in E. coli and S. cerevisiae respectively. Positively
charged residues are an additional nascent chain feature that can influence the codon
translation rate by interacting with the negatively charged tunnel resulting in a slowdown
of translation at the A-site42,68,69.
As the methods advance to study the translation process dynamically in real
time50, more factors may be discovered influencing the rate of translation elongation.
1.7 Influence of translation kinetics on co-translational processes
Translation is a resource intensive process and efficient production of proteins is
required to maintain protein homeostasis. Protein maturation is a multi-step process
requiring several factors to act in a timely fashion. A misstep can disrupt the protein
homeostasis potentially driving pathogenesis of diseases. Without changing the protein
abundance, this disruption can cause the protein to misfold and lead to aggregation
causing cytotoxicity. The ribosome as catalytic macromolecular complex maintains
balance between efficient protein production and ensuring that the proteins are
functionally active. This is achieved by the non-uniform pattern of translation kinetics
where variability of translation rates creates periods of fast and slow translation to
efficiently and accurately generate a functional proteome70.
Multi-domain proteins tend to fold in a domain-wise fashion such that they can
avoid large-scale non-native interactions71. Translation is a sequential process and it can
allow the separation of time scales for different domains of a multi-domain protein to fold.
However, there is still a danger for the nascent polypeptide accessing a large
conformation space upon exiting the ribosome exit tunnel to misfold72,73. The nascent
polypeptide needs to be supervised during the elongation phase to avoid any non-native
interactions. A network of molecular chaperones assist with the processing of nascent
9
polypeptides by helping avoid misfolded nascent chain conformations while the rest of
the polypeptide is being synthesized inside the ribosome72–74. A network of factors also
exist to facilitate co-translational protein maturation steps of assembly of large protein
complexes75 and membrane targeting76. Alteration of translation kinetics have been
demonstrated to affect these co-translational processes but their mechanism of
coordination is not well understood77,78.
It was hypothesized that optimizing the mRNA sequence by replacing non-
optimal codons with optimal codons should result in an increase in efficiency of protein
production79. However, multiple lines of evidence have been found that optimizing the
mRNA sequence increases the efficiency of protein production but can often lead to loss
of functionality77,80,81, in some cases leading to widespread aggregation of proteins82.
Optimizing the FRQ protein in Neurospora, for example, led to the loss of circadian
rhythm83. Evolutionary selection pressures have shaped codon usage such that optimal
and non-optimal codons are distributed in clusters to create periods of faster and slower
translation48. It has been found that optimal codons are essential for maintaining the
fidelity of translation at structurally sensitive sites where a slowdown can result in
mistranslation leading to misfolding84. It was also expected that non-optimal codons will
be enriched in interdomain linker regions to slow down translation and facilitate co-
translational domain folding. It has been seen from single protein studies, for example,
that mutating optimal codons to non-optimal codons downstream of a N-terminal domain
makes the designed protein YKB fold with increased efficiency85. A study identified a
rare codon cluster downstream of a domain of SufI protein whose folding efficiency was
perturbed when they were mutated to common codons cluster81. Clusters of non-optimal
codons were found to be present between secondary structural motifs within structural
domains86. Ribo-Seq data has demonstrated that there is a slowdown of translation in
inter domain linkers41 but no systematic enrichment of non-optimal codons was observed
across interdomain linkers in large-scale analysis of 121, 120 and 51 multi-domain
proteins in E.coli, H. sapiens and S. cerevisiae87. This indicates that there are molecular
factors which need to be identified that are influencing translation and causing a
slowdown in interdomain linkers.
1.8 Objectives of dissertation
This introduction highlights the importance of translation in regulating gene expression,
its influence on downstream co-translational processes and current challenges
10
concerning the analysis of Ribo-Seq data. This dissertation aims to address some of
these challenges so that Ribo-Seq data can be efficiently modeled to extract absolute
codon translation rates. This dissertation also aims to find novel insights that can be
gained from analysis of Ribo-Seq to understand the molecular origin of variability in
translation rates as well as any coordination with co-translational processes.
In Chapter 2, I describe a method to accurately identify the A-site within ribosome-
protected fragments. This method implements a probabilistic approach and utilizes the
fundamental feature of translation that A-site of a ribosome can occupy only the region
between the second and stop codons of a transcript. It overcomes the limitations of the
heuristic approaches used by other methods and can be applied to wider range of
fragment sizes. The usability of this method is demonstrated by greater accuracy of the
method in comparison to contemporary methods.
In Chapter 3, I present a method that uses a chemical kinetic model to derive an
equation for calculating codon translation rates of individual codons within an mRNA
transcript from Ribo-Seq data. This is fundamentally different from other analysis
methods of Ribo-Seq data35,39,42,43,88 that build their models of translation assuming a
constant elongation rate for a particular codon type. A chemical kinetic model of
translation will accurately capture the codon translation rates at each codon position with
minimal assumptions and hence is more likely to accurately quantify the role of translation
kinetics in influencing co-translational processes.
In Chapter 4, I describe an analysis of Ribo-Seq data that proposes a novel
molecular factor influencing the translation rate. This analysis demonstrates that the
chemical identity of the amino acid pairs in the P- and A-sites of the ribosome can
influence the codon translation rate and predicts that mutating the P-site amino acid will
lead to either speedup or slowdown of translation rate. This prediction is tested
experimentally for 12 amino acid pairs and all 12 mutations result in change in speed in
the expected direction. I also demonstrate that evolution selects for fast-translating pairs
relative to slow-translating pairs potentially to increase the efficiency of protein
production. However local enrichment of slow-translating pairs is observed in interdomain
linkers which can potentially explain the slowdown observed downstream of domain
regions but could not be attributed to enrichment of non-optimal codons. Identification of
amino acid pairs adds another feature of nascent chain mediated regulation of translation
further explaining the origin of variability in translation rates.
11
In Chapter 5, I demonstrate that the co-translation process of binding of Hsp70
chaperone Ssb is coordinated with faster translation by the ribosome. The binding of Ssb
to nascent polypeptides are profiled using a method called Selective Ribosome
Profiling89, a variant of Ribo-Seq where chaperone bound to ribosome-nascent chain
complex are selected for ribosome profiling. I also describe how faster translation is
encoded within the mRNA with molecular factors affecting translation rate enriched in a
fashion to accelerate translation in the ribosome during periods of Ssb binding.
Finally, in Chapter 6, I summarize the findings from the studies presented in this
dissertation and their implication for studying the effect of synonymous mutations on
functional protein production and their role in diseases.
12
Chapter 2
IDENTIFYING A- AND P-SITE LOCATIONS ON RIBOSOME-PROTECTED MRNA
FRAGMENTS USING INTEGER PROGRAMMING
The research presented in this chapter has been published as a research article in
Scientific Reports titled “Identifying A- and P-site locations on ribosome-protected mRNA
fragments using Integer Programming” by Nabeel Ahmed*, Pietro Sormanni*, Prajwal
Ciryam, Michele Vendruscolo, Christopher M. Dobson and Edward P O’Brien (* denotes
co-first authors). The author contributions are stated below: “P.S., P.C. and E.P.O.
conceived the study. N.A., P.S. and E.P.O. designed the computational analyses. P.C.,
M.V., C.M.D. contributed to design of the computational analyses. N.A. and P.S. analyzed
the data. N.A. and E.P.O. wrote the manuscript. All authors reviewed and commented on
the manuscript.” This chapter is being reproduced from the above publication under Open
Access Creative Commons Attribution 4.0 International License (CC BY).
2.1 Abstract
Identifying the A- and P-site locations on ribosome-protected mRNA fragments from Ribo-
Seq experiments is a fundamental step in the quantitative analysis of transcriptome-wide
translation properties at the codon level. Many analyses of Ribo-Seq data have utilized
heuristic approaches applied to a narrow range of fragment sizes to identify the A-site. In
this study, we use Integer Programming to identify the A-site by maximizing an objective
function that reflects the fact that the ribosome’s A-site on ribosome-protected fragments
must reside between the second and stop codons of an mRNA. This identifies the A-site
location as a function of the fragment’s size and its 5′ end reading frame in Ribo-Seq data
generated from S. cerevisiae and mouse embryonic stem cells. The correctness of the
identified A-site locations is demonstrated by showing that this method, as compared to
others, yields the largest ribosome density at established stalling sites. By providing
greater accuracy and utilization of a wider range of fragment sizes, our approach
increases the signal-to-noise ratio of underlying biological signals associated with
translation elongation at the codon length scale.
2.2 Introduction
Translation is a fundamental cellular process and an important step of gene expression
resulting in the production of proteins in cells90. In the past decade the advent of Ribo-Seq
(also known as Ribosome profiling), a high-throughput Next-Generation Sequencing
13
method25,91, has enabled the transcriptome-wide study of translation. Ribo-Seq involves
rapidly halting translation in cells through the use of antibiotics or flash freezing followed
by cell lysis and then digestion of the lysate using an RNase enzyme26. The resulting pool
of ribosome-protected mRNA fragments is then amplified and sequenced. The number
and length of mRNA fragments that map to the coding sequences (CDSs) of transcripts is
a function of the location and number of ribosomes that were sitting at a particular position
on different copies of the same transcript. Where the ribosome’s A- and P-sites were
located on a fragment during the digestion step is not known a priori, additional information
and assumptions must be introduced to estimate their locations. Since translation occurs
at the A- and P-sites, the identification of these sites is critical to address translation-
related questions. If the A- and P-sites are not accurately identified, then systematic or
random error can diminish the statistical power of any underlying biological signal that
might exist. The identification of the A- and P-sites within ribosome footprints is therefore
fundamental to quantitatively understanding translation at the codon length scale.
Because of the importance of this assignment problem, a number of methods for
identifying the A- and P-sites have been created25,29–34,37,38,92. Many of these approaches
utilize the biological fact that only the P-site is permitted to occupy the start codon during
translation initiation and only the A-site is permitted to occupy the stop codon during
termination. Using such approaches, the A-site location in S. cerevisiae Ribo-Seq
datasets, for example, has been estimated to be 15 nt from the 5′ end of ribosome-
protected mRNA fragments of size 28 nt25,40; 16 nt for fragment size 29 nt40; 15 nts from
the 5′ end of fragments that are 30 nt in length35 and frame-specific offsets of 14 to 17 nts
from the 5′ end for fragments between 28 and 30 nt in length29,41. The P-site location offset
is 3 nt prior to the A-site. Similarly, in mouse embryonic stem cells (mESCs), such
approaches have yielded specific offsets for different fragment lengths33.
Here, we utilize the fundamental biological fact that the A-site on ribosome-protected
fragments must reside within the CDS of a gene under normal growth conditions. We use
this fact to create an objective function that, when maximized, identifies where the
ribosome’s A- and P-sites are most likely to be located on a ribosome-protected mRNA
fragment. We apply our method to S. cerevisiae and mESCs Ribo-Seq datasets and show
that, compared to other methods, our approach has greater accuracy and statistical power
in identifying A- and P-site locations and assigning read density.
14
2.3 Results
2.3.1 Integer Programming Algorithm
In the analysis of Ribo-Seq data, mRNA fragments are initially aligned onto the reference
transcriptome and their location is reported with respect to their 5′ end. This means that
one fragment will contribute one read that is reported on the genome coordinate to which
the 5′ end nucleotide of the fragment is aligned (Figure 2.1A). In Ribo-Seq data, fragments
of different lengths are observed that can arise from incomplete digestion of RNA and from
the stochastic nature of mRNA cleavage by the RNase used in the experiment (Figures
2.2 and A.1). A central challenge in quantitatively analyzing Ribo-Seq data is to identify
from these Ribo-Seq reads where the A- and P-sites were located at the time of digestion.
It is non-trivial to do this since incomplete digestion and stochastic cleavage can occur at
both ends of the fragment. For example, mRNA digestion resulting in a fragment of size
29 nt can occur in different ways, two of which are illustrated in Figure 2.1B. The quantity
that we need to accurately estimate is the number of nucleotides that separate the codon
in the A-site from the 5′ end of the fragment, which we refer to as the offset and denote ∆.
Knowing ∆ determines the position of the A-site as well as the P-site since the P-site will
always be at ∆ minus 3 nt.
Our solution to this problem relies on the biological fact that for canonical transcripts
with no upstream translation the A-site of actively translating ribosomes must be located
between the second codon and the stop codon of the CDS3. Therefore, the optimal offset
value ∆ for fragments of a particular size (𝑆) and reading frame (𝐹) is the one that
maximizes the total number of reads 𝑇(∆|𝑖, 𝑆, 𝐹) between these codons for each gene i on
which the fragments map onto. The size of an mRNA fragment 𝑆 is measured in
nucleotides, and the frame 𝐹 has values of 0, 1 or 2 as defined by the gene start codon
ATG and corresponds to the frame in which the 5′ end nucleotide of the fragment is located
(Figure 2.1A). The 5′ end frame 𝐹 is a result of RNase digestion and it is distinct from the
reading frame of the ribosome that is typically translating in-frame (frame 0 of A-site). In
other words, for each combination of (𝑆, 𝐹) we shift the 5′ aligned read profile by 3
nucleotides at a time (to preserve the reading frame 𝐹) until we identify the value ∆ that
maximizes the reads between second and stop codon (Figure 2.1C, see next sub-section).
This procedure is carried out systematically for each fragment size 𝑆 and reading frame 𝐹
separately, as each may have (and we find some have) a different optimal ∆.
15
This concept can be expressed in terms of Integer Programming93, a mathematical
optimization procedure in which an objective function is maximized subject to integer and
linear restraints. With ∆ as the integer variable to optimize, the objective function in this
case is 𝑇(∆|𝑖, 𝑆, 𝐹) = ∑ 𝑅(𝑗, ∆|𝑖, 𝑆, 𝐹)𝑁𝐶,𝑖
𝑗=4 , where 𝑁𝐶,𝑖 is the number of nucleotides in the
CDS of gene 𝑖 and 𝑅(𝑗, ∆|𝑖, 𝑆, 𝐹) is the number of reads from fragments of size 𝑆 and frame
𝐹 mapped onto gene 𝑖 whose 5′ end is at nucleotide position 𝑗 on the CDS after being
shifted along the transcript by ∆ nucleotides. The optimal ∆, denoted ∆′, for a given (𝑆, 𝐹)
for gene 𝑖 is determined as max{𝑇(∆|𝑖, 𝑆, 𝐹)} subject to the constraints (i) that 0 ≤ ∆ ≤ 𝑆,
and (ii) that the modulus of ∆
3= 0. Constraint (i) enforces the requirement that the A-site
is located between the first and last nucleotide of the fragment of size 𝑆 nts. Constraint (ii)
maintains the frame of the 5′-most nucleotide of the fragment as the Ribo-Seq reads are
shifted by an amount ∆. We enforce Constraint (ii) because we are interested in the
assignment of reads to the A-site at the resolution of a codon, not an individual nucleotide.
If we did not enforce constraint (ii), our algorithm would simply yield equal 𝑇(∆|𝑖, 𝑆, 𝐹)
scores for the two other values of ∆ that would still map the reads on the A-site codon,
but in the two frames where the 5′ end was not in. Therefore, to simplify the determination
of offsets we implemented constraint (ii). Thus, by maximizing 𝑇(∆|𝑖, 𝑆, 𝐹) for the CDS of
each gene in a data set of 𝑁𝑔 genes, we will obtain a set of 𝑁𝑔 values of ∆′. From this
distribution of ∆′ values, the A-site location corresponds to the most probable ∆′ value.
While identifying the ∆′ value for each gene in our data set, we also minimize the
occurrence of false positives by ensuring that the highest score, 𝑇(∆′|𝑖, 𝑆, 𝐹), is significantly
higher than the next highest score, 𝑇(∆′′|𝑖, 𝑆, 𝐹), which occurs at a different offset ∆′′. If
the difference between the top two scores is less than the average number of reads per
codon, we apply the following additional selection criteria. To choose between ∆′ and ∆′′,
we select the one that yields a number of reads at the start codon that is at least one-fifth
less than the average number of reads at the second, third and fourth codons. We further
require that the second codon has a greater number of reads than the third codon. The
biological basis for these additional criteria are that the true offset (i.e., the actual location
of the A-site) cannot be located at the start codon, and that the number of reads at the
second codon should be higher on average than the third codon due to contributions from
the initiation step of translation, during which the ribosome is assembling on the mRNA
with the start codon in the P-site. Below, we demonstrate that the results from our method
are largely robust to changes in these thresholds.
16
Figure 2.1. The A-site location can be defined as an offset from the 5′ end of ribosome-
protected fragments. (A) A schematic representation of a translating ribosome (top drawing)
and of the offset ∆ between the Ribo-Seq reads mapped with respect to the 5′ end of the footprints
and centered on the A-site (blue bars). The ribosome is shown protecting a 28 nt fragment with
its 5′ end in reading frame 0, as defined from the ATG start codon of the gene. The E-, P- and A-
sites within the ribosome are indicated. The reads are then shifted from the 5′ end to the A-site
by the offset value ∆. (B) Stochastic nuclease digestion can result in different fragments. The two
most probable variants of a 29 nt footprint with the 5′ end in frame 1 are shown with their
boundaries mapped by dotted lines aligning to the genome which can result in offsets of 15 nt
(top) and 18 nt (bottom), respectively. (C) To illustrate the application of the Integer Programming
algorithm, consider a hypothetical transcript that is 60 nt in length. The first panel shows the
ribosome profile originating from reads assigned to the 5′ end of fragments of size 33 in frame 0.
The start and the stop codon are indicated while the rest of the CDS region is colored light peach.
The algorithm shifts this ribosome profile by 3 nt and calculates the objective function 𝑇(𝛥|𝑖, 𝑆, 𝐹).
The extent of the shift is the offset Δ. Values of 𝑇(𝛥|𝑖, 𝑆, 𝐹) for Δ = 12, 15, 18, 21 nts are indicated.
In this example, the average number of reads per codon is 7.85. The difference between the top
two offsets, 18 (𝑇 = 222) and 15 (𝑇 = 215), is less than the average. Hence, we check the
secondary criteria (Results). Offset 18 meets the criteria that the number of reads in the start
codon is less than one-fifth of the average of reads in second, third and fourth codons and also
that number of reads in the second codon is greater than reads in third codon. Hence, Δ = 18 nt
is the optimal offset for this transcript.
17
Figure 2.2. mRNA fragment size distribution for S. cerevisiae Ribo-Seq dataset from Pop
and co-workers (A) and the Pooled dataset (B).
2.3.2 Illustrating the Integer Programming optimization procedure
To illustrate this Integer Programming algorithm in action we provide an example using
the hypothetical mRNA shown in Figure 2.1C. The algorithm is as follows: First, for gene 𝑖,
consider 𝑅(𝑗, ∆= 0|𝑖, 𝑆, 𝐹) composed of those fragments of size 𝑆 (= [20,21, … ,35] nt) and
whose 5′ end has been aligned to reading frame 𝐹 (= 0, 1 or 2). Second, for this ribosome
profile, determine the ∆ that maximizes 𝑇(∆|𝑖, 𝑆, 𝐹). Do this by starting from the 5′-end-
aligned ribosome profile (∆=0) and shift it three nucleotides at a time (i.e., obey Constraint
(ii) described previously) towards the 3′ end of the transcript such that ∆ = 0, 3, 6, 9, … , ≤ 𝑆.
At each value of ∆, calculate 𝑇(∆|𝑖, 𝑆, 𝐹) and record its value. Third, after all ∆ values have
been tested, the ∆ that maximizes 𝑇(∆|𝑖, 𝑆, 𝐹) is denoted ∆′, which is the putative location
of the A-site relative to 5′ end of fragments of size 𝑆 and frame 𝐹 for gene 𝑖. Check if the
secondary-selection criteria are required and apply them when the scores for the top two
offsets differ by less than the average number of reads per codon in the mRNA. Finally,
repeat these steps for every fragment size between 20-35 nts in length and every reading
frame. Thus, for one gene, this procedure yields 48 (=16x3) independent values for ∆′,
one for each fragment size and frame combination.
The fragment-size and frame distributions of ribosome-protected fragments (Figure
2.2) in S. cerevisiae are not gene dependent (Figure A.2), and therefore, neither should
the offset values be gene dependent. Thus, the location of the A-site, relative to the 5′
end of a fragment of size 𝑆 and frame 𝐹, corresponds to the most probable value of the
offset across all the genes in the dataset.
18
2.3.3 A-site locations in S. cerevisiae Ribo-Seq data are fragment size and frame
dependent
We first applied the Integer Programming method to Ribo-Seq data from S. cerevisiae
published by Pop and co-workers39. For each combination of 𝑆 and 𝐹 we first identified
those genes that have at least 1 read per codon on average in their corresponding
ribosome profile. The number of genes meeting this criterion is reported in Table A.1. We
then applied the Integer Programming method to this subset of genes. The resulting
distributions of ∆ values are shown in Figure 2.3A for different combinations of fragment
length and frame. We only show results for fragment sizes between 27 and 33 nt because
greater than 90% of reads map to this range (Figure 2.2A). The most probable offset value
for all fragment sizes between 20 to 35 nt is reported as an offset table (Table A.2).
We see that the optimal ∆ value - that is, the A-site location - changes for different
combinations of 𝑆 and 𝐹, with the most probable values either at 15 or 18 nt. Thus, the
location of the A-site depends on 𝑆 and 𝐹. In most cases, there is one dominant peak for
a given pair of 𝑆 and 𝐹 values. For example, for fragments of size 27 through 30 nt in
frame 0, greater than 70% of their per-gene optimized ∆ values are 15 nt from the 5′ end
of these fragments. Similar results are found for other combinations such as sizes 30, 31
and 32 nt in frame 1 and 28 through 32 nt in frame 2, where optimized ∆ values are 18 nt.
Thus, across the transcriptome, the A-site codon position on these fragments is uniquely
identified.
There are, however, 𝑆 and 𝐹 combinations that have ambiguous A-site locations
based on these distributions. For example, for fragments of size 27 nt in frame 1, 47% of
the gene-optimized ∆ values are at 15 nt while 30% are at 18 nt. Similar results are
observed for fragments 28 and 29 nt in frame 1, and 31 and 32 nt in frame 0. Thus, for
these 𝑆 and 𝐹 combinations there is a similar probability of the A-site being located at one
codon or another, and therefore we cannot uniquely identify the A-site’s location.
2.3.4 Higher coverage leads to more unique offsets
We hypothesized that ambiguity in identifying the A-site for particular 𝑆 and 𝐹
combinations may be due to low coverage (i.e., sampling poor statistics). To test this
hypothesis we pooled the reads from different published Ribo-Seq datasets into a single
dataset with consequently higher coverage and more genes that meet our selection
19
criteria (Table A.1). Application of our method to this Pooled dataset gives unique offsets
for more 𝑆 and 𝐹 combinations compared to the original Pop dataset (Figure 2.3B and
Table A.2), consistent with our hypothesis. For example, for fragments of size 27 and
frame 1, now we have the unique offset of 15 nt with 72% of gene-optimized ∆ values at
15 nt (Figure 2.3B). However, we still see the ambiguity present for certain (𝑆, 𝐹)
combinations.
We employed an additional strategy to increase coverage by restricting our
analysis to genes with greater average reads per codon. If the hypothesis is correct, then
we should see a statistically significant trend of an increase in the most probable ∆ value
with increasing read depth. We applied this analysis to the Pooled dataset and find that
Figure 2.3. Distribution of offset values from the Integer Programming algorithm applied to
transcripts from S. cerevisiae. The data plotted in (A) are from the Pop dataset, and (B) the Pooled
dataset. The distributions are plotted as a function of the of the offset value and for fragment sizes
of 27 to 33 nt, are shown, from left to right, for frames 0, 1 and 2. For a given fragment size and
frame, the A-site location is at the most probable Δ value in the distribution, provided the offset
occurs for more than 70% of the genes (dashed lines in panels). Error bars represent 95%
Confidence intervals calculated using Bootstrapping. Sample sizes are reported in Table A.1.
20
some initially ambiguous 𝑆 and 𝐹 combinations become unambiguous as coverage
increases. For example, at an average of 1 read per codon, (𝑆, 𝐹) combinations of (25, 0),
(27, 2) and (30,1) are ambiguous as they fall below our 70% threshold. However, we see
a statistically significant trend (𝑠𝑙𝑜𝑝𝑒 = 0.5, 𝑝 = 3.94 × 10−6) for fragments of (25, 0) that
the 15 nt offset becomes more probable upon increasing the coverage, eventually
crossing the 70% threshold (Figure 2.4A). Similarly, for (27, 2) (𝑠𝑙𝑜𝑝𝑒 = 0.58, 𝑝 =
Figure 2.4. Increasing coverage identifies A-site locations for 𝑺 and 𝑭 combinations that
were initially ambiguous. Plotted is the percentage of transcripts with a particular Δ value for
different Sand F combinations from the Pooled dataset of S. cerevisiae. In each panel, multiple
distributions are plotted corresponding to transcripts with increasing coverage, indicated by the
legend at the bottom. For example, the distributions in blue and red arise from transcripts with,
respectively, at least 1 or 2 reads per codon on average. We observe the A-site location tends
towards 15 nt for S = 25, F = 0 (A) and towards 18 nt for S = 27, F = 2 (B), and S = 30, F = 1 (C).
For S = 32, F = 0 (D), there is no trend even at higher coverage. Note that for S = 27, F = 2 (panel
B), there are less than 10 genes with an average greater than 50 reads per codon and hence
we do not include the data point beyond average greater than 45 reads per codon (see
Methods). Error bars represent 95% Confidence intervals calculated using Bootstrapping.
21
5.77 × 10−5) and (30,1) (𝑠𝑙𝑜𝑝𝑒 = 0.25, 𝑝 = 0.009) there is a trend towards an offset of 18
nt, with more than 70% of genes having this offset at the highest coverage (Figures 2.4B,
C). Hence, for these fragments, increasing coverage uniquely identifies ∆′ and hence the
A-site location. For a few combinations of (𝑆, 𝐹), like (32, 0), the ambiguity is not resolved
even upon very high coverage (Figure 2.4D), which we speculate may be due to inherent
features of nuclease digestion being equally likely for more than one offset.
Thus, high enough coverage yields the optimal offset table represented in Table
2.1, where the offset is the most probable location of the A-site relative to the 5′ end of the
mRNA fragments generated in S. cerevisiae.
Table 2.1. A-site locations (nucleotide offsets from 5′ end) determined by applying the
Integer Programming algorithm to the Pooled dataset in S. cerevisiae are shown as a
function of fragment size and frame. The top two offset values are listed for those 𝑆 and 𝐹
combinations in which the A-site location could not be uniquely determined. For unique offsets, the
most-probable offset value is listed.
Fragment Size Frame 0 Frame 1 Frame 2
24 15 15/12 18/12
25 15 12/15 18
26 15/12 18/15 18/15
27 15 15 18
28 15 15 18
29 15 15/18 18
30 15 18 18
31 15 18 18
32 18/15 18 18
33 18 18 18
34 18 18 18/21
22
2.3.5 Consistency across different datasets
Ribo-Seq data is sensitive to experimental protocols that can introduce biases in the
digestion and ligation of ribosome-protected fragments. Pooling datasets together offers
the advantage of higher coverage but it may mask the biases specific to an individual
dataset. To determine whether our unique offsets (Table 2.1) are consistent with results
from individual data sets we applied the Integer Programming algorithm to each individual
dataset. Most of these datasets have low coverage resulting in fewer genes meeting our
filtering criteria. For each unique offset in Table 2.1, we classify it as consistent with an
individual data set provided that the most probable offset from the individual dataset (even
if it does not reach the 70% threshold due to limitations in the depth of coverage) is the
same as in Table 2.1. We find that the vast majority of unique offsets (18 out of 20) in
Table 2.1 are consistent across 75% or more of the individual datasets (statistics reported
in Table A.3). Just two (𝑆, 𝐹) combinations show frequent inconsistencies. (𝑆, 𝐹)
combinations (27, 1) and (27,2) are inconsistent in 33% or more of the individual datasets
(Table A.3). This suggests that researchers who wish to minimize false positives should
discard these (𝑆, 𝐹) combinations when creating A-site ribosome profiles.
2.3.6 Robustness of the offset table to threshold variation
The Integer Programming algorithm utilizes two thresholds to identify unique offsets. One
is that 70% of genes exhibit the most probable offset, the other, designed to minimize false
positives arising due to sampling noise in the Ribo-Seq data, is that the reads in the first
codon be less than one-fifth of the average reads in the second, third and fourth codon.
While there are good reasons to introduce these threshold criteria, the exact values of
these thresholds are arbitrary. Therefore, we tested whether varying these thresholds
changes the results reported in Table 2.1. We varied the first threshold to 60% and 80%,
and recomputed the offset table. We report whether the unique offset changed by listing
an ‘R’ or ‘S’ (for robust and sensitive, respectively) alongside the reported offset in Table
A.3. We find that two-thirds of the unique (𝑆, 𝐹) combinations do not change (Table A.3).
(𝑆, 𝐹) combinations (25, 0), (25, 2), (27,0), (27, 1), (28, 1), (31, 0), (33, 0) and
(33, 2) become ambiguous when we increased the threshold to 80%.
We varied the second, aforementioned threshold from one-fifth up to one and down
to one-tenth, and we find that all unique (𝑆, 𝐹) combinations except
(25, 2), (33,0), (33, 2) and (34, 1) remain unchanged (reported as ‘R’ in Table A.3). Thus,
23
in summary, in the vast majority of cases, the unique offsets reported in Table 2.1 depend
very little on specific values of these thresholds.
2.3.7 Testing the Integer Programming algorithm against artificial Ribo-Seq data
To test the correctness and robustness of our approach we generated a dataset of
simulated ribosome occupancies across 4,487 S. cerevisiae transcripts and asked
whether our method could accurately determine the A-site locations. Artificial Ribo-Seq
reads were generated from these occupancies assuming a Poissonian distribution in their
(𝑆, 𝐹) values using random footprint lengths similar to that found in experiments (see
Methods and Figures A.3A, B). We investigated the ability of our method to correctly
determine the true A-site locations for four different sets of pre-defined offset values (see
Methods). The Integer Programming algorithm was then applied to the resulting artificial
Ribo-Seq data. We find the offset table generated from the algorithm reproduces the input
offsets used (Figure A.3C and Table A.4). This procedure was repeated for different read
length distributions as well as with different input offsets and we find that the offset tables
generated by our algorithm reproduce the input offset tables in greater than 93% of all
(𝑆, 𝐹) combinations (Figures A.3B, C). The method identifies a small number of ambiguous
offsets due to the low read coverage at the tails of the distributions. A finding that
emphasizes further the importance of read coverage as a critical factor in accurately
identifying the A-site.
2.3.8 A-site offsets in mouse embryonic stem cells
The biological fact that A-site of a ribosome resides only between the second and stop
codon is not limited to S. cerevisiae and hence the Integer Programming algorithm should
be applicable to Ribo-Seq data from any organism. Therefore, we applied our method to
a Pooled Ribo-Seq dataset of mouse embryonic stem cells (mESCs). The resulting A-site
offset table exhibited ambiguous offsets at all but three (𝑆, 𝐹) combinations (Table A.5). In
mESCs there is widespread translation elongation that occurs beyond the boundaries of
annotated CDS regions in upstream open reading frames (uORFs)94. Enrichment of
ribosome-protected fragments from these translating uORFs can make it difficult for our
algorithm to find unique offsets because they can contribute reads around the start codon
of canonical annotated CDSs. Therefore, we hypothesized that if we apply our algorithm
to only those transcripts devoid of uORFs and possessing a single initiation site then our
algorithm should identify more unique offsets. Ingolia and co-workers33 have
experimentally identified for well-translated mESCs transcripts its number of initiation sites
24
and whether uORFs are present. Therefore, we selected those genes that have only one
translation initiation site near the annotated start codon and further restricted our analysis
to transcripts with a single isoform, as multiple isoforms can have different termination
sites.
Application of Integer Programming algorithm to this set of genes increases the
number of unique offsets from 3 to 13 (𝑆, 𝐹) combinations (Table A.6). Applying the same
robustness and consistency tests as we did in S. cerevisiae reveals that 77% of the unique
offsets are robust to threshold variation, and a similar percentage is consistent across both
individual datasets used to create the Pooled data (Table A.6). Thus, the unique offsets
we report for mESCs are robust and consistent in the vast majority of datasets. This result
also indicates that successful identification of A-site locations requires analysing only
those transcripts that do not contain uORFs.
2.3.9 Integer Programming does not yield unique offsets for E.coli
As a further test of how widely we can apply our algorithm, we applied it to a Pooled Ribo-
Seq data from the prokaryotic organism E. coli. The number of genes meeting our filtering
criteria is reported in Table A.7. MNase, the nuclease used in the E. coli Ribo-Seq protocol,
digests mRNA in a biased manner - favoring digestion from the 5′ end over the 3′ end66,95.
Therefore, as done in other studies66,95,96, we applied our algorithm such that we identified
the A-site location as the offset from the 3′ end instead of the 5′ end. Polycistronic mRNAs
(i.e., transcripts containing multiple CDSs) can cause problems for our algorithm due to
closely spaced reads at boundaries of contiguous CDS being scored for different offsets
in both the CDSs. To avoid inaccurate results, we restrict our analysis to the 1,915
monocistronic transcripts that do not have any other transcript within 40 nt upstream or
downstream of the CDS. Based on our experience in the analysis of mESCs dataset, we
filter out transcripts with multiple translation initiation sites as well as transcripts whose
annotated initiation sites have been disputed. Nakahigashi and co-workers97 have used
tetracycline as translation inhibitor to identify 92 transcripts in E.coli with different initiation
sites from the reference annotation and we exclude these transcripts from our analysis as
well. However, for this high coverage pooled dataset, we find ambiguous offsets for all
(𝑆, 𝐹) combinations (Table A.5). A meta-gene analysis of normalized ribosome density in
the CDS and 30 nt region upstream and downstream reveal signatures of translation
beyond the boundaries of the CDS (Figure A.4), especially a higher than average
enrichment of reads a few nucleotides before the start codon. We speculate that the base-
pairing of the Shine-Dalgarno (SD) sequence with the complementary anti-SD sequence
25
in 16S rRNA98 protects these few nucleotides before the start codon from ribonuclease
digestion and hence results in an enrichment of Ribo-Seq reads. Since these “pseudo”
ribosome-protected fragments cannot be differentiated from actual ribosome-protected
fragments containing a codon with the ribosome’s A-site on it, our algorithm is limited in
its application for this data.
2.3.10 Reproducing known PPX and XPP motifs that lead to translational slowdown
In S. cerevisiae65 and E. coli66,67 certain PPX and XPP polypeptide motifs (in which X
corresponds any one of the 20 amino acids) can stall ribosomes when the third residue is
in the A-site. Elongation factors eIF5A (in S. cerevisiae) and EF-P (in E. coli) help relieve
the stalling induced by some motifs but not others65. Even in mESCs, Ingolia and co-
workers33 detected PPD and PPE as strong pausing motifs. Therefore, we examined
whether our approach can reproduce the known stalling motifs. We did this by calculating
the normalized read density at the different occurrences of a PPX and XPP motif.
In S. cerevisiae, we observed large ribosome densities at PPG, PPD, PPE and PPN
(Figure 2.5A), all of which were classified as strong stallers in S. cerevisiae65 and also in
E. coli67. In contrast, there is no stalling, on average, at PPP, consistent with other
studies65. This is most likely due to the action of eIF5A. For the XPP motifs, the strongest
stalling was observed for GPP and DPP motifs, which are consistent with the results in S.
cerevisiae and in E. coli (Figure 2.5B). In mESCs, we see the strongest stalling at PPE
and PPD, reproducing the results of Ingolia and co-workers33 (Figure A.5A). For XPP
motifs, we observed very weak stalling only for DPP (Figure A.5B). Thus, our approach to
map the A-site on ribosome footprints enables the accurate detection of established
translation pausing at particular PPX and XPP nascent polypeptide motifs.
A study of Ribo-Seq data of mammalian cells99 observed a sequence-independent
translation pause when the 5th codon of the transcript is in the P-site. This post-initiation
pausing was also observed in an in vitro study of poly-phenylalanine synthesis where
stalling was observed when the 4th codon was in the P-site100. With the A-site profiles
obtained using our offset tables for S. cerevisiae and mESCs; we also observe these
pausing events when both the 4th and 5th codons are at the P-site (Figure A.6).
26
2.3.11 Greater A-site location accuracy than other methods
There is no independent experimental method to verify the accuracy of identified A-site
locations using our method or any other method26,27,29–32,37,38,42,89,92,101–103. We argue that
the well-established ribosome pausing at particular PPX sequence motifs is the best
available means to differentiate the accuracy of existing methods. The reason for this is
that these stalling motifs have been identified in E.coli61,62 and S. cerevisiae104 through
orthogonal experimental methods (including enzymology studies and toe printing), and the
exact location of the A-site during such a slowdown is known to be at the codon encoding
the third residue of the motif62. Thus, the most accurate A-site identification method will be
Figure 2.5. Several PPX and XPP motifs lead to ribosomal stalling in S. cerevisiae. The
median normalized ribosome density is obtained for all instances of (A) PPX and (B) XPP motifs
in which X corresponds to any one of the 20 naturally occurring amino acids. Using a permutation
test, we determine if the median ribosome density is statistically significant or occurs by random
chance. Statistically significant motifs are highlighted in dark red. This analysis was carried out on
the Pop dataset for transcripts in which at least 50% of codon positions have reads mapped to
them. Error bars are 95% Confidence Intervals for the median obtained using Bootstrapping.
27
the one that most frequently assigns greater ribosome density to X at each occurrence of
the PPX motif.
We applied this test to the strongest stalling PPX motifs, i.e., PPG in S. cerevisiae and
PPE in mESCs. In S. cerevisiae, the Integer Programming method yields the greatest
ribosome density at the glycine codon of PPG motif when applied to both the Pooled
(Figure 2.6A) and Pop datasets (Figure A.7A). Examining each occurrence of PPG in our
gene dataset, we find that in a majority of instances our method assigns more ribosome
density to glycine than every other method when applied to both the Pooled (Figure 2.6B,
Wilcoxon signed-rank test (𝑛 = 224), 𝑃 < 0.0005 for all methods except Hussmann (𝑃 =
0.164)) and Pop datasets (Figure A.7B, Wilcoxon signed-rank test (𝑛 = 35), 𝑃 < 10−5 for
all methods except Hussmann (𝑃 = 0.026) and Ribodeblur (𝑃 = 0.01)). The same
analyses applied to mESCs at PPE motifs shows that our method outperforms the other
nine methods (Figures 2.6C-D) with our method assigning greater ribosome density at
glutamic acid for at least 85% of the PPE motifs in our dataset as compared to all other
methods (Figure 2.6D, Wilcoxon signed-rank test (𝑛 = 104), 𝑃 < 10−15 for all methods).
Thus, for S. cerevisiae and mESCs our Integer Programming approach is more accurate
than other methods in identifying the A-site on ribosome-protected fragments.
A large number of molecular factors influence codon translation rates and ribosome
density along transcripts11. One factor is the cognate tRNA concentration, as codons
decoded by cognate tRNA with higher concentrations should have on average lower
ribosome densities35,41,105. Therefore, as an additional qualitative test, we expect that the
most accurate A-site method will yield the largest anti-correlation between the ribosome
density at a codon and its cognate tRNA concentration. This test is only qualitative as the
correlation between codon ribosome-density and cognate tRNA concentration may be
affected by other factors, including codon usage and reuse of recharged tRNAs in the
vicinity of the ribosome106,107. Using tRNA abundances previously estimated from RNA-
Seq experiments on S. cerevisiae41, we find that our Integer Programming method yields
the largest anti-correlation compared to the eleven other methods considered (Table A.8),
further supporting the accuracy of our method. We were unable to run this test in mESCs
as measurements of tRNA concentration have not been reported in the literature.
28
Figure 2.6. The Integer Programming algorithm correctly assigns greater ribosome
density than other methods to the Glycine in PPG motifs in S. cerevisiae and to Glutamic
acid in PPE motifs in mESCs. (A) Normalized ribosome density obtained using the various
methods used to identify the A-site is shown for an instance of PPG motif in gene YLR375W
with G at codon position 303 in the Pooled dataset of S. cerevisiae (The legend indicates the
method and full details for each method can be found in the Methods section). (B) The fraction
of PPG instances (n = 224) at which the Integer Programming method yields greater ribosome
density at glycine compared to every other method. The color-coding is the same as shown in
the legend in panel (A). Our method does better if it assigns greater ribosome density in more
than half the instances (horizontal line in panel B). The Integer Programming method does better
than all other methods (P < 0.0005) except for Hussmann, which is not statistically different
(P = 0.164). (C) Normalized ribosome density is shown for an instance of PPE motif in gene
uc007zma.1 with E at codon position 127 in the Pooled dataset of mouse ESCs (see Legend
and main text for details about methods). (D) The fraction of PPE instances at which the Integer
Programming method yields greater ribosome density at glutamatic acid compared to every
other method. The color-coding is same as shown in the legend of panel (C). The Integer
Programming method does better than all other methods (P < 10−15) in accurately assigning
ribosome density to Glutamic Acid in PPE motifs (n = 104). For the analyses presented in (B)
and (D), two-sided p-values were calculated using the Wilcoxon signed rank test. Error bars
represent the 95% Confidence Interval about the median calculated using Bootstrapping.
29
2.4 Discussion
We have introduced a method to determine the A- and P-site locations on ribosome-
protected mRNA fragments, and shown that it is more accurate than other methods in
correctly assigning ribosome density to the glycine residue in PPG motifs and glutamic
acid residue in PPE motifs, which are strong translation-stalling sites in S. cerevisiae and
mESCs, respectively. Our method is unique amongst existing methods because it (i) uses
a probabilistic approach to identify the A-site location through Integer Programming
optimization and (ii) has an objective function rooted in the biology of translation – meaning
that its optimization enforces the fact that the A-site location of most reads must have been
between the second and stop codons of the CDSs. To be sure, several methods use
biological features to assign the A-site (such as having more reads around the start and
stop codons than in the UTR25,33). However, ours is the only method that also utilizes
feature (i), which is beneficial because the stochastic nature of mRNA cleavage during the
digestion-step of Ribo-Seq necessitates a probabilistic perspective. Our method is not
entirely probabilistic since we have to set thresholds and apply a secondary criterion to
arrive at a final offset value. These measures are unavoidable due to the variability in
coverage between different genes. However, we find that the results are robust to variation
in thresholds and mostly consistent across different Ribo-Seq datasets. Hence, the
respective A-site offset tables provided for S. cerevisiae (Table 2.1) and mouse embryonic
stem cells (Table A.6) can be applied to any dataset from these organisms.
Noteworthy about our test for accuracy is that it is based on results from orthogonal
experimental techniques. The stalling of translation at glycine in PPG motifs is well-
documented61,62,65,66,104 and in S. cerevisiae the Integer Programming method assigns
higher Ribo-Seq reads at the glycine codon at most instances of PPG compared to other
A-site methods. In mESCs PPE is the strongest stalling motif33. The Integer Programming
method outperforms other methods by assigning, on average, 1.76 times more reads at
the glutamic acid codon compared to other methods. These results indicate that the
Integer Programming method presented in this study is more accurate than existing
methods. One reason for this increase in accuracy, among many possible reasons, may
be that most methods only use reads from around the start codon, while our method uses
reads from around both the start and stop codons.
A potential point of confusion may arise from the distributions shown in Figure 2.3 in
which there are two highly probable offset values, raising the question of whether or not
there are multiple A-site locations for a given fragment size and frame. In almost all
30
fragment length and frame combinations, there is one unique most probable A-site
location, but this ambiguity can arise from poor read coverage on a gene or stochastic
fluctuations in the extent of digestion on one side of an mRNA fragment compared to the
other. Consider fragment size 28 in frame 1. In the
Pop data set (top, middle panel of Figure 2.3A), approximately half of the genes have
∆= 15 nt, while the others have ∆= 18 nt, meaning the A-site could be at either location.
When we increase the read coverage of the genes, however, we see that the vast majority
of the offsets shift to 15 nt (bottom, middle panel in Figure 2.3B). Thus, the original A-site
ambiguity was not due to multiple, equally possible A-site locations, but rather the true A-
site location was hard to detect without better coverage. Consider another example. For
𝑆 = 27 and 𝐹 = 1 we observe in Figure 2.3A that 8% of genes have an optimal ∆= 0,
seemingly suggesting that the A-site is located at the 5′-end on a subset of fragments.
Spot-checking the ribosome profiles of these genes, we find that these genes contain no
reads in the 27 nt region upstream of the second codon and 27 nt upstream of the stop
codon (data not shown). Thus, the values of 𝑇(∆|𝑖, 𝑆, 𝐹) for all ∆ were equal and the
optimal ∆ was arbitrarily assigned a value of 0. In the higher coverage Pooled dataset,
however, there are only 2% of genes with optimal ∆= 0 for 𝑆 = 27 and 𝐹 = 1. Hence, as
we increase coverage, the proportion of genes with spurious offsets decreases. Thus,
offsets away from the most probable offset arise from sampling issues, not from multiple
A-site locations. This result is also seen in the analysis of the artificial Ribo-Seq data where
our algorithm correctly predicts the true offsets for a majority of (𝑆, 𝐹) combinations while
ambiguous offsets occur only for those (𝑆, 𝐹) combinations with the lowest read coverage.
We note that we set a threshold of 70% to determine a most-probable offset for each
fragment size and reading frame and demonstrated that the results are robust to variation
with this threshold (Table A.3). Therefore, the A-site assignments reported in Table 2.1
represent the most likely location of the A-site relative to the 5′ end of mRNA fragments
produced from Ribo-Seq experiments on S. cerevisiae.
Some (𝑆, 𝐹) combinations (such as 𝑆 = 32 and 𝐹 = 0, in Table 2.1) appear to be
inherently ambiguous, that is, increasing their coverage does not lead to a unique A-site
assignment (Figure 2.4D). We do not know the reason for this result, but we speculate
that these are situations where there are truly multiple equally probable A-site locations.
Another possibility is that the ribosome adopts different conformations in these situations
that result in different read lengths and offsets, leading to ambiguity40. A third possibility is
that the nucleotide context around the start and stop codons for a subset of genes may
31
influence the offset assignment. While for a majority of (𝑆, 𝐹) combinations, higher
coverage leads to convergence towards a single offset, a minority of genes still have an
offset different from the most probable offset, possibly due to sequence bias. It’s possible
this effect has a bigger impact for (𝑆, 𝐹) combinations with ambiguous offsets. The
important point is that the A-site cannot be accurately assigned in these situations. We
therefore recommend that researchers discard reads from these (𝑆, 𝐹) combinations to
minimize chances of erroneous A-site assignments. We believe it will have negligible
effect on the A-site profiles since these combinations contribute only 2.9% of total reads
in the Pooled dataset.
We have found that the Integer Programming algorithm is sensitive to reads arising
from outside the boundaries of annotated CDS regions from non-canonical sources like
upstream ORFs (uORFs) or Internal Ribosome Entry Sites (IRES). Specifically, applying
our method to Ribo-Seq data from mESCs yielded few unique offsets. It was only after
removing genes that had multiple translation initiation sites, some arising from uORFs,
that the number of unique offsets increased more than four-fold. The reason for this
improvement was that by removing the uORFs, our method’s assumption was met that
the reads within 40 nt of the start codon only arise from the annotated CDS. Our method
was not able to identify any unique offsets in E. coli Ribo-Seq data even after we controlled
for multiple translation initiation sites. We observed in E. coli a high enrichment of reads
before the start codon after applying the conventional 12 nt offset from 3′ end66 (Figure
A.4) which we speculate may be due to protection of mRNA segments involved in binding
of the Shine-Dalgarno sequence to the ribosome108 and could limit the accuracy of our
method.
The next best method to the Integer Programming method is the approach used in the
study of Hussmann and co-workers29. Hussmann’s method uses a near-neighbor heuristic
to determine frame-specific offsets of +15, 14 and 16 for lengths 28 and 29 for frames 0,
1 and 2 respectively, and offsets of +!5, 17 and 16 for length 30. The reason Hussmann’s
method yields comparable results is that its offset table is highly similar to Table 2.1. If the
reading frame is maintained after applying the offset form the 5′ end, then 8 out of 9 of
Hussmann’s offsets are the same as in Table 2.1 with the 9th offset of (29, 1) being
ambiguous in our method. We believe the Integer Programming method is superior
because it more frequently assigns greater ribosome density to glycine in PPG motifs,
exhibits a strong correlation with cognate tRNA abundances, provides greater statistical
power and is based on biological features of translation rather than heuristic assumptions.
32
Specifically, Hussmann’s method only uses reads that are 28, 29 and 30 nt in length,
whereas our method uses reads between 24 to 34 nt in length.
Our method preserves the original 3 nt periodicity found in the original 5′-end aligned
mRNA fragments. Therefore, it is not designed for detecting frame-shifting, translation of
upstream ORFs, or novel short peptides. Nevertheless, correct assignment of reads to the
A-site codon is essential in a variety of other analyses, such as determining translation
kinetics, and our method provides the most accurate assignment of ribosome density
compared to other methods (Figure 2.6 and Table A.8).
In summary, we have created a method for A-site identification that is more accurate
than existing methods in S. cerevisiae and mouse embryonic stem cells, utilizes a
fundamental feature of translation to identify the A-site, and have revealed how the A-site
location changes based on the size of the mRNA fragment and its frame. By increasing
the accuracy and range of fragment sizes for which the A-site can be identified, our
approach can help future studies to measure translation elongation properties at the length
scale of individual codons.
2.5 Methods
2.5.1 Ribo-Seq datasets
2.5.1.1 S. cerevisiae. Published Ribo-Seq data from S. cerevisiae were obtained from
GSM1557447 used in the study of Pop and co-workers39. The raw reads were pre-
processed according to the method stated in the original study. Raw fastq files were
downloaded and preprocessed using Fastx-toolkit (v0.013)
(http://hannonlab.cshl.edu/fastx_toolkit/index.html) as stated in the methods of the original
study. The adapter sequence CTGTAGGCACCATCAAT was stripped using FastQ clipper
and low-quality reads were filtered by FastQ quality filter. The processed reads were
aligned first to the ribosomal RNA sequences using Bowtie 2 (v2.2.3)109. The reads which
did not align to the ribosomal sequences were then aligned to the Saccharomyces
cerevisiae assembly R64-2-1 (UCSC: sacCer3) using Tophat (v2.0.13)110 with up to two
mismatches allowed. Gene annotations were obtained from Saccharomyces Genome
Database (http://www.yeastgenome.org/) on May 4, 2016 for 6,572 protein-coding genes.
Reads were assigned to the nucleotide positions according to the 5′ end.
The pooled Ribo-Seq dataset was formed by combining reads from all replicates of
S. cerevisiae Ribo-Seq data published in studies in which cycloheximide (CHX) was not
used to induce translation arrest9,28,35,39–41,55,111–114. It has been demonstrated that CHX
33
pre-treatment leads to distortion of ribosome profiles due to ribosome slippage even after
CHX treatment28,29. The distorted ribosome profiles can spill across the CDS boundaries
thus limiting the application of Integer Programming algorithm. Hence, our analysis only
used those datasets without CHX pre-treatment. The list of all the utilized datasets is
reported in Table A.9. The raw reads from each study were processed according to the
reported method in the original study. If the method is not reported in the original study,
we used cutadapt (v1.14)115 to pre-process the raw reads. The alignment and assignment
of reads to gene transcripts was done as above for the Pop dataset39.
2.5.1.2 Mouse embryonic stem cells. The “no drug” sample for mouse embryonic stem
cells (mESCs) measured by Ingolia and co-workers33 was utilized in this study. Since CHX
treatment has been shown to artificially alter ribosome profiles in S. cerevisiae, we
believed it is prudent to not use mESC samples pre-treated with CHX. To increase the
coverage we pooled reads from another untreated Ribo-Seq sample of mESCs published
in the study of Hurt and co-workers116. The linker sequence
CTGTAGGCACCATCAATTCGTATGCCGTCTTCTGCTTGAA for Ingolia’s dataset and
the poly-A adapter sequence for Hurt’s dataset were trimmed using cutadapt (v1.14)115.
The trimmed reads were first aligned to ribosomal RNA sequences using Bowtie2
(v2.2.3)109 and the filtered reads were subsequently aligned to mm10 reference
transcriptome consisting of 21,185 genes obtained from UCSC knownGene database
using Tophat (v2.0.13)110 with up to two mismatches allowed. For a gene with multiple
isoforms, only the isoform with the longest CDS was included in the reference
transcriptome. For transcripts with no information on the 5′ UTR region, we included 40 nt
of genomic sequence upstream from the start codon for successful alignment of reads
around start codon and effective application of Integer Programming algorithm.
Translation initiation site data was obtained from Table S3 of study of Ingolia and co-
workers33. We selected genes that have only one translation initiation site coding for only
a canonical CDS product. From these genes, only genes containing a single isoform were
selected, resulting in 430 genes in our final dataset.
2.5.1.3 Escherichia coli. Wild-type Ribo-Seq data for E.coli were obtained from studies
of Li and co-workers (2012)117, Li and co-workers (2014)118 and Woolstenhulme and co-
workers66. The accession numbers of the samples used are provided in Table A.9. The
respective linker sequences in each sample were trimmed using cutadapt (v1.14)115.
Reads were initially aligned to ribosomal RNA sequences using Bowtie2 (v2.2.3)109 and
34
the rest of reads aligned to the E.coli reference genome build NC_000913.3 using Tophat
(v2.0.13)110 with up to two mismatches allowed. Gene annotations were obtained for 4314
genes from RefSeq database corresponding to NC_000913.3.
2.5.2 Gene selection, analyses and statistical tests
2.5.2.1 Selection of genes. To obtain good sampling statistics, we selected for analysis
only those genes that have on average greater than 1 read per codon per fragment length
per reading frame. This means that different sets of genes can be used in the Integer
Programming algorithm depending on the fragment length and frame under scrutiny. The
average number of reads per codon was calculated on the CDS region of the gene and
an additional upstream region corresponding to the size of the fragment length being
considered. Genes in which more than 1% of the total number of mapped reads, for a
given 𝑆 and 𝐹, mapped to multiple locations across the genome were discarded from
further analysis.
2.5.2.2 Identifying unique offsets. We defined the most probable offset ∆′ to have a
unique, unambiguously identified A-site if at least 70% of genes in the dataset had an
offset equal to ∆′, and further require that there be at least 10 genes in the dataset.
Otherwise, the A-site location is defined as ambiguous for the fragment size and frame
under scrutiny. In the Results section, we show the A-site location is largely robust to
moderate variation in this 70% threshold.
2.5.2.3 High coverage test. To test for the effect of depth of coverage on the A-site
location we increased the average number of reads per codon required for a gene to be
included in the analyzed dataset from 1 to values up to 50. Three requirements have to
be met for an ambiguous offset to be identified as unique as coverage is increased. As
before, 70% of the genes had to have the most probable offset with at least 10 genes in
the dataset. In addition, there must to be a statistically significant increasing trend in the
most probable offset with increasing coverage. This requirement prevents fluctuations
above 70% due to statistical error as being counted as a unique offset. This trend is
calculated using Linear Regression Analysis.
2.5.2.4 Test using Artificial Ribo-Seq data. To construct artificial ribosome occupancies,
we used Gillespie’s algorithm119 to simulate translation across S. cerevisiae mRNA
transcripts. During the simulations, we saved snapshots every 100 steps recording the A-
site codon location and creating a histogram of ribosome occupancies across the
35
transcript. To be consistent with the sampling statistics of the experimental Pooled S.
cerevisiae data, we carried out our analysis on the same 4,487 transcripts that met our
filtering criteria for (𝑆, 𝐹) = (28, 0), and normalized our simulated ribosome occupancies
such that they sum up to the total number of reads mapped to that transcript in the
experimental data. We then created different fragment size and 5′ end reading frame
distributions (Figure A.3A, B). Specifically, since the reads are counts, we use Poisson
statistics by treating each (𝑆, 𝐹) as an event in the order:
(20, 0), (20, 1), (20, 2), … , (35, 0), (35, 1) and (35, 2). Six shifted Poisson distributions of
different variances (𝜆 = 4, 8, 16, 24, 48, 80) were generated. The distributions were shifted
such that the mode of the distribution was at (𝑆, 𝐹) = (28, 0), which is typically found in
experiments, with probabilities summing up to 1 between (20, 0) and (35, 2). Two
additional read length distributions were also considered with modes at (𝑆, 𝐹) = (24, 0)
and (𝑆, 𝐹) = (32, 0) with 𝜆 = 8. Four different sets of offset tables were used as an input
to generate the artificial Ribo-Seq reads from the simulated ribosome occupancies for
each of these distributions. These four offset sets are i) a constant offset of 15 nucleotides
for all (𝑆, 𝐹)s, ii) a constant offset of 18 for all (𝑆, 𝐹)s iii) a constant offset of 12 for 𝑆 =
20, 21, … , 26, 27 and constant offset of 18 for 𝑆 = 28, 29, … , 34, 35 iv) the “top offset” values
for (𝑆, 𝐹) combinations identified using our algorithm in the experimental Pooled S.
cerevisiae data (i.e., the offset values of Table 2.1). These input offset tables were
compared to the ‘output’ offset table generated by applying the IP algorithm on the artificial
Ribo-Seq data to test the correctness of our method.
2.5.2.5 Statistical significance of PPX and XPP motifs. To test if the normalized read
density distribution of a PPX or XPP motif is due to random chance, we calculated the P-
value using a permutation test120. For the total number of instances of a PPX/XPP motif,
we randomly selected an equal number of instances of any other three-residue motif and
determined the median normalized read density at the third codon position of the motif,
thereby creating a random distribution. We repeated this procedure 10,000 times and
calculated the fraction of iterations that had a median density equal to or greater than the
one observed for that PPX/XPP motif. This fraction is equal to the P-value. The instances
of PPX and XPP motifs are identified from those transcripts that have at least 50% of
codon positions with 1 read or more.
2.5.2.6 Comparison with other A-site mapping methods. We compared the
performance of Integer Programming algorithm with other methods by calculating the
36
difference in normalized read density between the Integer Programming A-site value and
the compared method’s A-site value at the third codon of PPG and PPE motifs, which are
associated with ribosome pausing in S. cerevisiae and mESCs respectively.
In S. cerevisiae, A-site ribosome profiles were obtained for Integer Programming
method by applying the offsets listed in Table 2.1 for fragment sizes 24 to 34 nt. For
methods used by Martens and co-workers31 and Hussmann and co-workers29 specifically
in S. cerevisiae, A-site profiles were obtained by applying the offsets for specific fragment
sizes as stated in the Methods sections of those studies. We included a constant heuristic
offset of 15 nt which has been used in several studies of S. cerevisiae Ribo-Seq
data25,43,60,101. The constant offset of 15 nt has been applied to a wide range of fragment
lengths across studies including 22-32 nt25, 27-30 nt60, 28 nt43, 27-34 nt101. To be
conservative, we apply a constant offset of 15 nt to fragments between 27 and 30 nt only.
Similarly, we also include a method where a constant offset of 18 nt is applied to fragments
between 27 and 30 nt to compare to the performance of the Integer Programming method.
For mESCs, Ingolia and co-workers33 implemented length specific offsets of 15, 16 and
17 nts from the 5′ end, respectively, for fragments of size 29-30 nt, 31-33 nt and 34-35 nt.
Several studies have also implemented a constant offset of 15 for range of fragment sizes
25-35 nt94,121. Similar to S. cerevisiae, we also implement a constant offset of 18 nt to
fragment size range of 25-35 nt.
Few general methods have been proposed to determine A-site locations in any
organism. We implemented the methods Plastid38, RiboProfiling92 and riboWaltz37 which
are publicly available as R packages. The A-site offset tables generated using these
methods for our analyzed datasets in S. cerevisiae and mESCs are presented in Table
A.10. To determine the A-site profiles using the ‘ribodeblur’ method created by Wang and
co-workers32, we ran the source code available in GitHub (https://github.com/Kingsford-
Group/ribodeblur-analysis/releases/tag/v0.1) on our datasets and added a custom Python
script to generate the ‘deblurred’ A-site profiles. For Rpbp102, the publicly available
software was downloaded and run locally to obtain the A-site offsets. We also applied the
center-weighted method as described by Becker and co-workers89; for reads greater than
23 nt, we trim 11 nt from both ends of the fragment and distribute the read equally among
the remaining nucleotides. For scikit-ribo method30, the source code was downloaded and
was successfully run for S. cerevisiae datasets to obtain the A-site profiles. Scikit-ribo
could not be run on mouse ESC data as the current available version of the source code
37
contains bugs resulting in inaccurate annotation assignments for higher eukaryotic
genomes.
Instances of PPG motifs (in S. cerevisiae) and PPE motifs (in mESCs) used for analysis
are selected from genes in which at least 90% of codon positions have at least 1 read in
their 5′ aligned ribosome profiles in the CDS region and an upstream region of 18 nt. An
instance of a motif is included for analysis only if its ribosome density is greater than 1.5
of average ribosome density at the third codon position in the A-site profile of any
compared methods. We use the Wilcoxon signed rank test to determine if there is a
statistically significant difference between the normalized read density at the third codon
of motif instances obtained by Integer Programming and other methods.
2.6 Acknowledgements
We thank Ajeet Sharma for providing us with computer code to simulate translation using
Gillespie’s algorithm and the members of the O’Brien Lab for critical feedback on the
manuscript. P.S. is supported by a Borysiewicz Biomedical Fellowship from the University
of Cambridge. This work was supported by the research grant from the National Science
Foundation ABI grant 1759860 to E.P.O.
2.7 Data Availability
All source code is made available on the GitHub repository https://github.com/nabeel-
bioinfo/Asite_IP_method
38
Chapter 3
A CHEMICAL KINETIC BASIS FOR MEASURING TRANSLATION ELONGATION
RATES FROM RIBOSOME PROFILING DATA
The research presented in this chapter is part of a study published in PLOS Computational
Biology titled “A Chemical Kinetic Basis for Measuring Translation Initiation and Elongation
Rates from Ribosome Profiling data” by Ajeet K. Sharma*, Pietro Sormanni*, Nabeel
Ahmed*, Prajwal Ciryam, Ulrike A. Friedrich, Günter Kramer and Edward P. O’Brien (*
denotes co-first authors). This publication describes three methods for measuring
translation initiation rates, average translation elongation rate as well as individual codon
translation rates respectively. The contribution of Nabeel Ahmed for this study is towards
development of method for measuring individual codon translation rates and estimating
the molecular factors influencing the rate of translation elongation.
Portions of text of this chapter is being reproduced from the above publication with
permission from PLOS under Open Access Creative Commons Attribution License (CC
BY). Only the research contributed by Nabeel Ahmed related to codon translation rates is
presented in this chapter. The in silico validation analysis for codon translation rates
presented in this chapter was carried out by Ajeet K. Sharma for data provided by Nabeel
Ahmed.
3.1 Abstract
Analysis methods based on simulations and optimization have been previously developed
to estimate relative translation rates from next-generation sequencing data. Translation
involves molecules and chemical reactions,; hence, bioinformatics methods consistent
with the laws of chemistry and physics are more likely to produce accurate results. Here,
we derive simple equations based on chemical kinetic principles to measure the individual
codon translation rates from ribosome profiling experiments. Our methods reproduce the
known rates from ribosome profiles generated from detailed simulations of translation. By
applying our methods to data from S. cerevisiae, we find that the extracted rates reproduce
expected correlations with various molecular properties in agreement with previous
reports that used other approaches. Our analysis further reveals that a codon can exhibit
up to 26-fold variability in its translation rate depending upon its context within a transcript.
This broad distribution means that the average translation rate of a codon is not
representative of the rate at which most instances of that codon are translated, and it
39
suggests that translational regulation might be used by cells to a greater degree than
previously thought.
3.2 Author Summary
The process of translating the genetic information encoded in an mRNA molecule to a
protein is crucial to cellular life and plays a major role in regulating gene expression. The
translation initiation rate of a transcript is a direct measure of the rate of protein synthesis
and is the key kinetic parameter defining translational control of the gene’s expression.
Translation rates of individual codons play a considerable role in coordinating co-
translational processes like protein folding and protein secretion and thus contribute to the
proper functioning of the encoded protein. Direct measurement of these rates in vivo is
nontrivial and recent next generation sequencing methods like ribosome profiling offer an
opportunity to quantify these rates for the entire translatome. In this study, we develop
chemical kinetic models to measure absolute rates and quantify the influence of different
molecular factors in shaping the variability of these rates at codon resolution. These new
analysis methods are significant to the field because they allow scientists to measure
absolute rates of translation from Next-Generation Sequencing data, provide analysis
tools rooted in the physical sciences rather than heuristic or ad hoc approaches, and allow
for the quantitative, rather than qualitative study of translation kinetics.
3.3 Introduction
Translation-associated rates influence in vivo protein abundance, structure and function.
It is therefore crucial to be able to accurately measure these rates. The ribosome
synthesizes a protein in three steps namely initiation, elongation, and termination 122–124.
Translation is initiated at the start codon of the mRNA transcript by the ribosomal subunits
that form a stable translation-initiation complex125,126. During the elongation step, the
ribosomes moves along the mRNA transcript decoding individual codons and adding
residues to the growing nascent chain. Translation is terminated when the stop codon is
in the ribosome’s A-site resulting in release of the synthesized protein. The elongation
phase is terminated when the ribosome’s A-site reaches the stop codon, resulting in
release of the fully synthesized protein. The initiation and elongation phases of translation
contribute to protein levels inside a cell; indeed, alteration of their rates can cause protein
abundance to vary by up to five orders of magnitude5,127,128, and alter protein structure and
function11. Termination does not influence the cellular concentration of proteins as it is not
40
a rate limiting step129. Therefore, knowledge of translation initiation and codon translation
rates are important to understand the regulation of gene expression.
Significant efforts have been made to extract these rates from data generated from
ribosome profiling experiments39,130–132, a technique that measures the relative ribosome
density across transcripts25. In a number of methods, the actual rates are not measured
but instead a ratio of rates, or other relevant quantities have been reported35,41–43,60.
Estimates of translation-initiation and codon translation rates have helped identify the
molecular determinants of these rates. For example, estimated initiation rates correlate
with the stability of mRNA structure near the start codon and in the 5' untranslated
region4,41,129,132 indicating mRNA structure can influence initiation. Similarly, codon
translation rates have been found to positively correlate with their cognate tRNA
abundance41,105, and anti-correlate with the presence of downstream mRNA secondary
structure53,54 and positively charged nascent-chain residues inside the ribosome exit
tunnel42,133. Some of these findings are controversial as different analysis methods and
data have led to contradictory results concerning the role of tRNA concentration27,33,41,43,
positively charged residues42,60 and coding sequence (CDS) length41,129,131,132. Moreover,
the accuracy of these methods is unknown because orthogonal, high-throughput
experimental measurements of translation rates do not exist.
In the absence of data that could differentiate the accuracy of different methods,
we argue that the methods most likely to be accurate will be those that are constrained by
and account for the chemistry and physics of the translating system. Here, we present
such a method, derived from chemical kinetic principles that permit the extraction of
individual codon translation rates from Next-Generation Sequencing (NGS) data. These
methods are verified against artificial ribosome profiling data generated from detailed
simulations of the translation process where the translation rates are known a priori. We
then apply these methods to in vivo ribosome profiling data and extract the transcriptome-
wide translation-initiation and codon translation rates in S. cerevisiae and transcriptome-
wide average elongation rate in mouse stem cells. We show that the translation rate
parameters correlate with factors known to modulate these rates, and assign absolute
numbers to these rates.
41
3.4 Results
3.4.1 Theory
3.4.1.1 Measuring individual codon translation rates
To derive a mathematical expression for extracting codon translation rates from ribosome
profiling data we assumed steady state conditions in which the flux of ribosomes at each
codon position is equal to the rate of protein synthesis
𝑁2,𝑖
ribo
𝜏(2,𝑖)=
𝑁3,𝑖ribo
𝜏(3,𝑖)= ⋯
𝑁𝑗,𝑖ribo
𝜏(𝑗,𝑖)= ⋯ =
𝑁𝑁𝑐(𝑖),𝑖ribo
𝜏(𝑁𝑐(𝑖),𝑖) . (Eq 3.1)
In Eq. (3.1), 𝑁𝑗,𝑖ribo and 𝜏(𝑗, 𝑖) are, respectively, the steady-state number of ribosomes and
the average translation time of the 𝑗𝑡ℎ codon position within copies of transcript 𝑖 in a given
experimental sample. The mean total time of synthesis ⟨𝑇(𝑖)⟩ of transcript 𝑖 is, by
definition, equal to
⟨𝑇(𝑖)⟩ = 𝜏(2, 𝑖) + 𝜏(3, 𝑖) + ⋯ + 𝜏(𝑁𝑐(𝑖), 𝑖), (Eq. 3.2)
Solving Eqs. (3.1) and (3.2) for 𝜏(𝑗, 𝑖) (see derivation in Appendix B) yields
𝜏(𝑗, 𝑖) =𝑁𝑗,𝑖
ribo
∑ 𝑁𝑙,𝑖ribo𝑁𝑐(𝑖)
𝑙=2
⟨𝑇(𝑖)⟩. (Eq. 3.3)
As is the convention in the field26, we assume that ribosome profiling reads at the 𝑗𝑡ℎ codon
position of transcript 𝑖, 𝑐(𝑗, 𝑖), are directly proportional to 𝑁𝑖𝑗ribo. This relationship can be
expressed as
𝑁𝑗,𝑖ribo = 𝑎𝑗,𝑖𝑐(𝑗, 𝑖), (Eq. 3.4)
where 𝑎𝑗,𝑖 is a constant of proportionality. 𝑎𝑗,𝑖 values have not been experimentally
measured, but they are commonly assumed to be constant for all codon positions in a
single experiment26. That is, 𝑎𝑗,𝑖 = 𝑎𝑖 for all 𝑖 and 𝑗. Using Eq. (3.4) with 𝑎𝑗,𝑖 = 𝑎𝑖 in Eq.
(3.3) yields
𝜏(𝑗, 𝑖) =𝑐(𝑗,𝑖)
∑ 𝑐(𝑙,𝑖)𝑁𝑐(𝑖)𝑙=2
⟨𝑇(𝑖)⟩. (Eq. 3.5)
Eq. (3.5) indicates that we can determine the individual codon translation rates from
ribosome profiling reads provided we know the average total synthesis time of the
transcript. Eq. (3.5) can be connected to the expression for normalized ribosome density,
derived in the SI of Weinberg et al.41, where 𝜏(𝑗,𝑖)
𝜏(𝑖)𝑁𝑐(𝑖) is the normalized ribosome density
and is expressed as a function of 𝑐(𝑗, 𝑖)s. Eq. (3.5) is also related to a metric used in the
42
simulations of study of Dao Duc and Song132 to estimate the codon translation rates. It is
important to note that 𝜏(𝑗, 𝑖) is the actual codon translation time, which includes the time
delay caused by ribosome-ribosome interactions and is distinct from the intrinsic
translation rate of a codon 𝜔(𝑗, 𝑖). 𝜏(𝑗, 𝑖) is equal to the inverse of 𝜔(𝑗, 𝑖)𝑓(𝑗, 𝑗 + ℓ, 𝑖)134
where 𝑓(𝑗, 𝑗 + 𝑙, 𝑖) is the conditional probability that given that a ribosome is at the 𝑗𝑡ℎ
codon position there is no ribosome at the (𝑗 + 𝑙)𝑡ℎ codon position.
3.4.2 Application
3.4.2.1 In silico validation of our methods
As a first step to test the accuracy of the measured translation rates from Eq. (3.5), we
applied them to S. cerevisiae ribosome profiles generated by Gillespie simulations 135 in
which all of the underlying rates are known (see Methods). If our analysis method is
accurate then a necessary condition is that they reproduce these rates from the simulated
profiles. We applied Eq. (3.5) to the steady-state ribosome profiles and find that the
individual codon translation times are accurately measured by our method (Figure 3.1,
median 𝑅2 = 0.96 and median slope =1.00). Thus, the analysis method we have created
can in principle accurately capture the translation rate parameters.
There are several points worth noting concerning these tests. First, the rates used
in the simulation model are realistic, having been taken from literature values129,136.
Second, the depth of coverage in the simulated ribosome profiles is in the same range as
experiments, e.g., having 26 million reads arising from coding sequences41. Third, Eq.
(3.5) require knowledge of the average synthesis time of a protein, which is experimentally
difficult to measure. Therefore, in the above analyses we used the approximation that the
average synthesis time of a protein is proportional to the number of codons in its transcript,
multiplied by the transcriptome-wide average codon translation time (Eq. (B.5))33,137.
However, when we increase our read coverage from 7.1 million to 35.5 billion reads and
use the exact synthesis time of a protein, 𝑅2 between the estimated and true codon
translation rates goes to > 0.99 for all 85 transcripts in the simulated data (See Methods).
Thus, our model is reasonably accurate when approximate protein synthesis times are
used (Eq. (B.5)) and the coverage is similar to typical experiments, and highly accurate
when the exact synthesis time is used and coverage is high.
43
Figure 3.1. Eq. (3.5) accurately determines codon translation times from simulated ribosome
profiles. (A) Average translation time of a codon in YER009W S. cerevisiae transcript is plotted as
a function of its position within the transcript. The true codon translation times in the simulations are
plotted as green boxes, blue and black data points are the translation times measured using Eq.
(3.5). Blue data points were calculated using the average protein synthesis time measured from the
simulations and relative ribosome density calculated using a large number of in silico ribosome
profiling reads. Black data points were calculated using the average protein synthesis time
estimated from the scaling relationship (Eq. (B.5)) and the relative ribosome density calculated from
the in silico reads which were equal to the reads aligned to the same transcript in the experiment41.
(B) Measured codon translation times, plotted with black and blue data points in (A), are plotted
against true codon translation times in the simulations in the top and bottom panel, respectively. (C)
Probability distribution of the 𝑅2 correlation between the true and calculated codon translation times
for the 85 S. cerevisiae transcripts. (D) Probability distribution of the slope of the best-fit lines
between the estimated and true codon translation times for the 85 S. cerevisiae transcripts. The
high 𝑅2 in (C) and median slope of 1.00 in (D) indicate that Eq. (3.5) can, in principle, accurately
measure absolute rates
44
3.4.2.2 Measurement of individual codon translation rates
To extract individual codon translation times along a coding sequence we applied Eq. (3.5)
to 117 and 364 high-coverage transcripts from ribosome profiling data reported,
respectively, in studies of Nissley et al.9 and Williams et al.114 (see Methods). The number
of transcripts in both of these datasets are small as compared to the size of S. cerevisiae
transcriptome. Therefore, to determine whether these subsets of transcripts are
representative of the whole transcriptome we compared the distributions of different
physicochemical properties in these two sets to the total transcriptome. We find that the
subset of transcripts from Nissley et al.9 have 6.6% higher mean GC content but a very
similar mode of length distribution and codon usage relative to the total transcriptome
(Figure B.1A-C). For Williams et al.’s ribosome profiling dataset114 we again find that the
mode of the length distribution and codon usage is similar to the S. cerevisiae
transcriptome, with 5.3% higher mean GC content (Figure B.1D-F). This indicates that the
set of transcripts we analyze are largely representative of the properties of the
transcriptome, but have a bit higher GC content.
Upon extracting individual codon translation times from these ribosome profiling
data, we first characterized the distribution of translation times for the 61 sense codons
(Table B.1). We find around three-fold difference between the median translation times of
the fastest and slowest codons in the Nissley dataset9. The fastest and slowest codons
are AUU and CCG codons that are translated in 127±2 and 344±37 ms (median ± standard
error), respectively. The variability in translation times for a given codon type is even
larger, as illustrated by wide distributions of their translation times in the Nissley dataset
(Figures 3.2A, B.2A). Figure 3.2B shows an example where the AAG codon is translated
with translation times ranging from 59 ms at codon position 413 to 363 ms at codon
position 196 in YAL038W S. cerevisiae transcript. We find a 16-fold variability in codon
translation times across the transcriptome even if we ignore the extremities of the
distributions by only considering the translation times between the 5th and 95th percentiles
of all codon types. Similar ranges are found in the Williams dataset where there is a 26-
fold variability in translation times and 3.9 fold-difference in median translation times of the
fastest (AUC) and slowest (CGA) codons, which are translated with median time of 128±2
and 496±61 ms, respectively (Table B.2, Figure B.2B). The medians and standard
deviations of translation time distributions are well correlated between the above two
datasets (Figures B.3A, B). The study of Dao Duc and Song132 also infers the individual
45
codon translation rates and a very high correlation is observed between the rates obtained
using our method and the rates found in that study (Figure B.3C).
3.4.2.3 Molecular factors flanking the A-site shape the variability of individual codon
translation rates
A number of molecular factors have been shown or suggested to influence the translation
rate of a codon in the A-site, including tRNA concentration, mRNA structure, wobble-base
pairing, and proline residues at or near the ribosome P-site29,35,41–43,53,54,105. Here, we test
whether the presence or absence of these factors correlate with changes in translation
speed that we measure. We first examined whether the cognate tRNA concentration
correlates with our translation times. We find that the median codon translation times
negatively correlates with the abundance of cognate tRNA (Figures 3.3A and B, 𝜌 =
−0.51 (p-value = 0.0006) and 𝜌 = −0.50 (p-value= 0.0009), respectively), indicating that
codons with lower cognate tRNA concentrations typically are translated more slowly.
The presence of a proline amino acid at the ribosome’s P-site can slowdown
translation due to its stereochemistry58. We tested whether such an effect was present in
our data set by calculating the percentage difference in median translation time at the A-
site when proline is present at the P-site versus when it is not present at the P-site. We
Figure 3.2. Wide variability in individual codon translation rates in vivo. (A) Probability density
functions for translation times of AUU, GAC and UGG codons in Nissley dataset. Median translation
times for AUU, GAC and UGG codon are 127, 208 and 331 ms, respectively. (B) The translation
time profile of S. cerevisiae transcript YAL038W from Nissley dataset is shown between codon
positions 150 and 450. AAG codon (colored red) is translated in 362.8 ms at the 196𝑡ℎ codon
position and in 58.6 ms at 413𝑡ℎcodon position.
46
find a 19% increase in median translation time when proline is present (Figure 3.3C, Mann-
Whitney U test, p-value = 2.2 × 10−32) indicating that proline does systematically
slowdown translation in vivo.
It has been found that the presence of downstream mRNA secondary structure
can slow down the translation at the A-site51–54. To test for this effect, we plotted the
difference in the median translation time at the A-site when mRNA secondary structure is
present versus when it is not present at a given number of codon positions downstream
of the ribosome A-site. Structured versus unstructured nucleotides were identified using
DMS data138. We find that when secondary structure is present 4 codons downstream of
the A-site, placing that structure directly at the leading edge of the ribosome, there is on
average a 6.7% increase in codon translation time at the A-site (Figure 3.3D, Mann-
Whitney U test, p-value = 2.7 × 10−14). A slowdown is also found when we cross-
reference our codon translation times with mRNA structure data from PARS139, which
measures the presence of mRNA structure in vitro (Figure 3.3E, Mann-Whitney U test, p-
value = 5.6 × 10−9).
Wobble base pairing between the codon and anti-codon tRNA stem-loop has been
found to slowdown translation speed as compared to Watson-Crick base pairing in
bacteria140 and metazoans49. For each pair of codon types that are decoded by the same
tRNA molecule, by Watson-Crick base pairing in one instance and wobble base pairing in
the other, we tested whether two codon types are translated with different rates. We find
that there is no systematic difference in median translation times between codons that are
decoded by either mechanism (Figure 3.3F, Wilcoxon signed-rank test, p-value=0.46),
indicating that, at least in S. cerevisiae, wobble base pairing does not slowdown in vivo
translation elongation.
These results were reproduced using another dataset 114 that also shows that
codon translation times anti-correlate with tRNA concentration (Figures B.4A-B, 𝜌 =
−0.58 (p-value = 7.8 × 10−5) and 𝜌 = −0.56 (p-value= 0.0002), respectively), exhibit
significant slow-down in codon translation time when a proline is present at the P-site
(Figure B.4C, Mann-Whitney U test, p-value = 3.0 × 10−27) and mRNA structure present
downstream to the A-site (Figures B.4D-E, Mann-Whitney U test, p-values =
3.6 × 10−5,8.7 × 10−4, respectively) and similarly, we found no difference between the
translation rate of codons that are translated with Watson-Crick and Wobble base pairing
(Figure B.4F, Wilcoxon signed-rank test, p-value=0.88).
47
Figure 3.3. Molecular factors shaping the variability of individual codon translation rates.
(A-B) Median translation times of codon types are negatively correlated with cognate tRNA
abundance estimated by (A) gene copy number and (B) RNA-Seq gene expression. (C) Probability
distribution of the translation time of codons that are followed by the proline encoding codon and
the rest of the other codons are plotted in green and blue, respectively. (D-E) Percentage difference
in median translation times when mRNA structure is present relative to when it is not present is
plotted as a function of codon position after the A-site. Grey bars indicate results that are not
statistically significant. Error bars are the 95% C.I. calculated using 104 bootstrap cycles;
significance is assessed using the Mann-Whitney U test corrected with the Benjamini Hochberg
FDR method for multiple-hypothesis correction. mRNA structure information used in (D) and (E)
are provided by in vivo DMS and in vitro PARS data, respectively. (F) Scatter plot of the median
translation times of pairs of codon types that are decoded by the same tRNA molecule. The red
line is the identity line. The list of tRNA molecule names and decoded codon types were taken from
Cannarrozzi et al.176. Error bars are standard error about the median translation time calculated
with 104 bootstrap cycles.
48
3.5 Discussion
We have presented a method for measuring elongation rates from ribosome profiling data.
What distinguishes our approach from many others is that it uses simple equations derived
from chemical kinetic principles, it does not require simulations or a large number of
parameters, and it yields absolute rather than relative rates. We demonstrated that our
approach provides accurate results when applied to test data sets (Figure 3.1), and
reproduced previously reported correlations between translation speed and various
molecular factors (Figures 3.3 and B.4), suggesting the rates obtained by this method are
reasonable.
A novel finding concerning elongation rates is that in S. cerevisiae the translation
time of a codon depends dramatically on its context within a transcript. In S. cerevisiae,
the range of individual codon translation time spans up to 26-fold, from 45 to 1,194 ms,
even after discarding the top and bottom 5% of this distribution as possible outliers. The
codon AAG in gene YAL038W, for example, occurs 36 times along this gene’s transcript.
At the 196th codon position AAG is translated in 363 ms, and at the 413th position AAG is
translated in 59 ms. Thus, the same codon in different contexts can be translated at very
different speeds. Characterizing the distribution of mean times of translation of different
occurrences of the same codon reveals a broad distribution (Figure 3.2A), whose
coefficient of variation is often close to 0.5 (Tables B.1 and B.2). This means that the
standard deviation is half of the average translation time of a codon. This leads to the
important finding that the average translation rate of a codon type is not representative of
the rate at which most instances of that codon type are translated. These results are
consistent with the findings that a large number of molecular factors determine codon
translation rates in vivo141, thus giving rise to a broad distribution of rates (Figures 3.2A,
B.2) and these factors have been shown to cause a bias towards slower translation in the
first 200 codons of the transcript54.
A molecular factor that has not been quantified in this study is ribosome queuing.
Currently, the conventional Ribosome profiling protocol isolates only monosomes and the
monosome-protected fragments are extracted and sequenced. However, should
ribosomes queue along a transcript, disomes and trisomes are likely to be produced that
are not accounted for in current datasets. Recent studies132,142 have attempted to quantify
the extent of ribosomal queuing but several challenges remain. One of the central
challenges is to correctly identify the location of A-sites of ribosomes translating disome-
and trisome-protected mRNA fragments. Current ribosome profiling datasets that include
49
disomes have very sparse coverage, which limits the application of our method but more
importantly suggests that the occurrence of disomes, and hence of queuing, may be rather
rare under normal growth conditions. However, under stress conditions, ribosome queuing
has the potential to become frequent for some genes and potentially decrease the
accuracy of our method unless the disomes and trisomes fragments are included. As
advances in ribosome profiling experiments are made to generate high coverage data and
improve the A-site identification on disomes and trisomes, our method will be able to more
accurately quantify the rates of translation elongation under non-standard growth
conditions.
A number of approaches have been developed to measure codon translation times
including simulation based approaches131,132 that extract rates by comparing the local
distribution of ribosome profiling reads with simulated ribosome densities, others that
optimize an objective function39 or fit a normalized-footprint-count distribution of a codon
to an empirical function130, and yet others that measure relative codon translation times
by quantifying the enrichment of ribosome read density using a variety of procedures35,43.
In contrast, Eq. (3.5) allows individual, absolute codon translation rates to be calculated
directly from the ribosome profile along the transcript. Another distinction is that a number
of these methods39,43,130,131 assume that all occurrences of a codon across the
transcriptome must be translated at the same rate. This assumption cannot be correct as
it is known that non-local aspects of translation (such as mRNA structure) can influence
the translation speed of individual codons. Eq. (3.5) does not make this assumption, and
therefore its extracted rates will better reflect the naturally occurring variation of codon
translation times across a transcript.
The codon usage in a transcript, and associated translation rates, can affect
various co- and post-translational processes involving nascent proteins11. Therefore, the
accurate knowledge of codon translation times measured using Eq. (3.5) will help provide
a better quantitative understanding of how translation speed can impact the efficiency of
co-translational processes, such as protein folding, chaperone binding, and numerous
other processes involving the nascent protein. Coupled with molecular biology techniques
that can knock out various genes and their functions in cells, Eq. (3.5) provides the
opportunity to quantitatively examine whether co-translational processes can cause
translation speed changes.
Ribosome profiles have ill-quantified sequencing biases27 that can potentially
produce reads that are not proportional to the underlying number of ribosomes at a
50
particular codon position. This could lead to errors in the extraction of translation rate
parameters using our methods143. It has been demonstrated that using translational
inhibitors like cycloheximide leads to distortion of ribosome profiles due to inefficient arrest
of translation28,29. This was one of the primary reasons why initial studies using
cycloheximide did not observe a correlation of codon translation rates and cognate tRNA
concentration. While there is often a strong correlation between the total number of
mapped reads per transcript between datasets from different studies, the correlation is
often poor at the individual nucleotide level101. This “noise” at this resolution has been
attributed to sparse read coverage101, choice of ribonuclease for digestion144, and the
methods used to halt elongation in the ribosome profiling protocol28,29. Restricting our
analyses to transcripts with high coverage contributes to more reproducible results, as can
be seen by the high correlation between the two datasets used in this study (Figure B.3).
Experimental improvements that minimize bias have been developed26,41,144,145, such as
using flash-freezing for arresting translation and utilizing short microRNA library
generation techniques146, but sequence-dependent biases can still exist, for example due
to varying efficiencies of linker ligation147. As experimental techniques are improved to
minimize bias, the accuracy of the rates extracted using our methods will also increase.
The absence of accurate translation rate parameters is an impediment to
quantitatively modeling the process of translation. By measuring translation rate
parameters using a chemical-kinetic framework, our method can contribute to ongoing
efforts2,148 to understand how the sequence features of an mRNA molecule can regulate
gene expression. More broadly, the approach we have taken in this study is to utilize ideas
from chemistry and physics to analyze Next-generation Sequencing data; a branch of
bioinformatics we refer to as physical bioinformatics. We expect that this physical-science-
based approach will prove useful in understanding other large biological data sets
concerning translation and compliment the conventional computer science approaches to
bioinformatics.
3.6 Methods
3.6.1 Simulated steady state ribosome profiling data. We carried out protein synthesis
simulations using the inhomogeneous ℓ-TASEP model142,148–151. In this model, with 𝑙=10
and the A-site of the ribosome located at the 6th codon within the ribosome-protected
mRNA fragment, a new translation-initiation event stochastically occurs on transcript 𝑖 with
rate 𝛼(𝑖) when the first six codons of the transcript are not occupied by another ribosome25.
51
The ribosome then stochastically moves along the transcript from codon position 𝑗 to 𝑗 + 1
with rate 𝜔(𝑗, 𝑖) if no ribosome is present at the (𝑗 + ℓ)th codon position. A ribosome
stochastically terminates the translation process with rate 𝛽 when its A-site encounters
the stop codon. Note that our simulation model does not account for other processes such
as ribosome recycling152 and drop-off153.
85 S. cerevisiae mRNA transcripts were selected to test our codon translation rate
measurement method. They were chosen based on the filtering criteria that each codon
has at least 3 reads in the ribosome profiling data reported in study of Weinberg et al.41.
We used the translation-initiation rates reported in Ciandrini et al.129 in our simulations for
these transcripts. We used codon translation rates from Fluitt and Viljoen’s model for all
61 sense codons136 and set the translation-termination rate to 35 𝑠−1 129. We set ℓ = 10
codons in our simulations because it is the canonical mRNA fragment length that is
protected by ribosomes against nucleotide digestion in ribosome profiling experiments25.
We simulated the translation of these 85 S. cerevisiae mRNA sequences using the
Gillespie’s algorithm135 to generate the in silico ribosome profiling data. During the
simulations, we recorded ribosome locations across the transcript every 100 steps, which
we found minimized the time-correlation between successive saved snapshots. The codon
positions of the ribosome’s A-site in each of these snapshots, summed over all snapshots,
constituted the in silico generated ribosome profile for the transcript. We ran the
simulations until the total number of in silico ribosome profiling reads were equal to the
total number of reads aligned to the same transcript measured from experimental
ribosome profiling data reported in study of Weinberg et al.41. This allowed us to create a
realistic level of statistical sampling in our in silico ribosome profiles. Each of the
uncorrelated snapshots can be thought as a separate copy of the mRNA transcript. Thus,
the total number of these snapshots were equal to the mRNA copy number in our in silico
experiment which we used to calculate 𝜌(𝑗, 𝑖)s.
3.6.2 In silico measurement of average protein synthesis and codon translation
times
To calculate the codon translation times (Eq. (3.5)) from in silico ribosome profiles we
need the average time a ribosome takes to synthesize a protein from a given transcript.
We measured this quantity from our simulations by recording the time it takes a ribosome
to traverse from the start codon to the stop codon in the transcript. The average synthesis
time of a protein was then calculated from 10,000 individual ribosomes.
52
We also calculated the average synthesis time of a protein using a scaling
relationship that uses the transcriptome-wide average codon translation time (Appendix
B, Eq. (B.5)). To calculate this quantity, we first computed the average codon translation
time for each transcript by dividing the average protein synthesis time of a transcript by its
CDS length. We then calculated the transcriptome-wide average codon translation time
using the average codon translation time of each transcript.
Testing the accuracy of Eq. (3.5) requires the real codon translation times which
we measured by setting a separate clock at each codon position of a transcript in our
simulations. These clocks measured the time difference between successive arrival and
departure of a ribosome at each codon position. To calculate the average codon
translation time at each codon position at least 10,000 such measurements were made.
3.6.3 Analysis of ribosome profiling and RNA-Seq data
3.6.3.1 Datasets: To calculate the codon translation rates, we apply our method to high-
coverage ribosome profiling datasets of wild type S. cerevisiae reported in Nissley et al.9
and Williams et al.114 with NCBI accession numbers GSM1949551 and GSM1495503,
respectively. In our analysis, reads were preprocessed and mapped to sacCer3 reference
genome as described in Nissley et al.9. To maintain the accuracy of read assignment,
transcripts in which multiple mapped reads constitute more than 0.1% of total reads
mapping to the CDS region were not considered in the analysis. A-site positions in
ribosome profiling reads were assigned according to the offset table generated using an
Integer Programming algorithm which maximizes the reads between the second and stop
codon of transcripts154. The offset table for S. cerevisiae is taken from Table 2.1 of Chapter
2 (also Ahmed et al.154).
3.6.3.2 Selection of genes for codon translation rates: To extract individual codon
translation rates, we restrict our analysis to genes that have at least 3 reads at every codon
position of the transcript. We find that 117 and 364 genes meet this criterion in the data
set of Nissley et al.9 and Williams et al.114, respectively. This stringent requirement is
necessary since Eq. (3.5) would predict codons with zero reads to be translated in zero
time. Reads at the start codon and the second codon have contributions from the
translation initiation process; therefore, we ignored these codon positions in our
calculations of translation time distributions and correlation with molecular factors. As
stated before, transcripts containing multiple mapped reads greater than 0.1% of the total
reads mapped to the transcript were discarded. Genes with overlap of coding sequence
53
regions as well as those containing introns (which is less than 6% of S. cerevisiae genome)
were not considered in the analysis to avoid overlap of ribosome profiles.
3.6.3.3 Miscellaneous: (a) Since experimental measurements of ⟨𝑇(𝑖)⟩ s are not available
for S. cerevisiae we use Eq. (B.5) to estimate ⟨𝑇(𝑖)⟩ with ⟨𝜏𝐴⟩ = 200 ms, as reported in the
literature155,156. (b) The measures for tRNA abundance based on gene copy number and
RNA-Seq measurements were obtained from Table S2 of Weinberg et al.41.
3.6.4 Assignment of mRNA secondary structure
Both DMS and PARS data provide information about base-paired nucleotides within an
mRNA molecule. We considered a codon to be structured if at least two of its three
nucleotides were base-paired or one nucleotide was base-paired and the structure
information for the other two nucleotides were not available.
DMS data for S. cerevisiae were downloaded from GEO database with accession
number GSE45803 138. The reads from all in vivo replicates were pooled together and then
aligned to the ribosomal RNA sequences using Bowtie 2 (v2.2.3)109. The reads which did
not align to the ribosomal RNA sequences were then aligned to the Saccharomyces
cerevisiae assembly R64-2-1 (UCSC: sacCer3) using Tophat (v2.0.13)110. In our analysis,
A and C nucleotides were considered base-paired when the DMS signal was below the
threshold of 0.2 and considered unstructured if the DMS signal was greater than 0.5. A
and C nucleotides with DMS signal between 0.2 and 0.5 are considered ambiguous and
classified together with U and G nucleotides, which do not react with DMS. Codons
involving such nucleotides were not considered in our analysis.
PARS data were downloaded from genie.weizmann.ac.il/pubs/PARS10 with
PARS scores available for all transcripts, except YDR461W and YNL145W, which were
excluded from our analysis. Nucleotides with a PARS score greater than 0 were
considered base-paired139.
54
Chapter 4 EVOLUTIONARILY SELECTED AMINO ACID PAIRS ENCODE
TRANSLATION-ELONGATION RATE INFORMATION
This chapter is formatted as a 2,500-word manuscript that was recently submitted in a
peer-reviewed journal. The authors are Nabeel Ahmed, Ulrike A. Friedrich, Pietro
Sormanni, Prajwal Ciryam, Bernd Bukau, Günter Kramer and Edward P. O’Brien. The
contributions of each author to this study are: N.A and E.P.O conceived the study. U.F,
G. K and B.B carried out experiments to generate S. cerevisiae mutant strains and running
Ribo-Seq on these strains. P.S. and P.C contributed to analysis methods of published
Ribo-Seq data and annotation of domain boundaries in S. cerevisiae. N.A. analyzed the
data. N.A and E.P.O wrote the manuscript.
4.1 Abstract
The speed of translation is generally considered to be encoded within messenger RNA
molecules and influenced by intracellular conditions. Here, using a combination of
mutational experiments, bioinformatic and evolutionary analyses, we show that particular
pairs of amino acids and their associated tRNA molecules predictably and causally encode
translation rate information within the primary structures of proteins when these pairs are
present in the A- and P-sites of the ribosome. For some pairs, it is solely the amino acid
identity or tRNA identity that determines the variation in translation speed, while for others,
the speed is determined by a combination of these two factors. The fast-translating pairs
of amino acids are enriched seven-fold relative to the slow-translating pairs across the
Saccharomyces cerevisiae proteome, while the slow-translating pairs are enriched
downstream of domain boundaries. Thus, translation rate information is causally encoded
in the primary structures of proteins via pairs of amino acids, and signatures of
evolutionary selection pressure indicate their use in coordinating co-translational
processes.
4.2 Main Text
Variation in translation-elongation kinetics along a transcript’s coding sequence plays an
important role in the maintenance of cellular protein homeostasis by regulating co-
translational protein folding, localization, and maturation11,70,81. Codon translation rates are
influenced by a range of molecular factors, including the presence of particular tripeptide
sequence motifs composed of one or more prolines58,60–62,65 and positively charged
55
nascent-chain residues within the negatively charged ribosome exit tunnel42,133,
suggesting that the primary structures of proteins encode translation-elongation rate
information, in some cases through their influence on ribosome catalysis. Since the
ribosome catalyzes peptide bond formation between 400 unique amino acid pairs when
they reside in the P- and A-sites of the ribosome, we postulated that the chemical identity
of some of these pairs might predictably and causally alter codon translation rates.
To test our hypothesis, we used ribosome profiling applied to Saccharomyces
cerevisiae. Ribosome profiling is a high-throughput technique that measures ribosome
densities that are directly proportional to the location and number of ribosomes translating
different codon positions across a transcriptome25. The measured ribosome density ρ at
a codon is inversely proportional to the speed at which ribosomes translate that
codon26,143. We analyzed the translational profiles of 364 high-coverage transcripts
measured in six independent, published data sets41,111,113,114,157 (Table C.1). For each of
the 400 unique pairs of amino acids that can reside in the P- and A-sites — which for a
given pair we denote as (X,Z), where X is the amino acid in the P-site and Z is the amino
acid in the A-site — we compared the normalized ribosome density distribution, [𝜌(𝑋, 𝑍)],
arising from all instances of the pair (X,Z) in the data set versus the distribution [𝜌(~𝑋, 𝑍)]
arising from all instances of Z being in the A-site but X not being present in the P-site. For
example, for the pair denoted (N, R), N is in the P-site and R is in the A-site, while (~N, R)
corresponds to the 19 other naturally occurring amino acids that can be in the P-site when
R is in the A-site (Figure 4.1a). The percent change in the median of [𝜌(𝑋, 𝑍)] relative to
the median of [𝜌(~𝑋, 𝑍)] quantifies the influence of the identity of the P-site amino acid on
the rate of translation relative to the rate when any other amino acid is in the P-site (Eq.
C.2). This approach controls for cognate tRNA concentration effects because the A-site
amino acid is fixed. Applying this analysis to each of the six published datasets, we
obtained six matrices reporting the percent change in ribosome density when a particular
amino acid pair is present in the P- and A-sites (Figure C.1). We focus only on highly
reproducible results by taking the intersection of those pairs that exhibit a consistent sign
change in all datasets and a percent change that is statistically significant in at least four
of the datasets. We found 84 pairs in which the presence of a particular P-site amino acid
is correlated with faster translation (green-shifted colors in Figure 4.1b) and 73 pairs in
which the identity of the P-site amino acid is correlated with slower translation (red-shifted
colors in Figure 4.1b). The results for the remaining pairs are not significant or consistent
56
Figure 4.1. Bioinformatic analyses of ribosome profiling data indicate that the identity of amino
acids in the P- and A-sites can predictably alter the translation speed of the A-site codon. (a) A
ribosome with the amino acids N and R in the P- and A-sites, respectively. From ribosome profiling data,
we calculated the distribution of ribosome densities in the A-site from all instances of (N, R) in our dataset
and compared the result to the distributions of all other instances of R in the A-site when N is not present
in the P-site, denoted [(~N, R)] (top panel). (b) Each box in the matrix indicates, for a pair of amino acids
in the P- and A-sites, the percent change in median normalized ribosome density ρ when that particular
amino acid is in P-site compared to any other amino acid in the P-site, keeping the A-site amino acid
unchanged (Eq. C.2). The sign of the percent change must be consistent in all 6 analyzed ribosome
profiling datasets and statistically significant in at least 4 out of the 6 datasets, otherwise the box is colored
gray. * corresponds to any of the stop codons being present in the A-site. (c) Comparison of distributions
of amino acid pairs where R is kept constant at the A-site while the P-site is mutated from N to S. The
distributions of normalized ribosome densities for P- and A-site pairs (N, R) and (S, R), which differ
significantly from each other, are shown (Mann-Whitney U test, 𝑝 = 4.45 × 10−17). The median normalized
ribosome densities of the two distributions differ by 53.4%, and the odds of a change in translation speed
when (N, R) is mutated to (S, R) or vice versa is 2.98 (Eq. C.4). (d) The estimated percent difference
values for all 7,980 mutations of amino acid pairs with a constant A-site are plotted with respect to the
statistical significance of the difference between the distributions (see Methods in Appendix C). We
estimate that mutating the P-site will lead to significant changes in translation speed in 4,254 (53%) of
these mutations. (e) For the significant combinations of amino acids pairs, the distribution of the odds of
mutating any instance of the pair resulting in a change in speed is plotted.
57
across the datasets (gray boxes in Figure 4.1b). These results suggest that the identity of
the P-site amino acid in 157 pairs of amino acids can predictably accelerate or retard
translation.
To test whether the potentially confounding factors of tripeptide motifs65, positively
charged upstream residues42,133, downstream mRNA structure52,53, or cognate tRNA
concentration41,105 explain the speed changes in Fig. 1b, we controlled for each of these
factors separately and found that even in their absence, the sign of the speed change is
preserved in 156 of the 157 pairs (Figure C.2). Thus, while these molecular factors can
contribute to codon translation rates, they do not explain the sign of the speed change we
observed.
Figure 4.1b predicts that by keeping the A-site amino acid fixed and mutating the
P-site amino acid, it is possible to accelerate or retard translation. For example, when
comparing the amino acid pairs (N,R) to (S,R), where R is the amino acid in the A-site, we
find a median ribosome density difference of 53% (Figure 4.1c). Hence, we predict that
the codon encoding R in the (S,R) pair will be translated faster than the codon encoding
R in the (N,R) pair. We predict that there will be a translation rate change for 53% of the
7,980 possible P-site mutations in amino acid pairs where the A-site is fixed (Figure 4.1d).
Because we are dealing with overlapping distributions (Figure 4.1c), we can calculate the
odds (Eq. C.4) that, for example, a mutation from (N,R) to (S,R) will accelerate translation
with 3-to-1 odds. We calculated these odds for each of the possible 7,980 mutations and
found a broad distribution (Figure 4.1e). With odds of 5.7-to-1, mutating (W,G) to (P,G)
will retard translation, while mutation with odds of 1-to-1, such as (V,W) to (H,W), are
equally likely to accelerate translation as they are to retard translation when Val is mutated
to His for different instances of (V,W) across the proteome.
To experimentally test these predictions, we introduced 12 non-synonymous
mutations into various positions of five non-essential S. cerevisiae genes that are not
involved in translation, and no mutations were made at functional sites of the encoded
proteins158 (Table C.2). Five of the mutations we predicted will accelerate translation, five
we predicted will retard translation, and two we predicted to have minimal effects on
translation speed when the mutated residue is present in the P-site. To ensure precise
measurements at codon resolution we performed ribosome profiling experiments at
unconventionally high read depths. Having an average of 86 million mapped exome reads
58
Figure 4.2. Experiments demonstrate that the identity of amino acids in the P- and A-sites can
predictably alter the translation speed of the A-site codon, consistent with the predictions from Fig
4.1b. Normalized ribosome density (Eq. C.1) upon mutation at five pairs of residues that are predicted to
retard translation (a), five other pairs that are predicted to accelerate translation (b), and two negative control
mutations that are predicted to have little to no effect on translation speed (c) were measured in S.
cerevisiae. The gene name and pair of amino acids before and after mutation are listed above the panels
in (a) through (c). Full details concerning the mutations are provided in Table C.2. In each panel, the
normalized ribosome density measured at the A-site residue is reported for the wild-type sequence
transcript (blue data points) and mutated sequence (orange data points). Each data point corresponds to
one biological replicate; the horizontal bar indicates the mean value. The difference between the medians
in each panel is statistically significant (Fisher-Pitman permutation test, 𝑝 = 0.036 for all subpanels in (a)
and for mutations in YOL*, YKL* and YLR* in (b). 𝑝 = 0.002 for two mutations in YHR* in (b). 𝑝 = 0.002 and
𝑝 = 0.004 for the two subpanels in (c), respectively). The distribution of percent differences in ribosome
density between the mutant and wild-type sequences for the data in panels (a) and (b) is shown as a blue
box plot in panel (d), and for the negative control sequences in panel (c), the distribution is shown as an
orange box plot in panel (d). The mutations in the negative controls do show a statistically significant
difference in normalized densities compared to wild-type (Fisher-Pitman permutation test, 𝑝 = 0.002 and
𝑝 = 0.004). However, these mutations exhibit a 2.5-fold reduction in effect size (d), consistent with the
predictions from Fig. 4.1b.
59
per sample, and totaling 1.7 billion mapped exome reads across samples (Table C.3). The
resulting ribosome profiles exhibit strong 3-nt periodicity, 87% of mapped reads are in
frame zero at a fragment size of 28 nt, and a very strong correlation between profiles for
the same gene across samples (Pearson 𝑟 = 0.96, Figures C.3 and C.4), indicating
technical biases are minimal, and any such biases that exist will cancel out when we carry
out a relative comparison between wild type and mutant results. Comparing the
normalized ribosome densities between the wild-type and mutant strains (Figure 4.2) we
find that the direction of change in ribosome density at the A-site is consistent with the
predictions from Figure 4.1b. Two of these ten mutations include a proline in the P-site,
for which we observe a speedup when the pair (P,G) is mutated to (E,G) in the gene
YMR122W-A (Figure 4.2a), while mutating (Q,D) to (P,D) in YOL109W leads to a
slowdown of translation (Figure 4.2b). These two mutations serve as positive controls
because the presence of proline is well established to retard translation58,60. Two additional
mutations were incorporated as negative controls and are predicted to cause little change
in the rate of translation (i.e., mutations that switch between gray boxes in Figure 4.1b).
We found that while the normalized ribosome densities of these mutants are statistically
different from that of the wild type (Figure 4.2c), the median effect size on translation speed
was 2.5-fold lower than what we observed for the other 10 mutations (Figure 4.2d). These
results are consistent with the hypothesis that the P-site amino acid can predictably and
significantly alter the translation rate of the A-site codon.
These amino acid mutations also change the identity of the tRNA molecule.
Therefore, the change in tRNA identity could also be the cause of the altered translation
speeds. To bioinformatically estimate the relative contributions of amino acid versus tRNA
identity to changes in speed, we projected the normalized ribosome density to the 42
unique tRNA molecules that can reside in the P- and A-sites. Based on a comparison of
these distributions, we can calculate the contributions of tRNA and amino acids to the
change in translation speed (Eq. C.5, and see Methods in Appendix C). We estimate that
the contribution of these two factors is a continuum for different pairs of amino acids
(Figure 4.3a). For some pairs, such as when (E,N) is mutated to (H,N), we estimate that
the change in speed is driven entirely by the amino acid identity, while other pairs, such
as (S,G) mutated to (E,G), are driven entirely by tRNA identity; many pairs are driven by
60
Figure 4.3. Depending on the amino acid pair, translation speed is influenced by either
the identity of the tRNA pair, the amino acid pair, or both. (a) A bioinformatic analysis that
models all 7,980 possible mutations to the P-site residue for a given starting amino acid pair
(Eq. C.5 and Methods in Appendix C) estimates that speed changes are caused by a
combination of the change in tRNA identity and amino acid identity. Plotted is the probability
density of the estimated contribution of the change in amino acid identity to the speed change.
For some pairs, upon mutation, the speed change is entirely due to the amino acid change (i.e.,
data near 100%), for some, the speed change is entirely due to the tRNA molecule change (i.e.,
data near 0%), and for others, the speed change is due to a combination of the two changes.
(b) For the same amino acid mutation, two mutants are created using synonymous codons
decoded by different tRNAs. (c-e) To experimentally measure the contribution of amino acid
identity, a given non-synonymous mutation in the P-site was encoded using two different
synonymous codons, each resulting in the same amino acid mutation but decoded by different
tRNA molecules (see (b)). For the mutations YOL109W (G,G)→(S,G) (c) and YOL109W
(Q,D)→(P,D) (d), the change in ribosome density from wild-type was similar for both
synonymous mutants (Mutant 1 vs Mutant 2, Fisher-Pitman permutation test, 𝑝 = 0.1857 and
𝑝 = 0.7714, respectively), and hence, the amino acid is the predominant cause for the change
in translation speed. For the mutation YOL109W (N,R)→(S,R) (e), the speedup was seen for
only one mutant while the other mutant exhibits a normalized ribosome density indistinguishable
from that of the wild-type (Wild type vs. Mutant 2, Fisher-Pitman permutation test, 𝑝 = 0.1857),
indicating in this case that the tRNA identity is the predominant cause for the change in speed
upon mutation.
61
a mixture of these two factors. To experimentally estimate the relative contribution of
amino acid identity, we took the three mutations that we previously incorporated into the
gene YOL109W (Table C.2) and created a new gene construct with the same three amino
acid mutations but used synonymous codons that are decoded by different tRNA
molecules (Table C.4). For the mutation (N,R) to (S,R), for example, we previously used
the codon UCC to mutate N to S. In the new strain, we used the synonymous codon UCG,
which is decoded by a different tRNA molecule 45 (Figure 4.3b). For the mutants (G,G) to
(S,G) and (Q,D) to (P,D), there was a change in normalized ribosome density that was in
the same direction and similar in magnitude regardless of the tRNA molecule used
(Figures 4.3c, d). This indicates that for these mutations, the change in amino acid identity
in the P-site is the primary cause of the change in translation rate in the A-site (Figures
4.3c, d). In contrast, for the mutation (N,R) to (S,R), we observed a change in ribosome
density when one tRNA molecule was used but no change in ribosome density when
another tRNA molecule was used (Figure 4.3e), indicating that the tRNA identity was the
primary cause. Our bioinformatic analysis predicted that for the mutations (G,G) to (S,G),
and (Q,D) to (P,D), the amino acid change is the major contributor to the observed change
in speed. For the mutation (N,R) to (S,R), involving different tRNAs, we were unable to
make a bioinformatic prediction because the sample size was too small (n < 10) to apply
any statistical test. Thus, the bioinformatic predictions and experimental results are
qualitatively consistent, indicating that in some cases it is the amino acid identity that
causes the change in speed and in others it is the tRNA identity.
If evolutionary selection pressures have acted to encode translation rate
information in the primary structures of proteins, then there should be a non-random
distribution of fast and slow-translating pairs of amino acids across the proteome. To test
this hypothesis, we calculated the enrichment and depletion of all 400 pairs of amino acids
across the S. cerevisiae proteome relative to the occurrence expected from a random
pairing. We selected the top 20% of the amino acid pairs that were enriched across the
proteome and the bottom 20% that were depleted and determined how many of the 84
fast-translating and 73 slow-translating amino acid pairs were present in either of these
quintiles. The odds ratio of fast-translating pairs being enriched across the proteome and
slow-translating pairs being depleted was 7.5 (Eq. C.7, 𝑝 = 0.0011, Fischer’s exact test),
indicating that selection pressures have indeed selected for the presence of fast-
translating pairs and selected against slow-translating pairs (Figure 4.4a) across the
proteome.
62
-10
-5
0
5
10
15
20
10 20 30 40 50 60 70 80 90 100 EntireLinker
% c
han
ge in
P in
L v
s D
reg
ion
s
Window size of linker region
Slow Pairs Fast Pairs
b
-60
-40
-20
0
20
40
60
80
100
0 0.5 1 1.5 2 2.5
% c
han
ge
in
me
dia
n ρ
Fold enrichment of amino acid pair relative to random chance
Insignificant pairs
Fast pairs
Slow pairs
Depletion Enrichment
a
Slowdown of
translation
Speedup of
translation
-10
-5
0
5
10
15
20
25
10 20 30 40 50 60 70 80
Pe
rce
nt ch
an
ge
in
P
in
L v
ers
us D
re
gio
n
Window Size of Linker region
Slow pairs
Fast pairs
b
Domain Exit tunnel Linker
10
20
30
Window
Size
30
c
63
Despite the broad preference for fast-translating pairs, we found that the slow-translating
amino acid pairs were locally enriched by 15% (95% CI: [6%, 23%], p<0.0001, n=170 domains,
random permutation test) in linker regions relative to domain regions of S. cerevisiae cytosolic
proteins (Figures 4.4b,c). In these linker regions, which start 30 residues downstream of domain
boundaries, 21% of the amino acid pairs, on average, are slow-translating pairs, and in the
extreme cases of genes YDR432W and YGL203C, there are 18 slow pairs in a 30 residue stretch.
Codon usage does not explain this enrichment of slow-translating pairs, as we found no difference
in the frequency of non-optimal codon usage between linker and domain regions (Figure C.5).
These results indicate that a number of slow-translating pairs exist in linker regions that can
cumulatively lead to a slowdown of translation as domains emerge from the ribosome exit tunnel,
which may aid in co-translational folding.
When the Hsp70 chaperone Ssb is bound to ribosome-nascent chain complexes,
translation is faster than when Ssb is not bound, possibly because chaperone binding prevents
nascent chain folding and hence allows translation to become uncoupled from folding and to
proceed faster159. We examined if the fast-translating amino acid pairs we identified contributed
to this speedup. We found that the fast-translating amino acid pairs were enriched by at least 4%
(95% CI: [2.3%, 6.1%], p=0.0001, n=425, random permutation test) in regions translated while
Ssb is bound, suggesting that these pairs do make a contribution (Figure C.6). Taken together,
these results indicate that across the primary structures of proteins, evolutionary pressures have
Figure 4.4. Evolution selects for fast-translating pairs across the proteome but enriches slow-
translating pairs across interdomain linker regions. (a) The enrichment and depletion of amino acid
pairs across the S. cerevisiae proteome is plotted against the percent change in median normalized
ribosome densities (ρ) of amino acid pairs taken from Figure 4.1b. Among the top 20% enriched and top
20% depleted set, the odds ratio of fast-translating pairs being enriched and slow-translating pairs being
depleted is 7.5 (Eq. C.7, Fisher’s exact test, 𝑝 = 0.0011). (b) The enrichment of fast- and slow-translating
pairs in linker (L) regions relative to domain (D) regions. The percent change is calculated as 𝑓(𝑋,𝐿)−𝑓(𝑋,𝐷)
𝑓(𝑋,𝐷)∗
100%, where 𝑓(𝑋, 𝐿) is the fraction of either slow or fast pairs in the linker region, and 𝑓(𝑋, 𝐷) is the fraction
of either slow or fast pairs in the domain regions. A positive percent change would indicate an enrichment
in the linker region, while a negative value would indicate a depletion. As a test of robustness, 𝑓(𝑋, 𝐿) was
computed over different window sizes in the linker region, discarding the first 30 residues after the domain
to account for those residues being in the ribosome exit tunnel, as illustrated in panel (c). 𝑛 = 170 for a
window size of 30 residues. For all window sizes, the percent change was significant (𝑝 < 0.05) for slow-
translating pairs and insignificant (𝑝 > 0.05) for fast-translating pairs. p-values were computed using the
random permutation test, and error bars in (b) represent 95% CI calculated from bootstrapping.
64
selected for amino acid pairs that exhibit faster translation, and along a transcript, fast- and slow-
translating pairs are enriched locally in regions that are associated with co-translational folding
and chaperone binding.
We have demonstrated that the chemical identity of pairs of amino acids and tRNA
molecules present in the P- and A-sites can predictably and causally accelerate or retard
translation in the A-site of S. cerevisiae ribosomes. Two essential and unique features of our
analyses of ribosome profiling data are Eqs. C.2 and C.5. Eq. C.2 holds a fixed amino acid identity
in the A-site while varying the amino acid in the P-site. This feature keeps the cognate tRNA
concentration and the accommodation time into the A-site constant, thereby allowing us to isolate
the effect of the P-site amino acid on translation rates in the A-site. On the other hand, Eq. C.5
allows us to distinguish between the relative contributions of the change in amino acid and tRNA
identity when a mutation is made in the P-site. Confounding molecular factors, such as tripeptide
sequence motifs and many others, that can also affect translation speed do not explain our results
(Figure C.2). For example, a recent study160 observed that 17 pairs of very rare codons, when
present in the P- and A-sites, inhibit translation due to the interaction of their predominantly
inosine-modified, wobble-decoding tRNAs. To test whether this mechanism explains our slow-
translating pairs of amino acids, we removed all instances of codon pairs decoded by wobble
base pairing and found that 72 out of 73 of our slow-translating pairs remained slow (Figure C.7).
Thus, wobble base pairing does not explain our observations. In total, we identified 157 amino
acid pairs that could change the translation speed and verified these predictions experimentally
for 10 of these pairs. A surprising result is the large number of amino acids, including glycine and
aspartic acid, that can retard translation when present in the P-site. The molecular mechanism by
which these amino acids retard translation is not known, but we hypothesize this may be due to
a much slower step of peptide bond formation, as has been observed biochemically with proline.
Pairs whose effect arises primarily due to tRNA identity could influence a large number of different
steps during the translation-elongation cycle, including hybrid state formation and translocation.
Conversely, there are few reports in the literature on molecular factors whose presence
accelerates translation; however, we have identified 84 putative pairs that have this effect and
experimentally verified five of these pairs (Figures 4.1b and 4.2b). Determination of the molecular
cause of this speedup would be a fruitful area of future research. Evolutionary selection pressures
select only against phenotypic traits, not genotype. Therefore, the enrichment of fast-translating
amino acid pairs across the S. cerevisiae transcriptome and the clusters of slow- and fast-
translating pairs along transcripts that are correlated with co-translational processes suggest that
the elongation kinetics encoded by these pairs influence organismal phenotype and fitness. These
65
results suggest the surprising possibility that there exists disease-causing amino acid mutations
that do not alter the final folded structures of proteins but instead alter the co-translational behavior
and processing of the nascent proteins via altered elongation kinetics. In summary, elongation
kinetics are causally and predictably encoded in protein primary structures through pairs of amino
acids, with broad implications for protein sequence evolution, translational control of gene
expression, and disease.
66
Chapter 5
EVOLUTIONARILY-ENCODED TRANSLATION KINETICS COORDINATE CO-
TRANSLATIONAL SSB CHAPERONE BINDING IN YEAST
The research presented in this chapter was first published as part of a large-scale
study in Cell titled “Profiling Ssb-Nascent Chain Interactions Reveals Principles of Hsp70-
Assisted Folding” by Kristina Döring, Nabeel Ahmed, Trine Riemer, Harsha Garadi
Suresh, Yevhen Vainshtein, Markus Habich, Jan Riemer, Matthias P. Mayer, Edward P.
O’Brien, Günter Kramer and Bernd Bukau.
The research contributed by only Nabeel Ahmed is being discussed in this chapter
and was also presented at 62nd Annual Meeting of Biophysical Society and the Abstract
was published in Biophysical Journal titled “Evolutionarily-Encoded Translation Kinetics
Coordinate Co-Translational SSB Chaperone Binding in Yeast” by Nabeel Ahmed, Kristina
Döring, Günter Kramer, Bernd Bukau and Edward P. O’Brien.
Portions of text of this chapter are being reproduced from the above publications
with permission from CellPress under the Journal publishing agreement that allows
authors to use the publication for inclusion in a thesis or dissertation.
5.1 Abstract
Chaperones can bind to the ribosomes and nascent polypeptides co-translationally to
assist protein folding. It was not known previously whether there is any interplay between
chaperone binding and translation kinetics. To study this effect, we utilize the high-
throughput transcriptome-wide quantitative data for translation kinetics and chaperone
binding from ribosome profiling and selective ribosome profiling methods respectively. In
vivo selective ribosome profiling has shown that yeast Hsp70 chaperone Ssb associates
broadly with a major fraction of nascent proteins. Using the ribosome footprint densities
as a measure of local translation rate, we find that mRNA segments within the ribosome
are translated faster during periods of Ssb binding to the nascent polypeptides compared
to when Ssb is not bound. The acceleration of translation is maintained even when Ssb is
knocked out thus implying an inherent encoding of faster translation within the mRNA
sequence. Testing for mRNA features that can slow translation, we find that mRNA
segments translated by Ssb-engaged ribosomes are enriched for fast-translated codons
(having higher cognate tRNA concentrations), are depleted for slowly translated codons,
67
and contain fewer proline codons. In addition, mRNA segments located 1-15 nucleotides
downstream of Ssb-bound ribosomes have reduced mRNA secondary structure. Finally,
nascent chain segments located in the ribosome tunnel of Ssb-bound ribosomes have
average numbers of positively charged residues but are enriched in negatively charged
residues. Taken together, these evolutionarily-encoded mRNA and nascent chain features
cause faster translation during Ssb binding. This finding is significant as any alteration of
translation kinetics due to synonymous mutations can potentially disrupt efficient binding
of Ssb leading to possible misfolding of the protein without any change in its sequence.
5.2 Introduction
Proteins attain their correctly folded conformation and localization through complex set of
processes that are prone to errors and sensitive to perturbations and necessitates
coordination between mechanisms161,162. Networks of chaperones have been evolved in
eukaryotic cells to engage the nascent polypeptide chain appearing outside the ribosome
exit tunnel163. This engagement is important as the nascent chain must be assisted to
avoid any non-native interactions that can result in the protein being misfolded73.
Chaperones also assist the nascent chain to attain its functional form72,164 or to coordinate
with targeting factors for efficient translocation to other organelles165. This engagement of
chaperone occurs as the mRNA transcript is being translated sequentially within the
ribosome and the synthesized polypeptide is exiting the ribosome. Since these two
processes are occurring during similar time scales, it is likely that there may be some
coordination between them to achieve successful binding of chaperones along with
efficient translation elongation. Evidence of coordination between a co-translational
process and translation kinetics is seen for co-translational folding where there is
slowdown of translation in interdomain linker regions that can facilitate co-translational
folding of the domain that exited the ribosome81,166.
Ssb is a Hsp70 chaperone that has been found to bind ribosomes to facilitate early
folding167. Ssb acts in coordination with a network of chaperones including nascent chain
associated complex (NAC) and a pair of two proteins Ssz1 and J-protein Zuo1 that
constitute the ribosome-associated complex (RAC)168. Recent studies have characterized
the binding patterns of Ssb to show that it prefers to engage with nascent chains with
domains enriched in aggregation-prone, hydrophobic and intrinsically disordered
regions169. In S. cerevisiae, two isoforms of Ssb are present: Ssb1 and Ssb2 that differ by
only four amino acids. Deletion of both Ssb isoforms results in defects in ribosome
68
biogenesis, sensitivity against cold, slow growth as well as reduced efficiency of protein
folding164. Hence Ssb is an important chaperone required for generation of a functional
proteome.
With the advent of next-generation sequencing technology and development of
method like Ribo-Seq, it has been possible to capture translation at the transcriptome level
which is referred as the ‘translatome’. Characterizing the translatome allows us to
determine properties of translation kinetics and understand the molecular origin of variable
rates of translation elongation. Combining the power of Ribo-Seq with biochemical
approaches of crosslinking and immunoprecipitation, a subset of ribosomes can be
isolated that associates with a specific factor of interest. Selective Ribo-Seq is a variant of
Ribo-Seq that implements this approach and the data obtained can be modeled to extract
the profile of binding of a factor to the ribosome-nascent chain complex. In this study, we
use selective Ribo-Seq for Ssb-bound ribosomes to test whether there is any coordination
of binding of Ssb with translation elongation kinetics measured by ribosome profiling.
5.3 Results
5.3.1 Selective Profiling of Ssb-Bound Ribosomes
Selective Ribo-Seq is a variant of Ribo-Seq where a subset of ribosomes is isolated that
are associated with a specific factor/chaperone at the time of translation arrest34,89. This is
achieved by cross-linking the specific factor of interest with the ribosome-nascent chain
(RNC) complex. The RNCs cross-linked with the factor are isolated from other
monosomes by immunoprecipitation. The ribosome footprints obtained from this subset of
ribosomes are those undergoing translation when the factor of interest is bound to the
RNC. At the transcript level, the number of reads mapping to the transcript can lead to
identification of protein substrates that are bound by the factor during their synthesis. The
reads mapped to individual codon positions can be modeled to extract a profile of binding
state of the factor across different regions of the mRNA transcript. However, the number
of reads obtained will also be a function of translation rate apart from binding of the
chaperone. In a profile of raw reads, a fast translating chaperone bound region will not be
detected in comparison with slow-translating chaperone bound region that will have higher
ribosome density attributed to translation kinetics. To accurately determine the chaperone
binding profile, the number of reads at each codon in the Sel-Ribo-Seq dataset is
normalized by the number of reads at that codon position from Ribo-Seq dataset. This
69
ratio is termed fold enrichment (FE) that determines whether the codon is being translation
during periods of chaperone binding outside the ribosome’s exit tunnel. The theoretical
basis of FE measure is demonstrated by derivation of equations based on biophysical
principles (Appendix D).
Selective Ribo-Seq is applied on S. cerevisiae with Ssb as the chaperone of
interest. The set of ribosome-protected mRNA footprints obtained from Ribo-Seq is termed
the translatome while those from selective Ribo-Seq are termed Ssb-bound translatome.
It should be noted that Ssb-bound translatome represents the mRNA fragments x that are
being translated at the ribosome’s peptidyl transferase center (PTC) while Ssb is
physically interacting with the part of nascent chain that is n amino acids upstream of x
and has already been translated and has emerged from the exit tunnel (Figure 5.1).
With the availability of translatome wide Ssb-binding profile, molecular principles
of Ssb binding were ascertained by co-authors of the publication159 that includes the study
discussed in this chapter. A few important inferences are: i) ~72% of detected proteins are
identified as Ssb substrates including proteins from all major compartments, ii) Ssb has a
preference to bind positively charged sequences closer to ribosomal surface and iii)
activity of Ssb was found to be dependent on RAC but not NAC.
5.3.2 Coordination of Ssb Binding with Translation Elongation Rates
Translatome and Ssb-bound translatome represent the translation kinetics profile and the
Ssb binding profile respectively. The analysis for correlating translation rates with Ssb
binding profile is restricted to high coverage transcripts with at least one read per codon
Figure 5.1. Schematic representing the ribosome footprint x obtained from selective Ribo-
Seq when Ssb is bound to the region of nascent chain n amino acids upstream of x.
70
for statistical robustness (Figure 5.3A). A meta-analysis where all Ssb-bound translated
regions normalized to constant length window are aligned and the ribosome occupancy is
observed to dip as the aligned profile shifts from Ssb-unbound to Ssb-bound region and
again increases once Ssb-bound region is crossed over (Figure 5.2A). The dip in ribosome
occupancy represents the speedup of translation happening during the period of Ssb
binding. For a more comprehensive analysis, thresholds are used to define Ssb-bound
and unbound segments (See Methods for details). These thresholds are based on the
percentiles from the Cumulative Distribution Function of the FE measure (Figures 5.3B, C
and D). The Ssb-bound and unbound mRNA segments are compared in independent
Ribo-Seq datasets analyzing translation in wild-type cells. mRNA segments translated by
Ssb-bound ribosomes are generally translated faster than Ssb-unbound ribosomes
(Figure 5.2B). The difference in observed local translation speed in wild-type (WT) cells
vary by 10%-38% (Figure 5.2B) depending on the stringency of the selection criteria used
to define bound and unbound segments (Figure 5.3D).
The acceleration of translation during periods of Ssb binding can be a result of two
non-exclusive mechanisms. The first mechanism is that binding of Ssb triggers speed up
of translation. The second is that intrinsic features of the mRNA or the nascent chain
accelerate translation. The hypothesis of the first mechanism is tested by analyzing
relative elongation speed in translatomes of ssb1Δssb2Δ cells. It is observed that
accelerated translation is maintained even in the absence of Ssb (Figure 5.2B) although
to a slightly reduced extent (Figure 5.3E). This effect is uniformly observed for all
thresholds used to identify mRNA segments translated by Ssb-bound or unbound
ribosomes and the Ssb contribution to the translation speed-up is limited (up to 15%;
Figure 5.2C). In Chapter 3, I have demonstrated that translation rate of a codon can
depend on its cognate tRNA concentration, presence of proline residues in the P-site as
well as presence of downstream secondary structure. Since the accelerated translation is
encoded by intrinsic features of mRNA and nascent chain, it can be hypothesized that
these factors may play a role in facilitating faster translation. Indeed, it is found that Ssb-
bound translated segments are enriched in fast-translated codons and depleted in slow-
translated codons (Figure 5.2D). Features that slow translation like mRNA secondary
structure are found to be depleted 1-15 nt downstream of Ssb-bound translated segments
(Figure 5.2E). Similarly, proline residues are depleted thus avoiding slowdown of
translation (Figure 5.3F). Nascent chain features like positively charged residues can
interact with negatively charged tunnel and pause translation. It is observed that Ssb-
71
Figure 5.2. Altered Translation Kinetics of Ssb-Bound Ribosomes (A) Average ribosome
densities in translatomes and Ssb-bound translatomes related to Ssb binding. 95% CI is
shaded. (B) Change in translation speed for Ssb-bound and unbound ribosomes in WT and
ssb1Δssb2Δ translatomes (Wilcoxon rank-sum test, p < 0.0001 for all thresholds). Error bars
show 95% CI. (C) Contribution of Ssb binding and mRNA features to faster translation. Error
bars show 95% CI. (D) Enrichment of fast codons and depletion of slow codons in bound
versus unbound segments for indicated thresholds (Wilcoxon signed-rank test, p < 0.05 for
[P95, P5], p < 0.0001 for other thresholds). Error bars show 95% CI. (E) Change in DMS
reactivity of bound versus unbound segments reflecting the probability of secondary structure
formation with the indicated offsets from each nucleotide. Differences are significant up to an
offset of 15 nt (paired t test, p < 0.0001).
72
73
Figure 5.3. Identifying Ssb-Bound mRNA Segments and the Molecular Origins of Translation
Acceleration (A) Overlap of high coverage genes with Ssb bound mRNA segments in Ssb1 WT
translatome in two biological replicates. (B) Probability distribution of Fold Enrichment (FE) values
across the high coverage gene set. (C) Cumulative Distribution Function (CDF) of FE values.
Percentiles were used to define the stringency thresholds to identify Ssb bound and unbound
periods. P5, P50 and P95 are shown in red. (D) An example of a Ssb binding profile for the gene
CYS3 using different thresholds for the bound and unbound segment definition. The first 120 nt
and last 60 nt are excluded from the analysis as well as nucleotide positions exhibiting fold
enrichment values between the thresholds (white). Ssb bound (green) and unbound (red) segments
are either defined using the P50/P50 threshold (upper panel) or the P95/P5 threshold resulting in less
nucleotide positions (i.e., more white space) that are used for the statistical analysis (lower panel).
(E) The differences in speed-up between WT and ssb1Δssb2Δ translatomes. (Wilcoxon signed-
rank test, p < 0.0001 for all thresholds). Error bars show 95% CI. (F) Depletion of proline residues
in Ssb bound compared to unbound segments for the indicated thresholds. (Wilcoxon signed-rank
test, p < 0.0001 for all thresholds). Error bars show 95% CI. (G) Percent change of probability of
finding positively and negatively charged residues in upstream regions of Ssb bound compared to
upstream regions of unbound segments for the indicated thresholds. (Wilcoxon signed-rank test, p
< 0.0001 for negatively charged residues and p > 0.05 for positively charged residues for all
thresholds). Error bars show 95% CI. (H) Change in translation speed for Ssb bound and unbound
ribosomes in publicly available RP datasets. GEO: GSE63789 (Pop), GSE69414 (Young),
GSE61011 (Williams), GSE61012 (Jan), GSE52968 (Guydosh), GSE75322 (Nissley), GSE67387
(Nedialkova), GSE51164 (Gardin), GSE53268 (Weinberg). Error bars show 95% CI. The samples
from these datasets were chosen based on the criteria that they do not use CHX for pretreatment
and hence capture in vivo translation dynamics reliably.
bound translated segments contain average numbers of positively charged residues but
are enriched in negatively charged residues (Figure 5.3G). The distribution of these
features indicate that evolution encodes the faster translation within the mRNA transcript
and nascent chain such that they are coordinated with binding of Ssb.
5.4 Discussion
The results presented in this study demonstrate for the first time that translation kinetics
are evolutionarily encoded to coincide with the binding of Hsp70 chaperone Ssb. The
analysis for testing this coordination has been possible due to development of methods
like Ribo-Seq and selective Ribo-Seq that can capture the translatome as well as the
factor-bound translatome. In this study, selective Ribo-Seq was used to characterize the
binding profile of Hsp70 chaperone Ssb and correlate the binding profile with translation
74
elongation kinetics to demonstrate that Ssb binding coincides with faster translation of
mRNA within the ribosome. As a test of robustness, this analysis was carried out in nine
other published Ribo-Seq datasets from S. cerevisiae that did not use cycloheximide
(CHX) as a pretreatment (Figure 5.3H). It was found that seven out of nine of the datasets
exhibit this speedup with Ssb-bound regions (the origins of the inconsistency with two of
the datasets is unknown). The evolutionary encoding of kinetics is through the distribution
of molecular factors that influence translation rates across the transcript such that they
create periods of faster translation during periods of Ssb binding. These factors were
correlated with translation kinetics and discussed extensively in Chapters 3 and 4. The
results of this study indicate that these molecular factors are playing an important role in
a co-translational process relevant for creating a functional proteome.
During stress conditions, it has been shown that the interaction of Hsp70
chaperones with RNCs is altered and it subsequently results in pausing of translation
elongation170,171. Deletion of Hsp70 chaperones also resulted in elongation pause even in
absence of heat stress indicating that inhibition of Hsp70 activity is a mechanism of stress
response to induce global pause of translation170. However, the results demonstrated in
this study is distinct from these findings since the faster translation is encoded within the
features of mRNA and nascent chain rather than an effect of chaperone binding. This is
demonstrated by conservation of faster translation upon deletion of Ssb isoforms and no
induced ribosome stalling. However, what is the biological reason for evolving faster
translation kinetics upon chaperone binding? Engagement of chaperones like Ssb can
help the nascent chain avoid misfolded intermediates. Hence these are periods of protein
synthesis that can proceed at a faster pace without caring for the co-translational folding
of the nascent chain. We speculate that the acceleration of translation is to increase of
efficiency of protein production during periods of Ssb binding where the nascent chain is
being prevented by chaperone Ssb to acquire non-canonical conformations. Evolution has
selected for the faster translation to coincide with Ssb binding and potentially optimize the
efficiency of protein production in the cell.
5.5 Methods
5.5.1 Translation kinetics analysis
High coverage genes were obtained from list of substrates which have greater than zero
reads at every codon position along the transcript. The first 40 codons as well as last 20
codons were excluded from the subsequent analyses of these transcripts since these
75
regions can be influenced by initiation and termination, respectively. Ssb bound and
unbound segments were initially defined using the peak detection algorithm in which a
region is defined as Ssb bound if its Fold Enrichment (FE) value is greater than 1.5 over
a stretch of at least 15 nt. To study the effect of translation rate at the extremities of Ssb-
binding probabilities, varying stringency thresholds were set to define the Ssb bound and
unbound segments. These thresholds are defined by the percentiles from the Cumulative
Distribution Function of FE values (Figures 5.3B and C). Setting an initial threshold of P50,
every nucleotide position with an FE value higher than P50 was classified as Ssb bound
and every nucleotide position with an FE value lower than the P50 threshold was classified
as Ssb unbound (Figure 5.3D). For all other pairs of thresholds, e.g., (P95, P5), all positions
with FE values higher than the upper threshold (e.g., P95) were classified as Ssb bound
while all values below the lower threshold (e.g., P5) were classified as Ssb unbound. The
other positions with FE values between the thresholds were excluded from the analysis.
Ssb bound and unbound segments were defined in the Ssb1-GFP strain background.
These regions were then used to perform the relative translation speed analysis in
independent translatomes (WT and ssb1Δssb2Δ).
5.5.2 Speed-up of translation
The translation rate was calculated as the inverse of the average number of ribosome
reads per nucleotide and translation rate for the Ssb bound and unbound segments
computed. To control for expression level differences across the genes, the percent
change in translation rate was calculated for each gene separately using the equation
% 𝑐ℎ𝑎𝑛𝑔𝑒 = <𝑅𝐵>−1− <𝑅𝑈𝐵>−1
<𝑅𝑈𝐵>−1 ∗ 100% where < 𝑅𝐵 > and < 𝑅𝑈𝐵 > are the average number
of reads per codon in the Ssb bound (B) and unbound (UB) segments. The statistical
significance of the speed-up across the gene dataset was calculated using the Wilcoxon
rank-sum test. Error bars in the associated plots are 95% CI about the median calculated
using the Bootstrapping method (Figure 5.2B).
5.5.3 Contribution of mRNA versus Ssb binding
For every gene in our dataset, we use the percent change calculation described above
to estimate the contribution of mRNA and Ssb binding to the translation speed-up using
the equation: % 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑚𝑅𝑁𝐴 =% 𝑐ℎ𝑎𝑛𝑔𝑒∆𝑆𝑠𝑏
% 𝑐ℎ𝑎𝑛𝑔𝑒𝑊𝑇 ∗ 100% and
% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑆𝑠𝑏 𝑏𝑖𝑛𝑑𝑖𝑛𝑔 = 100 − % 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑚𝑅𝑁𝐴. % 𝑐ℎ𝑎𝑛𝑔𝑒∆𝑆𝑠𝑏 and
% 𝑐ℎ𝑎𝑛𝑔𝑒𝑊𝑇 correspond to the % 𝑐ℎ𝑎𝑛𝑔𝑒 in the ssb1Δssb2Δ and WT cells, respectively.
76
The error bars in the associated plots are 95% CI about the median calculated using the
Bootstrapping method (Figure 5.2C).
5.5.4 Enrichment/Depletion of Fast/Slow codons
The 61 sense codons were classified as being either Fast or Slow translating based on
the local tAI values reported in Tuller et al.172. The 31 codons with the highest tAI values
were classified as ‘Fast’ and the remaining 30 codons as ‘Slow’. The probability of finding
Fast and Slow codons in the B and UB segments were then calculated and the percent
change in these values between these segments computed. The statistical significance of
this difference was computed using the paired Permutation test120. 95% CI for the percent
change in probability were calculated using Bootstrapping. The enrichment/depletion of
proline residues was determined in the same manner.
5.5.5 Upstream charged residues
To test for enrichment/depletion of charged residues in the exit tunnel, we defined a 30-
residue window upstream of the Ssb bound and unbound segments along with the region
itself. The probability of finding a positively charged residue (K, N, H) and negatively
charged residue (D, E) were compared between the defined upstream regions of Ssb
bound and unbound segments. We find the results do not change even if overlapping
upstream positions of the Ssb bound and unbound segments are excluded from this
analysis.
5.5.6 Downstream mRNA secondary structure
In vivo mRNA secondary structure information for all yeast genes was taken from Rouskin
et al.138. ‘A’ and ‘C’ bases react with DMS if they are not base-paired into the mRNA’s
secondary structure. Hence, DMS reactivity is inversely proportional to the probability of
the nucleotide position forming secondary structure. DMS reactivities of ‘A’ and ‘C’
nucleotides within the Ssb bound and the unbound segments were compared as a function
of nucleotide offset downstream of each nucleotide position. The significance of the
change in DMS reactivity was assessed using the paired t test.
77
Chapter 6
Chapter 6 CONCLUSIONS AND FUTURE DIRECTIONS
6.1 Conclusions
The two goals of the research studies described in this dissertation are i) to demonstrate
that methods based on chemical kinetics and mathematical optimization can increase the
accuracy of modeling Ribo-Seq data to study the properties of translation elongation and
ii) to gain novel biological insights by uncovering previously unknown molecular factors
and determining how translation kinetics coordinates with co-translational processes. I
achieved both goals in this dissertation.
Ribo-Seq generates a snapshot of translation by isolating and sequencing
ribosome-protected mRNA fragments. However, the nuclease digestion step of Ribo-Seq
is imperfect giving rise to a wide distribution of fragment sizes and it is non-trivial to identify
which codon within the variable sized fragment was being translated. The method
presented in Chapter 2 improves the identification of A-site within ribosome-protected
fragments thus overcoming an important challenge for an accurate analysis of Ribo-Seq
data. This method is based on an Integer Programming based optimization of reads
between the second and stop codons of a transcript. The A-site offset from the 5′ end is
identified that maximizes the reads in this region. The Integer Programming method
outperforms 11 other methods by assigning more ribosome density signal at the A-site
stalling site of polyproline motifs that has been determined through orthogonal biochemical
experiments61,62,104. The offset tables generated for S. cerevisiae and mouse embryonic
stem cells are easy to apply for any Ribo-Seq dataset in these organisms and the method
itself has been made readily available online for researchers to generate A-site offset
tables in their organism of study.
The method I presented in Chapter 3 to measure codon translation rates has key
advantages: (1) it is based on chemical kinetic theory, (2) it simultaneously utilizes all the
reads along the CDS, (3) it does not assume a sense codon is translated at the same
speed in all of its different sequence contexts, meaning the true variability of translation
rates is captured, and (4) it measures an absolute translation rate as compared to a
relative difference in rates. This method is applied to high coverage S. cerevisiae Ribo-
Seq data and a 26-fold variability in translation rates is observed. To explain this
variability, molecular factors are correlated which show that cognate tRNA concentration,
presence of proline in the P-site and mRNA secondary structure 4 codons downstream
78
of A-site can influence the translation rate of the A-site codons. These factors have been
known previously to affect translation elongation but our method enabled measurement
of a stronger correlation and at codon resolution. For example, downstream mRNA
secondary structure has been estimated roughly through folding energy calculated based
on thermodynamic principles. However, in this analysis, I used DMS and PARS data that
provide high-throughput in vivo and in vitro profiles of mRNA secondary structure. The
correlation of translation slowdown with mRNA secondary structure is highest at 4 codons
downstream of the A-site that places it right at the mouth of the ribosome-mRNA channel
where the ribosome is likely to unwind the mRNA structure. A novel insight we can get
from the application of method presented in Chapter 2 is that we did not find that wobble-
decoding mechanism systematically slows down translation in S. cerevisiae. This effect
had been shown only in metazoans49 and only for one rare codon CGA in S. cerevisiae
where strong inhibition of translation was attributed to wobble decoding mechanism46.
In Chapter 4, I demonstrated a novel molecular factor that predictably and
causally influences translation rates. The chemical identity of the amino acid pairs in the
P- and A-sites can influence the rate at which the A-site codon will be translated.
Bioinformatic analyses predict that mutating the P-site amino acid can result in a
significant alteration of translation rate for ~54% of amino acid pairs and 10 of these
predictions are experimentally validated. I also demonstrated that evolution cares about
this encoding of translation rate information within the primary structure of the protein.
Fast-translating pairs are enriched 8 times more than slow-translating pairs across the
proteome. The slowdown of translation in interdomain linkers41 can be attributed to
enrichment of slow-translating pairs and not due to non-optimal codons that has been
hypothesized but have not been established conclusively47. This nascent-chain encoded
feature and the above analyses provides evidence that the protein sequence is optimized
such that slow-translating pairs are locally enriched to assist co-translational folding while
in absence of any functional need, fast-translating pairs are enriched to potentially
increase the efficiency of protein production.
Chapter 5 provides the first direct evidence of coordination between translation
kinetics and the binding of a chaperone, which in this case was the Hsp70 chaperone
Ssb. Analysis of Selective Ribo-Seq and Ribo-Seq data provided transcriptome-wide Ssb
binding profiles and translation rate profiles, respectively. Correlating these two
measures demonstrates faster translation within the ribosome’s PTC coordinated with
binding of Ssb to nascent chain outside the ribosome and this effect is proportional to
79
probability of Ssb binding. I also show that the faster translation is encoded within the
mRNA transcript with molecular factors that can influence translation enriched or
depleted in Ssb-bound translated mRNA segments. This demonstrates yet again that
evolution has optimized the mRNA codon choice and nascent chain features encoding
translation rates to coordinate with co-translational processes and generate a functional
proteome.
6.2 Future Directions
6.2.1 Synonymous mutations and diseases
In Chapter 1, I highlighted the evidence demonstrating that altered translation kinetics
can have an influence on co-translational processes like protein folding, targeting and
assembly. The experimental approach in multiple studies to demonstrate this effect was
to introduce synonymous mutations that do not change the nascent chain’s primary
structure but alter translation rates and determine the relative loss/gain of protein activity.
The future direction of research in this area should be to identify synonymous mutations
that are enriched in diseased conditions and establish their association by determining
whether the synonymous mutation is altering the translation kinetics and how it is
disrupting a co-translation process resulting in the loss/gain of protein activity. The
methods developed in Chapters 2 and 3 offer a simple and quantitative approach to
determine absolute codon translation rates. Ribo-Seq experiments can be run on
samples from patients and healthy individuals to determine differential translation
elongation rates at sites of synonymous codon substitutions.
The synonymous mutations have already been established in some cases to lead
to protein aggregation diseases and cancer173. Therefore, it is important to study the
variability in codon translation rates that can arise due to synonymous mutations. This
approach can potentially explain the molecular mechanism of disease states and enable
therapeutic manipulation through development of rationally-designed mRNA sequences.
6.2.2 Test phenotypic effect of loss of amino acid pairing due to mutations
In Chapter 4, I demonstrated the evolutionarily selection of amino acid pairs that encode
translation rate information. If selective pressures have acted on certain pairs of residues
because they encode translation rate information, then there must be some phenotypic
effect that is diminished upon mutating these pairs of residues. This leads to the
hypothesis that mutating these pairs is likely to alter the structure and function of the
encoded protein. To test this hypothesis, one can bioinformatically identify five yeast
80
proteins that are non-essential, cytoplasmic, monomeric enzymes (since these are easier
to assay than multimeric enzymes) and that are also predicted to have conserved,
translation-rate-encoding pairs of residues. Single amino acid mutations in these five
proteins can be chosen, as done in Chapter 4, that are likely to cause the largest change
in translation speed, but now under the assumption that these will be the most likely to
influence function. These mutations can be implemented in vivo by an experimental
collaborator with expertise in functional protein assays. The hypothesis will be supported
if the mutations both change the translation speed and decrease the protein’s relative
specific activity.
This future direction can dramatically expand the scope of this idea by
demonstrating that protein primary structure also encodes translation-rate information in
some pairs of residues so as to kinetically guide the co-translational acquisition of
structure and function in nascent proteins. Future research questions include how these
pairs of amino acids are modulating translation speed. I have demonstrated in Chapter 4
that both the identity of the amino acids in the P- and A-sites as well as the two tRNAs
aligned in these two positions have varied contributions to the change in translation speed.
One hypothesis is that if the amino acids are playing a major role, the catalysis of peptide
bond formation is rate limiting for these pairs, and hence mutating the residue at the P-
site can potentially switch peptide bond formation to the non-rate limiting regime and vice
versa. However, the role of tRNA interactions in influencing the translation rate is
unknown. One can speculate that depending on the identity of the two tRNAs, P-site tRNA
can facilitate or restrict the accommodation of amino-acyl tRNA in to the A-site influencing
the rate of translation. Further research is required to uncover evidence for this
mechanism and its molecular principles.
H
6.2.3 Causally test the effect of altered translation kinetics on Ssb chaperone
binding
In Chapter 5, I demonstrated that evolution has encoded translation rate information in S.
cerevisiae transcripts that correlate with the co-translational binding of the chaperone Ssb.
However, the correlation between faster translation and Ssb binding does not establish
causation. As a future direction, causation can be tested using the following procedure
(visually represented in Figure 6.1): i) Ssb-bound translating regions are determined using
the wild-type Ssb binding profiles. ii) Synonymous mutations are introduced in the
identified Ssb-bound translated regions such that they alter the translation kinetics from
81
fast to slow without changing the encoded protein sequence. iii) Selective Ribo-Seq is
carried out in the mutated strain to estimate the profile of Ssb binding. Loss of Ssb binding
will support our hypothesis that altering evolutionarily encoded translation kinetics will
disrupt the binding of this chaperone to nascent polypeptides. This can also be a potential
mechanism for a synonymous mutation in diseased condition reducing the efficiency of
chaperone binding and causing a downstream phenotypic effect. The evidence from the
analyses presented in Chapter 5 and inferences drawn from them will hopefully motivate
experimental researchers and structural biologists to study the role of chaperone binding
of nascent chain in final maturation of the protein.
82
Wild-type Mutant
0
5
10
15
20
25
0 100 200 300 400 500
Fold
En
rich
me
nt
Codon Position
0
5
10
15
20
25
0 100 200 300 400 500
Fold
En
rich
me
nt
Codon Position
a
b
Figure 6.1. Illustration of the hypothesis that a change in translation-elongation
rates will lead to disruption of Ssb binding. (a) Analysis in Chapter 5 demonstrated
that faster translation rates are encoded in those regions of an mRNA where Ssb tends
to bind the nascent chain (left panel). Strongly correlated regions of translation speed
and Ssb binding in genes can be identified and synonymous mutations will be
introduced that result in a slowdown in translation without changing the protein
sequence (right panel). If the proposed hypothesis is correct, these synonymous
mutations will disrupt Ssb binding. (b) The signal peaks from Sel-Ribo-Seq signify that
Ssb is bound to the nascent chain. Ssb binding to regions 1 and 2 (yellow regions on
nascent chain) will result in a binding peak in the Sel-Ribo-Seq profile (left panel). If
Ssb binding is disrupted due to a slowdown of translation downstream of region 2,
there will be loss of signal when the mutated region is translated (right panel).
83
Appendix A
CHAPTER 2 SUPPORTING INFORMATION
A.1 Supporting Figures
Figure A.1. Fragment size distribution in (A) Pooled Ribo-Seq data in mouse embryonic stem cells
(mESCs) and (B) Pooled Ribo-Seq data in Escherichia coli.
84
Figure A.2. Pairwise comparison of fragment-size and frame distributions between genes in
S. cerevisiae. (A) The heat map reports the pairwise Hellinger distance177 between the probability
densities of the fragment-size and frame distributions of individual genes. Only genes in the Pooled
data set that have at least 1 read per codon for fragment sizes between 24 and 34 nt were analyzed,
resulting in 210 genes in this analysis. (B) The probability density distribution of Hellinger distances
reported in (A). The Hellinger distance metric is bound between (0, 1); 0 indicates identical
distributions; while 1 indicates the distributions are divergent. All pairwise Hellinger distances are
less than 0.45 and only 11% of pairwise distances are greater than 0.1. Hence, the distribution of
reads of different fragment sizes and frames are highly similar and therefore exhibit very little
dependence on gene identity.
85
Read length
Distribution
Input offset table
Constant
offset of 15
Constant
offset of 18
Mixed offsets of
12 and 18
Top offsets from
experimental data
1 100% 93% 100% 93%
2 95% 95% 100% 95%
3 96.5% 100% 96.5% 100%
4 100% 100% 100% 97%
5 100% 96% 100% 98%
6 100% 100% 100% 100%
7 100% 100% 100% 100%
8 95% 95% 95% 100%
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
20 22 24 26 28 30 32 34
P (𝑆, 𝐹
)
Fragment Size
Frame 0
Frame 1
Frame 2
C
A B
Figure A.3. Integer Programming algorithm correctly reproduces the true A-site offsets
from Artificial Ribo-Seq data. (A) An example of a Shifted Poisson distribution with mode at
(𝑆, 𝐹) = (28, 0) and a variance 𝜆 = 48 that was used to generate artificial ribosome-protected
fragments (see Methods for details). The reads generated from this distribution are subjected to
the Integer Programming algorithm for four different input offset tables. These input offset tables
are shown in Supplementary Table S4. (B) Six read length distributions with their mode at (28, 0)
were generated with Poisson variances 𝜆 = 4, 8, 16, 24, 48, 80 and labeled 1 through 6,
respectively. The distribution in (A) is the distribution labeled 5 in (B). Two more distributions
were generated with variance 𝜆 = 8 but with modes at (24, 0) and (32, 0) and labelled as 7 and
8, respectively. (C) The percentage of offsets that the Integer Programming correctly identifies
in the artificial Ribo-Seq data created based on the eight read length distributions shown in (B)
and for each of the four different input offset tables (see Methods for details) used to generate
the artificial Ribo-Seq reads. The four input offset tables and the corresponding output offset
tables generated by the Integer Programming algorithm for distribution 5 is shown in Table A.4.
86
Figure A.4. Meta-gene analysis in Pooled Ribo-Seq data reveal excess ribosome density in
E.coli genes beyond CDS regions. A-site profiles are obtained for S. cerevisiae using unique offsets
from Table 2.1 obtained after application of Integer Programming algorithm. For E. coli, we use a
constant offset of 12 nt from 3΄ end as used by Woolstenhulme and co-workers66. For mESCs, we
use the unique offsets from Table A.6. We plot the meta-gene profiles for all genes dataset as well
as for the subset of filtered genes containing only single isoforms with one translation start site.
87
Figure A.5. Stalling at PPE and PPD motifs are reproduced in mESCs. The median normalized
ribosome density is obtained for all instances of (A) PPX and (B) XPP motifs in which X corresponds
to any one of the 20 naturally occurring amino acids (and stop codon for instance PP*). Using a
permutation test, we determine if the median ribosome density is statistically different from the
average ribosome density. Statistically significant motifs are highlighted in dark red. This analysis
was carried out on the Pooled dataset for transcripts in which at least 50% of codon positions have
reads mapped to them. Error bars are 95% Confidence Intervals for the median obtained using
Bootstrapping120. The dashed line at Normalized Ribo Density = 1 indicates whether that motif
results in a slowdown or not (a value > 1 would indicate it is slower than average). The dashed line
is unrelated to statistical significance of motifs determined by permutation test.
88
Figure A.6. Sequence-independent translational pause observed post-initiation in S.
cerevisiae and mESCs. Meta-gene analysis at the codon level with reads mapped to the P-site are
shown. There is a mild but distinct pausing of translation when the 4th and 5th codons are in the P-
site. This effect is seen in both Pooled and Pop datasets of S. cerevisiae as well as the filtered genes
dataset of mouse embryonic stem cells (mESCs).
89
Figure A.7. The Integer Programming algorithm correctly assigns greater ribosome density to the
Glycine residue in PPG motifs than other methods in S. cerevisiae. (A) Normalized ribosome density at
the predicted A-site using different methods to determine A-site is shown for an instance of the PPG motif in
gene YDR226W with the Glycine of the PPG motif is at codon position 16, for Pop dataset in S. cerevisiae. (B)
The fraction of PPG instances (𝑛 = 35) at which Integer Programming method yields greater ribosome density
at glycine against the compared method. The color-coding is same as shown in the legend of panel (A). Our
method does better if it assigns greater ribosome density in more than half the instances (horizontal line in panel
B). Integer Programming yields significantly higher ribosome density at G in the PPG motifs than all other
methods (For Hussmann 𝑃 = 0.026, for ribodeblur 𝑃 = 0.01 and for others 𝑃 < 10−5). Two-sided P-values
were calculated using the Wilcoxon signed rank test. Error bars are 95% Confidence Interval about the median
calculated using Bootstrapping120.
90
A.2 Supplementary Tables
Table A.1. Number of genes for the various fragment size and frame combinations that meet the
criteria of at least 1 read per codon on average in the Pop and Pooled datasets of S. cerevisiae.
Fragment size Pop dataset
Pooled dataset
Frame 0 Frame 1 Frame 2 Frame 0 Frame 1 Frame 2
20 34 7 57 156 50 98
21 45 60 41 224 129 139
22 99 55 48 234 115 199
23 73 47 95 162 107 290
24 58 91 72 161 352 193
25 155 69 55 647 251 194
26 105 64 175 481 241 916
27 159 255 161 1096 2213 878
28 1081 248 333 4487 1861 1468
29 850 330 1437 3919 1504 4474
30 528 876 1139 3184 3041 3835
31 276 610 643 2089 2251 3164
32 70 279 181 799 1897 1789
33 39 82 9 474 1076 322
34 1 2 0 237 194 71
35 0 0 0 58 40 33
91
Table A.2. Initial offset tables after application of Integer Programming algorithm to Pop and
Pooled datasets in S. cerevisiae.
Fragment size
Pop dataset
Pooled dataset
Frame 0 Frame 1 Frame 2 Frame 0 Frame 1 Frame 2
20 0/6 ND* 9/0 6/15 6/0 9/0
21 15/6 9/15 9/0 15/6 9/0 9/18
22 9/0 9/18 18/9 9/15 9/0 18/9
23 9/15 12/15 18/12 9/15 9/18 18/12
24 15/12 12/18 18/12 15/9 12/15 18/12
25 12/15 12/18 18/12 15/12 12/15 18/12
26 15 /12 15/18 18/15 15 /12 12/15 18/15
27 15 15/18 18/15 15 15 18/15
28 15 18/15 18 15 15 18/15
29 15 18/15 18 15 15/18 18
30 15 18 18 15 18/15 18
31 18/15 18 18 15 18 18
32 18/15 18 18 18/15 18 18
33 18/0 18/0 ND 18/15 18 18/15
34 ND ND ND 18/15 18/15 18/21
35 ND ND ND 18/15 15/18 21/18
*ND = Not Defined. The number of genes for certain 𝑆 and 𝐹 combinations are less than 10 and hence
Integer Programming algorithm is not applied to these combinations.
92
Table A.3. For unique offsets described in Table 2.1, the robustness to variation in
parameters and consistency across different Ribo-Seq datasets are described with
additional sub columns. The two sub columns in the top row refers to the unique offsets being
sensitive (S) or robust (R) to parameter variation. Namely, the change in threshold from 60% to
80% to classify the most probable offset as unique (left sub column) and variation in threshold of
the secondary selection criterion 𝑅(1) < 𝑎 ∗ 𝑀𝑒𝑎𝑛(𝑅(2), 𝑅(3), 𝑅(4)) where 𝑎 ranges from 1 to 1
10
(right sub column). The bottom row specifies the consistency of the unique offset across individual
Ribo-Seq datasets. For example, for fragment size 27 in frame 0, 15 is the unique offset which is
sensitive (S in left sub column) to a change in threshold from 60% to 80% and robust (R in right
sub column) to change in secondary selection parameter 𝑎 from 1 to 1
10. It is also consistent in 9 out
of 12 datasets for which we have more than 10 genes meeting our filtering criteria.
Fragment Size Frame 0 Frame 1 Frame 2
24 15 R R
15/12 18/12 4 of 4
25 15 S R
12/15 18 S S
5 of 7 4 of 4
26 15/12 18/15 18/15
27 15 S R
15 S R
18 R R
9 of 12 7 of 12 6 of 9
28 15 R R
15 S R
18 R R
14 of 17 10 of 13 10 of 12
29 15 R R
15/18 18 R R
14 of 15 15 of 16
30 15 R R
18 R R
18 R R
15 of 15 12 of 16 16 of 16
31 15 S R
18 R R
18 R R
12 of 13 13 of 15 15 of 16
32 18/15 18 R R
18 R R
11 of 11 9 of 10
33 18 S S
18 R R
18 S S
6 of 6 7 of 7 4 of 4
34 18 R R
18 R S
18/21 2 of 2 2 of 2
93
Table A.4. Input A-site offset tables used in the creation of artificial Ribo-Seq data (table below,
see Methods). Offset A-site tables (next page) output by the Integer Programming method when
applied to artificial Ribo-Seq data constructed using the input tables (Top) and P(𝑺, 𝑭) distribution
with mode (𝟐𝟖, 𝟎) and variance 𝝀 = 𝟒𝟖 (Distribution 5 in Figure A.3).
Input Offset tables
Fragment
size
Constant offset of 15 Constant offset of 18 Mixed offsets of 12 and
18
Top offsets from exp.
data
Frame
0
Frame
1
Frame
2
Frame
0
Frame
1
Frame
2
Frame
0
Frame
1
Frame
2
Frame
0
Frame
1
Frame
2
20 15 15 15 18 18 18 12 12 12 6 6 9
21 15 15 15 18 18 18 12 12 12 15 9 9
22 15 15 15 18 18 18 12 12 12 15 6 18
23 15 15 15 18 18 18 12 12 12 15 18 18
24 15 15 15 18 18 18 12 12 12 15 15 12
25 15 15 15 18 18 18 12 12 12 15 12 18
26 15 15 15 18 18 18 12 12 12 15 15 18
27 15 15 15 18 18 18 12 12 12 15 15 18
28 15 15 15 18 18 18 18 18 18 15 15 18
29 15 15 15 18 18 18 18 18 18 15 15 18
30 15 15 15 18 18 18 18 18 18 15 18 18
31 15 15 15 18 18 18 18 18 18 15 18 18
32 15 15 15 18 18 18 18 18 18 15 18 18
33 15 15 15 18 18 18 18 18 18 18 18 18
34 15 15 15 18 18 18 18 18 18 18 18 18
35 15 15 15 18 18 18 18 18 18 18 18 18
94
Output Offset tables
Fragment
size
Constant offset of 15 Constant offset of 18 Mixed offsets of 12 and
18
Top offsets from exp.
data
Frame
0
Frame
1
Frame
2
Frame
0
Frame
1
Frame
2
Frame
0
Frame
1
Frame
2
Frame
0
Frame
1
Frame
2
20 15 15 15 18 18 18 12 12 12 6 6 9
21 15 15 15 18 18 18 12 12 12 15 9 9
22 15 15 15 18 18 18 12 12 12 15 6 18
23 15 15 15 18 18 18 12 12 12 15 18 18
24 15 15 15 18 18 18 12 12 12 15 15 12
25 15 15 15 18 18 18 12 12 12 15 12 18
26 15 15 15 18 18 18 12 12 12 15 15 18
27 15 15 15 18 18 18 12 12 12 15 15 18
28 15 15 15 18 18 18 18 18 18 15 15 18
29 15 15 15 18 18 18 18 18 18 15 15 18
30 15 15 15 18 18 18 18 18 18 15 18 18
31 15 15 15 18 18 18 18 18 18 15 18 18
32 15 15 15 18 18 18 18 18 18 15 18 18
33 15 15 15 18 18 18 18 18 18 18 18 18
34 15 15 15 18 18 18 18 18 18 18 18 18
35 15 15 15 18 18/15 18/15 18 18 18 18 18 18/15
95
Table A.5. Initial offset table after application of Integer Programming algorithm to a Pooled dataset
in mESCs consisting of all genes. Offset table after application of Integer Programming algorithm
to a Pooled dataset of E. coli.
Fragment size
mESCs Pooled dataset
E. coli Pooled dataset
Frame 0 Frame 1 Frame 2 Frame 0 Frame 1 Frame 2
20 6/3 6/9 0/15 12/9 15/0 9/0
21 6/0 6/0 0/18 12/9 3/12 9/0
22 0/9 0/3 9/15 12/0 12/0 9/0
23 6/15 9/0 18/9 12/3 0/12 9/0
24 9/15 3/0 18/0 12/3 12/0 0/9
25 9/6 9/6 12/15 12/3 12/3 9/6
26 12/6 9/12 15/0 12/3 3/0 9/3
27 12/9 12/21 15/9 3/12 3/6 9/3
28 12/15 12/21 15/12 3/12 3/6 3/0
29 15/12 15/12 18/15 3/12 3/6 3/9
30 15 15/18 18/15 12/9 9/3 9/6
31 15/12 15/18 18/15 12/3 9/6 9/3
32 12/15 18/15 18 12/9 3/9 9/3
33 15/12 18/15 18 12/9 9/3 9/3
34 15/18 15/18 18/12 12/9 9/12 9/6
35 ND* ND ND 12/9 12/9 9/6
36 ND ND ND 12/15 9/12 9/12
37 ND ND ND 12/6 12/9 9/12
38 ND ND ND 12/15 12/15 9/12
39 ND ND ND 12/15 12/3 9/3
40 ND ND ND 12/15 15/3 9/3
96
Table A.6. A-site locations (nucleotide offsets from 5΄ end) determined by applying the
Integer Programming algorithm to the Pooled dataset in mESCs are shown as a function of
fragment size and frame. The dataset consists of only genes that have a single isoform and only
one translation start site. The top two offset values are listed for those 𝑆 and 𝐹 combinations in
which the A-site location could not be uniquely determined. The description of the sub columns is
the same as Table A.3.
Fragment Size Frame 0 Frame 1 Frame 2
28 15/12 15/12 15 S R
1 of 2
29 15 R R
15/18 15/18 2 of 2
30 15 R R
15/18 18/15 2 of 2
31 15 R R
15 S S
18 R R
2 of 2 1 of 2 2 of 2
32 15 R R
18/15 18 R R
2 of 2 2 of 2
33 15 R R
18 R R
18 R R
2 of 2 2 of 2 2 of 2
34 15/12 18 S S
18 R R
2 of 2 2 of 2
97
Table A.7. Number of genes in the combination of fragment size and frame meeting the criteria of
at least 1 read per codon on average in mESCs and E. coli Pooled datasets. The mESCs Pooled
dataset consist of genes that are single isoform and have only one defined translation initiation site.
Fragment size mESCs Pooled dataset
E. coli Pooled dataset
Frame 0 Frame 1 Frame 2 Frame 0 Frame 1 Frame 2
20 8 7 10 313 243 330
21 10 0 1 440 270 377
22 10 1 10 532 416 431
23 15 15 18 610 471 645
24 41 20 19 806 471 655
25 61 19 52 816 610 603
26 52 43 75 765 625 742
27 73 38 45 952 681 825
28 126 55 119 1001 849 868
29 187 125 208 988 840 981
30 230 191 257 1072 791 956
31 197 192 280 1042 898 916
32 103 187 237 994 823 993
33 47 125 108 1060 761 943
34 17 55 45 1008 842 856
35 0 0 0 891 740 924
36 0 0 0 943 598 799
37 0 0 0 827 618 625
38 0 0 0 640 494 596
39 0 0 0 588 323 461
40 0 0 0 440 288 278
98
Table A.8. Median normalized ribosome densities for 61 codon types were correlated with tRNA
abundance for the Integer Programming method and 11 other contemporary methods (see
Methods for details). The tRNA abundance values were obtained from Table S2 of study of
Weinberg and co-workers41.
Method Spearman's rho p-value
Integer Programming -0.583 6.39 × 10−5
Heuristic +18 -0.581 6.76 × 10−5
Plastid -0.580 6.98 × 10−5
RiboProfiling -0.575 8.55 × 10−5
riboWaltz -0.574 8.65 × 10−5
Hussmann -0.571 9.53 × 10−5
Martens -0.571 9.82 × 10−5
Heuristic +15 -0.570 9.94 × 10−5
ribodeblur -0.570 9.94 × 10−5
Scikit-ribo -0.567 1.09 × 10−4
Rpbp -0.566 1.12 × 10−4
Center-weighted -0.517 5.31 × 10−4
99
Table A.9. Publicly available datasets used in the study.
Dataset (first
author name) Year of
publication Number of
replicates* GEO
Study Accession numbers of
samples used
Saccharomyces cerevisiae Pop 2014 1 GSE63789 GSM1557447
Guydosh 2014 1 GSE52968 GSM1279568 Jan 2014 1 GSE61012 GSM1495525
Williams 2014 1 GSE61011 GSM1495503 Gerashchenko 2014 1 GSE59573 GSM1439584
Gardin 2014 2 GSE51164 GSM1239959, GSM1239960
Lareau 2014 3 GSE58321 GSM1406453, GSM1406454,
GSM1406455
Nedialkova 2015 3 GSE67387 GSM1646015, GSM1646016,
GSM1646017 Young 2015 1 GSE69414 GSM1700885
Weinberg 2016 1 GSE53268 GSM1289257 Nissley 2016 2 GSE75322 GSM1949550, GSM1949551
Mouse embryonic stem cells (mESCs) Ingolia 2011 1 GSE30839 GSM765298 Hurt 2013 1 GSE41785 GSM1024298
Escherichia coli Li 2012 2 GSE35641 GSM872393, GSM872394 Li 2014 1 GSE53767 GSM1300279
Woolstenhulme 2015 2 GSE64488 GSM1572273, GSM1572275
* For datasets with more than one replicate, all replicates were used to create the Pooled dataset.
100
Table A.10. A-site offsets determined using the publicly available R packages – Plastid38 ,
RiboProfiling92 and riboWaltz37. These methods generate a P-site offset as output for each
fragment length. The A-site offsets below are obtained after adding 3 nt to the P-site offsets.
S. cerevisiae Pop data S. cerevisiae Pooled data mESCs Pooled data
Fragment size Plastid RiboProfiling riboWaltz Plastid RiboProfiling riboWaltz Plastid RiboProfiling riboWaltz
20 16 7 16 16 7 16 NA NA NA
21 16 7 16 16 7 13 NA NA NA
22 16 7 15 16 7 15 NA NA NA
23 16 10 15 16 10 15 NA NA NA
24 16 10 17 16 10 15 NA NA NA
25 16 11 15 16 11 15 16 13 15
26 16 11 16 16 11 16 16 14 16
27 16 14 14 16 14 14 16 15 15
28 16 15 15 16 15 15 13 16 15
29 16 16 16 16 16 16 13 6 15
30 16 16 16 16 16 16 15 15 15
31 16 16 16 16 16 16 13 13 16
32 16 17 17 16 17 17 16 13 16
33 16 17 17 16 17 17 16 14 16
34 16 13 15 16 13 15 17 13 14
35 16 13 16 16 13 15 NA NA NA
101
Appendix B
CHAPTER 3 SUPPORTING INFORMATION
B.1 Supplementary Methods
B.1.1 Derivation of Eq. (3.3) from Eq. (3.1) and Eq. (3.2)
Eq. (3.1), restated below, defines the steady state condition of translation. Eq. (3.2) is the
mean synthesis time of transcript 𝑖, which is the sum of the translation times of the
elongating codons of transcript 𝑖.
𝑁2,𝑖ribo
𝜏(2,𝑖)=
𝑁3,𝑖ribo
𝜏(3,𝑖)= ⋯
𝑁𝑗,𝑖ribo
𝜏(𝑗,𝑖)= ⋯ =
𝑁𝑁𝑐(𝑖),𝑖ribo
𝜏(𝑁𝑐(𝑖),𝑖) (Eq. 3.1)
⟨𝑇(𝑖)⟩ = 𝜏(2, 𝑖) + 𝜏(3, 𝑖) + ⋯ + 𝜏(𝑁𝑐(𝑖), 𝑖) (Eq. 3.2)
The translation time of a codon position 𝑙 in transcript 𝑖 can be expressed (through a simple
algebraic rearrangement of Eq. (3.1)) in terms of the translation time of any other codon
position 𝑗 as
𝜏(𝑙, 𝑖) =𝜏(𝑗,𝑖)𝑁𝑖𝑙
Ribo
𝑁𝑖𝑗Ribo . (Eq. B.1)
For each codon position, 𝑙 = 2, 3, 4, … . , 𝑁𝑐(𝑖), we substitute Eq. (B.1) into Eq. (3.2), which
yields
⟨𝑇(𝑖)⟩ =𝜏(𝑗,𝑖)𝑁𝑖,2
Ribo
𝑁𝑖𝑗Ribo +
𝜏(𝑗,𝑖)𝑁𝑖,3Ribo
𝑁𝑖𝑗Ribo + ⋯ +
𝜏(𝑗,𝑖)𝑁𝑖𝑁𝑖
cRibo
𝑁𝑖𝑗Ribo . (Eq. B.2)
We then pull out the common terms, yielding
⟨𝑇(𝑖)⟩ =𝜏(𝑗,𝑖)
𝑁𝑖𝑗Ribo [𝑁𝑖,2
Ribo + 𝑁𝑖,3Ribo + ⋯ + 𝑁𝑖𝑁𝑖
cRibo], (Eq. B.3)
where the term in square brackets on the right-hand-side of Eq. (B.3) can be expressed
as a summation, yielding
⟨𝑇(𝑖)⟩ =𝜏(𝑗,𝑖)
𝑁𝑖𝑗Ribo ∑ 𝑁𝑙,𝑖
ribo𝑁𝑐(𝑖)𝑙=2 . (Eq. B.4)
Rearranging Eq. (S20) yields Eq. (3.3) in the main text:
𝜏(𝑗, 𝑖) =𝑁𝑗,𝑖
ribo
∑ 𝑁𝑙,𝑖ribo𝑁𝑐(𝑖)
𝑙=2
⟨𝑇(𝑖)⟩. (Eq. 3.3)
102
B.1.2. Estimation of ⟨𝝉(𝒊)⟩
We estimated the synthesis time of a protein by using the finding that it scales linearly with
the number of elongating codons in a transcript 137
⟨𝑇(𝑖)⟩ = (𝑁𝑐(𝑖) − 1)⟨𝜏A⟩. (Eq. B.5)
In Eq. (B.5), ⟨𝜏A⟩ is the transcriptome-wide average codon translation time. This
approximation is supported both by experimental results33 and a theoretical analysis that
indicates this estimate is typically within 5% of the true synthesis time137.
103
B.2 Supplementary Figures
Figure B.1. Comparison of the properties of the 117- and 364-transcript data sets from
studies of Nissley et al.9 and Williams et al.114, respectively, to the entire S. cerevisiae
transcriptome. Probability distributions of CDS length and percent GC content from the data set
of 117-transcripts from Nissley et al.9 (green) and from the entire transcriptome (blue) are plotted in
(A) and (B), respectively. (C) Scatter plot of the codon usage in the whole genome versus the 118-
transcript data set from Nissley et al.9. (D), (E) and (F) are the same as (A), (B) and (C), respectively,
except 364-transcripts from Williams et al.114 is used.
104
Figure B.2. Translation time distributions for the 64 codon types. (A) The translation time
distributions for each codon type is shown for the dataset of Nissley et al.9. The distribution is shown
ignoring the extreme 5th percentiles at both ends of the distribution. The codons are sorted based
on the medians of their translation time distributions. There are only three instances of CGG and
one instance of CGA in our gene subset and hence their boxplot is not noticeable. (B) Same as (A)
but for the dataset of Williams et al.114. The sorting is the same as in (A).
105
Figure B.3 Codon translation rates are highly correlated across datasets and with rates from
method of Dao Duc and Song . (A) The medians of the translation time distributions of the 64
codon types are highly correlated between the datasets of Nissley et al.9 and Williams et al.114. (B)
The standard deviations of these translation time distributions are also highly correlated for the two
datasets indicating that the variability of translation times is reproducible across datasets. (C) The
codon translation rates obtained using Eq. (3.5) for the dataset from Weinberg et al.41 is correlated
with codon translation rates inferred in the study of Dao Duc and Song132 on the same dataset.
106
Figure B.4. Molecular factors shaping the variability of individual codon translation rates in
the dataset from Williams et al.114. (A-B) Median translation times of codon types are n egatively
correlated with cognate tRNA abundance estimated by (A) gene copy number and (B) RNA-Seq
gene expression. (C) Probability distribution of the translation time of codons which are followed by
the proline encoding codon and the rest of the other codons are plotted in green and blue,
respectively. (D-E) Percentage difference in median translation times when mRNA structure is
present relative to when it is not present as a function of codon position after the A-site. Grey bars
indicate results that are not statistically significant. Error bars are the 95% C.I. calculated using 104
bootstrap cycles; significance is assessed using the Mann-Whitney U test corrected with the
Benjamini Hochberg FDR method for multiple-hypothesis correction. mRNA structure information
used in (D) and (E) were taken from in vivo DMS and in vitro PARS data, respectively. (F) Scatter
plot of the median translation times of pairs of codon types that are decoded by the same tRNA
molecule. The red line is the identity line. The list of tRNA molecules and which codon they decode
were taken from Cannarrozzi et al.176. Error bars are standard error about the median calculated
with 104 bootstrap cycles.
107
B.3 Supplementary Tables
108 T
ab
le B
.1. S
tatistics f
or
the
tra
nsla
tio
n tim
e d
istr
ibution
s o
f 64 c
odon
typ
es o
bta
ine
d f
rom
the N
issle
y d
ata
set
Co
do
n
typ
e
Am
ino
acid
Nu
mb
er
of
su
ch
co
do
ns
in o
ur
da
ta s
et
Mean
tran
sla
tio
n
tim
e (
ms)
Med
ian
tran
sla
tio
n
tim
e (
ms)
Sta
n
da
rd
de
via
tio
n
(ms)
Vari
an
ce
(ms
2)
5%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
95%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
Tim
e 5
%
pe
rcen
til
e
Ob
serv
e
d (
ms)
Tim
e 9
5%
pe
rcen
tile
Ob
serv
ed
(ms)
Min
imu
m
tim
e
Ob
serv
ed
(ms)
Maxim
um
tim
e
Ob
serv
ed
(ms)
Co
eff
icie
nt
of
Vari
ati
on
AU
U
I 908
146 +
/- 3
127 +
/- 2
97
9482
YH
L034C
/176
YE
R091C
/305
49
.93
6 2
90
.03
1 1
7.5
90
14
91
.588
0
.66
4
AU
C
I 836
143 +
/- 2
131 +
/- 3
70
4891
YN
L064C
/178
YG
L245W
/66
51
.52
9 2
74
.48
5 9
.06
3 4
70
.16
2 0
.49
0
GU
U
V
1052
148 +
/- 2
133 +
/- 2
80
6411
YO
R198C
/160
YO
R369C
/10
5
7.3
82
27
9.6
13
23
.18
0 1
06
0.5
20
0.5
41
AC
C
T
621
156 +
/- 4
138 +
/- 3
99
9854
YH
R208W
/107
YH
L034C
/91
5
7.5
37
29
9.6
18
21
.03
7 1
17
2.0
43
0.6
35
AA
U
N
401
161 +
/- 6
139 +
/- 4
113
12753
YB
R162W
-A/1
2
YO
R027W
/70
61
.52
8 3
15
.66
3 2
1.4
00
13
33
.213
0
.70
2
AA
C
N
994
153 +
/- 2
142 +
/- 3
73
5307
YD
R454C
/145
YG
R155W
/232
59
.44
1 2
85
.12
1 1
3.9
94
52
3.3
29
0.4
77
AA
G
K
1692
174 +
/- 3
145 +
/- 2
143
20352
YP
L061W
/449
YLR
150W
/115
5
3.9
13
37
5.8
49
17
.67
1 2
60
8.5
37
0.8
22
AC
U
T
748
162 +
/- 4
146 +
/- 4
104
10849
YLR
197W
/298
YP
L106C
/269
59
.97
6 3
10
.07
7 6
.06
5 1
46
4.1
46
0.6
42
CA
A
Q
931
169 +
/- 3
152 +
/- 3
90
8050
YG
R209C
/98
YO
L040C
/3
65
.18
1 3
21
.00
5 1
9.4
96
84
0.4
33
0.5
33
AA
A
K
820
178 +
/- 4
153 +
/- 3
116
13484
YB
R249C
/335
YD
R071C
/116
5
7.4
98
35
8.5
80
10
.54
3 1
41
1.5
64
0.6
52
UU
A
L
541
172 +
/- 4
153 +
/- 4
88
7786
YB
R025C
/340
YN
L244C
/39
6
6.8
52
34
0.5
22
20
.95
3 6
59
.73
8 0
.51
2
GU
C
V
750
167 +
/- 3
155 +
/- 3
83
6903
YD
L067C
/27
YA
L038W
/256
68
.92
5 2
98
.32
6 1
6.8
30
78
9.1
76
0.4
97
GC
U
A
1508
167 +
/- 2
156 +
/- 2
79
6168
YP
R069C
/176
YP
L028W
/258
66
.83
6 3
10
.78
9 1
5.7
83
75
7.5
90
0.4
73
UU
G
L
1331
175 +
/- 2
160 +
/- 3
88
7668
YO
R210W
/24
YD
L055C
/156
7
2.6
14
32
0.2
05
24
.59
2 1
25
1.3
41
0.5
03
GC
C
A
734
179 +
/- 3
164 +
/- 4
93
8608
YP
R069C
/27
YB
L030C
/124
66
.83
6 3
38
.18
8 1
5.4
38
11
66
.562
0
.52
0
UU
U
F
423
182 +
/- 6
165 +
/- 4
113
12687
YC
R053W
/287
YP
R069C
/72
67
.12
6 3
39
.32
3 1
8.3
07
16
10
.375
0
.62
1
UU
C
F
792
179 +
/- 3
169 +
/- 3
79
6223
YD
L055C
/211
YP
R035W
/108
73
.49
0 3
12
.21
9 1
9.2
20
63
0.3
72
0.4
41
109
Tab
le B
.1. S
tatistics f
or
the
tra
nsla
tio
n tim
e d
istr
ibution
s o
f 64 c
odon
typ
es o
bta
ine
d f
rom
the N
issle
y d
ata
set
Co
do
n
typ
e
Am
ino
acid
Nu
mb
er
of
su
ch
co
do
ns
in o
ur
da
ta s
et
Mean
tran
sla
tio
n
tim
e (
ms)
Med
ian
tran
sla
tio
n
tim
e (
ms)
Sta
n
da
rd
de
via
tio
n
(ms)
Vari
an
ce
(ms
2)
5%
perc
en
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
95%
perc
en
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
Tim
e 5
%
pe
rcen
tile
Ob
serv
ed
(ms)
Tim
e 9
5%
pe
rcen
tile
Ob
serv
ed
(ms)
Min
imu
m
tim
e
Ob
serv
ed
(ms)
Maxim
um
tim
e
Ob
serv
ed
(ms)
Co
eff
icie
nt
of
Vari
ati
on
UA
C
Y
634
233 +
/- 4
216 +
/- 4
101
10180
YG
R175C
/319;
YG
R175C
/457
YC
R012W
/123
10
4.7
67
41
4.6
40
33
.22
8 8
17
.72
8 0
.43
3
CC
A
P
943
245 +
/- 6
218 +
/- 4
171
29352
YG
L253W
/425
YG
L253W
/74
88
.62
7 4
59
.41
2 2
.57
3 3
89
8.7
17
0.6
98
CG
A
R
1
218 +
/- 0
218 +
/- 0
0
0
YE
R072W
/20
YE
R072W
/20
21
8.2
77
21
8.2
77
21
8.2
77
21
8.2
77
0.0
00
AC
A
T
188
259 +
/- 9
228 +
/- 1
0
124
15368
YP
L106C
/491
YB
R249C
/44
11
0.1
00
51
1.0
94
57
.57
1 8
07
.89
6 0
.47
9
CA
C
H
309
248 +
/- 6
236 +
/- 6
109
11783
YLL018C
/487
YIL
053W
/51
94
.93
8 4
65
.45
8 2
5.8
52
76
1.3
73
0.4
40
CC
U
P
272
254 +
/- 7
237 +
/- 8
122
14834
YLL018C
/520
YD
R226W
/208
10
7.5
96
46
9.2
14
62
.33
1 7
31
.43
2 0
.48
0
GC
A
A
251
253 +
/- 7
238 +
/- 8
111
12238
YD
R454C
/156
YLR
197W
/16
9
5.1
05
49
3.1
38
45
.49
1 7
41
.95
6 0
.43
9
CA
G
Q
124
253 +
/- 1
0
243 +
/- 1
5
111
12254
YH
R019C
/84
YO
R027W
/554
11
1.9
13
43
6.6
67
58
.39
2 6
94
.59
8 0
.43
9
AG
C
S
121
253 +
/- 9
247 +
/- 1
3
97
9456
YK
L192C
/70
YG
R175C
/308
1
22
.95
1 4
19
.06
6 6
1.4
18
65
1.2
62
0.3
83
CU
C
L
43
288 +
/- 2
2
250 +
/- 2
0
143
20370
YE
R009W
/3
YC
R053W
/56
13
7.9
84
63
4.6
43
42
.33
6 6
95
.23
6 0
.49
7
GG
C
G
224
263 +
/- 8
250 +
/- 7
126
15922
YD
R023W
/391
YLR
109W
/12
9
0.9
85
45
1.5
25
20
.53
4 9
44
.99
4 0
.47
9
GU
G
V
145
280 +
/- 1
8
253 +
/- 1
3
216
46686
YJR
104C
/7
YH
R064C
/118
9
1.3
20
47
5.0
38
56
.89
9 2
38
2.0
25
0.7
71
GA
G
E
319
268 +
/- 7
254 +
/- 7
117
13606
YO
R198C
/415
YO
R285W
/51
12
2.9
62
47
5.4
07
43
.04
9 1
09
4.6
66
0.4
37
GC
G
A
54
272 +
/- 1
9
258 +
/- 2
2
136
18474
YG
R282C
/13
YE
R055C
/53
83
.66
4 4
99
.33
9 3
5.6
73
65
5.7
17
0.5
00
UG
C
C
48
292 +
/- 1
9
263 +
/- 1
1
132
17535
YN
L244C
/89
YF
L045C
/46
12
6.1
19
57
5.0
68
59
.14
0 6
73
.77
4 0
.45
2
AU
A
I 79
279 +
/- 1
4
266 +
/- 1
7
122
14942
YE
R120W
/121;
YE
R120W
/211
YB
R162W
-
A/4
0
10
7.4
75
47
4.6
45
85
.15
1 6
65
.91
1 0
.43
7
CU
G
L
86
301 +
/- 1
3
279 +
/- 1
5
122
14892
YP
R062W
/88
YG
R155W
/307
12
6.5
17
52
2.7
21
11
1.9
13
72
0.7
22
0.4
05
110
Tab
le B
.1. S
tatistics f
or
the
tra
nsla
tio
n tim
e d
istr
ibution
s o
f 64 c
odon
typ
es o
bta
ine
d f
rom
the N
issle
y d
ata
set
Co
do
n
typ
e
Am
ino
acid
Nu
mb
er
of
su
ch
co
do
ns
in o
ur
da
ta s
et
Mean
tran
sla
tio
n
tim
e (
ms)
Med
ian
tran
sla
tio
n
tim
e (
ms)
Sta
n
dard
de
via
tio
n
(ms)
Vari
an
ce
(ms
2)
5%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
95%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
Tim
e 5
%
perc
en
til
e
Ob
serv
e
d (
ms)
Tim
e 9
5%
perc
en
tile
Ob
serv
ed
(ms)
Min
imu
m
tim
e
Ob
serv
ed
(ms)
Maxim
um
tim
e
Ob
serv
ed
(ms)
Co
eff
icie
nt
of
Vari
ati
on
UA
U
Y
237
187 +
/- 7
171 +
/- 8
107
11425
YD
L084W
/285
YD
L208W
/20
6
7.3
25
34
9.4
52
26
.30
5 1
06
9.2
03
0.5
72
AU
G
M
479
192 +
/- 4
177 +
/- 5
91
8234
YN
L244C
/92
YD
R071C
/78
7
5.6
71
34
7.3
75
35
.34
1 6
42
.12
8 0
.47
4
GA
U
D
900
203 +
/- 4
179 +
/- 3
130
16971
YO
R298C
-A/5
Y
DL084W
/229
7
2.7
71
41
4.3
06
21
.40
0 2
16
1.4
27
0.6
40
GG
U
G
1695
210 +
/- 3
183 +
/- 3
131
17063
YH
L034C
/151
YG
L253W
/80
74
.90
4 4
41
.32
5 5
.78
5 1
28
9.9
47
0.6
24
UC
U
S
740
201 +
/- 3
187 +
/- 4
94
8888
YP
R069C
/220
YLR
197W
/108
7
7.1
19
37
3.1
85
19
.86
9 7
15
.83
3 0
.46
8
UG
U
C
234
197 +
/- 6
189 +
/- 1
0
95
9059
YM
L022W
/98
YD
R461W
/33
74
.74
4 3
53
.32
6 3
9.2
65
73
0.3
81
0.4
82
GA
A
E
1886
208 +
/- 2
190 +
/- 2
100
9964
YE
R120W
/147
YO
R198C
/120
8
8.5
09
39
3.4
77
10
.83
8 1
04
2.4
20
0.4
81
AG
A
R
912
209 +
/- 4
192 +
/- 3
109
11800
YH
L034C
/116
YM
R260C
/66
8
1.1
46
40
4.2
65
8.3
97
13
88
.206
0
.52
2
CA
U
H
211
203 +
/- 6
192 +
/- 8
93
8574
YJR
104C
/81
YP
L106C
/244
78
.72
4 3
95
.46
1 3
3.6
32
46
9.6
42
0.4
58
CG
U
R
231
212 +
/- 7
193 +
/- 6
102
10431
YP
R035W
/333
YA
L012W
/227
77
.04
1 4
17
.24
7 4
4.3
04
61
4.6
31
0.4
81
GU
A
V
122
231 +
/- 1
3
200 +
/- 9
143
20506
YP
L061W
/295
YB
R109C
/122
97
.30
7 4
60
.73
9 6
7.8
98
13
14
.736
0
.61
9
AG
U
S
149
214 +
/- 7
205 +
/- 1
0
82
6774
YK
L216W
/4
YH
R019C
/60
8
8.1
92
36
5.1
91
33
.32
0 4
88
.18
7 0
.38
3
GA
C
D
850
231 +
/- 4
208 +
/- 4
127
16255
YG
R155W
/256
YLR
197W
/68
9
1.0
80
42
6.4
97
25
.19
0 2
16
5.0
45
0.5
50
CU
U
L
132
221 +
/- 9
209 +
/- 9
101
10294
YD
L084W
/177
YH
R179W
/28
88
.04
0 3
83
.42
6 1
8.3
07
60
1.2
73
0.4
57
CU
A
L
245
234 +
/- 7
210 +
/- 8
106
11131
YO
R063W
/43
YC
R053W
/80
97
.11
9 4
14
.95
9 5
8.4
88
74
4.2
31
0.4
53
UC
A
S
173
232 +
/- 9
211 +
/- 1
0
113
12790
YG
L245W
/374
YD
L208W
/106
8
3.3
26
41
7.8
23
62
.42
7 8
95
.30
6 0
.48
7
UC
C
S
527
235 +
/- 5
213 +
/- 5
114
12929
YD
R023W
/403
YE
R091C
/60
95
.53
4 4
51
.99
6 2
7.9
50
69
3.3
70
0.4
85
111
Tab
le B
.1:
Sta
tistics f
or
the tra
nsla
tio
n tim
e d
istr
ibution
s o
f 64 c
odon
typ
es o
bta
ine
d f
rom
the N
issle
y d
ata
set
Co
do
n
typ
e
Am
ino
acid
Nu
mb
er
of
su
ch
co
do
ns
in o
ur
da
ta s
et
Mean
tran
sla
tio
n
tim
e (
ms)
Med
ian
tran
sla
tio
n
tim
e (
ms)
Sta
n
da
rd
de
via
tio
n
(ms)
Vari
an
ce
(ms
2)
5%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
95%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
Tim
e 5
%
pe
rcen
tile
Ob
serv
ed
(ms)
Tim
e 9
5%
pe
rcen
tile
Ob
serv
ed
(ms)
Min
imu
m
tim
e
Ob
serv
ed
(ms)
Maxim
um
tim
e
Ob
serv
ed
(ms)
Co
eff
icie
nt
of
Vari
ati
on
GG
A
G
106
325 +
/- 1
7
292 +
/- 1
5
171
29181
YG
L123W
/26
YE
R087C
-B/4
9
14
7.5
89
62
0.4
94
99
.21
6 1
43
6.1
45
0.5
26
GG
G
G
56
327 +
/- 2
2
297 +
/- 2
3
168
28104
YD
R276C
/25
YH
R183W
/431
12
7.0
21
66
0.7
16
97
.68
4 8
51
.40
2 0
.51
4
AG
G
R
73
321 +
/- 1
4
305 +
/- 1
2
118
14042
YLL018C
/495
YLL018C
/112
16
4.5
59
53
1.6
52
13
2.3
41
70
8.8
69
0.3
68
CG
C
R
21
402 +
/- 5
2
321 +
/- 3
4
238
56430
YO
R332W
/166
YE
R055C
/38
16
5.2
48
77
7.5
43
14
3.4
43
12
16
.508
0
.59
2
AC
G
T
40
343 +
/- 1
8
327 +
/- 3
0
116
13428
YH
R072W
-
A/2
7
YE
R087C
-B/4
8
17
0.1
12
53
4.3
14
13
4.4
68
56
3.2
98
0.3
38
UC
G
S
40
334 +
/- 2
1
329 +
/- 2
4
133
17635
YH
R005C
-A/6
0
YG
L245W
/246
12
4.4
74
59
3.0
84
25
.31
7 6
02
.51
7 0
.39
8
CC
C
P
52
347 +
/- 2
1
331 +
/- 3
0
154
23588
YD
R071C
/16
YG
R037C
/20
14
5.6
73
55
1.4
34
44
.30
4 7
90
.12
7 0
.44
4
UG
G
W
259
370 +
/- 1
2
331 +
/- 1
1
193
37364
YH
R208W
/346
YD
R454C
/71
15
8.2
26
71
9.2
31
31
.63
0 1
46
7.2
49
0.5
22
CG
G
R
3
443 +
/- 8
6
337 +
/- 1
39
150
22411
YO
R027W
/586
YO
R332W
/114
33
6.7
07
65
4.3
83
33
6.7
07
65
4.3
83
0.3
39
CC
G
P
18
390 +
/- 3
8
344 +
/- 3
7
161
26061
YIL
051C
/23
YLR
109W
/94
24
8.4
18
59
5.0
42
16
3.0
92
87
5.2
21
0.4
13
UA
A
ST
OP
78
719 +
/- 3
1
669 +
/- 3
3
275
75830
YLR
325C
/79
YG
L245W
/709
40
0.8
59
13
03
.804
3
11
.21
9 1
70
0.9
31
0.3
82
UA
G
ST
OP
19
821 +
/- 7
9
829 +
/- 1
19
346
119416
YP
R069C
/294
YD
R002W
/202
38
0.4
53
13
04
.308
2
93
.87
9 1
53
9.1
06
0.4
21
UG
A
ST
OP
20
989 +
/- 9
6
976 +
/- 1
69
432
187022
YD
L055C
/362
YLL018C
/558
41
9.9
40
16
20
.272
2
89
.48
6 1
67
0.2
80
0.4
37
112
Tab
le B
.2. S
tatistics f
or
the
tra
nsla
tio
n tim
e d
istr
ibution
s o
f 64 c
odon
typ
es o
bta
ine
d f
rom
the W
illia
ms d
ata
set
Co
do
n
typ
e
Am
ino
acid
Nu
mb
er
of
su
ch
co
do
ns
in o
ur
da
ta s
et
Mean
tran
sla
tio
n
tim
e (
ms)
Med
ian
tran
sla
tio
n
tim
e (
ms)
Sta
n
da
rd
de
via
tio
n
(ms)
Vari
an
ce
(ms
2)
5%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
95%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
Tim
e 5
%
pe
rcen
til
e
Ob
serv
e
d (
ms)
Tim
e 9
5%
pe
rcen
tile
Ob
serv
ed
(ms)
Min
imu
m
tim
e
Ob
serv
ed
(ms)
Maxim
um
tim
e
Ob
serv
ed
(ms)
Co
eff
icie
nt
of
Vari
ati
on
AU
C
I 2637
146 +
/- 2
128 +
/- 2
93
8570
YO
R230W
/291
YG
R175C
/134
50.1
27
297.2
78
6.9
83
1803.1
82
0.6
37
GU
U
V
3339
145 +
/- 1
129 +
/- 1
83
6844
YC
R005C
/418
YM
R217W
/260
50.1
13
284.1
06
8.2
00
1154.6
68
0.5
72
AU
U
I 3462
151 +
/- 2
133 +
/- 1
102
10438
YG
R175C
/456;
YG
R175C
/476
YO
R374W
/202
45.4
17
313.6
39
7.6
93
1489.1
36
0.6
75
AC
C
T
1929
162 +
/- 3
135 +
/- 2
119
14179
YLR
075W
/42
YK
L182W
/945
50.7
41
354.5
98
13.1
87
1567.6
86
0.7
35
GU
C
V
2203
152 +
/- 2
137 +
/- 2
85
7146
YC
R005C
/453
YN
L244C
/57
52.4
99
293.3
33
6.7
31
896.4
51
0.5
59
CA
A
Q
3143
159 +
/- 2
138 +
/- 2
100
9910
YLR
438C
-A/5
0
YB
R025C
/201
51.7
01
325.4
35
12.1
09
1513.8
47
0.6
29
AC
U
T
2660
173 +
/- 4
141 +
/- 1
184
33816
YE
R091C
/93
YLR
438C
-A/7
4
52.6
42
347.1
37
10.5
95
5487.4
20
1.0
64
UU
A
L
2283
158 +
/- 2
142 +
/- 2
93
8668
YE
L002C
/129
YN
L138W
/17
51.5
54
333.3
05
16.1
39
923.4
12
0.5
89
AA
C
N
2972
166 +
/- 2
148 +
/- 2
89
7896
YIL
053W
/76
YG
R175C
/408
60.9
64
322.0
51
8.2
05
1036.4
69
0.5
36
GC
C
A
2412
166 +
/- 2
148 +
/- 2
94
8808
YH
R183W
/285
YE
R178W
/149
57.7
52
325.7
26
9.7
23
1161.8
43
0.5
66
AA
U
N
1980
168 +
/- 2
149 +
/- 2
102
10345
YB
R149W
/288
YO
R136W
/328
53.5
14
348.8
80
14.6
78
1252.3
47
0.6
07
GC
U
A
4226
164 +
/- 1
149 +
/- 1
85
7237
YO
R362C
/277
YO
R153W
/598
58.2
29
315.1
95
8.4
76
952.4
93
0.5
18
UC
U
S
2868
167 +
/- 2
150 +
/- 2
99
9839
YG
L187C
/7
YLR
027C
/131
49.3
86
335.0
70
9.2
01
1401.9
65
0.5
93
UU
C
F
2654
168 +
/- 2
152 +
/- 2
91
8261
YK
R039W
/362
YG
R240C
/318
58.5
62
327.9
65
6.1
50
1203.3
17
0.5
42
AA
G
K
4640
194 +
/- 2
155 +
/- 2
170
28836
YB
R106W
/186
YF
L014W
/88
55.2
03
444.9
31
12.7
04
3719.2
43
0.8
76
UU
U
F
2031
172 +
/- 2
157 +
/- 2
100
9998
YH
R183W
/209
YK
R013W
/3
57.7
52
338.8
84
14.3
97
2170.5
89
0.5
81
UC
C
S
1923
177 +
/- 2
160 +
/- 3
100
9962
YM
R202W
/103
YO
L040C
/29
55.8
96
359.8
00
11.8
40
953.4
87
0.5
65
113
T
ab
le B
.2. S
tatistics f
or
the
tra
nsla
tio
n tim
e d
istr
ibution
s o
f 64 c
odon
typ
es o
bta
ine
d f
rom
the W
illia
ms d
ata
set
Co
do
n
typ
e
Am
ino
acid
Nu
mb
er
of
su
ch
co
do
ns
in o
ur
da
ta s
et
Mean
tran
sla
tio
n
tim
e (
ms)
Med
ian
tran
sla
tio
n
tim
e (
ms)
Sta
n
da
rd
de
via
tio
n
(ms)
Vari
an
ce
(ms
2)
5%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
95%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
Tim
e 5
%
pe
rcen
til
e
Ob
serv
e
d (
ms)
Tim
e 9
5%
pe
rcen
tile
Ob
serv
ed
(ms)
Min
imu
m
tim
e
Ob
serv
ed
(ms)
Maxim
um
tim
e
Ob
serv
ed
(ms)
Co
eff
icie
nt
of
Vari
ati
on
UG
U
C
849
178 +
/- 3
161 +
/- 3
90
8157
YN
L036W
/134;
YN
L036W
/171
Y
OL030W
/100
65.5
89
357.1
05
18.6
86
646.1
12
0.5
06
UU
G
L
4170
179 +
/- 2
161 +
/- 2
98
9566
YG
L253W
/395
YF
L022C
/162
61.4
61
351.2
27
15.6
46
1097.3
82
0.5
47
CA
U
H
1081
185 +
/- 3
164 +
/- 3
111
12292
YP
L231W
/11
YH
R183W
/40
54.5
35
397.6
62
10.0
99
1156.4
53
0.6
00
AU
G
M
1855
182 +
/- 2
167 +
/- 3
94
8778
YK
R093W
/499;
YK
R093W
/504;
YK
R093W
/592
YLR
355C
/309
60.3
68
355.8
79
14.5
70
732.7
46
0.5
16
CG
U
R
827
192 +
/- 4
167 +
/- 5
112
12580
YO
R153W
/270
YP
L061W
/352
63.8
91
391.4
35
14.8
82
1084.5
17
0.5
83
GA
U
D
3453
200 +
/- 2
175 +
/- 2
132
17319
YD
R322C
-A/6
4
YB
R126C
/440
65.5
57
415.0
94
9.5
84
3165.4
78
0.6
60
UA
U
Y
1370
191 +
/- 3
176 +
/- 4
98
9692
YK
R093W
/601
YF
L045C
/113
68.4
17
375.2
64
5.4
57
700.1
41
0.5
13
CU
A
L
1151
194 +
/- 3
178 +
/- 3
99
9864
YLR
058C
/276
Y
IL002W
-A/5
5
73.6
40
368.6
70
15.5
38
1105.8
56
0.5
10
AA
A
K
3419
216 +
/- 3
179 +
/- 2
158
25021
YD
R023W
/396
YO
R230W
/48
67.4
19
476.2
03
11.6
26
2478.9
69
0.7
31
UC
A
S
1079
200 +
/- 3
180 +
/- 5
113
12661
YJL174W
/239
YM
R251W
-
A/2
8
67.5
73
397.3
97
13.8
67
900.9
91
0.5
65
GA
C
D
2746
202 +
/- 2
181 +
/- 2
114
12982
YIL
033C
/216
YG
L256W
/60
68.9
15
398.9
95
10.8
69
1979.4
87
0.5
64
GG
U
G
5179
223 +
/- 3
182 +
/- 2
182
33239
YK
L085W
/24
YP
R062W
/14
68.3
38
488.7
77
3.2
05
3821.5
45
0.8
16
AG
U
S
851
201 +
/- 4
184 +
/- 4
119
14210
YLR
179C
/3
YLL018C
/280
65.5
16
402.1
56
12.5
36
1931.9
33
0.5
92
CU
U
L
744
204 +
/- 4
184 +
/- 4
107
11518
YLR
300W
/4
YB
R283C
/339
72.0
92
408.5
80
15.0
83
928.2
04
0.5
25
GA
A
E
5887
206 +
/- 1
184 +
/- 2
113
12806
YLR
257W
/293
Y
KL060C
/194
73.3
88
409.3
45
4.8
72
1827.6
45
0.5
49
GU
A
V
698
204 +
/- 4
185 +
/- 5
111
12431
YK
L216W
/212
YB
R162W
-A/3
69.3
54
405.2
19
15.5
63
936.0
84
0.5
44
CA
C
H
967
208 +
/- 4
186 +
/- 4
113
12699
YJR
070C
/99;
YJR
070C
/139
YB
R196C
/182
78.0
94
413.2
48
15.6
36
1100.2
86
0.5
43
114
Tab
le B
.2.
Sta
tistics f
or
the
tra
nsla
tio
n tim
e d
istr
ibution
s o
f 64 c
odon
typ
es o
bta
ine
d f
rom
the W
illia
ms d
ata
set
Co
do
n
typ
e
Am
ino
acid
Nu
mb
er
of
su
ch
co
do
ns
in o
ur
da
ta s
et
Mean
tran
sla
tio
n
tim
e (
ms)
Med
ian
tran
sla
tio
n
tim
e (
ms)
Sta
n
da
rd
de
via
tio
n
(ms)
Vari
an
ce
(ms
2)
5%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
95%
pe
rcen
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
Tim
e 5
%
pe
rcen
tile
Ob
serv
ed
(ms)
Tim
e 9
5%
pe
rcen
tile
Ob
serv
ed
(ms)
Min
imu
m
tim
e
Ob
serv
ed
(ms)
Maxim
um
tim
e
Ob
serv
ed
(ms)
Co
eff
icie
nt
of
Vari
ati
on
AG
A
R
2735
218 +
/- 3
191 +
/- 2
135
18286
YLL018C
/544
YG
R285C
/247
74.8
84
436.5
41
14.2
15
2057.6
10
0.6
19
UA
C
Y
2184
216 +
/- 2
196 +
/- 2
104
10745
YG
R065C
/507
YG
R060W
/81
80.6
22
414.2
01
14.5
65
897.8
06
0.4
81
AG
C
S
593
217 +
/- 4
205 +
/- 5
100
9952
YK
R013W
/277
YG
R189C
/36
84.7
21
398.8
25
20.1
99
691.1
15
0.4
61
CC
A
P
2706
238 +
/- 3
206 +
/- 3
157
24516
YD
R353W
/273
YLR
355C
/67
83.4
62
498.5
75
13.0
49
3102.9
75
0.6
60
CA
G
Q
749
237 +
/- 5
212 +
/- 5
128
16333
YIL
043C
/280
YM
R002W
/113
76.8
23
476.0
71
19.6
26
938.0
37
0.5
40
GC
A
A
1429
239 +
/- 3
220 +
/- 4
112
12647
YP
L265W
/408
YB
L099W
/273
92.2
80
450.6
71
20.2
89
888.4
51
0.4
69
GG
C
G
1089
277 +
/- 6
233 +
/- 4
210
43985
YF
L005W
/198
YE
L060C
/101
83.2
35
574.7
35
22.4
71
3323.4
63
0.7
58
GA
G
E
1685
255 +
/- 3
234 +
/- 4
136
18480
YM
R203W
/268
YO
L038W
/87
90.5
01
503.1
80
15.0
29
1531.5
99
0.5
33
CC
U
P
1348
271 +
/- 4
235 +
/- 5
156
24443
YLR
179C
/55
YP
L265W
/4
88.6
40
559.1
05
32.7
97
1605.7
79
0.5
76
AU
A
I 584
273 +
/- 6
236 +
/- 8
156
24467
YO
R142W
/240
YK
L216W
/198
88.9
30
575.9
39
44.4
54
1299.1
16
0.5
71
AC
A
T
1168
260 +
/- 4
237 +
/- 4
146
21360
YD
L135C
/83
YG
R282C
/6
94.6
14
508.1
26
18.5
85
1737.9
88
0.5
62
GU
G
V
1010
264 +
/- 5
238 +
/- 5
148
21771
YD
L126C
/629
YD
L022W
/39
90.4
48
547.2
16
34.9
19
1954.1
11
0.5
61
UG
C
C
281
273 +
/- 9
244 +
/- 7
153
23403
YD
R098C
/176
YIL
078W
/261
87.5
30
538.7
04
34.7
66
1268.0
04
0.5
60
UC
G
S
454
282 +
/- 7
253 +
/- 7
147
21493
YK
R093W
/130
YG
R286C
/110
92.5
64
563.5
32
23.7
10
1200.7
43
0.5
21
CU
C
L
233
283 +
/- 9
256 +
/- 8
139
19262
YE
R012W
/75
YLR
056W
/94
105.9
93
514.6
89
37.1
26
949.2
60
0.4
91
CG
C
R
117
286 +
/- 1
4
260 +
/- 2
0
150
22393
YN
L111C
/83
YLR
257W
/95
117.6
06
567.5
35
12.2
28
812.0
40
0.5
24
GC
G
A
413
307 +
/- 8
271 +
/- 1
1
173
29868
YG
R020C
/74
YLR
027C
/150
111.1
18
620.6
41
29.9
18
1346.6
82
0.5
64
115
Ta
ble
B.2
. S
tatistics f
or
the tra
nsla
tio
n tim
e d
istr
ibution
s o
f 64 c
odon
typ
es o
bta
ine
d f
rom
the W
illia
ms d
ata
set
Co
do
n
typ
e
Am
ino
acid
Nu
mb
er
of
su
ch
co
do
ns
in o
ur
da
ta s
et
Mean
tran
sla
tio
n
tim
e (
ms)
Med
ian
tran
sla
tio
n
tim
e (
ms)
Sta
n
da
rd
de
via
tio
n
(ms)
Vari
an
ce
(ms
2)
5%
perc
en
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
95%
perc
en
tile
ob
serv
ed
in
Gen
e n
am
e /
co
do
n
po
sit
ion
Tim
e 5
%
pe
rcen
tile
Ob
serv
ed
(ms)
Tim
e 9
5%
pe
rcen
tile
Ob
serv
ed
(ms)
Min
imu
m
tim
e
Ob
serv
ed
(ms)
Maxim
um
tim
e
Ob
serv
ed
(ms)
Co
eff
icie
nt
of
Vari
ati
on
GG
A
G
651
320 +
/- 1
0
271 +
/- 6
245
59912
YK
L216W
/223
YO
R276W
/92
107.0
46
688.6
50
19.4
26
3936.7
25
0.7
66
AG
G
R
435
318 +
/- 9
273 +
/- 9
183
33659
YC
R005C
/420
YO
L038W
/49;
YO
L038W
/119
107.3
85
635.5
96
35.7
10
1658.5
00
0.5
75
UG
G
W
1079
325 +
/- 6
275 +
/- 5
211
44365
YLR
056W
/332
YO
L058W
/144
85.7
81
691.1
26
21.2
97
2176.3
20
0.6
49
CU
G
L
631
305 +
/- 7
281 +
/- 6
165
27089
YLR
027C
/5
YO
R187W
/21
106.6
13
606.9
04
36.6
42
1338.0
78
0.5
41
GG
G
G
501
326 +
/- 9
287 +
/- 8
190
36116
YB
R286W
/110
YK
L016C
/24
102.8
16
723.0
00
14.6
43
1358.2
55
0.5
83
AC
G
T
441
323 +
/- 8
296 +
/- 7
163
26726
YP
L078C
/66
YF
L037W
/72
117.9
19
637.2
51
40.9
96
1537.9
09
0.5
05
CC
C
P
426
356 +
/- 1
0
315 +
/- 1
0
210
44062
YG
L202W
/182;
YG
L202W
/440
YO
R153W
/830
115.2
42
677.2
44
45.
795
1696.0
36
0.5
90
CG
G
R
26
427 +
/- 5
4
360 +
/- 7
0
273
74374
YO
R052C
/71
YF
L038C
/108
83.9
36
940.1
93
68.3
12
1052.4
79
0.6
39
CC
G
P
178
504 +
/- 2
3
455 +
/- 2
1
311
96474
YP
L265W
/117
YD
L084W
/440
170.9
89
1194.4
65
39.1
40
1827.1
27
0.6
17
CG
A
R
31
482 +
/- 4
5
496 +
/- 6
1
250
62599
YM
L001W
/104
YD
L181W
/70
94.5
81
848.7
01
42.4
30
1212.1
05
0.5
19
UA
A
ST
OP
203
185 +
/- 1
3
144 +
/- 8
190
36114
YJL166W
/95
YC
R004C
/248
45.0
35
424.3
92
21.3
95
2208.0
19
1.0
27
UA
G
ST
OP
77
311 +
/- 2
0
255 +
/- 2
4
176
31145
YF
L045C
/255
YD
L022W
/392
98.0
80
630.2
06
62.1
33
852.0
48
0.5
66
UG
A
ST
OP
84
376 +
/- 3
2
317 +
/- 2
2
296
87558
YD
L067C
/60
YP
L059W
/151
92.8
45
880.4
35
26.6
69
2174.4
16
0.7
87
116
Appendix C
CHAPTER 4 SUPPORTING INFORMATION
C.1 Methods
C.1.1 Details of Experiments
C.1.1.1 Design of mutant strains
There are 7,980 possible amino acid mutations for all combinations of amino acid pairs
where the P-site amino acid is mutated while keeping the A-site amino acid unchanged.
Bioinformatic analyses of published Ribo-Seq data predicted that 4,134 out of these 7,980
possible mutations will result in a significant change in speed (Figure 4.1d). To
experimentally validate our bioinformatic predictions, we chose 5 mutations that can speed
up translation and 5 mutations that can slow down translation. Two more mutations were
created where our bioinformatic analysis predicted no significant change in speed. These
12 mutations were chosen such that they represent as many different combinations of
amino acids and also such that they can be mutated on a small number of genes.
Mutations (P, G) → (E, G) and (Q, D) → (P, D) were chosen to act as positive control since
Proline has been known to slow down translation when present in P-site. (G, G) → (S, G)
and (S, G) → (G, G) were implemented to test the complementarity of the mutations, i.e.,
if mutating the P-site from G → S is having a significant change in translation speed, is S
→ G having the same effect in the opposite direction? To experimentally verify whether
the effect on speed for G → S and S → G are also possible for more than one A-site, we
carry our similar mutations with T in the A-site, i.e., the mutations (G, T) → (S, T) and (S,
T) → (G, T). The rest of the mutations were chosen to represent amino acids not
represented in the above mutations. Location of these mutations were chosen such that
the normalized ribosome density in the published datasets at these instances of the amino
acid pair is close to the median and an instance is avoided that is at the extreme tail of the
distribution. The mutations were chosen on 5 non-essential highly expressed genes where
these mutations can be distributed on the instances of these amino acid pairs. The chosen
genes were not involved in the process of ribosome biogenesis or translation. The mutated
positions on the selected genes were chosen such that they were not at the functional
sites or sites subjected to post-translational modifications as defined in the
Saccharomyces Genome Database158. The gene name and location of mutations are
listed in Table C.2.
117
We denote the five mutant strains as YKL*, YMR*, YLR*, YOL* and YHR* that
were created such that each strain contained mutations in a single gene: YKL096W-A,
YMR122W-A, YLR109W, YOL109W and YHR179W respectively. To assess the effect of
tRNA versus amino acid identity on translation, an additional mutant strain, denoted
YOL**, was created which contained the same amino acid mutations as YOL*, but using
a synonymous set of codons. Details concerning these two set of synonymous mutations
in provided in Table C.4.
Ribosome profiling was carried out in two phases. In the first phase, two replicates
of mutant strains YKL*, YMR*, YLR* and YOL* were subject to Ribosome profiling. For
the single mutation in YKL096W-A, the mutant ribosome densities were from two
replicates of YKL* while the wild-type ribosome densities were from YMR*, YLR* and YOL*
which contained the endogenous transcript for YKL096W-A. A similar procedure was
followed for the four other mutations where we predict a speedup of translation and three
slowdown mutations present on these four genes.
In the second phase, Ribosome profiling was run for four replicates of YHR*, YOL*
and YOL**. YOL* and YOL** contained the same amino acid mutations but differed in
terms of the set of synonymous codons used. YHR* contained four mutations. Two
mutations were the negative control mutations. The other two were mutations predicted to
slowdown translation bringing the total slowdown mutations to 5. The 8 samples (4
replicates each of YOL* and YOL**) were used as wild type for mutations in YHR179W
while the 4 replicates of YHR* served as wild-type samples containing the endogenous
YOL109W gene against the YOL109W mutations in YOL* and YOL**. The number of
replicates were increased to 4 in the second phase to generate enough sample size for a
valid statistical test. The normalized ribosome densities were not compared across the
two phases as the Ribo-Seq samples prepared on different days show poor correlation of
ribosome densities at the codon level (Figure C.8).
C.1.1.2 Strain Construction of mutants
A two-step procedure omitting selection markers in the final construct was used for mutant
strain construction. First, the gene of interest was replaced in the strain BY4741 by a
K. lactis URA3 cassette according to Janke et al.174. Second, the desired mutant gene
enclosing overhangs (45nt/60nt) was constructed by PCR and used to replace the
introduced URA3 cassette by homologous recombination. Candidates were selected on
5-Fluoroorotic Acid (5-fluorouracil-6-carboxylic acid monohydrate; 5-FOA) containing
118
plates and insertion of the correct mutations was verified by colony PCR and DNA
sequencing.
C.1.1.3 Ribosome profiling and library preparation
200 mL of cells were grown in YPD to an OD600 nm of 0.5, rapidly filtered (All-Glass Filter
90mm, Millipore), flash frozen in liquid nitrogen, mixed with 600 µL frozen lysis buffer (20
mM Tris-HCl pH 8.0, 140 mM KCl, 6 mM MgCl2, 0.1% NP-40, 0.1 mg/ml CHX, 1 mM
PMSF, 2x Complete EDTA-free protease inhibitors (5056489001, Roche), 0.02 U/ml
DNase I (4716728001, Roche), 20 mg/mL leupeptin, 20 mg/mL aprotinin, 10 mg/mL E-64,
40 mg/mL bestatin) and pulverized by mixer milling (2 min, 30 Hz, MM400, Retsch).
Thawed cell lysates were cleared by centrifugation (20,000xg, 5 min, 4°C) and digested
by RNase I (AM2295, Ambion; 125 U/1mg nucleic acid) for 1 hr (25°C, 650 rpm) to obtain
ribosome footprints. The reaction was stopped by adding 10 µl SUPERase-In (AM2696,
Ambion). The digested lysate was loaded onto 10-50% (w/v) sucrose gradients and
centrifuged at 35,000 rpm, 4°C for 2.5 hrs. Gradients were fractionated and monosome
fractions were collected, pooled and used for RNA purification by hot acid-phenol
extraction. 5 µg of purified RNA was depleted for rRNA using the Ribo-Zero Gold for Yeast
kit (MRZY1306, Illumina). Deep sequencing libraries were prepared following the protocol
described in Döring et al.159 and sequenced on a HiSeq 2000 (Illumina).
C.1.2 Computational analyses of Ribo-Seq data
C.1.2.1 Analysis of Ribo-Seq datasets
Wild-type Ribo-Seq datasets were obtained from five different published
studies9,41,111,113,114 whose accession numbers are provided in Table C.1. The raw reads
for each of these published datasets were preprocessed according to the steps specified
in the Methods of the respective study. The sequenced reads for all Ribo-Seq datasets of
mutant strains prepared for this study were subject to a uniform preprocessing step. The
raw reads were first trimmed of their 3′ adapter sequence
CTGTAGGCACCATCAATTCGTATGCCGTCTTCTGCTTG using cutadapt v1.14115. The
reads were also subject to a quality filter of at least 20 during the cutadapt run.
For all downstream analyses, a uniform protocol was used as specified below.
Preprocessed reads were first mapped to a set of ribosomal RNA sequences using
Bowtie2 109 and then subsequently the unmapped reads were mapped to the rest of S.
cerevisiae reference genome sacCer3 using Tophat2 110. Custom python scripts were
implemented for all downstream analyses. Mapped reads were first quantified by their 5′
119
ends on individual gene transcripts and A-site positions were assigned according to Table
2.1 that is also published in Ahmed et al.154. To maintain the accuracy of read assignment,
transcripts in which multiple mapped reads constitute more than 0.1% of the reads
mapped to the CDS region were not considered in the analysis. To minimize noise and
increase the confidence in the results, we restrict our dataset to only transcripts that have
at least 3 reads mapped at every codon position. Applying this filter, we obtain 364 genes
for Williams’ dataset114, which has the highest coverage among the published datasets.
Hence, we use this dataset for all downstream analyses.
C.1.2.2 Estimation of translation speed change for amino acid pairs
The normalized ribosome density 𝜌 for every codon position 𝑗 in transcript 𝑖 is calculated
by dividing the number of mapped Ribo-Seq reads 𝑅𝑘,𝑖 by the average reads mapped to
the transcript 𝑖 consisting of 𝑁𝐶,𝑖 codons.
𝜌𝑗,𝑖 = 𝑅𝑗,𝑖
∑ 𝑅𝑘,𝑖/𝑁𝐶,𝑖𝑘 [Eq. C.1]
𝜌 values are binned into an individual distribution for every amino acid pair (X, Z) where X
is in the P-site and Z is in the A-site. The distribution [𝜌(𝑋, 𝑍)] is populated by 𝜌(𝑗, 𝑖) for
each codon position 𝑗 in transcript 𝑖 such that (𝑗 − 1𝐴𝐴, 𝑗𝐴𝐴) = (𝑋, 𝑍). The terms 𝑗 and 𝑖
are dropped from [𝜌(𝑋, 𝑍)] since this is an aggregated distribution of all instances of 𝜌𝑗,𝑖
for the amino acid pair (X, Z). The speedup or slowdown of translation caused by an amino
acid pair is estimated using the percent change in median of the distribution [𝜌(𝑋, 𝑍)] as
compared to median of the distribution [𝜌(~𝑋, 𝑍)]
𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝑐ℎ𝑎𝑛𝑔𝑒 = 𝑀𝑒𝑑𝑖𝑎𝑛[𝜌(𝑋,𝑍)]− 𝑀𝑒𝑑𝑖𝑎𝑛[𝜌({~𝑋},𝑍)]
𝑀𝑒𝑑𝑖𝑎𝑛[𝜌(~𝑋,𝑍)]∗ 100 % [Eq. C.2]
(~X, Z) represents the set of all pairs of amino acids where Z is in the A-site and X is not
in the P-site. A positive percent change (red shades in Figure 4.1b) will indicate that for
amino acid pair (X, Z), presence of X is leading to slower translation (higher values of 𝜌)
of Z as compared to when X is not present in the P-site. A negative percent change (green
shades in Figure 4.1b) will indicate Z is translated faster when X is present in the P-site
as compared to when X is no present in the P-site.
The distribution [𝜌(𝑋, 𝑍)] is plotted in Figure 4.1d for two pairs (N, R) and (S, R). 𝜌 is plotted
across the X-axis and the probability density 𝑃(𝜌) is plotted on the Y-axis that is calculated
below as
120
𝑃(𝜌(𝑋, 𝑍)) = ∑ ∑ 𝜌(𝑗−1,𝑗) = (𝑋,𝑍)(Θ(𝜌𝑗,𝑖−𝛿𝜌)−Θ(𝜌𝑗,𝑖+𝛿𝜌))𝑗=2𝑖
∑ ∑ 𝜌(𝑘−1,𝑘) = (𝑋,𝑍)𝑘=2𝑖 [Eq. C.3]
where 𝛿𝜌 is the bin width of the histogram. Θ(𝜌𝑗,𝑖 − 𝛿𝜌) and Θ(𝜌𝑗,𝑖 + 𝛿𝜌) are terms of
Heaviside steps function that are used to classify whether the term 𝜌𝑗,𝑖 is to be included in
a particular bin of width 𝛿𝜌 or not.
An odds measure is calculated for whether a P-site mutation from (X,Z) → (B,Z)
will result in change of translation rate. First all combinations of difference of normalized
ribosome densities of each instance of the two distributions is calculated. The odds is the
ratio of number of positive differences to number of negative differences if we predict a
slowdown and vice versa, if we predict a speedup of translation.
𝑂𝑑𝑑𝑠𝑇𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 𝑐ℎ𝑎𝑛𝑔𝑒 = 𝑁{𝜌(𝑋,𝑍)−𝜌(𝐵,𝑍)}>0
𝑁{𝜌(𝑋,𝑍)−𝜌(𝐵,𝑍)}<0 [Eq. C.4]
C.1.2.3 Reproducibility of the trends across different datasets
Normalized ribosome density profiles were calculated from the six Ribo-Seq data sets for
the 364 high coverage genes identified in the Williams dataset114. [𝜌(𝑋, 𝑍)] and [𝜌(~𝑋, 𝑍)]
were then calculated for all pairs of amino acids in the P- and A-sites. Instances of zero
A-site reads were not included in the distributions. The percent change in median
normalized ribosome density 𝜌 is considered to be reproducible if it is positive (slowdown)
or negative (speedup) in all 6 datasets and statistically significant in at least 4 out of the 6
datasets. It is these reproducible results that are reported in Figure 4.1b.
C.1.2.4 Percent contribution of amino acid vs tRNA
In S. cerevisiae, there are 41 unique tRNA molecules (excluding the start codon tRNAMet)
encoding the 20 amino acids of the nascent polypeptides. In P- and A-sites of the
ribosome, the two neighboring amino acids interact to form peptide bond and the two
tRNAs may also interact to probably influence the translation speed at the A-site. To
distinguish these two effects, we derive an equation that can quantity the percent
contribution of amino acid and tRNA to the translation speed. For a comparison of amino
acid pairs X-Z and B-Z, the percent contribution of tRNA is given by
% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑅𝑁𝐴 =
(𝑀𝑎𝑥𝜌{𝐵𝑡1,𝐵𝑡2….,𝐵𝑡𝑚}− 𝑀𝑖𝑛𝜌{𝑋𝑡1,𝑋𝑡2….,𝑋𝑡𝑛})−(𝑀𝑖𝑛𝜌{𝐵𝑡1,𝐵𝑡2….,𝐵𝑡𝑚}−𝑀𝑎𝑥𝜌{𝑋𝑡1,𝑋𝑡2….,𝑋𝑡𝑛})×100
(𝑀𝑎𝑥𝜌{𝐵𝑡1,𝐵𝑡2….,𝐵𝑡𝑚}− 𝑀𝑖𝑛𝜌{𝑋𝑡1,𝑋𝑡2….,𝑋𝑡𝑛}) [Eq. C.5]
121
subject to condition
𝑀𝑎𝑥𝜌{𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚} ≥ 𝑀𝑖𝑛𝜌{𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚} ≥ 𝑀𝑎𝑥𝜌{𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛} ≥
𝑀𝑖𝑛𝜌{𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛})
% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝐴𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 = 100 − % 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑅𝑁𝐴. [Eq. C.6]
The sets {𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚} and {𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛} are composed of the 𝑚 and 𝑛 different
tRNAs covalently attached to amino acids B and X, respectively. The percent contribution
can only be computed when the median normalized ribosome densities of individual
tRNAs of amino acids X do not overlap with those of individual tRNAs of B. In cases where
there is an overlap, it is not possible to apply Eq [S4], therefore they are neglected from
this analysis.
As an example, consider the case in which the change in tRNA identity contributes
100% to the corresponding change in translation speed. In such a case there will a tRNA
for X, for which normalized ribosome density 𝜌𝑋𝑡𝑥 will be equal to normalized ribosome
density for a tRNA of B, 𝜌𝑍𝑡𝑧. Since there is no overlap of the normalized ribosome
densities, 𝑀𝑖𝑛𝜌{𝐵} = 𝜌𝑍𝑡𝑧 and 𝑀𝑎𝑥𝜌{𝑋} = 𝜌𝑋𝑡𝑥
. This will result in 𝑀𝑖𝑛𝜌{𝐵} = 𝑀𝑎𝑥𝜌{𝑋} and
applying this to Eq. C.5
% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑅𝑁𝐴 = (𝑀𝑎𝑥𝜌{𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚}− 𝑀𝑖𝑛𝜌{𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛}) − (0) × 100
(𝑀𝑎𝑥𝜌{𝐵𝑡1, 𝐵𝑡2 … . , 𝐵𝑡𝑚} − 𝑀𝑖𝑛𝜌{𝑋𝑡1, 𝑋𝑡2 … . , 𝑋𝑡𝑛}) = 100%.
Now consider the other end of the spectrum in which the tRNA identity contributes nothing
to translation speed change and it is only the amino acid identity. In such a case, the
median normalized ribosome densities will be equal for all tRNAs of X and similarly for all
tRNAs of Z. Hence,
𝑀𝑎𝑥𝜌{𝐵} = 𝑀𝑖𝑛𝜌{𝐵} and 𝑀𝑎𝑥𝜌{𝑋} = 𝑀𝑖𝑛𝜌{𝑋}
Rearranging the terms in Eq. C.5
% 𝑐𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡𝑅𝑁𝐴 = (𝑀𝑎𝑥𝜌{𝐵}−𝑀𝑖𝑛𝜌{𝐵})−(𝑀𝑖𝑛𝜌{𝑋}−𝑀𝑎𝑥𝜌{𝑋})×100
(𝑀𝑎𝑥𝜌{𝐵}−𝑀𝑖𝑛𝜌{𝑋})= 0 %.
C.1.2.5 Enrichment/Depletion of amino acid pairs
Enrichment/Depletion of an amino acid pair across the proteome of S. cerevisiae is
calculated by dividing the observed probability of finding the amino acid pair by the
122
probability expected of forming these pairs by random chance, that is the product of the
probabilities of the individual amino acids across the proteome. This ratio is a measure of
enrichment/depletion of the amino acid pair which we call the enrichment score. To test
for evolutionary selection of amino acid pairs that significantly influence translation speed,
the top 20% (80 out of 400 amino acid pairs) of amino acid pairs are taken that have the
highest enrichment scores (highly enriched across the transcriptome) and the bottom 20%
of amino acid pairs with the lowest enrichment scores (highly depleted across the
transcriptome). Fisher’s exact test is used to test the hypothesis that the fast pairs are
more likely to be enriched while slow pairs are more likely to be depleted across the S.
cerevisiae transcriptome. The odds ratio of fast pairs being enriched and slow pairs being
depleted are calculated as shown below:
𝑂𝑑𝑑𝑠 𝑟𝑎𝑡𝑖𝑜 = 𝐸𝑛𝑟𝑖𝑐ℎ𝑒𝑑(𝐹𝑎𝑠𝑡 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑝𝑎𝑖𝑟𝑠)×𝐷𝑒𝑝𝑙𝑒𝑡𝑒𝑑(𝑆𝑙𝑜𝑤 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑝𝑎𝑖𝑟𝑠)
𝐸𝑛𝑟𝑖𝑐ℎ𝑒𝑑 ( 𝑆𝑙𝑜𝑤 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑝𝑎𝑖𝑟𝑠)×𝐷𝑒𝑝𝑙𝑒𝑡𝑒𝑑(𝐹𝑎𝑠𝑡 𝑎𝑚𝑖𝑛𝑜 𝑎𝑐𝑖𝑑 𝑝𝑎𝑖𝑟𝑠) [Eq. C.7]
C.1.2.6 Classification of downstream linker and domain regions
A database of 864 S. cerevisiae proteins with annotated domain boundaries was created.
The domain boundaries were identified based on the criteria from Ciryam et al.15 that was
used to identify domain boundaries in E.coli.
The hypothesis of co-translational folding influenced by translation kinetics states
that when the entire domain has exited the ribosome exit tunnel, the translation occurs
slowly for the domain to fold co-translationally80. An additional study14 also states that
when the partial domain region of the nascent polypeptide is exiting the ribosome exit
tunnel, the translation is likely to occur faster to avoid misfolded intermediates. To test
whether the enrichment of our fast and slow amino acid pairs contributes to this
faster/slower translation, we first identify pairs of domain and linker regions which meet
the criteria below. In our analysis, we define the domain region as the region being
translated when the nascent chain segment constituting the domain is outside the exit
tunnel. The growing nascent chain moves through the ribosome’s exit tunnel that usually
contains thirty residues. Hence, the first residue of the domain will appear outside the exit
tunnel when the 31st residue is being translated in the A-site. To account for these 30
residues in the exit tunnel, we consider our translated domain region to begin from 31st
residue of the defined domain start and end 30 residues downstream of the domain which
is when the most C-terminal residue of the domain appears outside the ribosome exit
tunnel. The second region which we call linker region with a fixed window size begins
when the entire domain has emerged from the ribosome exit tunnel which is the 31st
123
residue downstream of the end of the domain and not overlapping with the next domain.
Based on the window size, we determine the linker region to be from the 31st residue. If
the linker is shorter than the sum of 30 residues in the tunnel and window size, then this
domain-linker pair is not considered in the analysis for this window size. The
enrichment/depletion of fast and slow amino acid pairs is calculated in the linker region
relative to the domain region and the statistical significance is estimated using a random
permutation test. The amino acid pairs are kept intact while they are randomly permutated
across a pooled set of domain and linker regions. For each iteration of the permutation
test, the probability of fast and slow amino acid pairs is calculated across both the
downstream regions. Error bars are calculated using Bootstrapping 120.
C.1.2.7 Enrichment/depletion of amino acid pairs in Ssb-bound translated regions
In a previous study159, regions of mRNA were identified that are translated by the ribosome
when Hsp70 chaperone Ssb is bound to the nascent polypeptide chain and when it is not
bound. A Fold Enrichment (FE) measure is the ratio of selective Ribo-Seq reads to Ribo-
Seq reads and its profile across an mRNA transcript, where FE is above a threshold,
defines the regions that are translated when Ssb is bound. We define 5 thresholds based
on the percentile values of FEs from the Cumulative Distribution Function of FE values
(see Figure S6 from study of Döring and co-workers159). For example, for thresholds (P80,
P20), every nucleotide position with a FE value greater than P80 was classified as Ssb-
bound translated region (B region) while every position with a FE value lower than P20
were defined as Ssb-unbound translated region (UB region). The rest of the nucleotides
with FE value between P20 and P80 were ignored. Similarly, 5 thresholds are defined to
represent the increasing differential of Ssb binding strength as move from left to right on
X-axis in Figure C.5.
In the study of Döring and co-workers159, it was shown translation was faster, on
average, in the Ssb-bound translated regions relative to Ssb-unbound regions. The
presence or absence of several molecular factors, such as downstream mRNA secondary
structure, optimal codons, and proline content were shown to correlate with the increased
translation speed in the Ssb-bound regions. To test whether amino acid pairs are also
contributing as a molecular factor, we test the hypothesis that fast amino acid pairs are
enriched and slow amino acid pairs are depleted across Ssb-bound translated regions
relative to Ssb-unbound translated regions. Permutation test is used to measure the
statistical significance of the percent change of probabilities of fast and slow amino acid
pairs in Ssb-bound translated region (B region) relative to Ssb-unbound translated regions
124
(UB region) which is plotted on Y-axis is Figure C.5. Error bars are calculated using
Bootstrapping 120.
C.1.2.8 Miscellaneous details
(1) Mann-Whitney U test is used to assess statistical significance between the normalized
ribosome density distributions of amino acid pairs. The p-values are corrected with the
Benjamini-Hochberg FDR method for multiple-hypothesis correction.
(2) The confounding molecular factors being controlled in Figure C.2 are defined below.
The instances of amino acid pairs containing the concerned molecular factor are not
considered in the analysis in which the molecular factor is controlled. The five factors we
consider are [a] mRNA secondary structure 4 to 6 codons downstream of A-site, [b]
Positively charged residues 2 to 5 residues upstream of A-site, [c] Presence of sequence
motifs like PPX, XPP and motifs identified by Schuller et al. in wild-type S. cerevisiae65,
[d] Instances containing non-optimal codons of A-site amino acid as defined by Pechmann
and Frydman48 and [e] A-site and P-site codon instances which are decoded through
Wobble base-pairing. Mann-Whitney U test is used to compare probabilities of each of
these five factors to determine if there is any bias towards one distribution and hence the
statistically significant different can be attributed to the confounding factor.
(3) The sample size of mutant strains is either 2 or 4 and hence application of Mann-
Whitney U test is not feasible. An exact Fisher-Pitman permutation test is used which
overcomes the problem of low sample size. This test is applied to determine the statistical
significance of the difference between the normalized ribosome densities of mutant and
wild type strains. For non-overlapping distributions with 𝑛 = 2 and 𝑛 = 4 in one
distribution, the p-value equals 0.036 and 0.002 respectively. Hence, we get the same p-
values for all 8 comparisons with 𝑛 = 2 and 4 comparisons with 𝑛 = 4 of normalized
ribosome densities in mutant versus wild-type in Figure 4.2.
125
C.2 Supplementary Figures
126
Figure C.1. The percent change in median normalized ribosome density 𝝆 for a given pair
of amino acids in the P-site and A-site, relative to any other amino acid being in the P-site
(Eq. C.2). . The published datasets are listed based on the name of the first author of the study,
namely, Williams114 (a), Jan113 (b), two biological replicates from study of Nissley9 (c, d),
Weinberg41 (e) and Young111 (f). The accession numbers of these samples are listed in Table
C.1. The legend is same as in Figure 4.1b.
127
128
Figure C.2. The sign of the percent change in ribosome density (Eq. C.2) for the fast and slow
translating amino acid pairs remains the same after controlling for different molecular factors
known to influence translation speed. (a) 84 fast translating pairs (dark blue) and 73 slow translating
pairs (dark orange) shown in Figure 4.1b not controlling for any molecular factors. (b) The 157
significant amino acid pairs after controlling for downstream mRNA secondary structure. The direction
of the median speed change for all 157 pairs remains the same but 72 (46%) pairs lose statistical
significance (light orange and light blue colors for slow and fast, respectively) after controlling for the
factor. (c) Same analysis as (b) but controlling for positively charged residues present upstream of the
P-site. The direction of the speed change for all 157 pairs remains the same, 42 (27%) pairs lose
statistical significance. (d) Same analysis as (b) but controlling for non-optimal codons in the A-site.
The direction of the median speed change for 156-out-of157 pairs remains the same but 53 (34%) pairs
lose statistical significance . (e) Similar analysis as (b) but controlling for stalling motifs. The direction
of the speed change for all 157 pairs remains the same, 4 pairs (2.5%) lose statistical significance. The
loss of statistical significance is primarily due to a decrease in the sample size after filtering out of
instances of the given molecular factor from the [𝜌(𝑋, 𝑍)] and [𝜌(~𝑋, 𝑍)] distributions that are compared.
129
Figure C.3. The ribosome profiling data for all the mutant strains demonstrate consistent
fragment size distribution, strong 3 nt periodicity, robust frame distribution and high pairwise
correlation of individual transcript’s ribosome profiles. (a) Fragment size distribution of reads
mapped to the CDS regions by 5′ end including 50 nt region upstream of start codon for the mutant
strains created in this study. YOL* strain was subjected to ribosome profiling twice on different days
denoted as YOL*-I and YOL*-II for phases I and II respectively (see Methods for details). (b) The
distribution of reads of fragment size 28 whose 5′ end have aligned to reading frame 0, 1 or 2 for all
mutant strains. (c) The pairwise correlation of ribosome profiles for 108 high coverage genes that have
at least 3 reads at every codon position. The pairwise correlation is carried out only between samples
prepared during the same phase (See Methods for details). The median Pearson r is 0.96 indicating
very high correlation between ribosome profiles of genes across different samples. (d) Meta-gene
profile of normalized read counts for fragments of size 28 mapped by the 5′ end and plotted in a 100
nt region starting from -18 nucleotide position with respect to first nucleotide of start codon up to
nucleotide position 82. For analyses in (a), (b) and (d), for all mutant strains, the normalized read counts
were averaged across all replicates.
130
Figure C.4 Ribosome profiles of mutant and wild-type strains are highly correlated. (a-e) The
normalized ribosome density 𝜌 in the mutated gene, averaged over all replicates for mutant and wild-type
reference samples are correlated and plotted on the X and Y-axis respectively. The normalized ribosome
density at the codon position which is in A-site when the mutated position is in P-site is shown in red. In
all cases, the median Pearson R between the individual replicates is greater than 0.92.
131
Figure C.5. Optimal and non-optimal codons are equally distributed between the domain
and linker regions of proteins for both fast- and slow-translating amino acid pairs. Probability
of observing optimal or non-optimal codons in either domain or linker regions, and whether the
codons are part of fast- or slow-translating amino acids identified in Figure 4.1b. Comparison of
domain versus linker regions for both optimal and non-optimal codons and also for optimal vs non-
optimal codons within both domain and linker regions in slow pairs shows that they are equally
distributed (𝑝 − 𝑣𝑎𝑙𝑢𝑒𝑠 > 0.05, Wilcoxon-signed rank test, corrected for multiple testing by
Bonferroni method).
132
Figure C.6. Fast-translating amino acid pairs are enriched in those
transcript segments that are being translated when the chaperone Ssb is
bound to the nascent chain. As done in a previous publication159, 5 different
thresholds listed on the x-axis were defined based on the percentile values of the
Cumulative Distribution Function of Fold Enrichment (FE) metric (see Methods
for details). For each of these thresholds, Ssb-bound translating regions (B
segments) were defined by the nucleotide positions that have FE values greater
than the upper threshold (e.g., P75 for threshold (P75, P25)) while Ssb-unbound
translating regions (UB segments) are defined by the nucleotide positions with FE
less than the lower threshold (e.g., P25 for threshold (P75, P25)). For fast-translating
amino acid pairs (green) and slow-translating amino acid pairs (red), the percent
change in probability of finding these pairs in the B segments relative to UB
segments is reported on the y-axis. A positive percent change indicates
enrichment and negative percent change indicates depletion of the pairs.
Significance of enrichment/depletion is calculated using the random permutation
test (***: p < 0.0001, **: p < 0.01, *: p < 0.05). Error bars represent 95% CI and
were estimated using the Bootstrapping method. These results indicate that the
fast translating pairs of amino acids are enriched in those segments being
translated when Ssb is bound to the ribosome nascent chain complex.
133
Figure C.7. Translation speed differences are not explained by wobble decoding in the P- and
A-sites. (a) After controlling for all wobble base pairing tRNAs in both P- and A-sites, the sign of the
percent change in median normalized ribosome density 𝜌 (Eq. C.2) remains the same in 156 out of
157 the fast and slow translating pairs identified in Figure 4.1b. The coloring scheme is same as in
Figure C.2. (b) The percent difference in medians of the distribution of normalized ribosome density 𝜌
of all possible 7,980 amino acid pair mutations to the P-site is plotted on the X-axis with its statistical
significance being represented by the negative log of p-value plotted on the Y-axis. These distributions
are compared after controlling for instances of amino acid pairs where either the P-site or the A-site
codon is decoded through a Wobble base pairing mechanism. 2,758 amino acid pair mutations are
statistically significant. (c) After filtering for Wobble base pairing, the probability density distribution of
the odds of mutating any instance of P-site amino acid resulting in a significant change of translation
rate is plotted. The median odds are 1.72, similar to the 1.68 value observed in the original dataset
(Figure 4.1e).
134
Figure C.8. Samples prepared in the same phase (single batch on same day) exhibit higher
correlations than samples prepared in different phases. For 118 genes having at least 3 reads
per codon in the highest coverage replicate of YOL* mutant strain prepared in Phase I, pairwise
correlations are run between the normalized ribosome densities across the CDS in this YOL*
replicate with the highest coverage replicate of all other mutant strains. The boxplot of R2 values is
plotted here for correlating the normalized ribosome density profiles of 118 genes of sample YOL*
from Phase I with mutant strains from Phase I and II. YOL* mutant strain prepared in Phase I has
highly correlated ribosome densities at individual codon positions with mutant strains YMR*, YKL*
and YLR* which were also prepared in Phase I. The correlation of codon level ribosome density for
118 YOL* genes was lower when correlated with ribosome profiles of strains YHR*, YOL* and YOL**
prepared in Phase II. Hence, wild-type replicates were chosen for our mutations (Figure 4.2) such
that they have been prepared in the same phase.
135
C.3 Supplementary Tables
Table C.1. Ribo-Seq was obtained from five different published studies. The study and the sample
accession numbers are listed below.
Dataset (first
author name) Year of
publication Number of
replicates GEO Study Accession numbers of
samples used
Jan 2014 1 GSE61012 GSM1495525 Williams 2014 1 GSE61011 GSM1495503 Young 2015 1 GSE69414 GSM1700885
Weinberg 2016 1 GSE53268 GSM1289257 Nissley 2016 2 GSE75322 GSM1949550, GSM1949551
136
Table C.2. Details on the 12 single amino acid mutations that were made across 5 different
genes. For a given amino acid pair in the A- and P-sites, the columns, from left to right, report the
amino acid in the P-site (‘P-site’), the amino acid that the P-site is mutated to (‘Mutated P-site’), the
amino acid in the A-site (‘A-site’), the Gene name (‘Gene’), the A-site codon number (‘A-site codon
No.’), the wild-type and mutated codon in the P-site (P-site codon and Mutated P-site codon,
respectively), and the codon in the A-site (‘A-site codon’).
P-site Mutated P-site
A-site Gene A-site Codon No.
P-site Codon
Mutated P-site codon
A-site Codon
Slow → Fast Translation mutations
P E G YMR122W-A 56 GGC GAA CCA
N S R YOL109W 106 AAC UCC CGU
D F G YKL096W-A 32 GGU UUC GAC
G S T YLR109W 162 ACC UCU GGU
G S G YOL109W 99 GGC UCU GGU
Fast → Slow Translation mutations
Q P D YOL109W 14 CAA CCA GAU
S G G YLR109W 140 GGU GGU AGU
S G T YKL096W-A 62 ACC GGU AGC V H K YHR179W 339 GUG CAC AAG
E K E YHR179W 150 GAA AAA GAA
Negative Control Mutations
V Y F YHR179W 251 GUC UAC UUC
L N A YHR179W 129 CUU AAC GCU
137
Table C.3: Statistics of read mapping for ribosome profiling experiments for the
mutant strains carried out in this study. 1.7 billion reads were mapped to the exome in
total for all samples with an average of 86 million reads per sample.
Sample Total reads
(millions)
Mapped to rRNA Mapped to exome
Reads
(millions) % reads
Reads
(millions) % reads
Phase I
samples
YMR* rep1 37.10 9.04 24.37% 24.76 66.75%
YMR* rep2 56.06 11.90 21.23% 25.02 44.63%
YKL* rep1 59.79 15.65 26.18% 41.46 69.35%
YKL* rep2 65.39 17.04 26.05% 44.88 68.64%
YOL* rep1 123.58 28.99 23.46% 83.56 67.62%
YOL* rep2 59.10 14.76 24.97% 41.51 70.25%
YLR* rep1 55.39 8.54 15.43% 44.03 79.49%
YLR* rep2 57.00 5.34 9.36% 44.67 78.36%
Phase II
samples
YHR* rep1 195.94 27.96 14.27% 158.14 80.71%
YHR* rep2 163.22 30.91 18.94% 124.51 76.28%
YHR* rep3 151.04 24.84 16.45% 116.16 76.91%
YHR* rep4 159.08 19.18 12.06% 131.19 82.46%
YOL* rep1 158.03 29.91 18.93% 118.97 75.28%
YOL* rep2 148.15 16.85 11.38% 99.34 67.05%
YOL* rep3 149.12 24.39 16.36% 114.79 76.98%
YOL* rep4 141.42 23.34 16.50% 104.26 73.72%
YOL** rep1 146.85 18.53 12.62% 74.68 50.85%
YOL** rep2 157.73 22.76 14.43% 125.87 79.80%
YOL** rep3 141.19 20.74 14.69% 112.96 80.00%
YOL** rep4 141.42 23.34 16.50% 104.26 73.72%
Average 86.75
Total reads mapped to exome 1735.10
138
Table C.4. Three mutations to gene YOL109W to test the contribution of amino acid and
tRNA identity. Columns are the same as in Table C.2, except the two synonymous mutations that
encode for the same mutated residue at the P-site, are labeled ‘Mutant 1 P-site codon’ and ‘Mutant
2 P-site codon’.
P-site Mutated P-site
A-site Gene A-site Codon no
P-site Codon
Mutant 1 P-site codon
Mutant 2 P-site codon
A-site Codon
G S G YOL109W 99 GGC UCU AGC GGU
Q P D YOL109W 14 CAA CCA CCU GAU
N S R YOL109W 106 AAC UCC UCG CGU
139
Appendix D
CHAPTER 5 SUPPORTING INFORMATION
This appendix contains proofs for the Fold Enrichment (FE) measure used in Chapter 5 to
determine the Ssb-bound translated segments. This was published as Data_S1
Supplementary file for the study published in Cell titled “Profiling Ssb-Nascent Chain
Interactions Reveals Principles of Hsp70-Assisted Folding” by Kristina Döring, Nabeel
Ahmed, Trine Riemer, Harsha Garadi Suresh, Yevhen Vainshtein, Markus Habich, Jan
Riemer, Matthias P. Mayer, Edward P. O’Brien, Günter Kramer and Bernd Bukau. The
text below is being reproduced with permission from CellPress under the Journal
publishing agreement that allows authors to use the publication for inclusion in a thesis or
dissertation.
D.1 Derivations Demonstrating that the Fold Enrichment Is Directly Proportional to
the Ssb-Binding Probability and that, in the Fold Enrichment, the Contribution of
the Elongation Rate Is Eliminated
There are two proofs provided below. The first demonstrates that the Fold
Enrichment (FE) is directly proportional to the probability of Ssb binding. The second
demonstrates that the number of SeRP reads is a function of the elongation rate, and that
by using Fold Enrichment, the contribution of the elongation rate is eliminated and the
probability of Ssb binding isolated.
D.1.1 Proof 1: Demonstration that the FE is directly proportional to the probability
of Ssb binding
The probability of finding Ssb bound to a nascent chain when codon 𝑖 of transcripts from
gene 𝑗 are being translated at a given instant in time is defined as
𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) = 𝑵𝒃𝒐𝒖𝒏𝒅(𝒊,𝒋)
𝑵𝑻𝒐𝒕𝒂𝒍(𝒊,𝒋) , (1)
where 𝑁𝑏𝑜𝑢𝑛𝑑(𝑖, 𝑗) is the number of ribosomes that have Ssb bound when codon 𝑖 is being
translated, and 𝑁𝑇𝑜𝑡𝑎𝑙(𝑖, 𝑗) is the total number of ribosomes actively translating codon 𝑖.
The FE at codon 𝑖 of gene 𝑗 is defined as 𝐹𝐸(𝑖, 𝑗) = 𝑆(𝑖,𝑗)
𝑅(𝑖,𝑗), where 𝑆(𝑖, 𝑗) is the number of
reads arising from selective ribosome profiling that map to codon 𝑖 of gene 𝑗, and 𝑅(𝑖, 𝑗) is
the number of reads arising from ribosome profiling that map to the same location.
Experiments have shown that the number of RNA-Seq reads is directly proportional to the
number of mRNA molecules (Figure 2C in Mortazavi et al.175). We assume this holds for
140
RP and SeRP reads as well, since they are also Next Generation Sequencing methods.
Therefore, 𝑅(𝑖, 𝑗) is directly proportional to the total number or ribosomes at 𝑖 and 𝑗; that
is 𝑅(𝑖, 𝑗) ∝ 𝑁𝑇𝑜𝑡𝑎𝑙(𝑖, 𝑗), and 𝑅(𝑖, 𝑗) = 𝑎𝑁𝑇𝑜𝑡𝑎𝑙(𝑖, 𝑗), where 𝑎 is a constant of proportionality.
This equation can be algebraically rearranged to find
𝑵𝑻𝒐𝒕𝒂𝒍(𝒊, 𝒋) =𝑹(𝒊,𝒋)
𝒂 (2)
Likewise, 𝑆(𝑖, 𝑗) is most likely directly proportional to the number of ribosomes that
have Ssb bound at length 𝑖 on transcript 𝑗, and hence 𝑆(𝑖, 𝑗) = 𝑏 𝑁𝐵𝑜𝑢𝑛𝑑(𝑖, 𝑗) and
𝑵𝑩𝒐𝒖𝒏𝒅(𝒊, 𝒋) =𝑺(𝒊,𝒋)
𝒃 (3)
where 𝑏 is a constant of proportionality.
Substituting Eqs. 2 and 3 into Eq. 1 results in 𝑃𝑏𝑜𝑢𝑛𝑑(𝑖, 𝑗) = 𝑎 𝑆(𝑖,𝑗)
𝑏 𝑅(𝑖,𝑗) . Substituting
our definition for 𝐹. 𝐸. (𝑖, 𝑗) into this equation yields 𝑃𝑏𝑜𝑢𝑛𝑑(𝑖, 𝑗) = 𝑎
𝑏𝐹. 𝐸. (𝑖, 𝑗). Therefore,
𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) ∝ 𝑭. 𝑬. (𝒊, 𝒋) (4)
This demonstrates that the fold enrichment that is experimentally measured is directly
proportional to the probability of a Ssb molecule being bound to the nascent chain.
D.1.2 Proof 2: Demonstration that SeRP reads are a function of the elongation rate,
and that the Fold Enrichment metric controls for this effect.
The total number of ribosomes on transcript 𝑗, 𝑵𝑻𝒐𝒕𝒂𝒍(𝒋) = ∑ 𝑵𝑻𝒐𝒕𝒂𝒍(𝒌, 𝒋)𝑁𝐶𝑘=1 , where 𝑁𝐶 is
the number of codons in the transcript, is equal to
𝑵𝑻𝒐𝒕𝒂𝒍(𝒋) = 𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋𝝉𝑺,𝒋 (5)
where 𝑘𝑖𝑛𝑡,𝑗, 𝑁𝑚𝑅𝑁𝐴,𝑗 , 𝜏𝑆,𝑗 are, respectively, the initiation rate, mRNA copy number and
average synthesis time of transcript 𝑗. The synthesis time is the sum of the codon
translation times, therefore 𝜏𝑆,𝑗 = ∑ 𝜏𝐴(𝑘, 𝑗)𝑁𝐶𝑘=1 . 𝜏𝐴(𝑘, 𝑗) is the average translation time
of codon 𝑘 in transcript 𝑗. Substituting this expression into Eq. 5 it can be seen that
𝑵𝑻𝒐𝒕𝒂𝒍(𝒊, 𝒋) = 𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊, 𝒋) (6)
141
Inserting Eq. 3 into Eq. 1 we get 𝑆(𝑖, 𝑗) = 𝑏𝑁𝑇𝑜𝑡𝑎𝑙𝑃𝑏𝑜𝑢𝑛𝑑(𝑖, 𝑗). Subsituting Eq. 6
into this equation we get a key result
𝑺(𝒊, 𝒋) = 𝒃𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊, 𝒋)𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) (7)
Thus, we have demonstrated that the number of SeRP reads (𝑺(𝒊, 𝒋)) is not just a
function of the probability of Ssb binding, but also a function of the codon translation time
among other factors. Thus, the probability of binding cannot be defined by the SeRP
reads alone. We need additional information to define the bound regions.
Likewise, it can be shown using Eq. 2 that 𝑹(𝒊, 𝒋) = 𝒂𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊, 𝒋). Substituting
this expression and Eq. 7 into our definition of FE, we get
𝑭. 𝑬. (𝒊, 𝒋) = 𝑺(𝒊,𝒋)
𝑹(𝒊,𝒋)=
𝒃𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊,𝒋)𝑷𝒃𝒐𝒖𝒏𝒅(𝒊,𝒋)
𝒂𝒌𝒊𝒏𝒕,𝒋𝑵𝒎𝑹𝑵𝑨,𝒋 𝝉𝑨(𝒊,𝒋)=
𝒃
𝒂𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) (8)
These additional factors (including elongation speed) all cancel out and we have thereby
isolated the effect of Ssb binding.
𝑷𝒃𝒐𝒖𝒏𝒅(𝒊, 𝒋) ∝ 𝑭. 𝑬. (𝒊, 𝒋) (9)
In Eqs. 8 and 9 we have demonstrated that by dividing the SeRP reads by the RP reads
we eliminate the effects of elongation times, initiation rates, and mRNA copy number
and measure the effect of the probability of Ssb binding.
142
REFERENCES
1. Vogel, C. Translation’s coming of age. Mol. Syst. Biol. 7, 498 (2011).
2. Schwanhäusser, B., Busse, D., Li, N., Dittmar, G., Schuchhardt, J., Wolf, J., Chen, W. & Selbach, M. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011).
3. Cooper, G. Translation of mRNA. The Cell: A Molecular Approach. (Sinauer Associates, 2000). at <https://www.ncbi.nlm.nih.gov/books/NBK9839/>
4. Shah, P., Ding, Y., Niemczyk, M., Kudla, G. & Plotkin, J. B. Rate-limiting steps in yeast protein translation. Cell 153, 1589–601 (2013).
5. Espah Borujeni, A. & Salis, H. M. Translation Initiation is Controlled by RNA Folding Kinetics via a Ribosome Drafting Mechanism. J. Am. Chem. Soc. 138, 7016–7023 (2016).
6. Nissley, D. a & Brien, E. P. O. Timing Is Everything: Unifying Codon Translation Rates and Nascent Proteome Behavior. J. amercan Chem. Soc. 136, 17892−17898 (2014).
7. Trovato, F. & O’Brien, E. P. Fast Protein Translation Can Promote Co- and Posttranslational Folding of Misfolding-Prone Proteins. Biophys. J. 112, 1807–1819 (2017).
8. Sharma, A. K., Bukau, B. & O’Brien, E. P. Physical Origins of Codon Positions That Strongly Influence Cotranslational Folding: A Framework for Controlling Nascent-Protein Folding. J. Am. Chem. Soc. 138, 1180–1195 (2016).
9. Nissley, D. A., Sharma, A. K., Ahmed, N., Friedrich, U. A., Kramer, G., Bukau, B. & O’Brien, E. P. Accurate prediction of cellular co-translational folding indicates proteins can switch from post- to co-translational folding. Nat. Commun. 7, 10341 (2016).
10. Sharma, A. K. & O’Brien, E. P. Increasing Protein Production Rates Can Decrease the Rate at Which Functional Protein is Produced and Their Steady-State Levels. J. Phys. Chem. B 121, 6775–6784 (2017).
11. Sharma, A. K. & O’Brien, E. P. Non-equilibrium coupling of protein structure and function to translation–elongation kinetics. Curr. Opin. Struct. Biol. 49, 94–103 (2018).
12. Curran, J. F. & Yarus, M. Rates of aminoacyl-tRNA selection at 29 sense codons in vivo. J. Mol. Biol. 209, 65–77 (1989).
13. Komar, A. a. A pause for thought along the co-translational folding pathway. Trends Biochem. Sci. 34, 16–24 (2009).
14. O’Brien, E. P., Vendruscolo, M. & Dobson, C. M. Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates. Nat. Commun. 5, 2988 (2014).
15. Ciryam, P., Morimoto, R. I., Vendruscolo, M., Dobson, C. M. & O’Brien, E. P. In vivo translation rates can substantially delay the cotranslational folding of the Escherichia coli cytosolic proteome. Proc. Natl. Acad. Sci. U. S. A. 110, E132-40 (2013).
143
16. Gingold, H. & Pilpel, Y. Determinants of translation efficiency and accuracy. Mol. Syst. Biol. 7, 481 (2011).
17. Plotkin, J. B. & Kudla, G. Synonymous but not the same: the causes and consequences of codon bias. Nat. Rev. Genet. 12, 32–42 (2011).
18. Dong, H., Nilsson, L. & Kurland, C. G. Co-variation of tRNA Abundance and Codon Usage in \textit{Escherichia coli} at Different Growth Rates. J. Mol. Biol. 260, 649–663 (1996).
19. Trotta, E. Selection on codon bias in yeast: A transcriptional hypothesis. Nucleic Acids Res. 41, 9382–9395 (2013).
20. Sharp, P. M. & Li, W.-H. The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987).
21. dos Reis, M., Savva, R. & Wernisch, L. Solving the riddle of codon usage preferences: A test for translational selection. Nucleic Acids Res. 32, 5036–5044 (2004).
22. Carlini, D. B. & Stephan, W. In vivo introduction of unpreferred synonymous codons into the drosophila Adh gene results in reduced levels of ADH protein. Genetics 163, 239–243 (2003).
23. Nicola, A. V., Chen, W. & Helenius, A. Co-translational folding of an alphavirus capsid protein in the cytosol of living cells. Nat. Cell. Biol. 1, 341–345 (1999).
24. Dennis, P. P. & Bremer, H. Modulation of Chemical Composition and Other Parameters of the Cell at Different Exponential Growth Rates. EcoSal Plus 3, (2008).
25. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
26. Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M. & Weissman, J. S. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat. Protoc. 7, 1534–1550 (2012).
27. Dana, A. & Tuller, T. Determinants of Translation Elongation Speed and Ribosomal Profiling Biases in Mouse Embryonic Stem Cells. PLoS Comput. Biol. 8, (2012).
28. Gerashchenko, M. V. & Gladyshev, V. N. Translation inhibitors cause abnormalities in ribosome profiling experiments. Nucleic Acids Res. 42, (2014).
29. Hussmann, J. A., Patchett, S., Johnson, A., Sawyer, S. & Press, W. H. Understanding Biases in Ribosome Profiling Experiments Reveals Signatures of Translation Dynamics in Yeast. PLoS Genet. 11, e1005732 (2015).
30. Fang, H., Huang, Y.-F., Radhakrishnan, A., Siepel, A., Lyon, G. J. & Schatz, M. C. Scikit-ribo Enables Accurate Estimation and Robust Modeling of Translation Dynamics at Codon Resolution. Cell Syst. 6, 180-191.e4 (2018).
31. Martens, A. T., Taylor, J. & Hilser, V. J. Ribosome A and P sites revealed by length analysis of ribosome profiling data. Nucleic Acids Res. 43, 3680 (2015).
144
32. Wang, H., McManus, J. & Kingsford, C. Accurate Recovery of Ribosome Positions Reveals Slow Translation of Wobble-Pairing Codons in Yeast. J. Comput. Biol. 24, 486–500 (2017).
33. Ingolia, N. T., Lareau, L. F. & Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789–802 (2011).
34. Oh, E., Becker, A. H., Sandikci, A., Huber, D., Chaba, R., Gloge, F., Nichols, R. J., Typas, A., Gross, C. A., Kramer, G., Weissman, J. S. & Bukau, B. Selective ribosome profiling reveals the cotranslational chaperone action of trigger factor in vivo. Cell 147, 1295–1308 (2011).
35. Gardin, J., Yeasmin, R., Yurovsky, A., Cai, Y., Skiena, S. & Futcher, B. Measurement of average decoding rates of the 61 sense codons in vivo. Elife 3, e03735 (2014).
36. Popa, A., Lebrigand, K., Paquet, A., Nottet, N., Robbe-Sermesant, K., Waldmann, R. & Barbry, P. RiboProfiling: a Bioconductor package for standard Ribo-seq pipeline processing. F1000Research 5, 1309 (2016).
37. Lauria, F., Tebaldi, T., Bernabò, P., Groen, E. J. N., Gillingwater, T. H. & Viero, G. riboWaltz: Optimization of ribosome P-site positioning in ribosome profiling data. PLoS Comput. Biol. 14, 1–20 (2018).
38. Dunn, J. G. & Weissman, J. S. Plastid: nucleotide-resolution analysis of next-generation sequencing and genomics data. BMC Genomics 17, 958 (2016).
39. Pop, C., Rouskin, S., Ingolia, N. T., Han, L., Phizicky, E. M., Weissman, J. S. & Koller, D. Causal signals between codon bias , mRNA structure , and the efficiency of translation and elongation. Mol. Syst. Biol. 10, 770 (2014).
40. Lareau, L. F., Hite, D. H., Hogan, G. J. & Brown, P. O. Distinct stages of the translation elongation cycle revealed by sequencing ribosome-protected mRNA fragments. Elife 2014, 1–16 (2014).
41. Weinberg, D. E., Shah, P., Eichhorn, S. W., Hussmann, J. A., Plotkin, J. B. & Bartel, D. P. Improved Ribosome-Footprint and mRNA Measurements Provide Insights into Dynamics and Regulation of Yeast Translation. Cell Rep. 14, 1787–1799 (2016).
42. Charneski, C. A. & Hurst, L. D. Positively Charged Residues Are the Major Determinants of Ribosomal Velocity. PLoS Biol. 11, e1001508 (2013).
43. Qian, W., Yang, J. R., Pearson, N. M., Maclean, C. & Zhang, J. Balanced codon usage optimizes eukaryotic translational efficiency. PLoS Genet. 8, e1002603 (2012).
44. Dana, a. & Tuller, T. Mean of the Typical Decoding Rates: A New Translation Efficiency Index Based on the Analysis of Ribosome Profiling Data. Genes|Genomes|Genetics 5, 73–80 (2014).
45. Hani, J. tRNA genes and retroelements in the yeast genome. Nucleic Acids Res. 26, 689–696 (2002).
46. Letzring, D. P., Dean, K. M. & Grayhack, E. J. Control of translation efficiency in yeast by codon-anticodon interactions. Rna 16, 2516–2528 (2010).
47. Chaney, J. L. & Clark, P. L. Roles for Synonymous Codon Usage in Protein Biogenesis.
145
Annu. Rev. Biophys. 44, 143–166 (2015).
48. Pechmann, S. & Frydman, J. Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding. Nat. Struct. Mol. Biol. 20, 237–43 (2013).
49. Stadler, M. & Fire, A. Wobble base-pairing slows in vivo translation elongation in metazoans. RNA 17, 2063–2073 (2011).
50. Choi, J., Grosely, R., Prabhakar, A., Lapointe, C. P., Wang, J. & Puglisi, J. D. How Messenger RNA and Nascent Chain Sequences Regulate Translation Elongation. Annu. Rev. Biochem. 87, 421–449 (2018).
51. Qu, X., Wen, J.-D., Lancaster, L., Noller, H. F., Bustamante, C. & Tinoco, I. The ribosome uses two active mechanisms to unwind messenger RNA during translation. Nature 475, 118–121 (2011).
52. Wen, J. Der, Lancaster, L., Hodges, C., Zeri, A. C., Yoshimura, S. H., Noller, H. F., Bustamante, C. & Tinoco, I. Following translation by single ribosomes one codon at a time. Nature 452, 598–603 (2008).
53. Tuller, T., Waldman, Y. Y., Kupiec, M. & Ruppin, E. Translation efficiency is determined by both codon bias and folding energy. Proc. Natl. Acad. Sci. 107, 3645–3650 (2010).
54. Tuller, T., Veksler-Lublinsky, I., Gazit, N., Kupiec, M., Ruppin, E. & Ziv-Ukelson, M. Composite effects of gene determinants on the translation speed and density of ribosomes. Genome Biol. 12, (2011).
55. Nedialkova, D. D. & Leidel, S. A. Optimization of Codon Translation Rates via tRNA Modifications Maintains Proteome Integrity. Cell 161, 1606–1618 (2015).
56. Goodarzi, H., Nguyen, H. C. B., Zhang, S., Dill, B. D., Molina, H. & Tavazoie, S. F. Modulated expression of specific tRNAs drives gene expression and cancer progression. Cell 165, 1416–1427 (2016).
57. Yona, A. H., Bloom-Ackermann, Z., Frumkin, I., Hanson-Smith, V., Charpak-Amikam, Y., Feng, Q., Boeke, J. D., Dahan, O. & Pilpel, Y. tRNA genes rapidly change in evolution to meet novel translational demands. Elife 2, 1–17 (2013).
58. Pavlov, M. Y., Watts, R. E., Tan, Z., Cornish, V. W., Ehrenberg, M. & Forster, A. C. Slow peptide bond formation by proline and other N-alkylamino acids in translation. Proc. Natl. Acad. Sci. U. S. A. 106, 50–54 (2009).
59. Johansson, M., Ieong, K.-W., Trobro, S., Strazewski, P., Aqvist, J., Pavlov, M. Y. & Ehrenberg, M. pH-sensitivity of the ribosomal peptidyl transfer reaction dependent on the identity of the A-site aminoacyl-tRNA. Proc. Natl. Acad. Sci. 108, 79–84 (2010).
60. Artieri, C. G. & Fraser, H. B. Accounting for biases in riboprofiling data indicates a major role for proline in stalling translation. Genome Res. 24, 2011–2021 (2014).
61. Ude, S., Lassak, J., Starosta, A. L., Kraxenberger, T., Wilson, D. N. & Jung, K. Translation elongation factor EF-P alleviates ribosome stalling at Polyproline Stretches. Science (80-. ). 339, 82–86 (2013).
62. Doerfel, L. K., Wohlgemuth, I., Kothe, C., Peske, F., Urlaub, H. & Rodnina, M. V. EF-P Is
146
Essential for Rapid Synthesis of Proteins Containing Consecutive Proline Residues. Science 339, 85–88 (2013).
63. Starosta, A. L., Lassak, J., Peil, L., Atkinson, G. C., Virumäe, K., Tenson, T., Remme, J., Jung, K. & Wilson, D. N. Translational stalling at polyproline stretches is modulated by the sequence context upstream of the stall site. Nucleic Acids Res. 42, 10711–10719 (2014).
64. Gutierrez, E., Shin, B. S., Woolstenhulme, C. J., Kim, J. R., Saini, P., Buskirk, A. R. & Dever, T. E. eif5A promotes translation of polyproline motifs. Mol. Cell 51, 35–45 (2013).
65. Schuller, A. P., Wu, C. C. C., Dever, T. E., Buskirk, A. R. & Green, R. eIF5A Functions Globally in Translation Elongation and Termination. Mol. Cell 66, 194-205.e5 (2017).
66. Woolstenhulme, C. J., Guydosh, N. R., Green, R. & Buskirk, A. R. High-Precision analysis of translational pausing by ribosome profiling in bacteria lacking EFP. Cell Rep. 11, 13–21 (2015).
67. Peil, L., Starosta, A. L., Lassak, J., Atkinson, G. C., Virumae, K., Spitzer, M., Tenson, T., Jung, K., Remme, J. & Wilson, D. N. Distinct XPPX sequence motifs induce ribosome stalling, which is rescued by the translation elongation factor EF-P. Proc. Natl. Acad. Sci. 110, 15265–15270 (2013).
68. Sabi, R. & Tuller, T. A comparative genomics study on the effect of individual amino acids on ribosome stalling. BMC Genomics 16, S5 (2015).
69. Lu, J. & Deutsch, C. Electrostatics in the Ribosomal Tunnel Modulate Chain Elongation Rates. J. Mol. Biol. 384, 73–86 (2008).
70. Stein, K. C. & Frydman, J. The stop-and-go traffic regulating protein biogenesis: How translation kinetics controls proteostasis. J. Biol. Chem. 294, 2076–2084 (2018).
71. Frydman, J., Nimmesgern, E., Ohtsuka, K. & Ulrich, B. F. Folding of nascent polypeptide chains in high molecular chaperones. Nature 370, 111–117 (1994).
72. Kramer, G., Boehringer, D., Ban, N. & Bukau, B. The ribosome as a platform for co-translational processing, folding and targeting of newly synthesized proteins. Nat. Struct. Mol. Biol. 16, 589–597 (2009).
73. Pechmann, S., Willmund, F. & Frydman, J. The Ribosome as a Hub for Protein Quality Control. Mol. Cell 49, 411–421 (2013).
74. Preissler, S. & Deuerling, E. Ribosome-associated chaperones as key players in proteostasis. Trends Biochem. Sci. 37, 274–283 (2012).
75. Shiber, A., Döring, K., Friedrich, U., Klann, K., Merker, D., Zedan, M., Tippmann, F., Kramer, G. & Bukau, B. Cotranslational assembly of protein complexes in eukaryotes revealed by ribosome profiling. Nature 561, 268–272 (2018).
76. Chartron, J. W., Hunt, K. C. L. & Frydman, J. Cotranslational signal-independent SRP preloading during membrane targeting. Nature 536, 224–228 (2016).
77. Thommen, M., Holtkamp, W. & Rodnina, M. V. Co-translational protein folding: progress and methods. Curr. Opin. Struct. Biol. 42, 83–89 (2017).
147
78. Yu, C. H., Dang, Y., Zhou, Z., Wu, C., Zhao, F., Sachs, M. S. & Liu, Y. Codon Usage Influences the Local Rate of Translation Elongation to Regulate Co-translational Protein Folding. Mol. Cell 59, 744–754 (2015).
79. Elena, C., Ravasi, P., Castelli, M. E., Peirú, S. & Menzella, H. G. Expression of codon optimized genes in microbial systems: Current industrial applications and perspectives. Front. Microbiol. 5, 1–8 (2014).
80. Buhr, F., Jha, S., Thommen, M., Mittelstaet, J., Kutz, F., Schwalbe, H., Rodnina, M. V. & Komar, A. A. Synonymous Codons Direct Cotranslational Folding toward Different Protein Conformations. Mol. Cell 61, 341–351 (2016).
81. Zhang, G., Hubalewska, M. & Ignatova, Z. Transient ribosomal attenuation coordinates protein synthesis and co-translational folding. Nat. Struct. Mol. Biol. 16, 274–280 (2009).
82. Fedyunin, I., Lehnhardt, L., Böhmer, N., Kaufmann, P., Zhang, G. & Ignatova, Z. tRNA concentration fine tunes protein solubility. FEBS Lett. 586, 3336–3340 (2012).
83. Zhou, M., Guo, J., Cha, J., Chae, M., Chen, S., Barral, J. M., Sachs, M. S. & Liu, Y. Non-optimal codon usage affects expression, structure and function of clock protein FRQ. Nature 495, 111–115 (2013).
84. Zhou, T., Weems, M. & Wilke, C. O. Translationally optimal codons associate with structurally sensitive sites in proteins. Mol. Biol. Evol. 26, 1571–1580 (2009).
85. Sander, I. M., Chaney, J. L. & Clark, P. L. Expanding Anfinsen ’ s Principle : Contributions of Synonymous Codon Selection to Rational Protein Design Expanding Anfinsen ’ s Principle : Contributions of Synonymous Codon Selection to Rational Protein Design. J. Am. Chem. Soc. ASAP, (2014).
86. Chaney, J. L., Steele, A., Carmichael, R., Rodriguez, A., Specht, A. T., Ngo, K., Li, J., Emrich, S. & Clark, P. L. Widespread position-specific conservation of synonymous rare codons within coding sequences. PLoS Comput. Biol. 13, 1–19 (2017).
87. Saunders, R. & Deane, C. M. Synonymous codon usage influences the local protein structure observed. Nucleic Acids Res. 38, 6719–6728 (2010).
88. Dana, A. & Tuller, T. Efficient Manipulations of Synonymous Mutations for Controlling Translation Rate: An Analytical Approach. J. Comput. Biol. 19, 200–231 (2012).
89. Becker, A. H., Oh, E., Weissman, J. S., Kramer, G. & Bukau, B. Selective ribosome profiling as a tool for studying the interaction of chaperones and targeting factors with nascent polypeptide chains and ribosomes. Nat. Protoc. 8, 2212–39 (2013).
90. Calkhoven, C. F., Müller, C. & Leutz, A. Translational control of gene expression and disease. Trends Mol. Med. 8, 577–583 (2002).
91. Ingolia, N. T. Ribosome Footprint Profiling of Translation throughout the Genome. Cell 165, 22–33 (2016).
92. Popa, A., Lebrigand, K., Paquet, A., Nottet, N., Robbe-Sermesant, K., Waldmann, R. & Barbry, P. RiboProfiling: a Bioconductor package for standard Ribo-seq pipeline processing. F1000Research 5, 1309 (2016).
148
93. Sierksma, G. Linear and Integer Programming Theory and Practice. (CRC Press, 2001). at <http://openlibrary.org/books/OL8124799M/Linear_Integer_Programming>
94. Ingolia, N. T., Brar, G. A., Stern-Ginossar, N., Harris, M. S., Talhouarne, G. J. S., Jackson, S. E., Wills, M. R. & Weissman, J. S. Ribosome Profiling Reveals Pervasive Translation Outside of Annotated Protein-Coding Genes. Cell Rep. 8, 1365–1379 (2014).
95. O’Connor, P. B. F., Li, G. W., Weissman, J. S., Atkins, J. F. & Baranov, P. V. RRNA:mRNA pairing alters the length and the symmetry of mRNA-protected fragments in ribosome profiling experiments. Bioinformatics 29, 1488–1491 (2013).
96. Mohammad, F., Woolstenhulme, C. J., Green, R. & Buskirk, A. R. Clarifying the Translational Pausing Landscape in Bacteria by Ribosome Profiling. Cell Rep. 14, 686–694 (2016).
97. Nakahigashi, K., Takai, Y., Kimura, M., Abe, N., Nakayashiki, T., Shiwa, Y., Yoshikawa, H., Wanner, B. L., Ishihama, Y. & Mori, H. Comprehensive identification of translation start sites by tetracycline-inhibited ribosome profiling. DNA Res. 23, 193–201 (2016).
98. Malys, N. Shine-Dalgarno sequence of bacteriophage T4: GAGG prevails in early genes. Mol. Biol. Rep. 39, 33–39 (2012).
99. Han, Y., Gao, X., Liu, B., Wan, J., Zhang, X. & Qian, S. B. Ribosome profiling reveals sequence-independent post-initiation pausing as a signature of translation. Cell Res. 24, 842–851 (2014).
100. Haase, N., Holtkamp, W., Lipowsky, R., Rodnina, M. & Rudorf, S. Decomposition of time-dependent fluorescence signals reveals codon-specific kinetics of protein synthesis. Nucleic Acids Res. 46, (2018).
101. Diament, A. & Tuller, T. Estimation of ribosome profiling performance and reproducibility at various levels of resolution. Biol. Direct 11, 24 (2016).
102. Malone, B., Atanassov, I., Aeschimann, F., Li, X., Großhans, H. & Dieterich, C. Bayesian prediction of RNA translation from ribosome profiling. Nucleic Acids Res. 45, 2960–2972 (2016).
103. Sabi, R. & Tuller, T. A comparative genomics study on the effect of individual amino acids on ribosome stalling. BMC Genomics 16, S5 (2015).
104. Gutierrez, E., Shin, B. S., Woolstenhulme, C. J., Kim, J. R., Saini, P., Buskirk, A. R. & Dever, T. E. eif5A promotes translation of polyproline motifs. Mol. Cell 51, 35–45 (2013).
105. Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014).
106. Brackley, C. A., Romano, M. C. & Thiel, M. The dynamics of supply and demand in mRNA translation. PLoS Comput. Biol. 7, (2011).
107. Rudorf, S. & Lipowsky, R. Protein synthesis in E. coli: Dependence of codon-specific elongation on tRNA concentration and codon usage. PLoS One 10, 1–22 (2015).
108. Sonenberg, N. & Hinnebusch, A. G. Regulation of Translation Initiation in Eukaryotes: Mechanisms and Biological Targets. Cell 136, 731–745 (2009).
149
109. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012).
110. Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R. & Salzberg, S. L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
111. Young, D. J., Guydosh, N. R., Zhang, F., Hinnebusch, A. G. & Green, R. Rli1/ABCE1 Recycles Terminating Ribosomes and Controls Translation Reinitiation in 3′UTRs In Vivo. Cell 162, 872–884 (2015).
112. Guydosh, N. R. & Green, R. Dom34 rescues ribosomes in 3′ untranslated regions. Cell 156, 950–962 (2014).
113. Jan, C. H., Williams, C. C. & Weissman, J. S. ‘Principles of ER cotranslational translocation revealed by proximity-specific ribosome profiling’. Science 346, 748–751 (2014).
114. Williams, C. C., Jan, C. H. & Weissman, J. S. Targeting and plasticity of mitochondrial proteins revealed by proximity-specific ribosome profiling. Science 346, 748–751 (2014).
115. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10 (2011).
116. Hurt, J. a, Robertson, A. D. & Burge, C. B. Global analyses of UPF1 binding and function reveals expanded scope of nonsense-mediated mRNA decay. Genome Res. 23, 1636–1650 (2013).
117. Li, G.-W., Oh, E. & Weissman, J. S. The anti-Shine–Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature 484, 538–541 (2012).
118. Li, G. W., Burkhardt, D., Gross, C. & Weissman, J. S. Quantifying absolute protein synthesis rates reveals principles underlying allocation of cellular resources. Cell 157, 624–635 (2014).
119. Gillespie, D. T. Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361 (1977).
120. Good, P. Permutation, Parametric, and Bootstrap Tests of Hypothesis. (Springer Series in Statistics, 2005). doi:10.1007/978-0-387-98135-2
121. Reid, D. W. & Nicchitta, C. V. Primary role for endoplasmic reticulum-bound ribosomes in cellular translation identified by ribosome profiling. J. Biol. Chem. 287, 5518–5527 (2012).
122. Chowdhury, D. Stochastic mechano-chemical kinetics of molecular motors: A multidisciplinary enterprise from a physicist’s perspective. Phys. Rep. 529, 1–197 (2013).
123. Marshall, R. A., Aitken, C. E., Dorywalska, M. & Puglisi, J. D. Translation at the single-molecule level. Annu. Rev. Biochem. 77, 177–203 (2008).
124. Sharma, A. K. & Chowdhury, D. Template-directed biopolymerization:Tape-copying turing machines. Biophys. Rev. Lett. 7, 135–175 (2012).
125. Jackson, R. J., Hellen, C. U. T. & Pestova, T. V. The mechanism of eukaryotic translation initiation and principles of its regulation. Nat. Rev. Mol. Cell Biol. 11, 113–127 (2010).
150
126. Hinnebusch, A. G. The Scanning Mechanism of Eukaryotic Translation Initiation. Annu. Rev. Biochem. 83, 779–812 (2014).
127. Spriggs, K. A., Bushell, M. & Willis, A. E. Translational Regulation of Gene Expression during Conditions of Cell Stress. Mol. Cell 40, 228–237 (2010).
128. Kervestin, S. & Amrani, N. Translational regulation of gene expression. Genome Biol. 5, 359 (2004).
129. Ciandrini, L., Stansfield, I. & Romano, M. C. Ribosome Traffic on mRNAs Maps to Gene Ontology: Genome-wide Quantification of Translation Initiation Rates and Polysome Size Regulation. PLoS Comput. Biol. 9, e1002866 (2013).
130. Dana, A. & Tuller, T. Mean of the Typical Decoding Rates: A New Translation Efficiency Index Based on the Analysis of Ribosome Profiling Data. G3-Genes Genomes Genet. 5, 73–80 (2015).
131. Gritsenko, A. A., Hulsman, M., Reinders, M. J. T. & de Ridder, D. Unbiased Quantitative Models of Protein Translation Derived from Ribosome Profiling Data. PLOS Comput. Biol. 11, e1004336 (2015).
132. Dao Duc, K. & Song, Y. S. The impact of ribosomal interference, codon usage, and exit tunnel interactions on translation elongation rate variation. PLoS Genet. 14, e1001508 (2018).
133. Requião, R. D., de Souza, H. J. A., Rossetto, S., Domitrovic, T. & Palhano, F. L. Increased ribosome density associated to positively charged residues is evident in ribosome profiling experiments performed in the absence of translation inhibitors. RNA Biol. 13, 561–568 (2016).
134. Dao Duc, K., Saleem, Z. H. & Song, Y. S. Theoretical analysis of the distribution of isolated particles in totally asymmetric exclusion processes: Application to mRNA translation rate estimation. Phys. Rev. E 97, 12106 (2018).
135. Gillesple, D. T. Exact Stochastic Simulation of couple chemical reactions. J. Phys. Chem. 81, 2340–2361 (1977).
136. Fluitt, A., Pienaar, E. & Viljoen, H. Ribosome kinetics and aa-tRNA competition determine rate and fidelity of peptide synthesis. Comput. Biol. Chem. 31, 335–346 (2007).
137. Sharma, A. K., Ahmed, N. & O’Brien, E. P. Determinants of translation speed are randomly distributed across transcripts resulting in a universal scaling of protein synthesis times. Phys. Rev. E 97, 22409 (2018).
138. Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J. S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).
139. Kertesz, M., Wan, Y., Mazor, E., Rinn, J. L., Nutter, R. C., Chang, H. Y. & Segal, E. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 (2010).
140. Sorensen, M. A. & Pedersen, S. Absolute in vivo translation rates of individual codons in Escherichia coli. The two glutamic acid codons GAA and GAG are translated with a
151
threefold difference in rate. J. Mol. Biol. 222, 265–280 (1991).
141. Brar, G. A. Beyond the Triplet Code: Context Cues Transform Translation. Cell 167, 1681–1692 (2016).
142. Diament, A., Feldman, A., Schochet, E., Kupiec, M., Arava, Y. & Tuller, T. The extent of ribosome queuing in budding yeast. PLoS Comput. Biol. 14, e1005951 (2018).
143. Ingolia, N. T. Ribosome Footprint Profiling of Translation throughout the Genome. Cell 165, 22–33 (2016).
144. Gerashchenko, M. V. & Gladyshev, V. N. Ribonuclease selection for ribosome profiling. Nucleic Acids Res. 45, e6 (2017).
145. Lecanda, A., Nilges, B. S., Sharma, P., Nedialkova, D. D., Schwarz, J., Vaquerizas, J. M. & Leidel, S. A. Dual randomization of oligonucleotides to reduce the bias in ribosome-profiling libraries. Methods 107, 89–97 (2016).
146. Sébastien, P., Mariana, L.-Q. & Thomas, T. Cloning of Small RNA Molecules. Curr. Protoc. Mol. Biol. 72, 26.4.1-26.4.18 (2005).
147. Levin, J. Z., Yassour, M., Adiconis, X., Nusbaum, C., Thompson, D. A., Friedman, N., Gnirke, A. & Regev, A. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Methods 7, 709–715 (2010).
148. Zur, H. & Tuller, T. Predictive biophysical modeling and understanding of the dynamics of mRNA translation and its evolution. Nucleic Acids Res. 44, 9031–9049 (2016).
149. Shaw, L. B., Zia, R. K. P. & Lee, K. H. Totally asymmetric exclusion process with extended objects: A model for protein synthesis. Phys. Rev. E 68, 021910 (2003).
150. Chou, T., Mallick, K. & Zia, R. K. P. Non-equilibrium statistical mechanics: from a paradigmatic model to biological transport. Reports Prog. Phys. 74, 116601 (2011).
151. Lakatos, G. & Chou, T. Totally asymmetric exclusion processes with particles of arbitrary size. 36, 2027–2041 (2003).
152. Fernandes, L. D., de Moura, A. & Ciandrini, L. Gene length as a regulator for ribosome recruitment and protein synthesis: theoretical insights. Sci. Rep. 7, 17409 (2017).
153. Bonnin, P., Kern, N., Young, N. T., Stansfield, I. & Romano, M. C. Novel mRNA-specific effects of ribosome drop-off on translation rate and polysome profile. PLoS Comput. Biol. (2017). doi:10.1371/journal.pcbi.1005555
154. Ahmed, N., Sormanni, P., Ciryam, P., Vendruscolo, M., Dobson, C. M. & O’Brien, E. P. Identifying A- and P-site locations on ribosome-protected mRNA fragments using Integer Programming. Sci. Reports 9, 6256 (2019).
155. Bonven, B. & Gulløv, K. Peptide chain elongation rate and ribosomal activity in Saccharomyces cerevisiae as a function of the growth rate. Mol. Gen. Genet. 170, 225–30 (1979).
156. Karpinets, T. V, Greenwood, D. J., Sams, C. E. & Ammons, J. T. RNA:protein ratio of the unicellular organism as a characteristic of phosphorous and nitrogen stoichiometry and
152
of the cellular requirement of ribosomes for protein synthesis. BMC Biol. 4, 30 (2006).
157. Nissley, D. A., Sharma, A. K., Ahmed, N., Friedrich, U. A., Kramer, G., Bukau, B. & O’Brien, E. P. Accurate prediction of cellular co-translational folding indicates proteins can switch from post- to co-translational folding. Nat. Commun. 7, (2016).
158. JM, C., EL, H., Amundsen, C., Balakrishnan, R., Binkley, G., ET, C., KR, C., MC, C., SS, D., SR, E., DG, F., JE, H., BC, H., Karra, K., CJ, K., SR, M., RS, N., Park, J., MS, S., Simison, M., Weng, S. & ED, W. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700-5 (2012).
159. Döring, K., Ahmed, N., Riemer, T., Suresh, H. G., Vainshtein, Y., Habich, M., Riemer, J., Mayer, M. P., O’Brien, E. P., Kramer, G. & Bukau, B. Profiling Ssb-Nascent Chain Interactions Reveals Principles of Hsp70-Assisted Folding. Cell 170, 298-311.e20 (2017).
160. Fields, S., Gamble, C. E., Grayhack, E. J., Brule, C. E. & Dean, K. M. Adjacent Codons Act in Concert to Modulate Translation Efficiency in Yeast. Cell 166, 679–690 (2016).
161. Bukau, B., Deuerling, E., Pfund, C. & Craig, E. A. Getting newly synthesized proteins into shape. Cell 101, 119–122 (2000).
162. Frydman, J. Folding of Newly Translated Proteins in vivo: The Role of Molecular Chaperones. Annu. Rev. Biochem. (2001).
163. Albanèse, V., Yam, A. Y. W., Baughman, J., Parnot, C. & Frydman, J. Systems analyses reveal two chaperone networks with distinct functions in eukaryotic cells. Cell 124, 75–88 (2006).
164. Koplin, A., Preissler, S., Llina, Y., Koch, M., Scior, A., Erhardt, M. & Deuerling, E. A dual function for chaperones SSB-RAC and the NAC nascent polypeptide-associated complex on ribosomes. J. Cell Biol. 189, 57–68 (2010).
165. Ariosa, A., Lee, J. H., Wang, S., Saraogi, I. & Shan, S. Regulation by a chaperone improves substrate selectivity during cotranslational protein targeting. Proc. Natl. Acad. Sci. 112, E3169–E3178 (2015).
166. Jacobson, G. N. & Clark, P. L. Quality over quantity: Optimizing co-translational protein folding with non-’optimal’ synonymous codons. Curr. Opin. Struct. Biol. 38, 102–110 (2016).
167. Nelson, R. J., Ziegelhoffer, T., Nicolet, C., Werner-Washburne, M. & Craig, E. A. The translation machinery and 70 kd heat shock protein cooperate in protein synthesis. Cell 71, 97–105 (1992).
168. Gautschi, M., Mun, A., Ross, S. & Rospert, S. A functional chaperone triad on the yeast ribosome. Proc. Natl. Acad. Sci. 99, 4209–4214 (2002).
169. Willmund, F., Del Alamo, M., Pechmann, S., Chen, T., Albanèse, V., Dammer, E. B., Peng, J. & Frydman, J. The cotranslational function of ribosome-associated Hsp70 in eukaryotic protein homeostasis. Cell 152, 196–209 (2013).
170. Shalgi, R., Hurt, J. a., Krykbaeva, I., Taipale, M., Lindquist, S. & Burge, C. B. Widespread Regulation of Translation by Elongation Pausing in Heat Shock. Mol. Cell 49, 439–452
153
(2013).
171. Liu, B., Han, Y. & Qian, S. B. Cotranslational Response to Proteotoxic Stress by Elongation Pausing of Ribosomes. Mol. Cell 49, 453–463 (2013).
172. Tuller, T., Carmi, A., Vestsigian, K., Navon, S., Dorfan, Y., Zaborske, J., Pan, T., Dahan, O., Furman, I. & Pilpel, Y. An evolutionarily conserved mechanism for controlling the efficiency of protein translation. Cell 141, 344–354 (2010).
173. Sauna, Z. E. & Kimchi-Sarfaty, C. Understanding the contribution of synonymous mutations to human disease. Nat. Rev. Genet. 12, 683–91 (2011).
174. Janke, C., Magiera, M. M., Rathfelder, N., Taxis, C., Reber, S., Maekawa, H., Moreno-Borchart, A., Doenges, G., Schwob, E., Schiebel, E. & Knop, M. A versatile toolbox for PCR-based tagging of yeast genes: new fluorescent proteins, more markers and promoter substitution cassettes. Yeast 21, 947–62 (2004).
175. Mortazavi, A., Williams, B. a, McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).
176. Cannarrozzi, G., Schraudolph, N. N., Faty, M., von Rohr, P., Friberg, M. T., Roth, A. C., Gonnet, P., Gonnet, G. & Barral, Y. A role for codon order in translation dynamics. Cell 141, 355–367 (2010).
177. Hazewinkel, M. Encyclopedia of Mathematics. (Springer Science+Business Media B.V. / Kluwer Academic Publishers).
VITA
Nabeel Ahmed
Education
Ph.D., Bioinformatics and Genomics, The Pennsylvania State University Aug 2019 M.S. (Honors), Biological Sciences, Birla Institute of Technology and Science Jul 2013 M.S. (Technology), Information Systems, Birla Institute of Technology and Science Jul 2013
Publications
1) N. Ahmed, U. F. Friedrich,…,G. Kramer, E. P. O’Brien. “Evolutionarily selected amino acid
pairs encode translation-elongation rate information.” (Submitted)
2) A. K. Sharma*, P. Sormani*, N. Ahmed*, P. Ciryam, U. F. Friedrich, G. Kramer and E. P.
O’Brien. “A chemical kinetic basis for measuring initiation and elongation rates from
ribosome profiling data.” PLOS Comp Biology. 15(5): e1007070 (2019).
3) N. Ahmed*, P. Sormani*, P. Ciryam, M. Vendruscolo, C.M. Dobson and E. P. O’Brien.
“Identifying A- and P-site locations on ribosome-protected mRNA fragments using Integer
Programming.” Scientific Reports, 9:6256 (2019)
4) A. K. Sharma, N. Ahmed and E. P. O’Brien, “Determinants of translation speed are
randomly distributed across transcripts resulting in a universal scaling of protein synthesis
times.” Phys. Rev. E., 022409 (2018).
5) K. Döring, N. Ahmed, …, E. P. O’Brien, G. Kramer and B. Bukau. Profiling Ssb-Nascent
Chain Interactions Reveals Principles of Hsp70-Assisted Folding. Cell,170, 298–311.e20
(2017).
6) D.A. Nissley, A. K. Sharma, N. Ahmed, U. Friedrich, G. Kramer, B. Bukau and E. P.
O’Brien. “Accurate prediction of cellular co-translational folding indicates proteins can
switch from post- to co-translational folding” Nature Communications, 7, 10341 (2016).
*Co-first authors
Invited Oral Presentations
Biophysical Society 62nd Annual Meeting Feb 2018
Penn State Genomics Seminar series Mar 2016 and Sep 2017
Selected Poster Presentations
Triennial Ribosome 2019 meeting Jan 2019
EMBL Symposium ‘The Complex Life of RNA’ Oct 2018
Cold Spring Harbor Laboratory meeting on ‘Translational Control’ Sep 2018
From Computational Biophysics to Systems Biology 2017 May 2017
Selected Awards and Honors
Penn State Huck Institutes of Life Sciences Graduate Travel Award 2018
Penn State Eberly College of Science Braddock Scholarship 2013