Deep proteome and trancriptome mapping of human cervical cancer cell line

Deep proteome and

transcriptome mapping

of a human cancer

cell line

BARBA | BIADOMANG |CUA | DAYANAN | LOPEZ

AUTHORS: N. Nagaraj, J. Wisniewski, T. Geiger, J. Cox, M. Kircher, J. Kelso, S. Paabo, M. Mann

JOURNAL: Molecular Systems Biology

DATE PUBLISHED: October 29, 2011

Human genome is comprise of a mere 20,000protein coding genes.

RNA-Seq Transcriptomics

Transcripts between

8000-16000

of protein coding genes

High-res MS-based Proteomics

Essentially complete proteome of model

organism (yeast)

Limited to 4000-6000 protein groups in

mammalian systems

“…explore a human proteome in the depth achievable with current technology and to

compare it with the corresponding transcriptome.”

METHODOLOGY

HeLa Cells Lysate

Flash freezingStore at

-80oC

Cell LysisSonication

CentrifugationProtein content determination

Protein Fractionation by Gel

Filtration

0.1mL cell lysate

Load onto GL Column

Elution with buffer

Protein Digestion and Peptide

Fractionation

Removal of detergent

Protein Digestion

LysC

Gluc

Trypsin

Mass Spectroscopy

PurificationRP C18

ChromatographyMS

RNA sequence

Extraction QuantificationRNA library preparation

Enrichment

FragmentationRNA fragments

copied into DNABlunt ends conversion

Addition of deoxyadenosine

Ligation of forked adaptors

Amplify Sequencing

Data Availability

Gene and Transcript Quantification

Data analysis

RESULTS AND DISCUSSION

Sample: The HeLa cells

• HeLa cells– Human cervical carcinoma cell line

• “Immortal cells”: can grow indefinitely, be frozen for decades

• Standardized field of tissue culture

– Named after Henriette Lacks by a scientist at John Hopkins Hospital• A piece of her tumor was taken

• Her cells never died

– Prolific growth maybe due to HPV18• HPV18 viral proteins (E6 and E7) suppresses p53 and pRb

gene products, respectively.

Proteome Coverage Study

• Objective: to achieve maximum proteome coverage at a reasonable measurement time

– Procedure: Investigate effects of protein fractionation, proteolytic digestion, peptide fractionation, and reverse phase chromatography


• Protein fractionation– Gel filtration: separation based on size and shape

• Proteolytic digestion– Trypsin: C-terminal side of lysine and arginine

– Glu-C: C-terminal of glutamic residues

– Lys-C: carboxyl side of lysine residues

– Note: Protein digestion heavily affects effective protein characterization and identification by mass spectrometry• Overlapping fragments = larger sequence coverage


• Pipette-based prefractionations

– Strong anion exchange resin

• Independent of pH

• Mostly used for deep coverage of the composition of the sample or if specific peptides should be enriched

• Reverse phase chromatography

– Reduced the complexity of the peptide mixture by selecting peptides for tandem mass spectrometry according to their polarity


• LC-MS/MS analysis– Peptide MS spectra

• Interpretation by comparison with lists from generated from theoretical digestion of protein

– Fragment MS/MS spectra• Interpretation by comparison from theoretical

fragmentation of peptide

– Elution time of peptide is based on its polarity

– Repeated extensively, in order to increase the number of peptides, thereby making the protein less complex, for which tandem mass spectra are acquired


• Procedure is referred as “shotgun” proteomics

– Most successful strategy to achieve extensive proteome coverafe

– Summary: protein sample is extracted from their biological source, subjected to enzymatic digestion, the resulting peptide mixtures are analysed by LC-MS/MS

• Additional augmented fractionation steps for proteins/peptides can also be conducted

MaxQuant Computational Proteomics Environment

• MaxQuant– Quantitative proteomics software package

designed for analyzing large mass spectrometric data sets

– Has an integrated search engine, Andromeda

– Supported instrument: LTQ-Orbitrap• Orbitrap: ions circulate around a central, spindle-

shaped electrode

• Highly accurate: axial frequency oscillation, determined with high precision, is proportional to the square root of m/z.


• Number of runs: 2 337 336 high resolution fragmentation spectra and high-accuracy precursor masses

• Search Engine: Andromeda– Algorithm: uses a probability based approach to

match tandem mass (MS/MS) spectra to peptide sequences in databases

– Median peptide score: 121, 6% below has a score of 60• For each score, corresponds to the sum of the highest ions

score for each distinct sequence


• Average identification of fragmentation spectra: 43%

• Average absolute mass deviation of the precursors for the matched fragment masses: 1.2 and 4.8 p.p.m


• Result of analysis

– Identified and quantified number of peptides

• 163 784 peptides

– FDR (false discovery rate): 1%

– Out of 163 784 peptides

• 84 051 from tryptic digestion

• 52 108 from Lys-C digestion

• 44 704 from Glu-C digestion


• Result of analysis– From obtained data, MaxQuant identified 10 255

proteins with 99% confidence• Lower bound of the number of proteins expressed in HeLa

cells

– There were observed overlapping fragments of enzymatic cleavage• Tryptic digestion: yielded highest number of identifications• Lys-C digewstion: 85% overlapped with Trypsin• Glu-C digestion: 5.2% novel identifications• Shows that <5% of all proteins were only identified by one

peptide• Taken all together, >25% median sequence coverage

ENSEMBL database and GENSCAN predictions

• ENSEMBL-annotated human protein-coding genes– MS/MS spectra: searched against the ENSEMBL

database with GENSCAN predictions

– 10 255 proteins were mapped to 9207 human protein-coding genes• Most identified number of genes at chromosome 1

• Least number of identified genes at chromosome 21

– GENSCAN preidictions: >1900 peptides not known to ENSEMBL genes

Completeness of Detected Proteome

• Inspecting the macrocomplexes which are functionally necessary

• Proteosome, spliceosome, histone modifying complexes and respiratory chain complexes

• Corum protein complex database

Corum protein complex database

• Collection of experimentally verified mammalian protein complexes

– protein complex function

– Localization

– subunit composition

– literature references

• Mean proteome coverage of all Corum protein was >95%

• Transcriptome coverage 96.5%

• Among the lower coverage which is due to cell type specificity are (next slide)

• Sarcoglycan-sarcospan provides structural integrity in muscle tissues

• SNARE for neurotransmitter release in synapses• ITGA2b-ITGB3 - a fibronectin receptor that plays a

crucial role in coagulation

Complex Normally Expressed % Coverage

Sarcoglycan-sarcospan

Muscle 20

SNARE (Soluble N-

ethylmaleimide sensitive factor Attachment protein

Receptor )

Neuronal tissue 40

ITGA2b-ITGB3 Platelets 50

• 5% of the HeLa cell population was in mitosis

– 61/63 proteins in a reference set of cell cycle-specific proteins

– High coverage of the most metabolic pathways pertaining to basic cellular function

• Comprehensiveness of the proteome is hard to determine by comparison with pathway databases because they contain cell type-specific proteins

Quantitative Analysis

• Deep-sequencing transcriptomics– Proteomics data - >90% complete

• Transciptome + proteome data – 10,000 - 12, 000 genes expressed in HeLa cells

• iBAQ (intensity based absolute quantification)– incorporating individual peptide signals in MS and

normalized by the number of observable peptides of the protein

– Estimate the absolute amount of each protein


• 40 most abundant protein comprised 25% of the proteome

– Filamin A, pyruvate kinase, enolase, vimentin, Hsp 60

• 600 proteins-> 75% of the HeLa cell proteome mass


• Contribution of each protein to the total mass in combination with the knowledge of number of cells in the initial sample

– roughly estimate the absolute copy number of the proteins in HeLa cells


• Ranked distribution of proteins

– 90% protein is within a range of a factor of 60 above or below the median protein copy number of 18, 000 molecules per cel

– The lower half accounts for <2% of its total mass


• Protein abundance values

– Used to estimate the proportional contribution of any:

• individual protein,

• protein complex and

• protein class

to the total proteome


• Ribosomes (encoded by only 1% human gene)

– 195 proteins contributed 6% to total protein mass

• Actin cytoskeletoncontributes four-fold more to the proteome mass than expected from the number of genes and proteins


• Integral membrane

– 25% of the genome

– 7.6% protein mass


• Protein folding

– 2% of the identified proteome by number

– 8% of proteome mass


Percentage to the Total Mass

“Protein folding” proteins

Integral membrane Proteins

Human genome 25 2

Protein Mass 7.6 8

• Differences are due to cell-type specific functions of these proteins

Structural proteins and proteins in basic machineries

Regulatory proteins

>

Ribosome proteins form tight cluster at the top end

Proteosome also abundant but not its regulatory subunits (factor of 100 less)

Cytoskeletal and metabolic proteins extend over a broad range

Enolase – highest expression value

Glycogen phosphorylase – 100,000-fold less at protein level and 10,000 less at transcript level

Regulatory proteins such as protein kinases and transcription factors have, on average, lower expression than the structural proteins

Each category spans a large expression range

Expression levels can provide starting points for systems biologicalmodeling of the cell

“Given the rapid technological progress in both fields, we predict that the required

depth of 10,000–12,000 genes will be routinely reachable

soon.”

TRANSCIPTOME RNA-Seq

PROTEOME High-res MS

THANK YOU

Deep proteome and trancriptome mapping of human cervical cancer cell line

Technology

Transcript of Deep proteome and trancriptome mapping of human cervical cancer cell line