seminar Towards the Precision Medicine Era: Computationalrshamir/seminar/16/PM-intro.pdf · Lecture...
Transcript of seminar Towards the Precision Medicine Era: Computationalrshamir/seminar/16/PM-intro.pdf · Lecture...
Towards the Precision
Medicine Era: Computational challenges
Ron Shamir, CS, TAU
Fall 2016 seminar
http://www.cs.tau.ac.il/~rshamir/seminar/16/precmedsem16.html
Lecture 1 Outline • A little bit of biology • Gene expression • Protein-protein networks • Protein-DNA networks • Functional enrichment • About the seminar • Your opportunity to ask lots of
questions!!!
2
Gregor Mendel laws of inheritance,“gene” 1866
Watson and Crick DNA Discovery 1953
Genome Project 2003
5
DNA and Chromosomes •DNA: 4 bases molecule: ACGT
•Chromosome: contiguous stretch of DNA
•Genome: totality of DNA material
6
Genes: Recipes for Proteins • Gene: a DNA
segment that specifies the sequence of a protein.
• RNA: copy of DNA of a gene; “manufacturer instructions” for a protein
.html1/p1/Page1http://morgan.rutgers.edu/MorganWebFrames/Level 8
DNA RNA protein
transcription translation
The hard disk
One program
Its output
9
© Ron Shamir
The busy chef
• The profile of the cell: which genes are expressed as mRNAs and at what quantities.
10
20,000 recipes 10,000 dishes, in different quantities
Cooking 10,000 dishes
DNA RNA protein
Gregor Mendel laws of inheritance,“gene” 1866
Watson and Crick DNA Discovery 1953
Genome Project 2003
One of many computational challenges in the Human Genome project:
Assemble a puzzle of 27 million pieces
12
Complexity • ~3,000,000,000 letters in the genome • 2,278,100 letters in the Bible • => one genome = a stack of ~ 1,000 Bibles
• ~20,000 genes in the genome • Hard to identify • Harder to figure their function • Even harder to figure how they work together
13
Enter Bioinformatics • The marriage of CS and Biology • Responds to the explosion of biological data,
and builds on the IT revolution
14
September 15 2016: 220,731,315,250
bases
Biology is becoming an information science 15
• Find out the function of genes/proteins • Understand gene regulation • Figure out how genes, proteins interact:
Gene networks, development, … • Understand human DNA variations • Figure out the medical implications of all
the above • Research driven by new genome-wide high
throughput technologies • Key computational challenge: integration
Now that we know the human genome sequence, what’s next?
17
DNA chips / Microarrays • Simultaneous measurement
of expression levels of all genes.
• Perform 105-106 measurements in one experiment
• Allow global view of cellular processes.
18
Measured now primarily by deep sequencing (NGS) Up to 1010 bases in one experiment
The Raw Data
gene
s Expression levels,
“Raw Data”
experiments Entries of the Raw Data matrix: Ratios/absolute values/…
• expression pattern for each gene • Profile for each experiment /condition/sample/chip
Needs normalization!
19
GEO
20
Nearly 2 million expression profiles All publicly available, well organized A vast, underutilized resource. © Ron Shamir
Protein interaction networks
21 © Ron Shamir
Protein-protein interactions (PPIs)
• Low throughput measurements: accurate, scarce
• High throughput: more abundant, noisy • Large, readily available resource
© Ron Shamir 22
Regulation of Transcription
• A gene’s ranscription regulation is mainly encoded in the DNA in a region called the promoter
• Each promoter contains several short DNA subsequences, called binding sites (BSs) that are bound by specific proteins called transcription factors (TFs)
© Ron Shamir 25
TF TF Gene 5’ 3’
BS BS
Regulation of Transcription (II)
• By binding to a gene’s promoter, TFs promote or repress the recruitment of the transcription machinery
• The conditions that govern a gene’s transcription are determined by the specific combination of BSs in its promoter
© Ron Shamir 26
Gene 1
Gene 2
Modeling TF binding sites: Position Weight Matrix (PWM)
0 0.2 0.7 0 0.8 0.1 A
0.6 0.4 0.1 0.5 0.1 0 C
0.1 0.4 0.1 0.5 0 0 G
0.3 0 0.1 0 0.1 0.9 T
© Ron Shamir 27
ATGCAGGATACACCGATCGGTA 0.0605 GGAGTAGAGCAAGTCCCGTGA 0.0605 AAGACTCTACAATTATGGCGT 0.0151
Score: product of base probabilities. Need to set score threshold for hits.
Protein-DNA interactions
• Can be predicted using PWMs (look for hits in the promoters)
• Can be measured experimentally (ChIP-chip, ChIP-seq, PBM,…)
• The result in all cases: for each TF – a list of gene targets
• Presentable as a network • We often combine the PPI
and the PDI networks © Ron Shamir
28
Goal
• Challenge: Detect active functional modules: connected subnetwork of proteins whose genes are co-expressed
• “Where is the action in the network in a particular experiment?”
© Ron Shamir 31
© Ron Shamir 33
What is the Gene Ontology?
• Set of biological phrases (terms) which are applied to genes: – protein kinase – apoptosis – membrane
24th Feb 2006 Jane Lomax
What is the Gene Ontology?
• Genes are linked, or associated, with GO terms by trained curators at genome databases – known as ‘gene associations’ or GO
annotations
• Allows biologists to make inferences across large numbers of genes without researching each one individually
GO structure
gene A
Clark et al., 2005
part_of
is_a
Clark et al., 2005
part_of
is_a
Reminder: Hypergeometric score • Urn with N balls of which
m are red. • Draw n balls at random
w/o replacement • X = no. of red balls drawn
−−
==
nN
knmN
km
kXP )(
'( , , , ) ( ')
k kHG N m n k P X k
≥
= =∑
P-value for the chance that draw is random measures
enrichment © Ron Shamir 40
GO Enrichment
• I have a set of genes/proteins. Is it enriched for a particular function?
• One function: use Hypergeometric p-val • Testing all function: use HG but correct
for multiple testing (Bonferroni/FDR)
© Ron Shamir 41
The seminar
42
Guidelines
• You will need to dig deeply for the methods: supplements (on journal websites), previous papers,..
• See seminar website for resources • (re)start with the basics: definitions,
examples • Papers contain more than you can cover: Select your presentation focus wisely
© Ron Shamir 43
Guidelines (2)
• Provide intuition and examples to motivate your method
• Add something original that you thought of (and don’t hide that!)
• Focus more on the algorithms than on the results (rule of thumb: 60-40)
© Ron Shamir 44
Planning your presentation • Start: 3:10, Break 4-4:10, Talk End: 4:40,
followed by 5 min for questions, then open discussion
• Use mostly slides, and the board sparingly • Rehearse your talk! • Make contingencies in case you’re out of time • In the end, summarize the paper, repeating
the main results. Discuss strengths, weaknesses, steps ahead.
© Ron Shamir 45
The questionnaire
• Prepare a short (4-5 item) questionnaire on the paper
• Level should basic, but require reading the paper
• Distribute it to students after the seminar
• Students will bring in their answers next week, and you will grade them.
© Ron Shamir 46
:קביעת הציון הסופי
35%: הבנת החומר• 35%: הצגת החומר• 10%: בחירה טובה איזה חומר להציג•): שיחות ודפי שאלות(השתתפות פעילה בסמינר •
20% 10%: בונוס על מקוריות• !!. 10%-: חריגה מהזמן•
© Ron Shamir 47
Lecture 2 - Outline
• Precision medicine • One story • Your opportunity to ask lots of
questions!!!
48
Precision medicine
Precision medicine • Precisely tailoring therapies to subcategories
of disease, often defined by genomics • Unlike “personalized medicine”, avoids the
(mis)interpretation of per-patient drug development
• Medicine has always been personalized – the difference is new biomedical technologies
The Precision Medicine Initiative: A New National Effort Euan A. Ashley, JAMA. 2015;313(21):2119-2120. doi:10.1001/jama.2015.3595.
Problems with current medicine • Even for successful drugs, effect may be
achieved by a minority of the cohort • High NNT: ave number of patients needed to
treat to help one patient (often >10 in drug; >50 in prevention)
© Ron Shamir 52
© Ron Shamir 53
© Ron Shamir 54
PM and Genetic disease • Cystic Fibrosis: mutated
chloride channel. Ivacaftor drug helps in case the channel reaches the cell surface. The subclass of patients that can benefit from it was identified by a mutation.
• Six mutation-dependent categories identified Towards Precision Medicine Euan A. Ashley, Nat Rev Genetics 16
PM and Genetic disease (2) • Precision oncology: Identifying and targeting
diseased pathways expressed in a tumor may help more than histology. “A better microscope”
• Study suggested that in 96% of undiagnosed primary tumors a genomic alteration could be identified and that in 85% of cases, it is potentially treatable by a known drug.
PM and Genetic disease (3) • clopidogrel highly successful for heart attack
prevention during surgery, but prescribing required prior testing for mutations in CYP2C19
• Prevention! Screening high-risk families for relevant mutations can be cost-effective and life saving
Examples of
precision medicine
Pharmacogenomics • Avoid the “one size fits all” in drug prescription • The use of genomic information to individualize drug
prescribing • Pharmacology + Genomics • Analyze how the genetic makeup of a person affects
his/her drug response • Develop effective, safe medications and doses that
will be tailored to a person's genetic makeup • Most genetic tests are now done after diagnosis and
delay prescription – in the future: preemptive testing • CPIC maintains a list of gene variants and actionable
drugs
The inevitable conclusion • To improve medicine and make it more
precise and personal, we need to know the genome sequence of the individual and his/her medical history.
• To make use of such information we first need to collect such data on many patients and analyze it seriously
• The time is ripe to do it!
© Ron Shamir 60
Projects around the world • US precision medicine initiative: Assembly of a 1M
cohort of individuals willing to share their electronic medical record data and genomic data. – 1st generation: data from genotyping chips containing
1-2 million SNPs or enhanced exome sequencing. – 2nd generation: genome sequencing
Projects around the world (2) • United Kingdom 100,000 Genomes Project
Projects around the world (3) • United Kingdom and Denmark already have large-scale
biobanks. • US Million Veteran Program reports recruitment currently at more
than 300 000 individuals, with thousands having been sequenced and hundreds of thousands having been genotyped.
• USA eMERGE consortium combines electronic medical record data and genomic data from almost 200 000 individuals.
• USA Global Alliance for Genomics and Health aims for the establishment of a common framework of harmonized approaches for effective and responsible sharing of genomic and clinical data.
• National Human Genome Research Institute created the Electronic Medical Records and Genomics Network, which now includes 10 EHR-based DNA repositories and >350 000 subjects
Projects around the world (4) • 23andme collected data from ~1 million individuals willing to
contribute their time and DNA to research. • Regeneron partnered with the Geisinger Health System to
connect the exome sequence with EMR data from hundreds of thousands of patients.
• Kaiser Permanente Northern California Research Program on Genes, Environment and Health biobank18 included ~200K consented subjects with saliva or blood samples linked to comprehensive longitudinal EHR data and self-reported demographic and behavioral information. A subset of 110K+ of these individuals have genome-wide genotype and telomere length data available, forming the Genetic Epidemiology Research on Adult Health and Aging cohort (2014 numbers)
Electronic Health Records (EHRs) • Created and maintained by HMOs, hospitals and clinical practice
environments. • The EHR is a mix of structured and narrative text data. • Structured data: billing codes, laboratory tests, medication
prescriptions, and certain standardized document elements (eg, height, weight, vital signs, problem lists).
• EHR billing codes: – diagnosis-related groups to categorize hospitalizations – International Classification of Disease: ICD codes to describe
diagnoses and morbidities – Current Procedural Terminology codes to describe procedures.
• Narrative or text data provider notes, especially those portions entered as “free” or unstructured text (the bulk of the data). Can be structured by NLP.
• Scanned data in analog form, e.g. radiographic images, scanned text documents. Cannot easily be searched for content.
Mobile health • Mobile wearable devices can measure people’s
activity and other factors continuously and accurately.
• A natural target: physical fitness. Easily measureable and a greater risk factor for all-cause mortality than smoking, diabetes, and obesity
• MyHeartCounts: cardiovascular mobile health study; recruited 30 000 smartphone users in 2 weeks
What to sequence per individual • Gene panel: capture and sequence selected genes (a few
dozens to a few hundreds) at great coverage (~100x) • Exome sequencing: the exons and regulatory regions
(~10mb, 10-50x) • Whole genome sequencing (WGS, 30x)
– Processing requires 1T – Final VCF file: 1G
• Tradeoffs: cost, speed, sensitivity, clinical standards • Storage and analysis challenges!! the data size of
genomics will soon surpass that of online video and particle physics
Pukelwartz, Supercomputing for the parallelization of whole genome analysis, Bioinformatics 2014 Stephens, Z. D. et al. Big Data: astronomical or genomical? PLOS Biol. 13, e1002195 (2015).
The actionable genome • 1500-2000 drugs
FDA-approved to date • Most drugs have a specific
protein that they target or are otherwise linked to.
• In that case we say that the gene is druggable or actionable
• No of druggable genes: ~4500
http://www.raps.org/Regulatory-Focus/
Mutation types • Somatic vs germline • SNV: single nucleotide variation • Indel: insertion or deletion • CNV: copy number variation • SV: structural variation
Somatic mutation frequencies
Lawrence Getz Mutational heterogeneity in cancer Nature 13
Many other issues • Security and privacy:
– Need to maintain data security, patient privacy – De-identification of the data works to some extent – but a full genome uniquely identifies the individual
• Need informed consent of the individual to use data
• Should the person be informed on his/her results • EHRs: noisy, incomplete, many biases • Genomic data: still lacks clinical-level standards
Sources • Ashley, The Precision Medicine Initiative,
JAMA 2015 • Ashley, Towards precision medicine. Nature
Rev Genetics Sept 2016 • Hall et al. Merging Electronic Health Record
Data and Genomics for Cardiovascular Research, Circ Cardiovasc Genet April 2016