The Role of The Statisticians in Personalized Medicine: An Overview of Statistical Methods in...
-
Upload
setio-pramono -
Category
Education
-
view
174 -
download
0
description
Transcript of The Role of The Statisticians in Personalized Medicine: An Overview of Statistical Methods in...
Setia Pramana 1
The Role of Statisticians in Personalized Medicine: An Overview of Statistical Methods in Bioinformatics
Setia Pramana
STISJakarta, 8 August 2014
Setia Pramana 2
Outline• Drug Development• Personalized Medicine• Central Dogma• Microarray Data Analysis• Next Generation Sequencing• Summary
Setia Pramana 3
Drug Developments• Takes 10-15 years• Cost millions USD• Who: Pharmaceutical, biotechnology, device companies,
Universities and government research agencies• Regulatory: The US Food and Drug Administration, BP POM• Evaluate:
– Safety – can people take it?– Efficacy – does it do anything in humans?– Effectiveness – is it better or at least as good as what is
currently available?– Do the benefits outweigh the risks?
Setia Pramana 4
Drug Development• The Stages:
- Drug Discovery- Pre-clinical Development- Clinical Development 4 Phases
• Statisticians are involved in all stages• Stages are highly regulated• Result is based on most of patients• But .. Patients are created differently!
Setia Pramana 5
Patients Heterogeneity
Setia Pramana 6
Patients Heterogeneity• We’re all different in
- Physiological, demographic characteristics- Medical history- Genetic/genomic characteristics
• What works for a patient with one set of characteristics might not work for another!
Setia Pramana 7
Patients Heterogeneity• “One size does not fit all”• Use a patient’s characteristics to determine best
treatment for him/her• Genomic information is a great potential
-- > Personalized medicine:“The right treatment for the right patient at the right
time”
Setia Pramana 8
Personalized Medicine
• The ability to determine an individual's unique molecular characteristics and to use those genetic distinctions to diagnose more finely an individual's disease, select treatments that increase the chances of a successful outcome and reduce possible adverse reactions.
• Personalized medicine also is the ability to predict an individual's susceptibility to diseases and thus to try to shape steps that may help avoid or reduce the extent to which an individual will experience a disease
Setia Pramana 9
Subgroup Identification and Targeted Treatment
• Determine subgroups of patients who share certain characteristics and would get better on a particular treatment
• Discover biomarkers which can identify the subgroup• Focus on finding and treating a subgroup
Setia Pramana 10
Subgroup Identification and Targeted Treatment
Genotype Phenotype Intervention Outcome
Mutations/SNPGene/Protein ExpressionEpigenetics
DiseasesDisabilityEtc.
DrugsTherapiesRegimes
Personalized medicine
Setia Pramana 11
Advanced Biomedical Technologies• High-throughput microarrays and molecular imaging
to monitor SNPs, gene and protein expressions• Next-Generation Sequencing
Setia Pramana 12
First…. Bit Biology
13
Central Dogma
http://compbio.pbworks.comSetia Pramana
Setia Pramana 14
Gene• The full DNA sequence of an organism is called its
genome• A gene is a segment that specifies the sequence of
one or more protein.
Setia Pramana 15
Genomics • The study of all the genes of a cell, or tissue, at :– the DNA (genotype), e.g., GWAS SNP, CNV etc…– mRNA (transcriptomics), Gene expression,– or protein levels (proteomics).
• Functional Genomics: study the functionality of specific genes, their relations to diseases, their associated proteins and their participation in biological processes.
Setia Pramana 16
Microarrays
Setia Pramana 17
Microarray
• DNA microarrays are biotechnologies which allow the monitoring of expression of thousand genes.
Setia Pramana 18
Applications• High efficacy and low/no side effect drug• Genes related disease.• Biological discovery– new and better molecular diagnostics– new molecular targets for therapy– finding and refining biological pathways
• Molecular diagnosis of leukemia, breast cancer, etc.• Appropriate treatment for genetic signature• Potential new drug targets
Setia Pramana 19
Microarray
Overview of the process of generating high throughput gene expression data using microarrays.
20
Pipeline• Experiment design Lab work Image processing • Signal summarization (RMA, GCRMA)• Normalization • Data Analysis:
– Differentially Expressed genes– Clustering– Classification– Etc.
• Network / Pathways (GSEA etc..) • Biological interpretations
Setia Pramana
Setia Pramana 21
Microarray Data Structure
Setia Pramana 22
Preprocessed DataGenes C1 C2 C3 T1 T2 T3
G8522 6.78 6.55 6.37 6.89 6.78 6.92G8523 6.52 6.61 6.72 6.51 6.59 6.46G8524 5.67 5.69 5.88 7.43 7.16 7.31G8525 5.64 5.91 5.61 7.41 7.49 7.41G8526 4.63 4.85 5.72 5.71 5.47 5.79G8528 7.81 7.58 7.24 7.79 7.38 8.60G8529 4.26 4.20 4.82 3.11 4.94 3.08G8530 7.36 7.45 7.31 7.46 7.53 7.35G8531 5.30 5.36 5.70 5.41 5.73 5.77G8532 5.84 5.48 5.93 5.84 5.73 5.75
Setia Pramana 23
Challenges• Mega data, difficult to visualize• Too few records (columns/samples), usually < 100 • Too many rows(genes), usually > 10,000• Too many genes likely leading to False positives• For exploration, a large set of all relevant genes is
desired• For diagnostics or identification of therapeutic
targets, the smallest set of genes is needed• Model needs to be explainable to biologists
Setia Pramana 24
Type of Microarray Data Analysis
• Gene Selection–find genes for therapeutic targets
• Classification (Supervised)– identify disease (biomarker study)–predict outcome / select best treatment
• Clustering (Unsupervised)–find new biological classes / refine existing ones–Understanding regulatory relationship/pathway–exploration
Setia Pramana 25
Gene Selection• Modified t-test• Significance Analysis of Microarray (SAM)• Limma (Linear model for microarrays )• Linear Mixed model• Logistics Regression• Lasso (least absolute selection and shrinkage operator)• Elastic-net• Etc,
Setia Pramana 26
Visualization• Dimensionality reduction• PCA (Principal Component Analysis)• Biplot• Heatmap• Multi dimensional scaling• Etc
Setia Pramana 27
Clustering• Cluster the genes• Cluster the
arrays/conditions• Cluster both simultaneously
• K-means• Hierarchical• Biclustering algorithms
Setia Pramana 28
Clustering
• Cluster or Classify genes according to tumors
• Cluster tumors according to genes
Setia Pramana 30
Classification• Linear Discriminant Analysis• K nearest Neighbor• Logistic regression• L1 Penalized Logistic Regression• Neural Network• Support Vector Machines• Random forest• etc
Aim: To improve understanding of host protein profiles during disease progression especially in
children.
Classification of Malaria Subtypes
•Identify panel of proteins which could distinguish between different subtypes.•Implement L1-penalized logistic regression
Penalized Logistic Regression
•Logistic regression is a supervised method for binary or multi-class classification.•In high-dimensional data (e.g., microarray): More variables than the observations Classical logistic regression does not work.•Other problems: Variables are correlated (multicolinierity) and over fitting.•Solution: Introduce a penalty for complexity in the model.
35
Penalized Logistic RegressionLogistic model:
Maximize the log-likelihood:
•-Penalization (Lasso):
•
36
• Shrinks all regression coefficients () toward zero and set some of them to zero.
• Performs parameter estimation and variable selection at the same time.
• The choice of λ is crucial and chosen via k-fold cross-validation procedure.
• The procedure is implemented in an R package called penalized.
37
L1 Penalized Logistic Regression
Classification of Severe Malaria Anemia vs. Uncomplicated Malaria group
38
AUC: 0.86
Setia Pramana 39
Dose-response Microarray Studies
Setia Pramana 40
Dose-response Microarray Studies
Implemented in R package IsoGene and IsoGeneGUI.
Setia Pramana 41
Dose-response Microarray Studies
Setia Pramana 42
Gene Signature for Prostate Cancer
Setia Pramana 43
Gene Signature for Prostate Cancer
Setia Pramana 44
Gene Signature for Prostate Cancer
Setia Pramana 45
Next Generation Sequencing
Setia Pramana 46
Next Generation Sequencing
Reading the order of bases of DNA fragments
Setia Pramana 48
NGS used for:• Whole genome re-sequencing• Metagenomics• Cancer genomics• Exome sequencing (targeted)• RNA-sequencing• Chip-seq• Genomic Epidemiology
Setia Pramana 49
Next Generation Sequencing
• Produce Massive Data and fast• Problem is storage and analysis
RNA-seq Pipeline
• Align to a reference genome using Tophat.
Reference
Pramana, et.al 50NBBC 2013Source: Trapnell et.al, 2010
RNA-seq Pipeline
• Measure gene expression using Cufflinks: FPKM (Fragments Per Kilobase of transcript per Million mapped reads).
Reference Gene
Transcript 2Transcript 1
Isoform/Transcript FPKM
Gene FPKM
Sample 1
Sample 2
Sample 3
Pramana, et.al 51NBBC 2013 Source: Trapnell et.al, 2013
Setia Pramana 52
Setia Pramana 53
Subtype-specific Transcripts/Isoforms• Breast invasive carcinoma (BRCA) from the Cancer
Genome Atlas Project (TCGA).• 329 tumor samples.• Platform: illumina• Paired-end reads (length 50 bp).• 20 -100 million reads
Subtype-specific Transcripts/Isoforms• To discover transcripts/isoforms which are only
significantly (high/low) expressed in a certain cancer subtype.
Pramana, et.al 54NBBC 2013
Analysis Flow329 samples TCGA
Discovery set179 samples
Validation set- TCGA 150 samples- External samples
Classification to mol-subtypes- Use Swedish microarray data as
training data.- Based on gene level FPKM- Median and variance normalization- K-nearest neighbor- Classifier genes selection
Subtype-specific Transcript- Transcript level FPKM of all
genes- For each transcript: Robust
contrast tests.- Multiple testing adjustment.
Pramana, et.al 55NBBC 2013
Setia Pramana 56
Subtype-specific Transcripts/Isoforms
Setia Pramana 57
Subtype-specific Transcripts/Isoforms
Setia Pramana 58
Subtype-specific Transcripts/Isoforms
Setia Pramana 59
Software?• R now is growing, especially in bioinformatics– Statistics, data analysis, machine learning– Free– High Quality– Open Source– Extendable (you can submit and publish your own package!!)– Can be integrated with other languages (C/C++, Java, Python)– Large active user community– Command-based (-)
Setia Pramana 60
My Current Research• Integration of Somatic Mutation, Expression and Functional
Data Reveals Potential Driver Genes Predictive of Breast Cancer Survival (KI, Ewha Univ, Brescia Univ).
• Molecular Subtyping of Breast Cancers using RNA-Sequence Data (KI, Ewha Univ, Brescia Univ).
• The genomic surveillance of drug-resistant tuberculosis (FKUI, NUS).
• Genomics screening for prostate cancer (KI)• Molecular subtyping of Malaria (KI, Scilab, Eijkman Inst.)• Health Technology Assessment (FKUI, Depkes)
Setia Pramana 61
Summary• Statistics plays important roles in developing
personalized medicine• Multidisciplinary field need collaboration with
different experts. • Bioinformaticians is one of the sexiest job• Big Data in Medicine: Numerous opportunities to be
explored and discovered.
Setia Pramana 62
Thank you for your attention….