Math, Stats and CS in Public Health and Medical Research
-
Upload
jessica-minnier -
Category
Data & Analytics
-
view
54 -
download
1
Transcript of Math, Stats and CS in Public Health and Medical Research
Jessica Minnier, OHSU, Lewis & Clark College Mathematics Colloquium, 3.19.14
Math, Stats and CS in Public Health and Medical Research
“Biostatistics (a portmanteau of biology and statistics; sometimes referred to as biometry or biometrics) is the application of statistics to a wide range of topics in biology.” – Wikipedia
or, “What is Biostatistics?”
“Bioinformatics is an interdisciplinary scientific field that develops methods for storing, retrieving, organizing and analyzing biological data” – Wikipedia “Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” – Wikipedia
Sample (n = 1)
¨ L&C mathematics major (2007), CS minor ¨ PhD in Biostatistics (2007-2012)
¤ “Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods”
¨ Postdoc (2012-2013) ¤ Cancer risk prediction with gene-environment
interactions ¨ Assistant Professor (2013-now)
v Division of Biostatistics v Department of Public Health & Preventive Medicine v School of Medicine (soon to be School of Public Health) v Oregon Health & Science University
Outline
¨ Biostatistics and Bioinformatics/Computational Biology ¤ More interesting definitions, research examples,
case studies ¤ Types of careers
¨ My trajectory ¤ LC math to grad school to jobs
¨ Resources and advice
Biostatistics, in the news.
Comics from Jim Borgman; XKCD; also fun: http://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon
In summary: A poor understanding of statistics makes everyone look bad.
Biostatistics, in the news
Forbes
Biostatistics, in the news
Applied math?
¨ Applied mathematics often studies deterministic models (engineering and mechanics, population models, cryptography)
¨ Some questions can’t be solved by deterministic models, but a partial answer can be given with statistics ¤ Does smoking cause lung cancer? (inference from
observational studies) ¤ Is it going to rain tomorrow? (stochastic model) ¤ Do statins lower cholesterol? (randomized trial)
Rafa Irizarry’s math major talk: https://www.youtube.com/watch?v=gXeWdvHKTQQ
Example data
¨ Collection of measurements from a sampled population
¨ Measurements of a lab experiment ¨ Medical images of subjects’ brains over time ¨ Results of a clinical trial ¨ Gene expression from different types of cultured
tissue ¨ Simulated data modeling HIV progression ¨ Values from electronic medical records sampled
retrospectively ¨ 3 million genetic mutations from 20,000 subjects
Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1
Inform medical decisions
¨ A large clinical trial in 2002 by the Women’s Health Initiative was stopped early due to preliminary data showing that hormone replacement therapy had a negative health impact.
¨ This data contradicted prior evidence on the efficacy of HRT for post menopausal women.
¨ Statistical decision to end the trial, prevent further harm
Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1; JAMA 2002;288(3):321-333
Inform medical decisions
¨ Guidelines for mammogram screening based on probabilities of false positives and negatives, cost-benefit analyses, survival analysis
¨ Analysis of adverse effects in a clinical trial determines drug safety, dosage, subpopulations
¨ Even general public must make decisions about risk when making their own medical decisions
¨ Experts cannot make decisions without data
Bioinformatics & Computational biology ¨ Sequencing the human genome (aligning,
matching, searching) ¨ Algorithms for turning massive information from
electronic medical records into useful predictors of disease progression
¨ Machine learning algorithms for risk prediction models with large and complex data (imaging, genetic)
¨ Analysis of networks (protein interactions, genetic pathways, social behavior influencing health outcomes)
¨ Simulation of complex data (methylation patterns in the genome)
Biomathematics
¨ Mathematical models to study infectious disease progression (in a population or in a body’s cells)
¨ Steady-state simulations of cancer cell growth
¨ Usually in joint biostatistics/biomathematics or applied mathematics departments, some epidemiology
Where do we work? (non-random sample = my classmates)
¨ Assistant professors: OHSU School of Medicine, UNC School of Medicine, UIUC Statistics Dept, University of New Mexico School of Medicine
¨ Consultant/Manager, Analysis Group ¨ Assistant Member, RAND Corporation
¤ Nonprofit global policy think tank ¨ Computational Biologist, Genentech ¨ Instructors: UPenn School of Medicine, Harvard School of Public Health ¨ Research Associate, Dana Farber Cancer Institute ¨ Statistician, Partners Health Care
¨ Other possibilities: ¤ Government: National Institutes of Health, Food & Drug, Centers for Disease and
Control, WHO, Health departments in foreign countries ¤ Google, Intel, etc. ¤ Liberal arts colleges or smaller universities focused on teaching ¤ Pharma, Consulting, Labs, Hospitals, Hospital Research Centers, Research Institutes,
Universities
Real data, please?
¨ Two examples…
Case study 1: RNA-Seq Data
¨ RNA sequencing uses Next Generation Sequencing (NGS) to quantify RNA presence and quantity in a genetic sample at a moment in time
¨ Studies the dynamic transcriptome of a cell
¨ The problem: Compare expressions of genes in heart vs. brain tissues? Which genes are turned off in heart and on in brain?
Case study 1: RNA-Seq Data
¨ Step 1: Biologists collect samples, send to lab for sequencing
¨ Step 2: Genetic material is transformed into millions of ‘reads’ ¤ AACTAGACCTGG
¨ Step 3: The reads are mapped to the genome, transformed into counts for each gene
¨ Step 4: The distribution of gene counts for different tissues is compared
RNA-seq: Step 3
¨ Step 3: The reads are mapped to the genome, transformed into counts for each gene
¨ Computational biologists developed fast searching algorithms to map a short read (likely containing errors) to a genome with millions of base pairs, much repetition, some variability (SNPs)
RNA-seq: Step 3
¨ Bowtie (Langmead 2009 Genome Biology) incorporated the Burrows Wheeler indexing algorithm to shorten the mapping to less than a day (used to be days if not months) http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf
¨ TopHat (Trapnell 2009 Bioinformatics) can detect splicing junctions where certain genes code for multiple proteins via alternatively spliced mRNA
RNA-seq: Step 4
¨ Step 4: The distribution of gene counts for different tissues is compared
¨ Bioinformaticians and biostatisticians clean the data, normalize the data, and conduct statistical tests to determine if certain genes are expressed in one tissue differently than another
¨ Tests based on models: negative binomial distribution of counts, likelihood ratio tests
¨ Clustering algorithms ¨ Study genetic pathway enrichment, up- or down-
regulated genes ¨ Biologists then study these genes more closely
Heatmap and dendogram from cluster algorithm comparing genes in cultured mouse heart and brain tissues
Case study 2: Electronic Medical Records ¨ Medical and health records are
becoming increasingly digitized ¨ EMR can contain records of health
measurements (blood pressure), diagnoses (depression), treatments prescribed (statins), family history information, and even detailed descriptions of doctor visits (clinician notes)
¨ Thousands of patients can have dozens of records, some can have just 2
¨ Question: How to select subjects with bipolar disorder from a large pool of patients?
Case study 2: Electronic Medical Records ¨ Step 1: All the records must be collected, stored, put
in a database, managed, tracked ¨ Step 2: A small subset must be read by a team of
clinicians and scored as “case” versus “control” ¨ Step 3: Transform codes and paragraphs of words
into predictors of disease ¨ Step 4: Determine important predictors of disease
and build a prediction model with these variables ¨ Step 5: Validate the model, assess its performance ¨ Step 6: Implement the model in larger pool of
subjects to select the bipolar cases for a future genetic study
EMR: Step 1
¨ Step 1: All the records must be collected, stored, put in a database, managed, tracked
¨ Computer scientists and bioinformaticians must perform these steps (SQL, anyone? MUMPS? Python, perl…)
¨ Efficiency in this setting is no small task
EMR: Step 3
¨ Step 3: Transform codes and paragraphs of words into predictors of disease
¨ Natural language processing (NLP) is used by bioinformaticians to mine the paragraphs of data for terms that occur often in cases and less often in controls
¨ Certain words in a doctor’s note become possible predictors of disease
EMR: Step 4-6
¨ Step 4-6: Determine important predictors of disease, build a prediction model with these variables, assess/validate performance, implement model
¨ Biostatisticians develop ¤ high dimensional regression methods or
machine learning methods ¤ to select important predictors and build models ¤ to predict outcomes based on a large number of
variables (i.e., LASSO, support vector machine learning)
Regularized logistic regression with NLP predictors Solution path for coefficients of predictors based on adaptive LASSO
Back to me.
¨ Began with Yung-Pin’s research project on CpG islands (related to new field of epigenetics)
¨ Enjoyed journal clubs/biostatistics meetings at OHSU
¨ Pure math vs. applied math vs. something else
¨ Did you want to be a doctor? Do you want to help people?
¨ Ended up in grad school, what did I learn?
Biostatistics grad school
¨ Statistics ≠ pure math! ¨ A masters would have helped with intuition,
but not usually funded ¨ Research universities ≠ Lewis & Clark! ¨ Depend on self-teaching, your classmates,
and especially the T.A.’s to get by (when interviewing, meet the students!)
¨ Light teaching load, (hopefully) heavy collaborative/consulting load
¨ Lots of women in public health (like LC)! ¨ Grad school is always hard.
Bioinformatics grad school
¨ So far mostly the same ¨ More focused on biology ¨ Incorporating more biology training, wet
labs ¨ Software/Bioconductor/R package
development ¨ Diverging from traditional biostat?
Helpful classes
¨ Statistics and probability (obviously) ¨ All the computer science classes, ever (python,
more C!) ¨ Linear algebra ¨ Genetics (molecular biology would have been
nice, though no biology required for biostat) ¨ Advanced calculus/real analysis (for theoretical
classes such as Prob II and Inference II and writing my thesis, not always required)
¨ Discrete ¨ Abstract Algebra (don’t worry, not required
either) ¨ Liberal arts education in general
Helpful skills
¨ Latex ¨ R ¨ Python or Perl ¨ Unix, cluster/cloud computing ¨ Teaching/tutoring ¨ Research experience! ¨ Programming, software development ¨ C, Fortran ¨ Github ¨ You must enjoy talking to people, collaborating,
explaining math/stat/cs to non mathematical people!
Pros & Cons
Pros ¨ Interesting & meaningful research problems ¨ Always in demand, more so every day ¨ Collaborate with clinicians, biologists,
researchers of all kinds ¨ Salary isn’t too shabby Cons ¨ Soft money L ¨ Grants, grants, always grants (but not
necessarily our own)
Last thoughts
¨ Consider Epidemiology ¨ Applied vs. Theoretical research ¨ My day: mostly programming and writing
code (cleaning data + analysis, simulations), lots of meetings, a bit of pen & pencil research and thinking of new grants, reading articles, reading clinical trial protocols, sample size and power calculations
¨ This will vary on where you work ¨ Masters vs. PhD
More talks like this
¨ Excellent overview of bioinformatics & computational biology fields and careers in medicine by Dr. Shannon McWeeney (http://www.biodevlab.org/) at OHSU https://ohsu.adobeconnect.com/_a46054336/p61byw86754/?launcher=false&fcsContent=true&pbMode=normal
¨ Rafa Irizarry’s (at HSPH http://rafalab.dfci.harvard.edu/) math major talk: https://www.youtube.com/watch?v=gXeWdvHKTQQ
¨ Plenty of interesting talks at JSM, the big statistical meeting/conference, it will be nearby in Seattle in August of 2015 http://www.amstat.org/meetings/jsm/2014/index.cfm (in Boston this year); http://www.amstat.org/meetings/jsm.cfm
Learning resources
¨ Summer Institute for Training in Biostatistics (for undergrads)http://www.nhlbi.nih.gov/funding/training/redbook/sibsweb.htm
¤ U Wisc at Madison, Columbia, Emory, Boston U, NC State, U of Iowa, U of Minnesota, U of Pittsburgh (All of the websites have “What is Biostatistics?” pages)
¨ MOOC’s (Massive Online Open Courses)
¤ Learn R http://www.flaviobarros.net/2014/03/14/online-multimedia-resources-learn-r
¤ Learn biostats https://www.coursera.org/course/biostats
¤ Learn statistical learning https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about
¤ Learn bioinformatics http://www.langmead-lab.org/teaching-materials/ andhttp://rosalind.info/problems/list-view/
¨ UW’s Summer Institutes (scholarships for students)
¤ Statistical Genetics; Statistics and Modeling in Infectious Diseases; Statistics for Clinical Research
¨ Comprehensive list of job postings for statistics/biostatistics/bioinformatics: http://www.stat.ufl.edu/jobs/
The internet
¨ Youtube ¤ Rafa Irizarry’s youtube channel (especially
http://youtu.be/gXeWdvHKTQQ) ¨ Simply Statistics blog (http://simplystatistics.org/)
¨ R-bloggers ¨ Getting Genetics Done blog
(http://gettinggeneticsdone.blogspot.com/ )
¨ FiveThirtyEight (http://fivethirtyeight.com/)
¨ Neat summary measure of types of research done in various departments (biased toward east coast) https://muschellij2.shinyapps.io/ENAR_Over_Time/
Questions?