Math, Stats and CS in Public Health and Medical Research

Jessica Minnier, OHSU, Lewis & Clark College Mathematics Colloquium, 3.19.14

Math, Stats and CS in Public Health and Medical Research

“Biostatistics (a portmanteau of biology and statistics; sometimes referred to as biometry or biometrics) is the application of statistics to a wide range of topics in biology.” – Wikipedia

or, “What is Biostatistics?”

“Bioinformatics is an interdisciplinary scientific field that develops methods for storing, retrieving, organizing and analyzing biological data” – Wikipedia “Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” – Wikipedia

Sample (n = 1)

¨  L&C mathematics major (2007), CS minor ¨  PhD in Biostatistics (2007-2012)

¤  “Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods”

¨  Postdoc (2012-2013) ¤  Cancer risk prediction with gene-environment

interactions ¨  Assistant Professor (2013-now)

v  Division of Biostatistics v  Department of Public Health & Preventive Medicine v  School of Medicine (soon to be School of Public Health) v  Oregon Health & Science University

Outline

¨  Biostatistics and Bioinformatics/Computational Biology ¤ More interesting definitions, research examples,

case studies ¤ Types of careers

¨  My trajectory ¤ LC math to grad school to jobs

¨  Resources and advice

Biostatistics, in the news.

Comics from Jim Borgman; XKCD; also fun: http://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon

In summary: A poor understanding of statistics makes everyone look bad.

Biostatistics, in the news

Forbes

Biostatistics, in the news

Applied math?

¨  Applied mathematics often studies deterministic models (engineering and mechanics, population models, cryptography)

¨  Some questions can’t be solved by deterministic models, but a partial answer can be given with statistics ¤  Does smoking cause lung cancer? (inference from

observational studies) ¤  Is it going to rain tomorrow? (stochastic model) ¤  Do statins lower cholesterol? (randomized trial)

Rafa Irizarry’s math major talk: https://www.youtube.com/watch?v=gXeWdvHKTQQ

Example data

¨  Collection of measurements from a sampled population

¨  Measurements of a lab experiment ¨  Medical images of subjects’ brains over time ¨  Results of a clinical trial ¨  Gene expression from different types of cultured

tissue ¨  Simulated data modeling HIV progression ¨  Values from electronic medical records sampled

retrospectively ¨  3 million genetic mutations from 20,000 subjects

Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1

Inform medical decisions

¨  A large clinical trial in 2002 by the Women’s Health Initiative was stopped early due to preliminary data showing that hormone replacement therapy had a negative health impact.

¨  This data contradicted prior evidence on the efficacy of HRT for post menopausal women.

¨  Statistical decision to end the trial, prevent further harm

Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1; JAMA 2002;288(3):321-333

Inform medical decisions

¨  Guidelines for mammogram screening based on probabilities of false positives and negatives, cost-benefit analyses, survival analysis

¨  Analysis of adverse effects in a clinical trial determines drug safety, dosage, subpopulations

¨  Even general public must make decisions about risk when making their own medical decisions

¨  Experts cannot make decisions without data

Bioinformatics & Computational biology ¨  Sequencing the human genome (aligning,

matching, searching) ¨  Algorithms for turning massive information from

electronic medical records into useful predictors of disease progression

¨  Machine learning algorithms for risk prediction models with large and complex data (imaging, genetic)

¨  Analysis of networks (protein interactions, genetic pathways, social behavior influencing health outcomes)

¨  Simulation of complex data (methylation patterns in the genome)

Biomathematics

¨  Mathematical models to study infectious disease progression (in a population or in a body’s cells)

¨  Steady-state simulations of cancer cell growth

¨  Usually in joint biostatistics/biomathematics or applied mathematics departments, some epidemiology

Where do we work? (non-random sample = my classmates)

¨  Assistant professors: OHSU School of Medicine, UNC School of Medicine, UIUC Statistics Dept, University of New Mexico School of Medicine

¨  Consultant/Manager, Analysis Group ¨  Assistant Member, RAND Corporation

¤  Nonprofit global policy think tank ¨  Computational Biologist, Genentech ¨  Instructors: UPenn School of Medicine, Harvard School of Public Health ¨  Research Associate, Dana Farber Cancer Institute ¨  Statistician, Partners Health Care

¨  Other possibilities: ¤  Government: National Institutes of Health, Food & Drug, Centers for Disease and

Control, WHO, Health departments in foreign countries ¤  Google, Intel, etc. ¤  Liberal arts colleges or smaller universities focused on teaching ¤  Pharma, Consulting, Labs, Hospitals, Hospital Research Centers, Research Institutes,

Universities

Real data, please?

¨  Two examples…

Case study 1: RNA-Seq Data

¨  RNA sequencing uses Next Generation Sequencing (NGS) to quantify RNA presence and quantity in a genetic sample at a moment in time

¨  Studies the dynamic transcriptome of a cell

¨  The problem: Compare expressions of genes in heart vs. brain tissues? Which genes are turned off in heart and on in brain?

Case study 1: RNA-Seq Data

¨  Step 1: Biologists collect samples, send to lab for sequencing

¨  Step 2: Genetic material is transformed into millions of ‘reads’ ¤ AACTAGACCTGG

¨  Step 3: The reads are mapped to the genome, transformed into counts for each gene

¨  Step 4: The distribution of gene counts for different tissues is compared

RNA-seq: Step 3

¨  Step 3: The reads are mapped to the genome, transformed into counts for each gene

¨  Computational biologists developed fast searching algorithms to map a short read (likely containing errors) to a genome with millions of base pairs, much repetition, some variability (SNPs)

RNA-seq: Step 3

¨  Bowtie (Langmead 2009 Genome Biology) incorporated the Burrows Wheeler indexing algorithm to shorten the mapping to less than a day (used to be days if not months) http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf

¨  TopHat (Trapnell 2009 Bioinformatics) can detect splicing junctions where certain genes code for multiple proteins via alternatively spliced mRNA

RNA-seq: Step 4

¨  Step 4: The distribution of gene counts for different tissues is compared

¨  Bioinformaticians and biostatisticians clean the data, normalize the data, and conduct statistical tests to determine if certain genes are expressed in one tissue differently than another

¨  Tests based on models: negative binomial distribution of counts, likelihood ratio tests

¨  Clustering algorithms ¨  Study genetic pathway enrichment, up- or down-

regulated genes ¨  Biologists then study these genes more closely

Heatmap and dendogram from cluster algorithm comparing genes in cultured mouse heart and brain tissues

Case study 2: Electronic Medical Records ¨  Medical and health records are

becoming increasingly digitized ¨  EMR can contain records of health

measurements (blood pressure), diagnoses (depression), treatments prescribed (statins), family history information, and even detailed descriptions of doctor visits (clinician notes)

¨  Thousands of patients can have dozens of records, some can have just 2

¨  Question: How to select subjects with bipolar disorder from a large pool of patients?

Case study 2: Electronic Medical Records ¨  Step 1: All the records must be collected, stored, put

in a database, managed, tracked ¨  Step 2: A small subset must be read by a team of

clinicians and scored as “case” versus “control” ¨  Step 3: Transform codes and paragraphs of words

into predictors of disease ¨  Step 4: Determine important predictors of disease

and build a prediction model with these variables ¨  Step 5: Validate the model, assess its performance ¨  Step 6: Implement the model in larger pool of

subjects to select the bipolar cases for a future genetic study

EMR: Step 1

¨  Step 1: All the records must be collected, stored, put in a database, managed, tracked

¨  Computer scientists and bioinformaticians must perform these steps (SQL, anyone? MUMPS? Python, perl…)

¨  Efficiency in this setting is no small task

EMR: Step 3

¨  Step 3: Transform codes and paragraphs of words into predictors of disease

¨  Natural language processing (NLP) is used by bioinformaticians to mine the paragraphs of data for terms that occur often in cases and less often in controls

¨  Certain words in a doctor’s note become possible predictors of disease

EMR: Step 4-6

¨  Step 4-6: Determine important predictors of disease, build a prediction model with these variables, assess/validate performance, implement model

¨  Biostatisticians develop ¤ high dimensional regression methods or

machine learning methods ¤  to select important predictors and build models ¤  to predict outcomes based on a large number of

variables (i.e., LASSO, support vector machine learning)

Regularized logistic regression with NLP predictors Solution path for coefficients of predictors based on adaptive LASSO

Back to me.

¨  Began with Yung-Pin’s research project on CpG islands (related to new field of epigenetics)

¨  Enjoyed journal clubs/biostatistics meetings at OHSU

¨  Pure math vs. applied math vs. something else

¨  Did you want to be a doctor? Do you want to help people?

¨  Ended up in grad school, what did I learn?

Biostatistics grad school

¨  Statistics ≠ pure math! ¨  A masters would have helped with intuition,

but not usually funded ¨  Research universities ≠ Lewis & Clark! ¨  Depend on self-teaching, your classmates,

and especially the T.A.’s to get by (when interviewing, meet the students!)

¨  Light teaching load, (hopefully) heavy collaborative/consulting load

¨  Lots of women in public health (like LC)! ¨  Grad school is always hard.

Bioinformatics grad school

¨  So far mostly the same ¨  More focused on biology ¨  Incorporating more biology training, wet

labs ¨  Software/Bioconductor/R package

development ¨  Diverging from traditional biostat?

Helpful classes

¨  Statistics and probability (obviously) ¨  All the computer science classes, ever (python,

more C!) ¨  Linear algebra ¨  Genetics (molecular biology would have been

nice, though no biology required for biostat) ¨  Advanced calculus/real analysis (for theoretical

classes such as Prob II and Inference II and writing my thesis, not always required)

¨  Discrete ¨  Abstract Algebra (don’t worry, not required

either) ¨  Liberal arts education in general

Helpful skills

¨  Latex ¨  R ¨  Python or Perl ¨  Unix, cluster/cloud computing ¨  Teaching/tutoring ¨  Research experience! ¨  Programming, software development ¨  C, Fortran ¨  Github ¨  You must enjoy talking to people, collaborating,

explaining math/stat/cs to non mathematical people!

Pros & Cons

Pros ¨  Interesting & meaningful research problems ¨  Always in demand, more so every day ¨  Collaborate with clinicians, biologists,

researchers of all kinds ¨  Salary isn’t too shabby Cons ¨  Soft money L ¨  Grants, grants, always grants (but not

necessarily our own)

Last thoughts

¨  Consider Epidemiology ¨  Applied vs. Theoretical research ¨  My day: mostly programming and writing

code (cleaning data + analysis, simulations), lots of meetings, a bit of pen & pencil research and thinking of new grants, reading articles, reading clinical trial protocols, sample size and power calculations

¨  This will vary on where you work ¨  Masters vs. PhD

More talks like this

¨  Excellent overview of bioinformatics & computational biology fields and careers in medicine by Dr. Shannon McWeeney (http://www.biodevlab.org/) at OHSU https://ohsu.adobeconnect.com/_a46054336/p61byw86754/?launcher=false&fcsContent=true&pbMode=normal

¨  Rafa Irizarry’s (at HSPH http://rafalab.dfci.harvard.edu/) math major talk: https://www.youtube.com/watch?v=gXeWdvHKTQQ

¨  Plenty of interesting talks at JSM, the big statistical meeting/conference, it will be nearby in Seattle in August of 2015 http://www.amstat.org/meetings/jsm/2014/index.cfm (in Boston this year); http://www.amstat.org/meetings/jsm.cfm

Learning resources

¨  Summer Institute for Training in Biostatistics (for undergrads)http://www.nhlbi.nih.gov/funding/training/redbook/sibsweb.htm

¤  U Wisc at Madison, Columbia, Emory, Boston U, NC State, U of Iowa, U of Minnesota, U of Pittsburgh (All of the websites have “What is Biostatistics?” pages)

¨  MOOC’s (Massive Online Open Courses)

¤  Learn R http://www.flaviobarros.net/2014/03/14/online-multimedia-resources-learn-r

¤  Learn biostats https://www.coursera.org/course/biostats

¤  Learn statistical learning https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about

¤  Learn bioinformatics http://www.langmead-lab.org/teaching-materials/ andhttp://rosalind.info/problems/list-view/

¨  UW’s Summer Institutes (scholarships for students)

¤  Statistical Genetics; Statistics and Modeling in Infectious Diseases; Statistics for Clinical Research

¨  Comprehensive list of job postings for statistics/biostatistics/bioinformatics: http://www.stat.ufl.edu/jobs/

The internet

¨  Youtube ¤  Rafa Irizarry’s youtube channel (especially

http://youtu.be/gXeWdvHKTQQ) ¨  Simply Statistics blog (http://simplystatistics.org/)

¨  R-bloggers ¨  Getting Genetics Done blog

(http://gettinggeneticsdone.blogspot.com/ )

¨  FiveThirtyEight (http://fivethirtyeight.com/)

¨  Neat summary measure of types of research done in various departments (biased toward east coast) https://muschellij2.shinyapps.io/ENAR_Over_Time/

Questions?

¨  [email protected]

Math, Stats and CS in Public Health and Medical Research

Data & Analytics

Transcript of Math, Stats and CS in Public Health and Medical Research