Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in...
Transcript of Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in...
![Page 1: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/1.jpg)
Week1Notes/Slides/InformationforthePurdueBigDataFor
BiologistsWorkshop
2016
Fleet 2016
![Page 2: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/2.jpg)
Welcome!
DAY1Session1:
WELCOME
EncouragingBigDataThinkinginBiomedicalresearch
![Page 3: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/3.jpg)
Introductions
Who are you? Where are you from?
What is your research interest?
Why are you interested in “big data”?
Fleet 2016
![Page 4: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/4.jpg)
InstructorsandTeachingAssistants
Main Instructors:James C. Fleet, PhD (Nutrition Science)Wanqing Liu, PhD (Medicinal Chemistry and Molecular
Pharmacology)Pete Pascuzzi, PhD (Libraries)Min Zhang, PhD (Statistics)
Teaching Assistants:Chen Chen (Statistics)Min Ren (Statistics)
Harley SchawadronFleet 2016
![Page 5: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/5.jpg)
GuestLecturers
Week 1:Doug Crabill (Purdue University)Sean Davis (National Cancer Institute)Nadia Atallah (Purdue University)Ingenuity Systems Staff
Week 2:Jennifer Neville (Purdue University)Shi Li Lin (Ohio State University)Martin Lundquist (John Hopkins University)
Fleet 2016
![Page 6: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/6.jpg)
Strengths/WeaknessesOf Hypothesis-DrivenScience
A B
C D
I
E
F
G
H
K. Hokusai
Fleet 2016
![Page 7: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/7.jpg)
DisruptiveTechnologiesPropelScientificAdvancement
GenomicGEO
ENCODEGTEx
ProteomicProteomicsDB
PhenotypesJAX
TCGA
MetabolomicHMDB
Genetic: WebQTL/JAX1000 genomes
What can we do with all this information?
Fleet 2016
Sequenced genomes helped scientists see that data generation need not be a barrier to scientific progress.
2001 Breakthrough of the Year
![Page 8: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/8.jpg)
Newtechnologiesbringchallengestousingand
integratingdata
12 Brain-imaging modalities
Requires Multidimensional
Thinking and Analytical Tools
Dinov (2016) J Med Stat Inform 4:3
Bra
in-i
mag
ing
mod
alit
y
Fleet 2016
![Page 9: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/9.jpg)
WhatisBiologicalBigDataScience(BBDS)?
An approach that integrates multi-dimensional, high-density, variously formatted data types to gain insight in the function or regulation of biological systems.
Data Velocity
Data Volume
DataVariety
Mb Gb Tb Pbbatch
periodic
Near Real Time
Real Time
“omic”
Image
population
physiologic
Two new “V’s”:VeracityValue
Fleet 2016
![Page 10: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/10.jpg)
BigDatainBiomedicineHasTwo“Flavors”
“Omics” Driven• Genotyping• Gene expression• NGS data
Payer-provider Driven• Electronic Medical
Records• Pharmacy information• Insurance Records
Basic Biological Understanding and Treatment
Discovery
Optimize Healthcare
Delivery and Economics
Fleet 2016
![Page 11: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/11.jpg)
BiologicalBigDataScience(BBDS)
CharacterizationAnalysis
Focus
Levels of ScaleGeneCell
OrganismCommunity
Global
Analytical Disciplines
Computational Disciplines
Life Sciences
Fleet 2016
Where do traditional biomedical researchers
fit?
![Page 12: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/12.jpg)
CentralDogmaofBiologyHasBeenReshapedbyBigData
Modified from Doerge (2002) NRG 3:43
Phenotype
Fleet 2016
1000 Genomes,
dbSNP
GTex, BioGPS
ENCODE
ProteomicsDB
![Page 13: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/13.jpg)
IsTraditionalScienceOutdated?
• Needs ‘falsifiable’ hypothesis
• Needs appropriate data• Statistical significance • Small effects in lots of data• Determines Causation
• No hypothesis needed• Exploratory?
• Full data not needed• Post-hoc explanation• Statistical
significance?• “Thin” data?
Hypothesis-driven vs Data-driven
“faced with massive data, this approach to science - hypothesize, model, test — is becoming obsolete.”
Chris Anderson, Editor, Wired Magazine, 2008
Fleet 2016
![Page 14: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/14.jpg)
TheWrongWaytoDoBioinformatics
Experimental Scientist
Well Designed Experiment 1
Results
And then a miracle
happens….
Fleet 2016
![Page 15: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/15.jpg)
TheChallenge ofBBDS
Statistics
Functional AnalysisVisualization
ComputingAlgorithms Statistics
Experimental Scientist
Well Designed Experiment 1
ResultsProcessed Data “n”
Computing
AlgorithmsStatistics
Sample Analysis
Raw Data
Processed Data 1
ComputingAlgorithmsStatistics
Core Facility
Fleet 2016
![Page 16: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/16.jpg)
BigDataChallengesinBiomedicine
1) Locating data2)Getting access to data3) Organizing, Managing, and Processing data4)Computational issues
• Infrastructure• Speed (computer and algorithms)
5) Analytical methods6) Training researchers to use BBD effectively
Fleet 2016
From the NIH “Big Data to Knowledge” (BD2K) program
![Page 17: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/17.jpg)
StagesofBigDataEducation
AWARENESS
FAMILIARITY
LOW LEVEL ABILITY
HIGH LEVEL ABILITY
NAIVE
EXPERTSkill +
Independence
Appreciate Value
Vocabulary to Talk with Experts to Define Goals
Use Pre-existing Software
Code, Develop Tools
Creative Work, Novel Approaches
Fleet 2016
![Page 18: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/18.jpg)
WhatWe’llcoverduringtheworkshop
Week 1Unit 1: Microarray
Unit 2: Next Generation Sequencing
Week 2Unit 3: Biomarker Discovery
Unit 4: Genetic Variation
Biological Goals:
• Regulation of gene expression
• Phenotype prediction
• Genotype-phenotype relationships
Fleet 2016
![Page 19: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/19.jpg)
WhatWewillcoverduringtheworkshopWeek 1Unit 1: Microarray
Unit 2: Next Generation Sequencing
Week 2Unit 3: Biomarker Discovery
Unit 4: Genetic Variation
Technical Goals:
• Analysis pipelines• Statistical issues• Visualization• Functional
annotation• Databases• Project management• Computation and
programming
Fleet 2016
![Page 20: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/20.jpg)
Day1Session2:WorkingwiththePurdueComputerInfrastructure
Doug CrabillDepartment of Statistics
Purdue University
![Page 21: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/21.jpg)
SitestoUnderstandComputing
� UNIX operating system� Learn UNIX� http://www.tutorialspoint.com/unix/index.htm
� Linux operating system� http://www.tutorialspoint.com//operating_system/os_lin
ux.htm� R coding
� http://bioinformatics.knowledgeblog.org/2011/06/21/using-r-a-guide-for-complete-beginners/
� https://www.r-project.org/about.html
Fleet 2016
![Page 22: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/22.jpg)
Monday
BREAK #1
![Page 23: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/23.jpg)
Day 1Session 3: Data Repositories and Pre-processed Data Sites
James C. Fleet, PhDDistinguished ProfessorDepartment of Nutrition Science
Pete Pascuzzi, PhDAssistant ProfessorPurdue Libraries
![Page 24: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/24.jpg)
Data Archives Web link Description
NIH Data Sharing Repositories
https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html
Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) sites
Gene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/
NCBI; transcriptome and ChIP-seq datasets
Array Expresshttp://www.ebi.ac.uk/arrayexpress/
EMBL-EBI repository to archive functional genomics data
European Nucleotide Archive (ENA) http://www.ebi.ac.uk/enaComprehensive record of worlds nucleotide sequencing information
The Cancer Genome Atlas (TCGA) http://cancergenome.nih.gov/Multi "omic" phenotype characterization of tumors
Proteomics IDEntifications (PRIDE)http://www.ebi.ac.uk/pride/archive/ European proteomics datasets
Metabolomics Workbench
http://metabolomicsworkbench.org/standards/nominatecompounds.php metabolomic datasets
Fleet 2016
![Page 25: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/25.jpg)
Fleet 2016
![Page 26: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/26.jpg)
GEO Datasets Week 1• GSE15947: Time course of 1,25(OH)2 D treated RWPE1 cells• GSE54783: The Osteoblast to Osteocyte Transition: Epigenetic
changes and response to the vitamin D3 hormone
GSE #: Accession number for a complete dataset that is submitted to GEO
GSM #: Accession number for a specific sample within a dataset
GPL #: The platform used to generate a dataset
SRX #: Accession number for a sample generated by NGS that is deposited in the Short Read Archive (SRA)
Fleet 2016
![Page 27: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/27.jpg)
Data Archives Web link Description
Oncominehttps://www.oncomine.org/resource/login.html 715 microarray datasets from 19 cancers
Gene Expression across Normal and Tumor tissue (GENT)
http://medical-genome.kribb.re.kr/GENT/
gene expression patterns in human cancer from Affy Chips (+ 1000 cell lines)
BioGPShttp://biogps.org/#goto=welcome
Human/rat/mouse transcript data by tissue
cBioPrortal http://www.cbioportal.org/ TCGA cancer genomics
Genotype-Tissue Expression project (Gtex)
http://www.gtexportal.org/home/
human, multi-tissue gene expression and gene variation for eQTL
Immunological Genome Project (Immgen) http://www.immgen.org/
transcriptome data from cultured mouse immune cells
Human Brain Transcriptome http://hbatlas.org/ transcriptome and associated metadata for developing and adult human brain.
NHLBI Kidney Transcriptome database
https://hpcwebapps.cit.nih.gov/ESBL/Database/Transcriptomic/index.html
Segment-specific expression in rat kidney
Kidney Systems Biology Projecthttps://hpcwebapps.cit.nih.gov/ESBL/Database/
Multi-omic database from rat and mouse studies
Saccharomyces Genome Databasehttp://www.yeastgenome.org/transcriptome-data-in-yeastmine
Integrated biological information on budding yeast
miRBase http://mirbase.org/
published miRNA sequences,annotation. Expression dataset links available
Fleet 2016
![Page 28: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/28.jpg)
Day 1Session 4: Understanding R and Bioconductor
James C. Fleet, PhDDistinguished ProfessorDepartment of Nutrition Science
Pete Pascuzzi, PhDAssistant ProfessorPurdue Libraries
![Page 29: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/29.jpg)
Fleet 2016
![Page 30: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/30.jpg)
Fleet 2016
![Page 31: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/31.jpg)
Fleet 2016
![Page 32: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/32.jpg)
Rstudio is a Graphical User Interface (GUI) that let’s you use R more conveniently
ScriptWindow
CommandWindow
Bioconductor packageWindow
Fleet 2016
![Page 33: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/33.jpg)
Day 1Session 5-6:
Microarray Overview
Data Processing and QC
James C. Fleet, PhDDistinguished ProfessorDepartment of Nutrition Science
Pete Pascuzzi, PhDAssistant ProfessorPurdue Libraries
![Page 34: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/34.jpg)
1,25(OH)2 D Induces Gene Transcription through the Vitamin
D Receptor
25-OH D 1, 25 (OH)2 D
CYP27B1
Fleet et al. (2012) Biochem J 441:61
Classical Role: Regulate whole body calcium metabolism
NGS project: Osteoblast and osteocytes
Novel Role:Prevent and treat cancer
Microarray project:Normal prostate epithelial cell
![Page 35: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/35.jpg)
Zinser and Welsh 2004, Carcinogenesis 25:2361Larriba et al., 2011, PLoS One 6:e23524
Colon Cancer
Mordan-Mcombs et al., 2010, JSBMB 121:368
APCmin MiceTumor size
WT HT KO(5 months)
Tum
or s
ize (m
m)
MMTV-neu MiceTumor Incidence
Breast Cancer
WT
KOTum
or-F
ree
Mice
(%)
Age (mo)3 6 9 12 15 18
LPB-Tag MiceTumor Size
Prostate Cancer
Tum
or w
eigh
t (m
gs)
Age (wks)7 9 12 15 18
KO
WT
VDRDeletionAcceleratesCancerDevelopmentinMice
Fleet 2016
![Page 36: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/36.jpg)
StrategiesofCancerPrevention
Umar et al. (2012) Nat. Rev. Cancer 12:835
Cancer Prevention
Cancer Treatment
Time (y)
Diagnosis
Survival without Cancer
Survival with
Cancer
Fleet 2016
![Page 37: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/37.jpg)
ResearchParadigm:StudyGeneexpressionwithDNA
microarray
Kovalenko et al. (2010) 1,25 dihydroxyvitamin D-mediated orchestration of anticancer, transcript-level effects in the immortalized, non-transformed prostate epithelial cell line, RWPE1. BMC Genomics 11:26
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820456/
Fleet 2016
![Page 38: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/38.jpg)
StudyDesign� RWPE1 cells
� Human prostate epithelial cell line (ATCC CRL-11609)� 54 y old white male
� Peripheral zone of normal health prostate� HPV18 immortalized� Not tumorigenic in nude mice
Fleet 2016Kovalenko et al. (2010) BMC Genomics 11:26
60% Confluent Culture
100 nM 1,25(OH)2 D or EtOH control
6 h 24 h 48 h
Media change
2 x 3 factorial design, n = 4/group (24 total samples)
GEO ID = GSE15947
![Page 39: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/39.jpg)
Affymetrix Microarray Analysis Workflow
High Quality RNA
Affymetixanalysis
Affymetix QC analysis
Process Data
• RMA• Normalize• MAS5 (P/A)
Data Filters• P/A call• Fold
AffymetixRaw Data
Additional QC analysis
Normalized Vetted Data
Statistical Analysis
Differentially Expressed Gene List
Clustering and
visualization
Pathway and GenesetAnalysis
Network building
Interpret Experiment
Well Designed
Study
Fleet 2016
![Page 40: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/40.jpg)
Affymetrix Tiling Arrays
11-25 probes or probe pairs per gene
Perfect Match = 25 bpMismatch in one base
Algorithms from Affymetrixand others define “expression”
Used to define “background”
Fleet 2016
![Page 41: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/41.jpg)
Probe ArrayPost-hybridization
Probe Set
Probe Pair(PM/MM)
Probe Cell
40 x 107 copies/cell
25 base long probes
Probe ArrayPre-hybridization
cDNA library and Biotin Label
Hybridization to Array
U133 Plus 2 GeneChip:54,000 probe sets
Affymetrix GeneChip Microarrays
U133 Plus 2 GeneChip:11 probe pairs/gene
Fleet 2016
![Page 42: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/42.jpg)
Residual Plots of Scanned GeneChips
http://plmimagegallery.bmbolstad.com/
High quality (mostly white)
ResidualsRed = (+)Blue = (-)White = 0
Too Variable (intense red/blue)
ScratchedUneven (edge drying)
Uneven
Fleet 2016
![Page 43: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/43.jpg)
Relative Log Expression (RLE)
Large spread and non-zero
= bad
Control limits
IQR limits
Outside control limits
= bad
provides a measure of reproducibility of gene expression data that can be compared across batches, experiments or trials.
Fleet 2016
![Page 44: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/44.jpg)
Normalized Unscaled Standard Error (NUSE)
Large spread and non-zero = bad
Outside control limits
= bad
Control limits
IQR limits
Values have no units. Used to assess relative quality of arrays within an analysis set
Fleet 2016
![Page 45: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/45.jpg)
95%
99%
541
Fleet 2016
![Page 46: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/46.jpg)
Monday
BREAK #2
![Page 47: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/47.jpg)
Day 1Session 7: Differential Gene Expression
James C. Fleet, PhDDistinguished ProfessorDepartment of Nutrition Science
Pete Pascuzzi, PhDAssistant ProfessorPurdue Libraries
![Page 48: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/48.jpg)
Statistics for Biologists
http://www.nature.com/collections/qghhqm
Nature Publishing:
Collections of short articles on basic statistical concepts for biologists
Fleet 2016
![Page 49: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/49.jpg)
What is a p-Value?
the probability of obtaining an effect at least as extreme as the one you observed.
Control Mean = 260Treatment Mean = 330.6
260330.6
J. Frost, MiniTab BlogFleet 2016
![Page 50: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/50.jpg)
What is type I error?
Rejecting the null hypothesis when it is in fact true….a
false positive.
Null Hypothesis is True
Null Hypothesis is False
Reject Null Hypothesis
Type I error (False positive, a)
Correct Inference (True positive)
Fail to Reject Null Hypothessis
Correct Inference (True negative)
Type II error (False negative, b)
Fleet 2016
![Page 51: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/51.jpg)
Why is type I error a problem for “omics” studies?
“large p, small n problem”
100 repeats of same experiment: 5 potential false positives when a = 0.05
25,000 transcripts on an array = 1,250 false positives at a = 0.05
P(at least one Type I error among m tests) =
e.g. When α = .05 and m = 10, P = 0.401
Fleet 2016
![Page 52: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/52.jpg)
How do we control type I error in “omics” studies?
Familywise Error Rate Correction (FWER):
the probability of making even one false discovery in a set of comparisons.
e.g. Bonferroni test
a/(# comparisons) = 0.05/25,000 = 0.000002
Very conservative but useful if the goal is to only find the changes that are most reliable (and likely the largest)
Fleet 2016
![Page 53: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/53.jpg)
How do we control type I error in “omics” studies?
False Detection Rate (FDR):
e.g. Benjamini and Hochberg procedure
i/m(Q) = (rank/(# comparisons) )* (false discovery rate)
Gene P value Rank FDR
Gene 1 0.0001 1 0.000002
Gene 2 0.0004 2 0.000004
Gene 1500 0.049 1500 0.003
Gene 25,000 0.988 25000 0.05
Accepts a set rate of false positives within a number of comparisons (e.g. 5% FDR means 5/100 “significant” comparisons are likely false
positives)Fleet 2016
![Page 54: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/54.jpg)
Data Reduction as a Strategy to Minimize Type I Error Problem
Filter out genes with:• Low expression
or• High variation
Hackstadt and Hess (2009) BMC Informatics 10:11
MAS5 (use M=P) # “Present”
At least 4/24 = P 28,883
75% total = P 21,448
75% P in at least 1 group 25,985
50% total = P 23,793
50% P in at least 1 group 29,260
Drop Bottom 25% # “Present”
75% total = P 40,597
75% P in at least 1 group 45,689
Drop genes SD/mean>0.25 # “Present”
75% total = P 18,661
54,677 genes on the U133 Plus 2.0 array
Fleet 2016
![Page 55: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing](https://reader034.fdocuments.net/reader034/viewer/2022042712/5f9b9a9756881f2e7566cd21/html5/thumbnails/55.jpg)
Processing and Statistics Influence the DEG List
Processing Statistic DEG at 48 h
gcRMA SAM 3566
RMA SAM 2249
RMA limma 1021
Fleet 2016