Hadoop ecosystem for health/life sciences
-
Upload
uri-laserson -
Category
Technology
-
view
1.810 -
download
1
description
Transcript of Hadoop ecosystem for health/life sciences
![Page 1: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/1.jpg)
1
Hadoop ecosystem for life sciencesUri Laserson30 September 2013
![Page 2: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/2.jpg)
2
About the speaker
• Currently “Data Scientist” at Cloudera
• PhD in Biomedical Engineering at MIT/Harvard (2005-2012)
• Focused on next-generation DNA sequencing technology in George Church’s lab
• Co-founded Good Start Genetics (2007-)• First application of next-gen sequencing to genetic
carrier screening
![Page 3: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/3.jpg)
3
Agenda
• Historical context• Introduction to Hadoop ecosystem• Genomics on Hadoop• Other use cases in life sciences
![Page 4: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/4.jpg)
4
Historical Context
![Page 5: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/5.jpg)
5
1999!
![Page 6: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/6.jpg)
6
Indexing the Web
• Web is Huge• Hundreds of millions of pages in 1999
• How do you index it?• Crawl all the pages• Rank pages based on relevance metrics• Build search index of keywords to pages• Do it in real time!
![Page 7: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/7.jpg)
7
![Page 8: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/8.jpg)
8
Databases in 1999
1. Buy a really big machine2. Install expensive DBMS on it3. Point your workload at it4. Hope it doesn’t fail5. Ambitious: buy another big machine as backup
![Page 9: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/9.jpg)
9
![Page 10: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/10.jpg)
10
Database Limitations
• Didn’t scale horizontally• High marginal cost ($$$)
• No real fault-tolerance story• Vendor lock-in ($$$)• SQL unsuited for search ranking
• Complex analysis (PageRank)• Unstructured data
![Page 11: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/11.jpg)
11
![Page 12: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/12.jpg)
12
Google does something different
• Designed their own storage and processing infrastructure
• Google File System (GFS) and MapReduce (MR)• Goals: KISS
• Cheap• Scalable• Reliable
![Page 13: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/13.jpg)
13
Google does something different
• It worked!• Powered Google Search for many years• General framework for large-scale batch computation
tasks• Still used internally at Google to this day
![Page 14: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/14.jpg)
14
Google benevolent enough to publish
2003 2004
![Page 15: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/15.jpg)
15
Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR.
• 2006: Spun out as Apache Hadoop• Named after Doug’s son’s yellow stuffed elephant
![Page 16: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/16.jpg)
16
Industry strategy: Copy Google
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubby Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on Hadoop
Hadoop
![Page 17: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/17.jpg)
17
Overview of core technology
![Page 18: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/18.jpg)
18
HDFS design assumptions
• Based on Google File System• Files are large (GBs to TBs)• Failures are common
• Massive scale means failures very likely• Disk, node, or network failures
• Accesses are large and sequential• Files are append-only
![Page 19: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/19.jpg)
19
HDFS properties
• Fault-tolerant• Gracefully responds to node/disk/network failures
• Horizontally scalable• Low marginal cost
• High-bandwidth
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distributionNode A Node B Node C Node D Node E
![Page 20: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/20.jpg)
20
MapReduce computation
![Page 21: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/21.jpg)
21
MapReduce
• Structured as1. Embarrassingly parallel “map stage”2. Cluster-wide distributed sort (“shuffle”)3. Aggregation “reduce stage”
• Data-locality: process the data where it is stored• Fault-tolerance: failed tasks automatically detected
and restarted• Schema-on-read: data must not be stored conforming
to rigid schema
![Page 22: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/22.jpg)
22
WordCount example
![Page 23: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/23.jpg)
23
HPC separates compute from storage
Storage infrastructure Compute cluster
• Proprietary, distributed file system
• Expensive
• High-performance hardware
• Low failure rate• Expensive
Big network pipe ($$$)
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.Hadoop is about data.
![Page 24: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/24.jpg)
24
Hadoop colocates compute and storage
Compute clusterStorage infrastructure
• Commodity hardware• Data-locality• Reduced networking
needs
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.Hadoop is about data.
![Page 25: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/25.jpg)
25
HPC is lower-level than Hadoop
• HPC only exposes job scheduling• Parallelization typically occurs through MPI
• Very low-level communication primitives• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split• Failures must be dealt with manually
• Hadoop has fault-tolerance, data locality, horizontal scalability
![Page 26: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/26.jpg)
26
Sqoop
Bidirectional data transfer between Hadoop and almost any SQL database with a JDBC driver
![Page 27: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/27.jpg)
27
Flume
A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.
Client
Client
Client
Client
Agent
Agent
Agent
![Page 28: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/28.jpg)
28
Cloudera Impala
Modern MPP database built on top of HDFS
Designed for interactive queries on terabyte-scale data sets.
![Page 29: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/29.jpg)
29
Cloudera Search
• Interactive search queries on top of HDFS
• Built on Solr and SolrCloud• Near-realtime indexing of new documents
![Page 30: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/30.jpg)
30
Benefits of Hadoop ecosystem
• Inexpensive commodity compute/storage• Tolerates random hardware failure
• Decreased need for high-bandwidth network pipes• Co-locate compute and storage• Exploit data locality
• Simple horizontal scalability by adding nodes• MapReduce jobs effectively guaranteed to scale
• Fault-tolerance/replication built-in. Data is durable• Large ecosystem of tools• Flexible data storage. Schema-on-read. Unstructured
data.
![Page 31: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/31.jpg)
31
Scaling Genomics
![Page 32: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/32.jpg)
32
![Page 33: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/33.jpg)
33
NCBI Sequence Read Archive (SRA)
Today…1.14 petabytes
One year ago…609 terabytes
![Page 34: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/34.jpg)
34
Every ‘ome has a -seq
Genome DNA-seq
TranscriptomeRNA-seqFRT-seqNET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
ProteomePhIP-seqBind-n-seq
![Page 35: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/35.jpg)
35
Genomics ETL
GATK best practices
![Page 36: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/36.jpg)
36
Genomics ETL
.fastq .bam .vcf
short read alignment
genotype calling
• Short read alignment is embarrassingly parallel• Pileup/variant calling requires distributed sort• GATK is a reimplementation of MapReduce; could run on Hadoop• Already available Hadoop tools
• Crossbow: short read alignment/variant calling• Hadoop-BAM: distributed bamtools• BioPig: manipulating large fasta/q• SEAL: Hadoop-enabled BWA• Contrail: de-novo assembly
![Page 37: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/37.jpg)
37
Use case 1: Scaling a genome center pipeline
• Currently at 5k genomes (150 TB incl. raw), looking to scale to 25k now (1 PB) and eventually 100k (requiring 4 PB)
• Current throughput• >1300 samples per month• >12 TB raw data per month
• Data ultimately served from MySQL database• 750 GB of processed variant data• 25k genomes requires >3.5 TB in MySQL
• Complex 4-tier storage system, including tape, filer, and RDMBS
![Page 38: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/38.jpg)
38
Use case 1: Scaling a genome center pipeline
• Database serves population genetics applications and case/control studies
• Unify all data processing into HDFS• Replace MySQL with Impala on Hadoop for increased
scalability• Possibly move raw data processing into MapReduce
![Page 39: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/39.jpg)
39
Use case 2: Querying large, integrated data sets
• Biotech client has thousands of genomes• Want to expose ad hoc querying functionality on large
scale• e.g., vcftools/PLINK-SEQ on terabyte-scale data sets
• Integrating data with public data sets (e.g., ENCODE, UCSC browser)
• Terabyte-scale annotation sets• Currently, these capabilities (e.g., data joins) are often
manually implemented
![Page 40: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/40.jpg)
40
Use case 2: Querying large, integrated data sets
• Hadoop allows all data to be centrally stored and accessible
• Impala exposes a SQL query interface to data sets in Hadoop
![Page 41: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/41.jpg)
41
Variant-filtering example
• “Give me all SNPs that are:• on chromosome 5• absent from dbSNP• present in COSMIC• observed in breast cancer samples• absent from prostate cancer samples• overlap a DNase hypersensitivity site• overlap a ChIP-seq site for a particular TF”
• On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds
![Page 42: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/42.jpg)
42
All-vs-all eQTL
• Possible to generate trillions of hypothesis tests• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values• Tested below on 120 billion associations
• Example queries:• “Given 5 genes of interest, find top 20 most significant
eQTLs (cis and/or trans)”• Finishes in several seconds
• “Find all cis-eQTLs across the entire genome”• Finishes in a couple of minutes• Limited by disk throughput
![Page 43: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/43.jpg)
43
All-vs-all eQTL
• “Find all SNPs that are:• in LD with some lead SNP
or eQTL of interest• align with some functional
annotation of interest”• Still in testing, but likely
finishes in seconds
Schaub et al, Genome Research, 2012
![Page 44: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/44.jpg)
44
Genomics summary
• ETL (raw data to analysis-ready data)• Data integration
• e.g., interactively queryable UCSC genome browser• De novo assembly• NLP on scientific literature
![Page 45: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/45.jpg)
45
Clinical dataManufacturing
Other use cases
![Page 46: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/46.jpg)
46
Use case 3: Clinical document queries for EHR company
• EHR wants to expose query functionality to clinicians• >16 million clinical documents with free text; processed
through NLP pipeline• >500 million lab results
• Perform subject expansion on search queries via ontologies
• e.g., “myocardial infarction” will match “heart disease”• Search functionality implemented with Lucene
(serving) on top of Hbase (processing/storage/indexing)
![Page 47: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/47.jpg)
47
Use case 3: Clinical document queries for EHR company
• Interested in recommendation engine-enabled queries, like:
• Clinician searches “diabetes” and has relevant lab results already highlighted when opening a patient’s record
• Clinician wants to know what other conditions might be correlated with a finding of interest
![Page 48: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/48.jpg)
48
Use case 3: Clinical document queries for EHR company
“Find other patients similar to mine”
• The Stanford system is limited to search
• Recommendation engines allow a button “find similar”
![Page 49: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/49.jpg)
49
Use case 4: Insurance company
• Data from 30 different EHRs across multiple business units
• High variance in ICD9 coding between locales.• Use NLP and machine learning to improve ICD9 coding
to reduce variance in diagnosis
![Page 50: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/50.jpg)
50
Use case 5: Pharma company variance in yields
• Pharma company performs large batch fermentations of their product
• Find high levels of variance in their yield• Fermentations are automated and highly
instrumented• e.g., dissolved oxygen, nutrients, COAs, temperature, etc.
• Perform time series analysis on fermentation runs to predict yields and determine which variables control variance.
![Page 51: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/51.jpg)
51
Use case 6: AgTech company integrating data sources
• Multiple reference genome sequences• Genotyping on thousands of samples• Weather data• Soil data• Microbiome data• Yield data• Geo data• All integrated in HBase
![Page 52: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/52.jpg)
52
Use case 6: AgTech company integrating data sources
• Can increase crop yields ~15% by “printing” seeds onto a field
• Support search queries by name, ontology concepts, protein families, creation dates, assembly/chromsome positions, SNPs
• Import any annotation data in CSV/GFF• Integration with cloning tools• Supports a web front-end for easy access
![Page 53: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/53.jpg)
53
Conclusions
![Page 54: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/54.jpg)
Highly heterogeneous data
54
COMMUNICATIONSLocation-based advertising
HEALTH CAREPatient sensors, monitoring, EHRs Quality of care
LAW ENFORCEMENT & DEFENSEThreat analysis,Social media monitoring, Photo analysis
EDUCATION& RESEARCHExperimentsensor analysis
FINANCIAL SERVICESRisk & portfolioanalysisNew products
ON-LINE ERVICES / SOCIAL MEDIAPeople & career matching
Websiteoptimization
UTILITIESSmart Meter analysis for network capacity
CONSUMER PACKAGED GOODSSentiment analysis of what’s hot,customer service
MEDIA /ENTERTAINMENTViewers /advertising effectiveness
TRAVEL &TRANSPORTATIONSensor analysis for optimal traffic flowsCustomer sentiment
LIFE SCIENCESClinical trialsGenomics
RETAILConsumer sentimentOptimized marketing
AUTOMOTIVEAuto sensors reporting location, problems
HIGH TECH / INDUSTRIAL MFG.Mfg. quality
Warranty analysis
OIL & GASDrilling exploration sensor analysis
©2013 Cloudera, Inc. All Rights Reserved.
![Page 55: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/55.jpg)
55
Flexibility• Store any data• Run any analysis and processing• Keeps pace with the rate of change of incoming data
Scalability• Proven growth to PBs/1,000s of nodes• No need to rewrite queries, automatically scales• Keeps pace with the rate of growth of incoming data
Efficiency• Cost per TB at a fraction of other options• Keep all of your data alive in an active archive• Powering the data beats algorithm movement
The Cloudera Enterprise Platform for Big Data
©2013 Cloudera, Inc. All Rights Reserved.
![Page 56: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/56.jpg)
56
Cloudera Hadoop Stack
![Page 57: Hadoop ecosystem for health/life sciences](https://reader038.fdocuments.net/reader038/viewer/2022103000/554a374cb4c905863d8b45f8/html5/thumbnails/57.jpg)
57