Rethinking data intensive science using scalable analytics systems

Rethinking data-intensive science using scalable analytics systems

Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. PattersonAMPLab, University of California, Berkeley, Cloudera, San Francisco, CA, Carl Icahn School of Medicine, Mount Sinai, New York, NY, Genomebridge, Cambridge, MA

Abstract

• In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity “big data” systems.

Background

source : NIH National Genome Research Institute 3

Characteristics of science analysis systems

Layering

• Physical Storage coordinates data writes to physical media.

• Data Distribution manages access, replication, and distribution of the files that have been written to storage media.

• Materialized Data encodes the patterns for how data is encoded and stored. This layer determines I/O bandwidth and compression.

Layering

• Data Schema specifies the representation of data, and forms the narrow waist of the stack that separates access from execution

• Evidence Access provides primitives for processing data, and enables the transformation of data into different views and traversals.

• Presentation enhances the data schema with convenience methods for performing common tasks and accessing common derived fields from a single element.

Layering

• Applications use the evidence access and presentation layers to compose algorithms for performing an analysis.

Case studies

Parquet

• OSS Created by Twitter and Cloudera, based on Google Dremel

• Columnar File Format

• Limit I/O to only data that is needed

• Compresses very well - ADAM file are 5-25% smaller than BAM file without loss of data

• 3 layers of parallelism: File/row group, Column chunk, Page

Parquet/Spark integration

• 1 row group in Parquet maps to 1 partition in spark

• We interact with Parquet via input/output formats

• Spark builds and execute a computation Directed Acyclic Graph(DAG), manages data locality, error/retries

Performance

• We evaluated ADAM against the GATK [14], SAMtools [32], Picard [51], and Sambamba [50]. We evaluated the performance of BQSR, INDEL realignment (IR), duplicate marking (DM), sort, and Flagstat (FS).

Genomics Workloads

• This data is shown in Table 2. Although ADAM is more expensive than the best legacy tool (Sambamba [50]) for sorting and duplicate marking, ADAM is less expensive for all other stages. In total, using ADAM reduces the end-to-end analysis cost by 63% over a pipeline constructed out of solely legacy tools.

Genomics Workloads

• Table 3 describes the instance types.

Genomics Workloads

• We achieve near-linear speedup across 128 nodes

Conclusion

• By rethinking the architecture of scientific data management systems, we have been able to achieve parity on single node systems, while providing linear strong scaling out to 128 nodes. By making it easy to scale scientific analysis across multiple commodity machines, we enable the use of smaller, less expensive computers, leading to a 63% cost improvement and a 28x improvement in read preprocessing pipeline latency.

Rethinking data intensive science using scalable analytics systems

Software

Transcript of Rethinking data intensive science using scalable analytics systems

Towards Scalable Data Management for Map-Reduce … Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures Frédéric Suter Joint

Asterism: Pegasus and dispel4py hybrid workﬂows for data-intensive … · 2016-10-14 · livers specialized software to execute data-intensive applica-tions in a scalable, e cient,

Data Intensive Scalable Computingjkx.fudan.edu.cn/~qzhang/COMP630030/2011/Lecture.1... · Data Intensive Scalable Computing for Science February 2009 Oceans of Data, Skinny Pipes

GENOMEVIEWER - Publications Listpublicationslist.org/data/a.april/ref-589/dhphd18-kanzki.pdf · 2017. 5. 23. · Rethinking data-intensive science using scalable analytics systems.

The Quest for Scalable Support of Data-Intensive …people.cs.uchicago.edu/~iraicu/publications/2009_HPDC09...The Quest for Scalable Support of Data-Intensive Workloads in Distributed

Rethinking Data-Intensive Science Using Scalable Analytics Systems

On Task Assignment in Data Intensive Scalable …feit/parsched/jsspp13/agosta.pdf · On Task Assignment in Data Intensive Scalable Computing GiovanniAgosta,GerardoPelosi,andEttoreSpeziale

Towards Efficient and Scalable Data-Intensive Content Delivery: … · 2019-03-25 · Towards Eﬃt and Scalable Itensive Content Deliv: A Issues and Challenges Irene Kilanioti1(B),

Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing Thesis Defense, 12/20/2010 Student: Jaliya Ekanayake Advisor:

The Quest for Scalable Support of Data-Intensive Workloads ... · The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems Ioan Raicu,1 Ian T. Foster,1,2,3

CSCI-2950u :: Data-Intensive Scalable Computing Rodrigo Fonseca (rfonseca) brown-csci2950-u-f11@googlegroups.com.

SALSASALSASALSASALSA Hybrid Cloud and Cluster Computing Paradigms for Scalable Data Intensive Applications April 15, 2011 University of Alabama Judy Qiu.

Scalable Programming and Algorithms for Data Intensive Life Science Applications

Rethinking DeindustrializationWe thank Emily Blanchard ...tuck-fac-cen.dartmouth.edu/images/uploads/faculty/... · Overall this is a group of small, highly productive, import intensive

Scalable Parallel Computing on Clouds : Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations.

Rethinking Data-Intensive Science Using Scalable Analytics ... · Rethinking Data-Intensive Science Using Scalable Analytics Systems Frank Austin Nothaft, Matt Massie, ... niques

10,000 Tables Can’t Be Wrong Designing a Highly Scalable MySQL Architecture for Write-intensive Applications Presentation

Big Data and the Web: Algorithms for Data Intensive Scalable Computing

Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

Rethinking Memory System Design for Data-Intensive Computingece740/f13/lib/exe/fetch.php?... · 2013-10-13 · Data-intensive applications: increasing demand/hunger for data Consolidation: