Rethinking data intensive science using scalable analytics systems

Rethinking data-intensive science using scalable analytics systems

Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. PattersonAMPLab, University of California, Berkeley, Cloudera, San Francisco, CA, Carl Icahn School of Medicine, Mount Sinai, New York, NY, Genomebridge, Cambridge, MA

1

Abstract

• In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity “big data” systems.

2

Background

source : NIH National Genome Research Institute 3

Characteristics of science analysis systems

4

Layering

• Physical Storage coordinates data writes to physical media.

• Data Distribution manages access, replication, and distribution of the files that have been written to storage media.

• Materialized Data encodes the patterns for how data is encoded and stored. This layer determines I/O bandwidth and compression.

5

Layering

• Data Schema specifies the representation of data, and forms the narrow waist of the stack that separates access from execution

• Evidence Access provides primitives for processing data, and enables the transformation of data into different views and traversals.

• Presentation enhances the data schema with convenience methods for performing common tasks and accessing common derived fields from a single element.

6

Layering

• Applications use the evidence access and presentation layers to compose algorithms for performing an analysis.

7

Case studies

8

Parquet

• OSS Created by Twitter and Cloudera, based on Google Dremel

• Columnar File Format

• Limit I/O to only data that is needed

• Compresses very well - ADAM file are 5-25% smaller than BAM file without loss of data

• 3 layers of parallelism: File/row group, Column chunk, Page

9

Parquet/Spark integration

• 1 row group in Parquet maps to 1 partition in spark

• We interact with Parquet via input/output formats

• Spark builds and execute a computation Directed Acyclic Graph(DAG), manages data locality, error/retries

10

Performance

11

• We evaluated ADAM against the GATK [14], SAMtools [32], Picard [51], and Sambamba [50]. We evaluated the performance of BQSR, INDEL realignment (IR), duplicate marking (DM), sort, and Flagstat (FS).

Genomics Workloads

12

• This data is shown in Table 2. Although ADAM is more expensive than the best legacy tool (Sambamba [50]) for sorting and duplicate marking, ADAM is less expensive for all other stages. In total, using ADAM reduces the end-to-end analysis cost by 63% over a pipeline constructed out of solely legacy tools.

Genomics Workloads

13

• Table 3 describes the instance types.

Genomics Workloads

14

Genomics Workloads

• We achieve near-linear speedup across 128 nodes

15

Conclusion

• By rethinking the architecture of scientific data management systems, we have been able to achieve parity on single node systems, while providing linear strong scaling out to 128 nodes. By making it easy to scale scientific analysis across multiple commodity machines, we enable the use of smaller, less expensive computers, leading to a 63% cost improvement and a 28x improvement in read preprocessing pipeline latency.

16

Q&A

17

Rethinking data intensive science using scalable analytics systems

Software

Transcript of Rethinking data intensive science using scalable analytics systems