Rethinking data intensive science using scalable analytics systems

17
Rethinking data-intensive science using scalable analytics systems Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, JeHammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson AMPLab, University of California, Berkeley, Cloudera, San Francisco, CA, Carl Icahn School of Medicine, Mount Sinai, New York, NY, Genomebridge, Cambridge, MA 1

Transcript of Rethinking data intensive science using scalable analytics systems

Page 1: Rethinking data intensive science using scalable analytics systems

Rethinking data-intensive science using scalable analytics systems

Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. PattersonAMPLab, University of California, Berkeley, Cloudera, San Francisco, CA, Carl Icahn School of Medicine, Mount Sinai, New York, NY, Genomebridge, Cambridge, MA

1

Page 2: Rethinking data intensive science using scalable analytics systems

Abstract

• In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity “big data” systems.

2

Page 3: Rethinking data intensive science using scalable analytics systems

Background

source : NIH National Genome Research Institute 3

Page 4: Rethinking data intensive science using scalable analytics systems

Characteristics of science analysis systems

4

Page 5: Rethinking data intensive science using scalable analytics systems

Layering

• Physical Storage coordinates data writes to physical media.

• Data Distribution manages access, replication, and distribution of the files that have been written to storage media.

• Materialized Data encodes the patterns for how data is encoded and stored. This layer determines I/O bandwidth and compression.

5

Page 6: Rethinking data intensive science using scalable analytics systems

Layering

• Data Schema specifies the representation of data, and forms the narrow waist of the stack that separates access from execution

• Evidence Access provides primitives for processing data, and enables the transformation of data into different views and traversals.

• Presentation enhances the data schema with convenience methods for performing common tasks and accessing common derived fields from a single element.

6

Page 7: Rethinking data intensive science using scalable analytics systems

Layering

• Applications use the evidence access and presentation layers to compose algorithms for performing an analysis.

7

Page 8: Rethinking data intensive science using scalable analytics systems

Case studies

8

Page 9: Rethinking data intensive science using scalable analytics systems

Parquet

• OSS Created by Twitter and Cloudera, based on Google Dremel

• Columnar File Format

• Limit I/O to only data that is needed

• Compresses very well - ADAM file are 5-25% smaller than BAM file without loss of data

• 3 layers of parallelism: File/row group, Column chunk, Page

9

Page 10: Rethinking data intensive science using scalable analytics systems

Parquet/Spark integration

• 1 row group in Parquet maps to 1 partition in spark

• We interact with Parquet via input/output formats

• Spark builds and execute a computation Directed Acyclic Graph(DAG), manages data locality, error/retries

10

Page 11: Rethinking data intensive science using scalable analytics systems

Performance

11

Page 12: Rethinking data intensive science using scalable analytics systems

• We evaluated ADAM against the GATK [14], SAMtools [32], Picard [51], and Sambamba [50]. We evaluated the performance of BQSR, INDEL realignment (IR), duplicate marking (DM), sort, and Flagstat (FS).

Genomics Workloads

12

Page 13: Rethinking data intensive science using scalable analytics systems

• This data is shown in Table 2. Although ADAM is more expensive than the best legacy tool (Sambamba [50]) for sorting and duplicate marking, ADAM is less expensive for all other stages. In total, using ADAM reduces the end-to-end analysis cost by 63% over a pipeline constructed out of solely legacy tools.

Genomics Workloads

13

Page 14: Rethinking data intensive science using scalable analytics systems

• Table 3 describes the instance types.

Genomics Workloads

14

Page 15: Rethinking data intensive science using scalable analytics systems

Genomics Workloads

• We achieve near-linear speedup across 128 nodes

15

Page 16: Rethinking data intensive science using scalable analytics systems

Conclusion

• By rethinking the architecture of scientific data management systems, we have been able to achieve parity on single node systems, while providing linear strong scaling out to 128 nodes. By making it easy to scale scientific analysis across multiple commodity machines, we enable the use of smaller, less expensive computers, leading to a 63% cost improvement and a 28x improvement in read preprocessing pipeline latency.

16

Page 17: Rethinking data intensive science using scalable analytics systems

Q&A

17