Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their...

10
Mango: Interactive Exploration on Large Genomic Datasets Alyssa Morrow University of California-Berkeley [email protected] Eric Tu University of California-Berkeley [email protected] Anthony Joseph University of California-Berkeley [email protected] David Patterson University of California-Berkeley [email protected] ABSTRACT Current genomics visualization tools are intended for a single node environment and lack the scalability re- quired to visualize multiple whole genome samples. Data from the 1000 Genomes Project [8] provides 1.6 ter- abytes of variant data and over 14 terabytes of align- ment data. However, typical genomic visualizations ma- terialize less than 40 kbp (kilobase pairs, 40,000 bases), only 3.3 e-7 % of the genome. Mango is a genome browser that selectively materializes and organizes genomic data to provide fast in-memory queries. Mango materializes data from persistent storage as the user requests differ- ent regions of the genome. This data is efficiently parti- tioned and organized in memory using interval trees. This interval-based organizational structure supports ad hoc queries, filters, and joins across multiple sam- ples at a time, enabling exploratory interaction with genomic data. Mango is built on top of Spark [6] and ADAM [15]. Leveraging Spark as Mango’s cluster computing frame- work enables scalable, distributed computations on ter- abytes of genomic data. Mango leverages ADAM’s ge- nomic file formats which can be stored in persistent stor- age and accessed by Spark. Both ADAM and Mango are part of the Big Data Genomics project at University of California-Berkeley [20]. Spark, ADAM, and Mango are all published under the Apache 2 license. Keywords Genomics, Genome Browser, Scalable Visualization, Sci- entific Computing, Spark 1. INTRODUCTION The rise of next generation sequencing in the past decade has led to a dramatic increase in the availabil- ity of genomic data [14]. The 1000 Genomes Project completed in 2013 sequenced 2,504 individuals across 26 populations, generating over 15 TB of genomic data [8]. Current projects such as the 100,000 Genomes Project in England are in the process of producing over 20 PB of raw genomic data to be consumed by the scientific community [9]. These initiatives include both alignment data, or aligned sequences from a DNA sequencer, and variant data, or points of differentiation in a population calculated from a process called variant calling. This has left the scientific community with a massive amount of formatted genomic data to interpret. This increasing availability in genomic data has spawned a myriad of genomic analysis pipelines intended to pre- process raw sequences, align reads (arranging short snip- pets of DNA) and identify variants in an efficient man- ner. GATK [23] is the first end-to-end standardized pre- processing pipeline that fully processes sequence data. Open source distributed genomics tools, such as ADAM and OpenCB, enable the use of cloud computing clus- ters for genome processing [16] [15] [17]. This reduces the cost and complexity of batch processing for genome alignment, preprocessing, and variant calling relative to a high performance computing cluster [16]. Specifi- cally, the ADAM project from UC Berkeley developed a distributed end-to-end genomic preprocessing pipeline that has led to a 20x speedup of standard preprocessing techniques. Although the cost of sequencing and preprocessing genomic data has significantly decreased, there has been little progress of platforms for ad hoc interactive anal- ysis on top of this data. Many genome visualization tools do not scale past a single node, and can only visu- alize genomic files less than 60 GB. Furthermore, these tools generally achieve latencies of up to 2 s, which is well above the recommended 500ms threshold for inter- active response times [10]. In this paper we present Mango, a distributed genome visualization tool intended to view and summarize ter- abytes of genomic data. Mango is designed with the assumption that users of a genome browser are viewing datasets too large to fit in memory. Therefore, Mango

Transcript of Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their...

Page 1: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

Mango: Interactive Exploration on Large GenomicDatasets

Alyssa MorrowUniversity of

[email protected]

Eric TuUniversity of

[email protected]

Anthony JosephUniversity of

[email protected]

David PattersonUniversity of

[email protected]

ABSTRACTCurrent genomics visualization tools are intended fora single node environment and lack the scalability re-quired to visualize multiple whole genome samples. Datafrom the 1000 Genomes Project [8] provides 1.6 ter-abytes of variant data and over 14 terabytes of align-ment data. However, typical genomic visualizations ma-terialize less than 40 kbp (kilobase pairs, 40,000 bases),only 3.3e−7% of the genome. Mango is a genome browserthat selectively materializes and organizes genomic datato provide fast in-memory queries. Mango materializesdata from persistent storage as the user requests differ-ent regions of the genome. This data is efficiently parti-tioned and organized in memory using interval trees.This interval-based organizational structure supportsad hoc queries, filters, and joins across multiple sam-ples at a time, enabling exploratory interaction withgenomic data.

Mango is built on top of Spark [6] and ADAM [15].Leveraging Spark as Mango’s cluster computing frame-work enables scalable, distributed computations on ter-abytes of genomic data. Mango leverages ADAM’s ge-nomic file formats which can be stored in persistent stor-age and accessed by Spark. Both ADAM and Mangoare part of the Big Data Genomics project at Universityof California-Berkeley [20]. Spark, ADAM, and Mangoare all published under the Apache 2 license.

KeywordsGenomics, Genome Browser, Scalable Visualization, Sci-entific Computing, Spark

1. INTRODUCTIONThe rise of next generation sequencing in the past

decade has led to a dramatic increase in the availabil-ity of genomic data [14]. The 1000 Genomes Projectcompleted in 2013 sequenced 2,504 individuals across 26populations, generating over 15 TB of genomic data [8].

Current projects such as the 100,000 Genomes Projectin England are in the process of producing over 20 PBof raw genomic data to be consumed by the scientificcommunity [9]. These initiatives include both alignmentdata, or aligned sequences from a DNA sequencer, andvariant data, or points of differentiation in a populationcalculated from a process called variant calling. Thishas left the scientific community with a massive amountof formatted genomic data to interpret.

This increasing availability in genomic data has spawneda myriad of genomic analysis pipelines intended to pre-process raw sequences, align reads (arranging short snip-pets of DNA) and identify variants in an efficient man-ner. GATK [23] is the first end-to-end standardized pre-processing pipeline that fully processes sequence data.Open source distributed genomics tools, such as ADAMand OpenCB, enable the use of cloud computing clus-ters for genome processing [16] [15] [17]. This reducesthe cost and complexity of batch processing for genomealignment, preprocessing, and variant calling relativeto a high performance computing cluster [16]. Specifi-cally, the ADAM project from UC Berkeley developeda distributed end-to-end genomic preprocessing pipelinethat has led to a 20x speedup of standard preprocessingtechniques.

Although the cost of sequencing and preprocessinggenomic data has significantly decreased, there has beenlittle progress of platforms for ad hoc interactive anal-ysis on top of this data. Many genome visualizationtools do not scale past a single node, and can only visu-alize genomic files less than 60 GB. Furthermore, thesetools generally achieve latencies of up to 2 s, which iswell above the recommended 500ms threshold for inter-active response times [10].

In this paper we present Mango, a distributed genomevisualization tool intended to view and summarize ter-abytes of genomic data. Mango is designed with theassumption that users of a genome browser are viewingdatasets too large to fit in memory. Therefore, Mango

Page 2: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

implements an ad hoc data tiling approach, interac-tively prefetching and formatting small sections of ge-nomic data into multiple visualization layers before theuser requests it. This design model bypasses expensivepreprocessing of all data while maintaining interactivelatencies.

Mango is built on top of Spark, a distributed comput-ing platform and ADAM, a distributed scientific anal-ysis tool intended for large genomic workloads. In-tended contributions of Mango to distributed visualiza-tion techniques include the following:

• Ad hoc data tiling to bypass static preprocessingof large datasets

• Simple prefetching mechanisms based on scientificworkload trends

• Achievement of interactive latencies on large datasets

2. BACKGROUNDTwo major trends that have initialized interest in dis-

tributed genomic analysis have been 1) the increasingavailability of genomic data and 2) the increasing avail-ability of open-source platforms enabling scientific com-puting in a distributed environment. Nothaft et. al.discuss rethinking scientific workloads in a distributedenvironment based on the assumption that we can nolonger fit genomic workloads into memory [15].

Existing tools that comply with standardized genomeprocessing protocol include projects such as GATK, OpenCBand ADAM. These tools provide an API for functional-ity such as genome alignment and variant calling. BothADAM and the GATK leverage distributed computingplatforms in their analysis pipelines and file formats.Apache Spark is utilized by both of these pipelines, andprovide us with fault tolerant, in-memory computing[6].

2.1 ADAMADAM is a distributed genomics analysis pipeline

which provides a well supported platform upon whichto build genomic visualization [16]. ADAM uses ApacheAvro [18] to define data schemas for genomic workloads,which are stored in persistent storage as Apache Par-quet [13]. These Parquet files can then be loaded intoADAM, which leverages Spark’s distributed executionand fault tolerance to perform preprocessing and ma-nipulations over genomic data.

As shown in figure 1, ADAM use a narrow waist ofschemas to segregate an application’s use of genomicformats from underlying data distribution mechanisms.ADAM supports both legacy genomic formats and Par-quet formats, and provides a toolkit to convert betweenthe two. Supported legacy formats used by Mangoinclude VCF files for storing variant data, BAM filesfor alignment data and BED files for annotation data.While parquet formats achieve a 63 percent reduction in

end to end processing cost (utilizing Amazon EC2 com-pute) compared to legacy formats, support for legacyformats is important to bridge genomics to big data sys-tems. Parquet also allows for significant compression ofdata, offering about 1.25x compression on gzipped BAMfiles, and 1.66x compression on gzipped VCF files [15].

These files are loaded via ADAM APIs into Sparkresilient distributed datasets (RDDs), upon which ge-nomic algorithms can be run. Specifically, these al-gorithms correspond to the preprocessing steps in theGATK pipeline.

Figure 1: ADAM Stack

2.2 Genome VisualizationGenome visualization tools enable users to gain in-

sight into genomic data through a visual interface as analternative to bioinformatics command-line and scriptoutput. Genomic data takes many different forms, suchas alignment data (short snippets of DNA output fromsequencer machines), variant data (positional polymor-phisms from a reference), and annotations data (exter-nal genomic features). All these data types are refer-enced by their location on the genome, and individualrecords are accessed by the chromosome they are lo-cated on, as well as their start and end base position.

Genomic visualizations can provide a powerful meansof discovering underlying trends in genomic data. Dur-ing variant analysis, a user wants to explore the correla-tion between variation in a person’s genetic makeup toan individual’s phenotype, variant frequencies across apopulation, and point wise comparisons across samples.

Similarly, queries on alignment data are used to com-pare multiple whole genome samples at specific loci forpoint-wise mutations. These queries are commonly usedwhile visualizing oncogenomics data [11]. Visualizationof alignment data is also used to validate and comparevariant calling pipelines to confirm or contradict discov-ered variants.

Page 3: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

Given this description of workloads for variant andalignment data, we constrain Mango to handle the fol-lowing workloads for variant and alignment data:

Variant Data

1. Observe the frequency of variants at all points inthe genome across a given population

2. Zoom in and compare individual genotypes againstvariant frequencies and other individual genotypes

Alignment Data:

1. Compare point mutations across samples

2. Estimate mutation probabilities computed from rawalignment data

3. RELATED WORK

3.1 Genome BrowsersTwo popular genome browsers are the UCSC Genome

browser [3] and the Integrated Genomics Viewer (IGV)[1]. We begin by briefly summarizing features of thesebrowsers in terms of their functionality, intuitivenessand performance.

3.1.1 UCSC Genome BrowserThe UCSC Genome browser [3] is available as a web

service, and offers access to a database of sequence dataand annotations for over 40 species. Users can uploadtheir own files and view them through the web service.

Functionality: The UCSC Genome browser containsa large number of tracks that can display many annota-tions from different sources on the same screen at once.These tracks are preformatted from a preselected listof data. Clicking on objects takes users to a web pagecontaining detailed information about the region of in-terest.

Intuitiveness: The UCSC Genome browser utilizes a”track view” of genomic data, where all genomic datais aligned along a one-dimensional genomic axis. Infor-mation such as the reference genome, sequenced reads,variants, and genomic features are displayed as horizon-tal tracks, or rows, aligned along a genomic coordinateaxis. Users navigate through their data by navigatingalong the genomic axis. Tracks are selected from a pre-defined menu, which toggles these tracks on and off.

Performance: The UCSC Genome browser is imple-mented as a web service. The browser returns staticviews with second latency. Computation and band-width is limited to servicing data from a MySQL database.

3.1.2 IGVIGV is a desktop application that allows visualization

of alignment, variant and annotation data [1].Functionality: IGV allows users to load data from

legacy genomics file formats. The tool displays all record

attributes related to the data being displayed on a sin-gle page. This can be contrasted to the UCSC Genomebrowser, where the fields of the data object are obtainedby clicking and viewing separate web pages. These dif-ferent genomics formats can be viewed side-by-side andnavigated concurrently.

Intuitiveness: IGV, like the UCSC Genome browser,displays data in tracks. However, unlike the UCSCGenome Browser, each interaction does not yield a re-fresh to a different web page. Instead, IGV allows in-teractive panning of the data. Interactions typically in-volve panning on small regions (< 1000 bp), and zoom-ing in and out at higher resolutions. Different informa-tion is displayed at different resolutions.

Performance: IGV implements a multiresolution for-mat to scale to the size of the data being displayed.This multiresolution format allows quick zooming andpanning of data, which can contain multiple samples.However, IGV stops displaying reads after 1kb (kilo-base, 1000 bases), and stops displaying any informationafter 40kb, which is only 3.3e−7% percent of the genome.

Both IGV and UCSC Genome Browser require datato be viewed at very high resolution, making it difficultto explore course trends in data. Technical limitationsforce tools to compromise features for latency. Whenloading large regions, IGV takes several seconds to loadinformation, and hangs when loading multiple files atonce.

3.2 Data Format and OrganizationThere have been many scientific visualization tools

that implement specialized data tiling techniques to vi-sualize multiple zoom levels across large datasets. Fore-Cache is a general-purpose visualization tool intendedto explore large scientific datasets stored in SciDB [2].ForeCache supports precomputation across four differ-ent zoom levels for numerical data. However, ForeCacherequires precomputation of all data tiles in the datasetto be viewed before a user can begin to use the visu-alization tool. This requires massive unnecessary startup overhead, especially when the user is going to viewonly a small section of the data. Another constraint ofForeCache is its tiling support for only numerical data.Genomic workloads focus heavily on sequence, or textworkloads. Operations such as detecting positional dif-ferences in sequence convolution is not achievable witha numerical approach.

IGV precomputes and exploits data tiling for genomespecific workloads [1]. IGV offers multiple levels of datatiles, including full genome, individual chromosomes,and fine grained alignment records. IGV requires tilesto be precomputed externally by the user using an ex-ternal command line tool. In the absence of precom-puted data tiles, IGV resorts to runtime computationof data tiles. One constraint of IGV is that it onlysupports data tiling and visualization on a single node.Additionally, data tiling is only supported for numericalvalues, such as frequency of base coverage. As a user

Page 4: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

zooms out past 40,000 base pair resolution, sequencedata is not visible.

3.3 PredictionMany visualization tools have offered prediction mech-

anisms to prefetch regions of data based on user his-tory, aggregated user trends and data formats. Hotspotmodels recommend regions that are viewed frequentlyby other users [4]. Momentum models prefetch datawith the assumption that a user’s next move will be thesame as their last [5]. ForeCache implements a two layerrecommendation and prefetching model to account forboth user and data trends. User trends are representedas a Markov model, where probability traversal modelsa series of user actions while exploring their dataset.ForeCache also implements linear l2 distance betweendata tiles to recommend new regions of a dataset tousers.

4. DESIGNRequirements for a distributed visualization tool in-

clude interactive response times and multiple view res-olutions based on the user’s zoom level on a given chro-mosome. To achieve interactive response times, we im-plemented a series of optimizations at the data servic-ing level for accessing highly selective subregions of datawith low latency. However, these optimizations are in-sufficient to achieve interactive latency if a user queriesa region that does not exist in the cache. Therefore,we implement a simple prefetching mechanism to pre-format surrounding regions asynchronously during pe-riods of idle use. To support multiple resolutions basedon user zoom level, Mango materializes six view reso-lutions and returns the view closest to the size of theregion queried. Multiple resolutions allow the user toview all regions of the genome with relevant summarystatistics and low latency.

Mango builds on top of ADAM and its genomic fileformats for feature, variant and alignment data. Al-though the Spark platform is not initially intended forhandling low latency visualization workloads, maintain-ing the ability to integrate visualization into an existingecosystem allows us to utilize the same genomic algo-rithms provided by ADAM in a unified environment.From a usability standpoint, building on top of Sparkwill facilitate easier development of genomic visualiza-tion because we are able to combine both pipelinedworkloads and visualization workloads with the samelanguage and frameworks. Leveraging Parquet files al-lows Mango to seamlessly plug into the Hadoop ecosys-tem [21], which regularly supports Parquet as the de-fault columnar storage format. This format can betransferred to other Hadoop based pipelines other thanADAM.

4.1 Mango StackMango utilizes a stack-oriented model shown in figure

2. The elements of the stack are put into three func-

tional groups: cluster, server/master node, and end-host/client. Mango supports modularity in its visual-ization and optimizations, allowing different levels ofthe stack to be replaced or used in existing genomicpipelines.

Figure 2: Mango stack

4.1.1 ClusterThe cluster layer includes the ADAM core and Par-

quet, combined with data access optimizations. Ourprimary contribution at the cluster layer is the imple-mentation of the Interval RDD, a version of Spark’sresilient distributed dataset (RDD) optimized for two-dimensional ranged queries, and a caching layer calledLazy Materialization.

Interval RDD.We implement a variant of Spark’s RDD called an In-

terval RDD that organizes records into an interval tree,optimizing overlapping range queries. Performing thistype of query in Spark’s primary abstraction for data,an RDD, currently requires a full scan of data. Con-sider the query which gets all 2-dimensional segmentsoverlapping or contained within the start and end valueof an interval:

RDD.filter(record.start < interval.end AND record.end> interval.start)

Even if records are sorted by start value, worst caseperformance requires all data to be scanned. Mean-while, interval trees provide O(log(n)) search for in-tervals within a given range. Interval trees insert anddelete nodes in the same manner as a binary searchtree. However, every node in an interval tree stores themaximum value found in its subtree. Therefore, duringsearch, a given subtree is only traversed if the rangeminimum is less than the maximum of the subtree.

Genomic range lookups are a key use case for Inter-val RDDs. Records are keyed by intervals defined overa 2-dimensional axis. For genomic data, the axis is thelocation on a chromosome, and the key is the tuple ofthe start and end nucleotide base positions on that chro-

Page 5: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

mosome.

Lazy Materialization.To efficiently cache and monitor data being queried

by the client, we implement a cache manager called LazyMaterialization which enables incremental data fetchingand formatting into a working set. This working set,which can be an Interval RDD or any Spark primitive,is lazily populated with data that is stored in memory.

Lazy Materialization fetches data from storage perquery, then uses materialized views to bring into mem-ory the immediate region around the queried interval.The layer keeps track of past queries and keeps in mem-ory the data associated with them. While initial queriestake the overhead time to load from storage, futurequeries on the same interval will instantly return frombeing cached in memory.

The motivating factor behind this design is that visu-alization applications tend to access data in localized ar-eas. Because of the coordinate nature of genomic data,users might be interested in a particular gene, and willlook in the immediate area around that particular generather than jump around to random locations in thegenome.

LazyMaterialization dynamically fetches data depend-ing on what a user requests and does not require theentire dataset to be loaded immediately. As opposedto running batch operations or transformations over anentire dataset, users are not sure what data they mayaccess over a session in a visualization application. Byconstructing a caching layer that scales to user requestsin a visualization application with a dynamically con-structed working set, we can use only the memory andcompute resources we need for our workload.

Working Set Primitives.As mentioned above, the Lazy Materialization struc-

ture is either backed by an Interval RDD or a Sparkprimitive. In this project, we use different structuresfor variant and alignment data.

Alignment data is loaded using Interval RDD’s. Com-mon operations over alignment data involve queryinga region for all overlapping two-dimensional segments,and performing computation over the data. The Inter-val RDD tackles both these considerations by respec-tively storing data in an optimized structure, and pro-vides the same rich procedural processing semantics asRDDs.

Variant data is loaded using DataFrames, introducedas part of the SparkSQL package [7]. DataFrames pro-vide the semantics of relational processing within theSpark ecosystem, while providing a query optimizer tospeed up simple data access patterns. Because variantdata is the summary and result of computed alignmentdata, access patterns are simpler than alignment dataand fall in line with declarative queries.

Variants are typically single point mutations, not in-

Table 1: Data layers and corresponding zoomlevels

Layer Resolution Range (bp)0 0 to 10001 0 to 50002 5000 to 10,0003 10,000 to 100,0004 100,000 to 1,000,0005 1,000,000+

tervals. In addition, visualization involves looking atregions of the genome that have a high frequency ofvariants. DataFrames are a good fit for these two prop-erties: DataFrames provide optimized filter and countoperations that can quickly query a region of variantdata, or return a count of how many variants are inthat region.

In addition, DataFrames allowed us to use flattenedAvro schemas not supported in ADAM. For the pur-poses of predicate pushdown, applying a predicate onnested schemas results in significantly worse performance.Because DataFrames understand the schema of Parquetdata on disk, we could use flattened data to improvequery times.

4.1.2 Server/Master NodeThe main function of the data servicing layer is servic-

ing client requests, formatting data into multiple visu-alization layers to be consumed by a visualization frontend and prefetching data based on user queries.

Data Layers.Both alignment and variant data are preformatted

into six layers of resolution based on the user’s zoomlevel. The layers and the bp range they support areshown in tables 1.

The six layers shown in table 1 are divided into threecategories: raw data, mismatch summary and convolveddata. Raw data includes visualization for raw alignmentdata, and is shown for resolutions from 0 to 5000 bp.For alignment data, this included raw sequence dataand per sequence mismatches, insertions and deletionsand every point as shown in 3. For variant data, layer 0includes raw point wise variants and per sample granu-larity.

The second resolution available for alignment data issummary mismatches and is shown in 4. At this reso-lution, raw sequence data is replaced with frequency ofmismatches, insertions and deletions at all points in thegenome.

The last four layers shown in figure 5 for resolutionuse image convolution to create summary visualizationat zoomed out regions of the genome while avoiding perbase computation. Each of the remaining layer coversan order of magnitude resolution. The resolution rangesfor the four remaining convolution layers can be found

Page 6: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

Figure 3: Alignment Data View: High Resolu-tion

Figure 4: Alignment Data View: Medium Res-olution

in table 1.

Figure 5: Alignment Data View: Low Resolu-tion

For alignment data, computation of the last four lay-ers includes convolving on sequence data with a chosenstride and patch size determined by the zoom level. Anexample of sequence convolution is shown in figure 6.The resulting sequences are aggregated and comparedto a convolved reference to calculate differential estima-tions in genome comparisons. Variant data is similarlyconvolved, except the positions with variants exist areset to a 1. Areas are convolved to return a numericalvalue of the presence of variants over a given range.

4.1.3 Server/Master NodeThe server/master node group services data to the

front end via a web service. To service data, we use

Figure 6: Example of sequence convolution

Scalatra [24] which provides a lightweight web applica-tion framework for Scala. We choose Scalatra becauseof its integration with both the Scala programming lan-guage and front end technologies. Using a Scala-basedframework allows for Scala, Spark, and web service codeto be written within the same class.

4.1.4 Endhost/clientThe endhost/client group include front end technolo-

gies used to generate visualization. We use D3.js [22],which provides custom visualizations within a front endenvironment. D3.js supports HTTP requests for JSONdata, which provides a simple abstraction upon whichto ask for data. We currently use HTTP requests asthe interface between the server/master node and theendhost/client.

4.2 PrefetchingWe implement a basic prefetching algorithm to asyn-

chronously fetch and materialize data during idle pe-riods of user interaction. The front end periodicallyissues HTTP requests to prefetch data after outstand-ing user requests have been completed. This prefetchingscheme works by materializing regions of data to the leftand right of the currently viewed region, or previouslyviewed regions. Parameters for prefetching include 1000bp to the left and right of the current window for highresolution raw data views and 100,000 bp to the left andright for summary statistics.

5. EVALUATIONLocal experiments to evaluate parity against single

node genome browsers were run on a 2015 MacBookPro with a 3.1 GHz Intel Core i7 processor and 16 GB1867 MHz DDR3 memory. Evaluation of Mango in adistributed environment was run on a 64 node cluster,each with Intel E5-2670 2.6 GHz 8 core CPU, 256 GBRAM and 4 1 TB HDFS hard drives.

5.1 Workload DescriptionEvaluation criteria described in this section assess

query patterns for both variant and alignment work-loads. Chosen query patterns for genomic data were

Page 7: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

modeled after scientific workload trends discovered by[2]. These assumptions model scientific query patternsas an iteration of zooming in and panning across a givenregion of interest. The assumptions made in scientificworkload exploration include the following:

• When transitioning between zoom levels, users tra-verse all levels and do not skip levels.

• Users initially view zoomed out regions of theirdata and hone in on smaller regions during explo-ration.

• Users exhibit zooming and panning in a non ran-dom pattern. This implies that previously viewedregions inform the user of regions to view next.

Following these assumptions, we have constructed sep-arate workloads for both variant and alignment data.Figure 5.1 demonstrates the chosen query patterns forboth datasets.

Figure 7: Query zoom levels across variant andalignment data workloads

For both workloads, queries 7 to 9 indicate panningout of the current view region at high (100 to 3000 bp)resolution.

The variant workload, shown in red, is initialized ata 50,000,000 bp region on one chromosome. The queryzooms in to a region of 1000 bp then zooms out to 3000bp. The variant workload is tested at lower resolutionthan alignment data. This is due to the fact that variantdata summarizes alignment data, and therefore displaysmore information in a smaller number of pixels.

The alignment data workload, shown in black in fig-ure 5.1, has a higher resolution query pattern than vari-ant data. Alignment workload starts at 1,000,000 bp asthe coursest level of granularity, and zooms in to 100bp, switching between resolutions of 100 and 1000 forqueries 6 to 9. Because alignment data is used to an-alyze pointwise mutations, the resolution for alignmentdata is often much higher.

5.2 Local ParityIn this section we demonstrate local comparison to

IGV, the state of the art genome browser run on asingle node. Like Mango, IGV is intended to visual-ize personal variant and alignment data. For variantworkloads, we demonstrate latency on a 18 GB vari-ant file from 2504 samples of chromosome 20 from the1000 Genomes Project [8]. For alignment workloads, wedemonstrate latency on a 318 MB single sample consist-ing of alignment data from chromosome 20. For bothworkloads, chromosome 20 was chosen due to its avail-ability of medium coverage alignment data from 1000Genomes Project. Chromosome 20 is a smaller chro-mosome, consisting of only 62,435,965 bases.

5.2.1 Alignment DataFigure 8 shows latency on a single node for an align-

ment workload against IGV, Mango, and Mango with-out prefetching. Shown in 8, IGV shows a spike in la-tency at the 10K query of 9411 ms. This is triggered bythe first resolution in which IGV materializes alignmentdata. Otherwise IGV holds constant latency around2000 ms. Mango without prefetch shows a spike in la-tency for the 1K Pan query, as this query triggers acache miss and requires retrieving raw data from per-sistent storage. However, there still exists a spike inquery time at the 1K Pan query. This may be due tothe overlapping between prefetch and user queries, asthe preformatting of raw alignment data was not yetcomplete.

In cases for Mango with and without prefetching en-abled, there is a dip in latency at the 10K query whichrises back after the first 1K query. This is due to lowlatency in convolution calculations at the 10K query.Higher resolution data at 1K triggers raw computationof mismatches.

Figure 8: Local response to alignment recordworkload

Page 8: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

5.2.2 Variant DataFigure 9 shows local latency results on the variant

workload. Average latency is slightly higher comparedto a method without prefetching, because the workingset of data is larger with prefetching. As a result, fetch-ing the same region over the larger working set resultsin the Spark job having to check more blocks of mem-ory for data satisfying the specified predicate. Thisadverse behaviors relates to the unionAll operation inDataFrames, which increases the number of tasks whenquerying on the unioned dataset. RDD’s don’t displaythis behavior, but using RDDs would require the useof ADAM’s nested schemas, which would significantlyincrease the cost of predicate pushdown. This higherprefetch latency is more noticeable locally because alocal machine has fewer executors than a cluster to dis-tribute the large number of Spark tasks.

Figure 9: Local Variant Workload

5.3 Horizontal ScalabilityMango was tested in a distributed environment to

compare latency across computational resources and withand without prefetching. Because there is no explicitcomparative software that supports distributed genomebrowsing, we compare Mango across a variable set ofcomputational resources with and without the prefetch-ing mechanism.

Tests on both variant and alignment workloads runagainst 8, 16, 32, 64, 128 and 256 executors, each with12 GB memory, where we assigned each to one CPUcore.

5.3.1 Alignment DataAlignment data workload was run against 2 250 GB

high coverage full genome alignment files of samplesNA12878 and NA12891 from the 1000 Genomes Project.These were the largest files produced by the 1000 GenomesProject. Results query sample NA12878 across chromo-

some 20.Shown in figure 10 are latency results from chromo-

some 20 against the alignment workload. The dip at the10K query and spike at 1K Pan can be explained simi-larly to the local evaluation by convolution speedup andcache miss, respectively. However, cache misses incur amuch higher cost in a distributed setting due to fetchingremotely from HDFS. Iterative addition of executors foralignment data shows significant benefit when loadingfrom persistent storage, shown in latency reduction inqueries 1M and 1K Pan from 8 to 256 executors.

Figure 10: Alignment query workload with noprefetching

Figure 11 shows the latency results from alignmentdata with prefetching enabled. With prefetching en-abled, 1K Pan does not incur real time data fetchingfrom persistent storage, thus maintaining close to inter-active thresholds.

Figure 11: Alignment query workload withprefetching

Page 9: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

5.3.2 Variant DataFigures 12 and 13 show latency results for the variant

workload without and with prefetching, respectively,over the largest variant file from the 1000 GenomesProject, a dataset of 106 GB querying across chromo-some 1, the largest chromosome. Similarly to the localvariant workload, we see a spike in latency in the 1Kand 1K Pan queries around 9 seconds which are trig-gered by cache misses. Unlike the alignment recordworkload, variant queries maintain approximately in-teractive threshold with 16 or more executors. Figure13 demonstrates latency with prefetching enabled. La-tency for prefetching is greater due to the DataFrameissue previously mentioned, but is less of an issue dueto more executors to handle tasks.

Figure 12: Variant query workload with noprefetching

Figure 13: Variant query workload withprefetching

6. FUTURE WORK

6.1 Advanced Prefetching MechanismsPrefetching mechanisms implemented in this report

were simplistic, fetching regions to the left and right ofthe currently viewed region to speed up panning in-teractions. We can improve prefetching mechanismsby completing a user study and building a prefetch-ing model based on past user interactions. From user-defined data, we can determine the size of the regionto materialize and the maximum latency allowed forrunning prefetching jobs. These user interactions canbe modeled via common visualization predication algo-rithms [5][4]. Coupled with the amount of compute andstorage resources given to Mango, we can better deter-mine the size and location of prefetched regions.

The second strategy to improve prefetching includesa recommendation model which materializes regions ofdata similar to regions the user has viewed. [2] cal-culates l2 distance between data tiles to recommendregions similar to what the user is currently viewing.With a genome browser, we can define more specificprediction methods because we do not support such ageneral workload. Such suggestion algorithms would in-clude patterns in insertions, deletions, and distributionin mismatch density.

6.2 Achieving Interactive LatencyWhile Mango outperforms existing tools over large

regions of data, latencies do not always meet the 500msthreshold required for interactive applications. Cur-rently, while a region of data may be persisted on acluster and return quickly when queried, that query stillrequires a Spark job to be run, leading to latency on theorder of hundreds of milliseconds. We can shortcut thisredundant request by caching the results of the serverat the client. If the data being requested already existsin the cache, the application can simply locally accessthat data without waiting for a round trip call to theserver. We can implement the same prefetching mech-anisms described for the server to further improve theutility of a client cache.

6.3 Unified Compute Across Data LayersA key principle guiding our implementation of data

layers was to provide constant computation times for allzoom levels of data. As seen in figures ??, 10 and 11, weachieve consistent spikes in computation time for com-puting high resolution and very course resolution. Al-though course resolution naturally requires higher diskaccess times, simplifying computation for high resolu-tion views would provide latencies consistent with con-voluted views.

6.4 Notebook Form FactorReproducibility is important in scientific analysis. While

Mango provides an interactive genome browser inter-face, user interaction is difficult to reproduce. We can

Page 10: Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

extend Mango to a notebook form factor that visuallydisplays the result of a query. This form factor requiresintegration into an existing notebook tool such as theSpark notebook [19], which can already perform analy-sis from ADAM.

7. CONCLUSIONIn this report we presented Mango, a genome browser

that leverages distributed computing for ad hoc genomicexploration. Mango scales horizontally, achieving laten-cies comparable to local state of the art browsers onboth variant and alignment workloads, while optimiz-ing for latency with additional compute resources in adistributed environment.

We demonstrated the optimizations required to ac-commodate a visualization workload of high data se-lectivity and low latency in a distributed environment.Mango selectively materializes data on user request andorganizes the data efficiently for future access.

With the rise of available genomic data, it is crucialthat clinicians and researchers have sufficient tools to in-teractively explore their data. Mango supports scalablevisualization that will enable necessary ad hoc explo-ration on these increasing datsets.

8. REFERENCES[1] Thorvaldsdttir, H., Robinson, J. T., and Mesirov,

J.P. ”Integrative Genomics Viewer (IGV): high-performance genomics data visualization andexploration,” Oxford Journals: Briefings inBioinformatics (2013), pp. 178-192.

[2] Battle, L., Remoco, C., and Stonebreaker, M.”Dynamic Prefetching of Data Tiles for InteractiveVisualization.” Computer Science and ArtificialIntelligence Laboratory Technical ReportMIT-CSAIL-TR-2015-031 (2015).

[3] Dreszer T., et al. ”The UCSC Genome Browserdatabase: extensions and updates 2011.” NucleicAcids Res. 2012 Jan;40 (Database issue):D918-23.PMID: 22086951; PMC: PMC3245018.

[4] R. Li, R. Guo, Z. Xu, and W. Feng. ”A PrefetchingModel Based on Access Popularity for GeospatialData in a Cluster-based Caching System.” Int. J.Geogr. Inf. Sci., 26(10):18311844, Oct. 2012.

[5] S. Yesilmurat and V. Isler. ”Retrospective AdaptivePrefetching for Interactive Web GIS Applications.”Geoinformatica, 16(3):435-466, July 2012.

[6] Apache. Spark. http://spark.apache.org/

[7] Armbrust, M., Xin, R., Lian, C., Huai, Y., Liu, D.,Bradley, J., Meng, X., Kaftan, T., Franklin, M.,Ghodsi,A., Zaharia, M.: Spark SQL: RelationalData Processing in Spark. ACM SIGMOD Int.Conf. on Management of Data, pp. 1383-1394(2015)

[8] European Bioinformatics Institute. 1000 GenomesProject. http://www.1000genomes.org/

[9] Genopmics england. 100,000 Genomes Project.https://www.genomicsengland.co.uk/the-100000-genomes-project/

[10] Zhicheng, V. and Heer, J. ”The Effects ofInteractive Latency on Exploratory VisualAnalysis,” TVCG (2014).

[11] Schroeder, M., Gonzalez-Perez, A. andLopez-Bigas, N. ”Visualizing MultidimensionalCancer Genomics Data”, BioMed Central Ltd(2013).

[12] Apache. Zepplinhttps://zeppelin.incubator.apache.org/

[13] Apache. Parquet https://parquet.apache.org/

[14] Stein L. ”The case for cloud computing in genomeinformatics.” Genome Biol. 2010;11:207. doi:10.1186/gb-2010-11-5-207.

[15] Nothaft, Frank Austin, et al. ”Rethinkingdata-intensive science using scalable analyticssystems.” Proceedings of the 2015 ACM SIGMODInternational Conference on Management of Data.ACM, 2015.

[16] Massie, Matt, et al. ”Adam: Genomics formatsand processing patterns for cloud scale computing.”EECS Department, University of California,Berkeley, Tech. Rep. UCB/EECS-2013-207 (2013).

[17] Gonzalez, Cristina Y., et al. ”Multicore andcloud-based solutions for genomic variant analysis.”Euro-Par 2012: Parallel Processing Workshops.Springer Berlin Heidelberg, 2013.

[18] Apache. avro https://avro.apache.org/

[19] Spark Notebook. http://spark-notebook.io/

[20] Big Data Genomics. http://bdgenomics.org/

[21] Apache. Hadoop. http://hadoop.apache.org.

[22] Bostock, Michael, Vadim Ogievetsky, and Jeffrey

Heer. ”DAs data-driven documents.” Visualizationand Computer Graphics, IEEE Transactionson17.12 (2011): 2301-2309.

[23] McKenna, Aaron, et al. ”The Genome AnalysisToolkit: a MapReduce framework for analyzingnext-generation DNA sequencing data.” Genomeresearch 20.9 (2010): 1297-1303.

[24] Scalatra. http://www.scalatra.org/