Extending Apache Spark with a Mediation Layerambite/papers/stripelis-SIGMOD-SBD-2018.pdf · gration...

6
Extending Apache Spark with a Mediation Layer Dimitris Stripelis Information Sciences Institute USC Viterbi School of Engineering Marina Del Rey, California, 90292 [email protected] Chrysovalantis Anastasiou Integrated Media Systems Center USC Viterbi School of Engineering Los Angeles, California, 90089 [email protected] José Luis Ambite Information Sciences Institute USC Viterbi School of Engineering Marina Del Rey, California, 90292 [email protected] ABSTRACT With the recent growth of data volumes in many disciplines of both industry and academia many new Big Data Management systems have emerged to provide scalable tools for efficient data storing, processing and analysis. However, most of these systems offer little support for efficiently integrating multiple external sources under a uniform schema and a single query access point, which greatly simplifies further analytics. In this work, we present Spark Mediator, a system that extends the logical data integration capabilities of Apache Spark. As a use case, we show the application of Spark Mediator to the integration of schizophrenia neuroimaging data and compare with previous data integration systems. CCS CONCEPTS Information systems Mediators and data integration; Federated databases; MapReduce-based systems; KEYWORDS Big Data Integration, Federated Databases, Mediator, Apache Spark ACM Reference Format: Dimitris Stripelis, Chrysovalantis Anastasiou, and José Luis Ambite. 2018. Extending Apache Spark with a Mediation Layer. In SBD’18: Semantic Big Data , June 10, 2018, Houston, TX, USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3208352.3208354 1 INTRODUCTION Over the last decade the database market has witnessed an enor- mous increase in the number of available data management engines. Within the physical infrastructure of any organization, it is very common for relational databases to coexist with NoSQL, NewSQL, SQL-on-Hadoop, Graph, Streaming and other engines, with each one tailored to optimize the execution of different data processing tasks. It is apparent, that the observation ’no size fits all’ [23] has finally prevailed. In order to handle this multi-store heterogeneity and variety in querying languages and data schemata, we need to introduce a single query access point that will operate over a unified data model and will distribute the query execution work- load to the respective data stores. Towards this end, we present Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SBD’18, June 10, 2018, Houston, TX, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5779-1/18/06. . . $15.00 https://doi.org/10.1145/3208352.3208354 the Spark Mediator middleware that allows seamless integration and querying of disparate data sources through a harmonized data model and a query rewriting component. Motivating Use Case. Our motivating example for developing the Spark Mediator middleware is the SchizConnect virtual data integration system [1], which integrates several leading schizophre- nia neuroimaging data sources (schizconnect.org). The study of complex diseases, such as Schizophrenia, require the integration of data from multiple heterogeneous data sources [26] to obtain a sufficient number of subjects to reach statistically valid conclu- sions. Each source features a diverse combination of data models, data types and values, which renders their harmonization a very challenging task. Many multi-site consortia have been developed to address this issue through the introduction of domain specific standardization techniques. For example, the Human Imaging Data- base (HID) [15] of the Functional Biomedical Informatics Research Network (FBIRN) [11] consortium is a federated database where each site follows the same standard schema. Other prominent con- sortia within the neuroscience domain that follow this practice are the Mind Clinical Imaging Consortium [16] and the ENIGMA Net- work [25]. However, none of these federated platforms provides the necessary toolset to efficiently perform large-scale analytics. Spark Mediator offers an alternative solution to this data integration and analysis problem by introducing a new mediation layer to Apache Spark [31] and leveraging its analytics stack. Our decision to extend Apache Spark with a mediation layer was due to Spark’s widespread adoption across multiple disciplines for executing expensive data analysis tasks and its’ versatile distributed programming model (MLlib, Streaming, GraphX) that simplifies the implementation of scalable solutions. Mediators and Federation. In the virtual data integration or mediation approach, that Spark Mediator implements, the data re- side at their original sources and are mapped to a harmonized virtual schema. All the schema mappings are described declaratively by logical formulas. When a user issues a query which is expressed over the harmonized schema, the mediator rewrites the original query into source-level queries. During query rewriting, the me- diator consults the defined schema mappings rules and identifies which data sources are required to answer the query. In addition, the integration system generates and optimizes a distributed query evaluation plan that accesses the sources and composes the an- swers to the user query. SchizConnect is a fully-fledged system that adopts this approach [1]. Spark Mediator falls under the category of mediator [30] and federated data management systems [13]. All these systems were extensively studied during the 1980s and 1990s and their goal was to support transparent access to multiple back- ends with different querying capabilities. Some of the most notable systems that were developed during that period were TSIMMIS [5], 1

Transcript of Extending Apache Spark with a Mediation Layerambite/papers/stripelis-SIGMOD-SBD-2018.pdf · gration...

Page 1: Extending Apache Spark with a Mediation Layerambite/papers/stripelis-SIGMOD-SBD-2018.pdf · gration layer to Spark’s highly-performant analytics stack, and it extends Spark’s

Extending Apache Spark with a Mediation LayerDimitris Stripelis

Information Sciences InstituteUSC Viterbi School of EngineeringMarina Del Rey, California, 90292

[email protected]

Chrysovalantis AnastasiouIntegrated Media Systems CenterUSC Viterbi School of EngineeringLos Angeles, California, 90089

[email protected]

José Luis AmbiteInformation Sciences Institute

USC Viterbi School of EngineeringMarina Del Rey, California, 90292

[email protected]

ABSTRACTWith the recent growth of data volumes in many disciplines of bothindustry and academia many new Big Data Management systemshave emerged to provide scalable tools for efficient data storing,processing and analysis. However, most of these systems offer littlesupport for efficiently integrating multiple external sources undera uniform schema and a single query access point, which greatlysimplifies further analytics. In this work, we present SparkMediator,a system that extends the logical data integration capabilities ofApache Spark. As a use case, we show the application of SparkMediator to the integration of schizophrenia neuroimaging dataand compare with previous data integration systems.

CCS CONCEPTS• Information systems → Mediators and data integration;Federated databases; MapReduce-based systems;

KEYWORDSBig Data Integration, Federated Databases, Mediator, Apache SparkACM Reference Format:Dimitris Stripelis, Chrysovalantis Anastasiou, and José Luis Ambite. 2018.Extending Apache Spark with a Mediation Layer. In SBD’18: Semantic BigData , June 10, 2018, Houston, TX, USA. ACM, New York, NY, USA, 6 pages.https://doi.org/10.1145/3208352.3208354

1 INTRODUCTIONOver the last decade the database market has witnessed an enor-mous increase in the number of available data management engines.Within the physical infrastructure of any organization, it is verycommon for relational databases to coexist with NoSQL, NewSQL,SQL-on-Hadoop, Graph, Streaming and other engines, with eachone tailored to optimize the execution of different data processingtasks. It is apparent, that the observation ’no size fits all’ [23] hasfinally prevailed. In order to handle this multi-store heterogeneityand variety in querying languages and data schemata, we needto introduce a single query access point that will operate over aunified data model and will distribute the query execution work-load to the respective data stores. Towards this end, we present

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]’18, June 10, 2018, Houston, TX, USA© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5779-1/18/06. . . $15.00https://doi.org/10.1145/3208352.3208354

the Spark Mediator middleware that allows seamless integrationand querying of disparate data sources through a harmonized datamodel and a query rewriting component.

Motivating Use Case. Our motivating example for developingthe Spark Mediator middleware is the SchizConnect virtual dataintegration system [1], which integrates several leading schizophre-nia neuroimaging data sources (schizconnect.org). The study ofcomplex diseases, such as Schizophrenia, require the integrationof data from multiple heterogeneous data sources [26] to obtaina sufficient number of subjects to reach statistically valid conclu-sions. Each source features a diverse combination of data models,data types and values, which renders their harmonization a verychallenging task. Many multi-site consortia have been developedto address this issue through the introduction of domain specificstandardization techniques. For example, the Human Imaging Data-base (HID) [15] of the Functional Biomedical Informatics ResearchNetwork (FBIRN) [11] consortium is a federated database whereeach site follows the same standard schema. Other prominent con-sortia within the neuroscience domain that follow this practice arethe Mind Clinical Imaging Consortium [16] and the ENIGMA Net-work [25]. However, none of these federated platforms provides thenecessary toolset to efficiently perform large-scale analytics. SparkMediator offers an alternative solution to this data integration andanalysis problem by introducing a new mediation layer to ApacheSpark [31] and leveraging its analytics stack.

Our decision to extend Apache Spark with a mediation layer wasdue to Spark’s widespread adoption across multiple disciplines forexecuting expensive data analysis tasks and its’ versatile distributedprogramming model (MLlib, Streaming, GraphX) that simplifies theimplementation of scalable solutions.

Mediators and Federation. In the virtual data integration ormediation approach, that Spark Mediator implements, the data re-side at their original sources and aremapped to a harmonized virtualschema. All the schema mappings are described declaratively bylogical formulas. When a user issues a query which is expressedover the harmonized schema, the mediator rewrites the originalquery into source-level queries. During query rewriting, the me-diator consults the defined schema mappings rules and identifieswhich data sources are required to answer the query. In addition,the integration system generates and optimizes a distributed queryevaluation plan that accesses the sources and composes the an-swers to the user query. SchizConnect is a fully-fledged system thatadopts this approach [1]. Spark Mediator falls under the categoryof mediator [30] and federated data management systems [13]. Allthese systems were extensively studied during the 1980s and 1990sand their goal was to support transparent access to multiple back-ends with different querying capabilities. Some of the most notablesystems that were developed during that period were TSIMMIS [5],

1

Page 2: Extending Apache Spark with a Mediation Layerambite/papers/stripelis-SIGMOD-SBD-2018.pdf · gration layer to Spark’s highly-performant analytics stack, and it extends Spark’s

Garlic [4] and the Information Manifold [17]. Spark Mediator ad-heres to the same integration design principles and provides similarfederated querying capabilities.

Relation to Polystores. More recently, the Polystore architec-ture [9] and many systems that follow this architectural paradigmhave been proposed, including BigDAWG [9], ESTOCADA [3], Myr-iaDb [28] and others [6, 14, 18]. All these systems share many opera-tional and querying functionalities with the mediator and federatedsystems architecture. However, Spark Mediator differentiates fromthe Polystore systems in terms of data ownership, data modelingand query processing. Spark Mediator is agnostic of the underlyingsources and each source acts independently, whereas in the Poly-store settings all the sources are managed together as an integratedset and data placement and replication is handled by the Polystoresystem. With respect to data modeling and query processing, theSpark Mediator supports only domain specific functionality under astandardized data model that all the sources need to agree on and re-lies on Spark SQL’s [2] relational and procedural processing engine.Conversely, a Polystore system allows the existence of differentdata models and query notations encapsulated inside user-facingabstractions, called islands of information (e.g. BigDAWG).

One of themain contributions of this work is the introduction of anew virtual integration layer in the Big Data Integration (BDI) land-scape [8] on top of the Apache Spark framework. We also presenthow this layer can have a direct applicability within the clinical andbiomedical research community through a real-world use case overneuroimaging data through the SchizConnect ecosystem. Finally,we propose and develop an extensible distributed query process-ing strategy that enhances the current query federation executionpipeline of Spark SQL [2].

The paper is structured as follows. Section 2 describes the de-sign of Spark Mediator, section 3 explains how logical integrationis achieved within the SchizConnect system, section 4 introducesthe query federation engine and optimization strategy and sec-tion 5 presents the experimental results. We conclude with a shortdiscussion in section 6.

2 MEDIATION FOR DATA ANALYTICSThe Mediator’s importance is twofold: it adds a new virtual inte-gration layer to Spark’s highly-performant analytics stack, and itextends Spark’s efficiency as a distributed query federation engine.In our previous experience [24] Spark Mediator is an essential toolfor large-scale infrastructures that require to execute expensivedata analysis operations over a group of heterogeneous sources.

The Spark Mediator middleware exposes a harmonized virtualschema (i.e., domain schema) over the data of the remote sources.This unified view allows the data to reside at the remote sourcesand be structured under their original schemata. During queryexecution, the Mediator receives the submitted query over thedomain schema and rewrites it to source-level queries through a setof schemamappings, declarative logical rules that relate the sources’and target/domain schemata. Based on these schema mappings theMediator is able to reconcile any semantic discrepancies among theremote data sources. Upon the completion of query rewriting, theMediator parses the source-level queries and generates a logicallyoptimized distributed query evaluation plan which forwards to the

Spark SQLExecution Engine

Planning &Optimization

Query Rewrite

Domain Query

Source-Level Query

Schema  Mappings

....Data

Source 1Data

Source 2Data

Source n

Optimized Logical Plan

Result Spark SQL Dataframe

Mediator

Figure 1: Spark Mediator Diagram.

Spark SQL engine for execution. Once the source-level queries areexecuted the Mediator returns the query result set as a Spark SQLDataframe. The aforementioned procedure is presented in Figure 1.

3 MOTIVATION: DATA INTEGRATION INNEUROIMAGING

As motivation, we describe SchizConnect [1], an integration usecase on schizophrenia neuroimaging that applies a mediator ap-proach. Spark Mediator improves on the previous architecture sinceit embeds the system in the powerful analytics stack of Spark.

3.1 Participating SourcesThe SchizConnect data are distributed across several publicly avail-able data sources. The FBIRN Phase II @ UCI [11] provides cross-sectional multisite data from 251 subjects including structural andfunctional magnetic resonance imaging scans. The FBIRN repos-itory is hosted on the PostgreSQL DBMS of the HID system [15].The NUSDAST @ XNAT Central [29] source contains data from 368subjects together with collected sMRI scans. These data are storedinside the eXtensible Neuroimaging Archiving Toolkit (XNAT) [20]public repository which provides a REST API that accepts queriesonly in the XNAT-specific XML format, and returns XML data. TheCOBRE & MCICShare @ COINS Data Exchange [22] contain datarelated to 198 subjects from the COBRE project and 212 subjectsfrom the MCICShare project. Due to the dynamic data packagingarchitecture of the COINS system, which restricts immediate returnof any data to a submitted query, the COINS data are replicated ina dedicated MySQL DBMS at USC/ISI.

3.2 Domain SchemaIntegrating data from multiple diverse sources under a homoge-neous schema is a challenging task. Each source follows different

2

Page 3: Extending Apache Spark with a Mediation Layerambite/papers/stripelis-SIGMOD-SBD-2018.pdf · gration layer to Spark’s highly-performant analytics stack, and it extends Spark’s

Project (source:STRING,study:STRING,projectid:NUMBER,description:STRING)

Subject (source:STRING,study:STRING,subjectid:STRING,age:NUMBER,sex:STRING,dx:STRING)

In_Project (source:STRING,study:STRING,subjectid:STRING,projectid:NUMBER)

Imaging_Protocol (source:STRING,study:STRING,subjectid:STRING,szc_protocol_hier:STRING,img_date:DATE,notes:STRING,datauri:STRING,maker:STRING,model:STRING,field_strength:NUMBER, site:STRING )

Cognitive_Assessment (source:STRING,study:STRING,site:STRING,subjectid:STRING,assessment:STRING,assessment_description:STRING)

Clinical_Assessment (source:STRING,study:STRING,site:STRING,subjectid:STRING,assessment:STRING,assessment_description:STRING)

Figure 2: SchizConnect Domain Model (selected predicates).

semantics for describing its’ data values and types. Bridging thesesemantic heterogeneities requires a solid understanding of howschema elements from one source are related to schema elementsof the other sources. To manage this complexity, a unified domainschema (also called the target, mediated, or global schema) is de-fined, and the schemata of the sources are mapped to the domainschema using logical rules, known as schema mappings [7].

The domain schema does not need to include every element fromeach source but rather those elements that are necessary to answerspecific types of queries. The design of the domain schema canbe easily extended to include semantic descriptions of additionalsources as they arrive and support new query types based on thesenewly defined sources. The SchizConnect domain schema relies onthis minimalistic design in order to build an incremental domainschema. As new sources with neuroimaging data are found, theSchizConnect domain schema can be extended to incorporate thesesources as well.

The current SchizConnect domain schema follows the relationalmodel as it is shown in Figure 2. The first attribute in all the pred-icates is source, which indicates the data provider. Currently, thesystem has four sources: HID, XNAT, COINS or Mappings. At-tributes are followed by the data types currently supported by theMediator: string, number and date.

The Project predicate contains the name and the descriptionof the studies in the data sources. The Subject predicate containsthe demographic and diagnostic information for individual partici-pants. The In_project predicate indicates in which studies andprojects each subject has participated. The Imaging_Protocolpredicate refers to MRI records for a particular subject along withvarious scanner metadata. The Cognitive_Assessment retains in-formation for the neuropsychological assessments of the subjectswhile the Clinical_Assessment holds the assessments for differ-ent symptoms in the subjects.

3.3 Schema MappingsThe schemamappings describe the relationships between the sourcesschemata and the domain schema. These mappings are describedthrough a set of declarative logical rules. In the database theoryliterature these mappings are also known as source-to-target tuple-generating dependencies (st-tgds) [10]. The Mediator uses a set ofdeclarative rules in order to define how the predicates of the domainschema relate to predicates of the sources schema. These rules arelogical implications of the form:

∀ ®X , ®Y ,Φs ( ®X , ®Y ) → ∃ ®Z ,ΨG ( ®X , ®Z )where the antecedent (Φs ) is a conjunctive formula over the sourcesschemata (S) predicates, and the consequent (ΨG ) is a conjunctive

subject ("UCI_HID", "fBIRNPhaseII__0010", SUBJECTID, AGE, SEX, DX) <- subject_age ("UCI_HID", STUDY, SUBJECTID, AGE) ^ (STUDY = "fBIRNPhaseII__0010") ^ subject_sex ("UCI_HID", STUDY, SUBJECTID, SEX) ^ HIDPSQLResource_nc_fbirn_phaseii_dx_mview (SUBJECTID, DX)

subject_age ("UCI_HID", "fBIRNPhaseII__0010", SUBJECTID, AGE) <- HIDPSQLResource_nc_subjexperiment (uniqueid1, tableid, owner, modtime, moduser, nc_experiment_uniqueid, SUBJECTID, nc_researchgroup_uniqueid) ^ (nc_experiment_uniqueid = 9610) ^ HIDPSQLResource_nc_assessmentinteger_mview (nc_assessmentdata_uniqueid, scoreorder,

textvalue, textnormvalue, comments, AGE, datanormvalue, storedassessmentid, ASSESSMENTID, assessmentname, SCORENAME, scoretype, ISVALIDATED, isranked, SUBJECTID, site, uniqueid) ^

(SCORENAME = "Age") ^ (ISVALIDATED = "TRUE")subject_sex ("UCI_HID", "fBIRNPhaseII__0010", SUBJECTID, SEX) <- HIDPSQLResource_nc_subjexperiment (uniqueid1, tableid, owner, modtime, moduser, nc_experiment_uniqueid, SUBJECTID, nc_researchgroup_uniqueid) ^ (nc_experiment_uniqueid = 9610) ^ HIDPSQLResource_nc_assessmentvarchar (tableid2, nc_assessmentdata_uniqueid2,

scoreorder2, owner2, modtime2, moduser2, textvalue2, textnormvalue2,comments2, SEX_SRC, datanormvalue2, storedassessmentid2, ASSESSMENTID2, SCORENAME2, scoretype2, ISVALIDATED2, isranked2, SUBJECTID, entryid2, keyerid2, raterid2, classification2, uniqueid2) ^

(SCORENAME2 = "Gender") ^ (ISVALIDATED2 = "TRUE") ^ MappingsMySQLResource_value_mappings (SEX, "UCI_HID", SEX_SRC, id)

imaging_protocol ("UCI_HID", "fBIRNPhaseII__0010", SUBJECTID, SZC_PROTOCOL_HIER, DATE,NOTES, DATAURI, MAKER, MODEL, FIELD_STRENGTH, SITE, "") <-

HIDPSQLResource_nc_subjexperiment (uniqueid, tableid, owner, modtime, moduser, nc_experiment_uniqueid, SUBJECTID, nc_researchgroup_uniqueid) ^

(nc_experiment_uniqueid = 9610) ^ HIDPSQLResource_nc_scannersbyscan_mview_phase3 (SUBJECTID, SITE, componentid,

segmentid, SOURCE_PROTOCOL, DATE, nc_colequipment_uniqueid, SOURCE_MAKE, SOURCE_MODEL, DATAURI, NOTES) ^

MappingsMySQLResource_protocol_mappings (SZC_PROTOCOL_HIER, "UCI_HID", SOURCE_PROTOCOL, ID1) ^

MappingsMySQLResource_scanner_mappings (MAKER, MODEL, FIELD_STRENGTH, "UCI_HID", SOURCE_MAKE, SOURCE_MODEL, ID2)

subject ("XNAT_CENTRAL", "NUSDAST", SUBJECT_LABEL, AGE, SEX, DX) <- XnatSubjectResource_xnat__subjectData (project, SUBJECT_LABEL, AGE, SEX, SRC_DX, QS) ^ MappingsMySQLResource_dx_mappings (DX, "XNAT_CENTRAL", 777, SRC_DX, id, "NUSDAST")

Figure 3: SchizConnect Sample Schema Mappings.

formula over the domain schema (G) predicates. In SchizConnectthe schema mappings are defined as Global-As-View (GAV) rules,which are st-tgds with a single predicate in the consequent.

Figure 3 illustrates some of the schema mappings currently de-fined in the domain schema of SchizConnect. In the figure, thedomain predicates are typed in bold (e.g., subject_age) while thesource predicates are in italics (e.g., HIDPSQLResource_nc_subj-experiment). The first rule indicates that the subject domain predi-cate provides data that come from the UCI_HID data source and isrelevant to the participating subjects of the current study. This rulealso shows that a domain predicate can be constructed through ajoin operation with other domain (auxiliary) predicates, in this casefrom subject_age and subject_sex.

Rules defined with the same consequent denote a union opera-tion. For instance, the subject domain predicate in the presentedschema is obtained through a union operation of the first two rules,with each one denoting a different data source (UCI_HID, andXNAT_CENTRAL respectively).1 Constraints can also appear inthe body (antecedent) of the rules (e.g., ’nc_experiment_unique_id= 9610’). The query engine pushes these selections to the sources.

Finally, the last rule shows the definition of the imaging_protocoldomain predicate. This rule employs the MappingsMySQLResourcedata source which contains mapping tables for normalizing at-tributes values that are retrieved from the corresponding sources.Within the SchizConnect schema mappings collection, there aremany rules that perform this normalization, but are omitted from 3due to space constraints. The main reason for this normalization

1We removed a similar rule for COINS for brevity.

3

Page 4: Extending Apache Spark with a Mediation Layerambite/papers/stripelis-SIGMOD-SBD-2018.pdf · gration layer to Spark’s highly-performant analytics stack, and it extends Spark’s

step is that even attributes originating from the same source exposedifferent values and codes for the same concept. In order to re-solve any attributes discrepancies a more sophisticated hierarchicalontology structure is being used [27].

3.4 Query RewritingAt runtime the mediator uses the schema mappings to reformulatea query expressed in terms of the domain schema into a set ofequivalent executable source-level queries. The result of this refor-mulation is a logical query plan [7]. Apart from the GAV schemamappings the Mediator supports also query rewriting for both LAV(aka Local-as-View) [19] and GLAV (aka Global-and-Local-as-View)rules. However, the last two mapping rules are not required for theSchizConnect domain.

Figure 4 presents a GAV rewriting example. If for instance wewould like to select all the participants involved in the currentstudies accessible by SchizConnect, along with all their associateddemographic and diagnostic information the Mediator will unfoldthe domain query (top level query in Fig. 4) into a union of con-junctive source-level queries. The final rewritten query (bottomlevel query in Fig. 4) is a union of conjunctive queries over thedata sources (XNAT_CENTRAL, UCI_HID, COINS). In this specificexample COINS data source is included twice since it contributestwo different studies.

SELECT*FROMsubject

(SELECT 'UCI_HID'assource,'fBIRNPhaseII__0010'asstudy,T8.subjectidassubjectid,T4.szc_protocol_hierasszc_protocol_hier,T6.dateasimg_date,T6.descriptionasnotes,T6.datauriasdatauri,T2.makerasmaker,T2.modelasmodel,T2.field_strengthasfield_strength,T6.siteassite

FROM MappingsMySQLResource_scanner_mappings T2,MappingsMySQLResource_protocol_mappings T4,HIDPSQLResource_nc_scannersbyscan_mview_phase3T6,HIDPSQLResource_nc_subjexperiment T8

WHERE T2.source_make=T6.source_makeANDT2.source_model=T6.source_modelAND T2.source='UCI_HID’AND T4.source_protocol=T6.source_protocolAND T4.source='UCI_HID'AND T6.subjectid=T8.subjectidAND T8.nc_experiment_uniqueid=9610)

UNION(SELECT 'XNAT_CENTRAL'assource,'NUSDAST'asstudy,T12.SUBJECT_LABELassubjectid,

T10.szc_protocol_hierasszc_protocol_hier,T12.SCAN_DATEasimg_date,T12.SCAN_IDasnotes,Concat(T12.IMAGE_ID,'/scans/',T12.SCAN_ID)asdatauri,'SIEMENS'asmaker,'VISION1.5T'asmodel,1.5asfield_strength,'WUSTL'assite

FROMMappingsMySQLResource_protocol_mappings T10,XnatMRSessionResource_xnat__mrSessionData T12WHERE T10.source_protocol=T12.SCAN_TYPEAND T10.source='NUSDAST')UNION(SELECT 'COINS'assource,'COBRE'asstudy,T14.ANONYMIZATION_IDassubjectid,

T14.szc_protocol_hierasszc_protocol_hier,T14.SCAN_DATEasimg_date,'notes'asnotes,T14.SERIES_IDasdatauri,T14.SCANNER_MANUFACTURERasmaker,T14.SCANNER_LABELasmodel,T14.FIELD_STRENGTHasfield_strength,T14.SCANNER_SITEassite

FROM COINSMySQLResource_series_v T14WHERE T14.STUDY_ID=1139)UNION(SELECT 'COINS'as source,'MCICShare'as study,T16.ANONYMIZATION_IDas subjectid,

T16.szc_protocol_hieras szc_protocol_hier,T16.SCAN_DATEas img_date,'notes'as notes,T16.SERIES_IDas datauri,T16.SCANNER_MANUFACTURERas maker,T16.SCANNER_LABELas model,T16.FIELD_STRENGTHas field_strength,T16.SCANNER_SITEas site

FROM COINSMySQLResource_series_v T16WHERE T16.STUDY_ID=5720)

Figure 4: SchizConnect Query Rewriting.

4 OPTIMIZING SPARK SQL FEDERATIONSpark SQL provides a convenient API for defining new data sourcesand perform various source specific query optimizations. Some ofthese optimizations include projections and selections pushdown tothe external querying source. However, the available optimizationsare still limited in terms of more expensive operations such asjoining tables that belong to the same remote database. For instance,assume that multiple Spark SQL tables are defined with the sameexternal database origin. When a join operation is applied overthem, Spark will naively bring the source data to the Spark cluster

and execute the joins locally. In some cases, this may be efficient, butin most cases it is preferable that a database takes advantage of itsindexes and statistics to evaluate and process a set of join operations.Another important limitation of Spark SQL is when more complexoperations are applied to an external database, Spark SQL doesnot collect information from the database’s computed indexes orusage statistics in order to optimize the query’s execution plan.As of Spark 2.2 a new cost-based optimization (CBO) frameworkwas introduced which computes various table and column-basedstatistics (e.g., cardinality, max, min etc.) that the Catalyst optimizerleverages in order to evaluate which query plan to execute. Eventhough this CBO framework can speed up Spark workloads, it doesnot take into account operations that could be executed directlyinside the external sources and therefore accelerate Spark SQL’sfederated execution. In the experiments section we show the impactof CBO computation on query execution.

Optimization Strategy. Pushing joins to be evaluated at datasources often yields more efficient plans, but such capability was notimplemented in Spark SQL, which is natively designed to push onlyprojections and selections down to the external sources. Therefore,we designed and implemented a simple but extensible mechanismthat allows us to push joins down to the data sources as sub-queries.Specifically, for the source-level queries generated by the Mediator,the strategy is to group tables which belong to the same externaldata source whenever a join operation is being applied upon them.

This approach generates a consolidated source-level sub-queryfor each data source with all the necessary joins, projections, andselections. As an example, assume a source-level query that joinsthree Spark SQL tables, where two of them come from the sameexternal database. Our strategy will generate a sub-query for thosetwo tables, push it down to the database for execution, retrieve theresult and then perform an in-memory join of this result with thethird table. For a small cost of several milliseconds during the queryrewriting, this simple strategy proved to be very effective in manyqueries as it is shown in the experiments section.

Once the execution plan of the rewritten query is computed andthe above optimization is applied, the Mediator uses Spark’s nativeconnectors to access the external data sources. Spark provides anexcellent support for querying databases through JDBC connectorsand offers the ability to partition the result of a query across theSpark workers based on one or more columns. For the UCI_HIDPostgreSQL database, COINS and Mappings MySQL databases thedefault JDBC connector is used. For accessing the XNAT CentralRepository, we implemented a custom wrapper on top of the theSpark SQL sources API.

A drawback of our approach is omitting Spark’s JDBC out-of-the-box caching ability. When loading data with JDBC, Spark usuallycaches the individual tables that have been queried and can laterreuse them. With our approach, even though Spark is caching thesub-query result, it cannot reuse it for subsequent operations. Wecite this as a future improvement, since we have already designeda solution that can reuse the materialized result.

Spark XNAT Central Wrapper. We developed a customiz-able data source wrapper for the XNAT Central repository [20]by extending the Spark SQL sources API. The wrapper extends thePrunedFilteredScan interface, which can capture all the projectionsand selections passed during querying time to the external XNAT

4

Page 5: Extending Apache Spark with a Mediation Layerambite/papers/stripelis-SIGMOD-SBD-2018.pdf · gration layer to Spark’s highly-performant analytics stack, and it extends Spark’s

repository and returns the retrieved data as a collection of RDDobjects. The wrapper converts all the selections to XML filters andgenerates the required XML query file on-the-fly. Once the XMLfile is created, the wrapper sends it directly to the XNAT service forexecution. This filtering optimization reduces drastically the sizeof the data shipped from the XNAT service back to Spark, since itallows the XNAT framework to evaluate the sub-query and droprecords that are redundant for the final query result set.

5 EXPERIMENTAL RESULTSFor our experiments we used version 2.2.1 of Apache Spark on anUbuntu 16.04 machine equipped with an Intel Core i7-950 processorat 3.07GHz and 16GB of RAM. The processor has 4 cores (8 hard-ware threads) with private L1 (32KB each for data and instructions),L2 (256KB) caches and 8MB of shared L3 cache. For the query execu-tion we initialized Spark in standalone mode with 6 cores and 12GBof RAM. We ran the queries using 3 Spark executors allocating2 cores and 2GB of RAM for each one.

Table 1: Structure of source-level queries

# Structure Size (#preds) Result Size1 U4CQ 21 5672 U3CQ 12 11113 U2CQ 10 91294 U16CQ 127 21045 U16CQ 127 12346 U9CQ 58 15667 U3CQ 8 217458 U9CQ 53 20725

We carefully selected eight domain queries whose respectivesource-level rewrites could test the effectiveness of Spark SQL’sfederation engine as well as the effectiveness of our simple opti-mization strategy. All source-level queries are in the form of unionof conjunctive queries. The number of conjunctive queries in eachunion varies and so is the number of predicates involved in eachconjunctive query of the union. The structure of the source-levelqueries is described in table 1. For instance, query 4 is a union of16 conjunctive queries which includes a total of 127 predicates andreturns 2104 rows after its execution.

While executing the federated queries over Spark, we observedpoor performance for the built-in UNION operator between theconjunctive queries because of Spark’s expensive data shuffling andthe number of comparisons that this operation involves betweenthe left and right operand in order to determine the distinct recordsfrom the two operands’ result set. Instead, we replaced the UNIONoperator with the UNION ALL, which naively merges the result setsof all the conjunctive queries and we compute the distinct recordsonly once on the final set. This simple reformulation of the queryplan, together with our pushdown optimization strategy gave asignificant boost to the execution of our federated queries.

Figure 5 illustrates the query response times for the versionof OGSA-DAI [12], which is used as the execution engine of the

SchizConnect mediator,2 Spark Native (i.e., no UNION ALL, noPushdown, no CBO), Spark with UNION ALL and Pushdown opti-mization (i.e., Spark U-Pushdown), Spark Native with CBO enabledand Spark U-Pushdown with CBO enabled. All the execution timesshown for each approach and for each query corresponds to theaverage time of ten consecutive executions. Spark U-Pushdown andSpark U-Pushdown CBO outperform Spark Native and Spark Na-tive CBO respectively across all queries, with exception to queries4 and 5 (explained below), which proves the effectiveness of ouroptimization strategy. Compared to OGSA-DAI, the execution timeof all Spark approaches is better (e.g., query 7) or almost equiva-lent (e.g., query 1), except for queries 4 and 5 where OGSA-DAIoutperforms Spark, which justifies OGSA-DAI’s smarter CBO forfederated queries with many source-level queries.

Q1 Q2 Q3 *Q4 *Q5 Q6 Q7 Q80

10

20

30

40

50

Exec

utio

n tim

e (s

ecs)

Federated Queries Execution TimeOgsadaiSpark NativeSpark U-PushdownSpark Native CBOSpark U-Pushdown CBO

Figure 5: Execution time for 8 queries

We do not include the execution times for queries 4 and 5 in thecases of Spark U-Pushdown and Spark U-Pushdown CBO, becauseour greedy optimization strategy generates a source-level sub-queryfor the HID PostgreSQL database which is very expensive to execute.The sub-query exceeds the 3 minutes execution time cut-off, dueto the size of the involved join operations and the large numberof tuples returned. Just counting the result set of the sub-querywithin PostgreSQL takes around 50 seconds. If we do not takeinto account our greedy optimization strategy and simply use theUNION ALL reformulation, the execution time is better than SparkNative with or without CBO and comparable to OGSA-DAI. Thesequeries showcase that a more sophisticated cost-based pushdownoptimization strategy can help reduce their overall execution time.

We believe that we can further improve the query executiontime by adding more optimization rules to our current strategy.One such rule would be to detect sub-queries that share a commonstructure and push them only once to the remote data source whileperforming additional operations on-the-fly within Spark for eachconjunctive query. In effect, we would execute a non-recursivedatalog program, in which common subexpressions are defined asdatalog rules. Ultimately, an extension of the cost-based optimizerwith join-pushdown would explore whether there are combinations2In these experiments, OGSADAI is extended with statistics collection from the remotesources and a simple, but effective cost-based optimizer.

5

Page 6: Extending Apache Spark with a Mediation Layerambite/papers/stripelis-SIGMOD-SBD-2018.pdf · gration layer to Spark’s highly-performant analytics stack, and it extends Spark’s

of join operations that would perform better if executed withinSpark rather than inside the remote sources.

6 DISCUSSIONIn this paper we presented a new virtual integration layer for BigData Integration built on top of the Apache Spark framework. Wealso presented how our Mediator can have a direct applicabilityon the clinical and biomedical research domain through a real-world use case in the neuroimaging domain. The naive executionof joins in Spark SQL lead us to design and implement a strategyfor pushing down joins to external data sources as sub-queries,hence improving the query performance of Spark SQL federationpipeline. As future work, we aim to improve our query optimizationtechniques by evaluating more complex operations and cachingmechanisms for the sub-queries and support query execution andrewriting for other query engines as well. Our goal is to eventuallyintegrate our techniques into Spark’s SQL Catalyst optimizer andmake them publicly available to the research community. Ultimately,we want to tailor the query optimization of the mediator to thetypes of analytic operations in Spark and Spark MLlib [21], so thatit improves the overall performance of the system and facilitatesthe analysis of complex heterogenous datasets.

ACKNOWLEDGMENTSThis work is supported in part by the National Institute of Health(NIH) under grants 5U01MH097435 (SchizConnect) and 1U24EB021996(PRISMS). The views and conclusions contained herein are thoseof the authors and should not be interpreted as necessarily repre-senting the official policies or endorsements, either expressed orimplied, of NIH or the U.S. Government.

REFERENCES[1] Jose Luis Ambite, Marcelo Tallis, Kathryn Alpert, David B Keator, Margaret King,

Drew Landis, George Konstantinidis, Vince D Calhoun, Steven G Potkin, Jessica ATurner, et al. 2015. SchizConnect: virtual data integration in neuroimaging. InInternational Conference on Data Integration in the Life Sciences. Springer, 37–51.

[2] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph KBradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015.Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACMSIGMOD International Conference on Management of Data. ACM, 1383–1394.

[3] Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, Ioana Ileana, and IoanaManolescu. 2015. Invisible glue: scalable self-tuning multi-stores. In Conferenceon Innovative Data Systems Research (CIDR).

[4] Michael J Carey, Laura M Haas, Peter M Schwarz, Manish Arya, William F Cody,Ronald Fagin, Myron Flickner, Allen W Luniewski, Wayne Niblack, DragutinPetkovic, et al. 1995. Towards heterogeneous multimedia information systems:The Garlic approach. In Research Issues in Data Engineering, 1995: DistributedObject Management, Proceedings. RIDE-DOM’95. Fifth International Workshop on.IEEE, 124–131.

[5] Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland,Yannis Papakonstantinou, Jeffrey Ullman, and Jennifer Widom. 1994. The TSIM-MIS project: Integration of heterogenous information sources. (1994).

[6] Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, MichaelStonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouz-zani, and Nan Tang. 2017. The Data Civilizer System.. In CIDR.

[7] AnHai Doan, Alon Halevy, and Zachary Ives. 2012. Principles of data integration.Elsevier.

[8] Xin Luna Dong and Divesh Srivastava. 2015. Big data integration. SynthesisLectures on Data Management 7, 1 (2015), 1–198.

[9] Jennie Duggan, Aaron J Elmore, Michael Stonebraker, Magda Balazinska, BillHowe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik.2015. The bigdawg polystore system. ACM Sigmod Record 44, 2 (2015), 11–16.

[10] Ronald Fagin, Phokion G Kolaitis, Renée J Miller, and Lucian Popa. 2005. Dataexchange: semantics and query answering. Theoretical Computer Science 336, 1(2005), 89–124.

[11] Gary H Glover, Bryon A Mueller, Jessica A Turner, Theo GM Van Erp, Thomas TLiu, Douglas N Greve, James T Voyvodic, Jerod Rasmussen, Gregory G Brown,David B Keator, et al. 2012. Function biomedical informatics research networkrecommendations for prospective multicenter functional MRI studies. Journal ofMagnetic Resonance Imaging 36, 1 (2012), 39–54.

[12] Alistair Grant, Mario Antonioletti, Alastair C Hume, Amy Krause, Bartosz Do-brzelecki, Michael J Jackson, Mark Parsons, Malcolm P Atkinson, and EliasTheocharopoulos. 2008. OGSA-DAI: Middleware for data integration: Selectedapplications. In eScience, 2008. eScience’08. IEEE Fourth International Conferenceon. IEEE, 343–343.

[13] Dennis Heimbigner and Dennis McLeod. 1985. A federated architecture forinformation management. ACM Transactions on Information Systems (TOIS) 3, 3(1985), 253–278.

[14] Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. Fastqueries over heterogeneous data through engine customization. Proceedings ofthe VLDB Endowment 9, 12 (2016), 972–983.

[15] David B Keator, Jeffrey S Grethe, D Marcus, B Ozyurt, Syam Gadde, Sean Murphy,Steve Pieper, D Greve, R Notestine, H Jeremy Bockholt, et al. 2008. A national hu-man neuroimaging collaboratory enabled by the Biomedical Informatics ResearchNetwork (BIRN). IEEE Transactions on Information Technology in Biomedicine 12,2 (2008), 162–172.

[16] Margaret D King, Dylan Wood, Brittny Miller, Ross Kelly, Drew Landis, WilliamCourtney, Runtang Wang, Jessica A Turner, and Vince D Calhoun. 2014. Auto-mated collection of imaging and phenotypic data to centralized and distributeddata repositories. Frontiers in neuroinformatics 8 (2014), 60.

[17] Thomas Kirk, Alon Y. Levy, Yehoshua Sagiv, and Divesh Srivastava. 1995. TheInformation Manifold. In Working Notes of the AAAI Spring Symposium on Infor-mation Gathering in Heterogeneous, Distributed Environments. Technical ReportSS-95-08, AAAI Press, Menlo Park, California.

[18] Boyan Kolev, Carlyna Bondiombouy, Patrick Valduriez, Ricardo Jiménez-Peris,Raquel Pau, and José Pereira. 2016. The cloudmdsql multistore system. In Pro-ceedings of the 2016 International Conference on Management of Data. ACM,2113–2116.

[19] George Konstantinidis and José Luis Ambite. 2011. Scalable query rewriting: agraph-based approach. In Proceedings of the 2011 ACM SIGMOD InternationalConference on Management of data. ACM, 97–108.

[20] Daniel S Marcus, Timothy R Olsen, Mohana Ramaratnam, and Randy L Buckner.2007. The extensible neuroimaging archive toolkit. Neuroinformatics 5, 1 (2007),11–33.

[21] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkatara-man, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al.2016. Mllib: Machine learning in apache spark. The Journal of Machine LearningResearch 17, 1 (2016), 1235–1241.

[22] Adam Scott, William Courtney, Dylan Wood, Raul De la Garza, Susan Lane,Runtang Wang, Margaret King, Jody Roberts, Jessica A Turner, and Vince DCalhoun. 2011. COINS: an innovative informatics and neuroimaging tool suitebuilt for large heterogeneous datasets. Frontiers in neuroinformatics 5 (2011), 33.

[23] Michael Stonebraker and Ugur Cetintemel. 2005. " One size fits all": an ideawhose time has come and gone. In Data Engineering, 2005. ICDE 2005. Proceedings.21st International Conference on. IEEE, 2–11.

[24] Dimitris Stripelis, José Luis Ambite, Yao-Yi Chiang, Sandrah P Eckel, and RimaHabre. 2017. A Scalable Data Integration and Analysis Architecture for SensorData of Pediatric Asthma. InData Engineering (ICDE), 2017 IEEE 33rd InternationalConference on. IEEE, 1407–1408.

[25] Paul M Thompson, Jason L Stein, Sarah E Medland, Derrek P Hibar, Alejan-dro Arias Vasquez, Miguel E Renteria, Roberto Toro, Neda Jahanshad, GunterSchumann, Barbara Franke, et al. 2014. The ENIGMA Consortium: large-scalecollaborative analyses of neuroimaging and genetic data. Brain imaging andbehavior 8, 2 (2014), 153–182.

[26] Jessica A Turner. 2014. The rise of large-scale imaging studies in psychiatry.GigaScience 3, 1 (2014), 29.

[27] Jessica A Turner, Danielle Pasquerello, Matthew D Turner, David B Keator,Kathryn Alpert, Margaret King, Drew Landis, Vince D Calhoun, Steven G Potkin,Marcelo Tallis, et al. 2015. Terminology development towards harmonizing mul-tiple clinical neuroimaging research repositories. In International Conference onData Integration in the Life Sciences. Springer, 104–117.

[28] Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, BrandonHaynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta,et al. 2017. The Myria Big Data Management and Analytics System and CloudServices.. In CIDR.

[29] Lei Wang, Alex Kogan, Derin Cobia, Kathryn Alpert, Anthony Kolasny, Michael IMiller, and Daniel Marcus. 2013. Northwestern University schizophrenia dataand software tool (NUSDAST). Frontiers in neuroinformatics 7 (2013), 25.

[30] Gio Wiederhold. 1992. Mediators in the architecture of future informationsystems. Computer 25, 3 (1992), 38–49.

[31] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and IonStoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10-10(2010), 95.

6