Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

of 26 /26
Interac(ve Queries on Compressed RDD Succinct Spark Rachit Agarwal AMPLab [email protected] TwiEer: @_ragarwal_

Embed Size (px)

Transcript of Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

  • Interac(veQueriesonCompressedRDD

    SuccinctSpark

    RachitAgarwalAMPLab

    [email protected]

    TwiEer:@_ragarwal_

  • Nosecondaryindexes,nodatascans,nodatadecompression

    AdistributedcompresseddatastoreSuccinct

    Pointqueries

    search randomaccess rangequeries regularexpressions

    UnifiedInterface

    Unstructureddata Key-valuestore Documentstore Tables

  • Interactivepointqueries

    Randomaccess

    Search

    RangeQueries

    RegularExpressions

    Aggregatequeries

    Updates

    Graphqueries

  • 0, 10, 14, 16, 19, 26, 29

    1, 4, 5, 8, 20, 22, 24

    2, 15, 17, 27

    3, 6, 7, 9, 12, 13, 18, 23 ..

    11, 21

    DataScans Indexes

    LowstorageHighLatency

    HighstorageLowLatency

    Existingsystems,e.g.,search()

    Search( )

  • IndexesinslowerstorageScansinfasterstorage

    execu(ngqueriesoffslowerstorage

    Inputsize

    QueryLatency

    Datascans

    Indexes

    Scansinslowerstorage

    Indexesinfasterstorage

    Existingsystemsatscale(qualitatively)

  • Succinct

    LowstorageLowLatency

    Queriesexecuteddirectlyonthe

    compressedrepresenta(on

    WhatmakesSuccinctunique

    Noaddi(onalindexes

    Queryresponsesembeddedwithin

    thecompressedrepresenta(on

    Nodatascans Func(onalityofindexes

    Nodecompression

    Queriesdirectlyonthecompressedrepresenta(on(exceptfordataaccessqueries)

    Succinct

  • Inputsize

    QueryLatency

    Indexes

    Succinct

    Avoidingdatascans

    Avoidingqueriesoffslowerstorage

    Datascans

    Succincttradeoffs

  • OriginalInput

    Extract:returnsdataatarbitraryoffsetsinuncompressedfileCount:returnscountofarbitrarystringsinuncompressedfile

    Succinct

    Search()={0,10,14,16,19,26,29}Extract(0,5)={,,,,}

    Count()=7

    Search:returnsoffsetsofarbitrarystringsinuncompressedfile

    Input:flat(unstructured)files

    Append(,,,,)Rangequeries

    SuccinctDatamodelandFunctionality

  • Supported,buttraded-offinfavorofpointqueriesoncompresseddata

    Preprocessingtime

    CPU(dataaccess)

    Sequentialscanthroughput

    In-placeupdates

    Whatdowelose?

    Succincttradeoffs

  • Nosecondaryindexes,nodatascans,nodatadecompression

    AdistributedcompresseddatastoreSuccinct

    Pointqueries

    search randomaccess rangequeries regularexpressions

    UnifiedInterface

    Unstructureddata Key-valuestore Documentstore Tables

  • Withallthepowerfulqueriesonvalues,documents,columns

    Unstructureddata

    Key-valuestores(Voldemort,Dynamo)

    Documentstore(Elasticsearch,MongoDB)

    Tables(Cassandra,BigTable)

    Andmanymore.

    UnifiedInterface

    SuccinctDataModel:FlatFileInterface

  • Search(Column1,)Search()

    SuccinctFlatFileInterface:Unification

  • Wherearewe?

    Succinct SuccinctSpark

    Wherearewegoing?

    Industrycollabora(on Succinct++

    AdistributedcompresseddatastoreSuccinct

  • System(prototyped&tested)

    Asalibrary

    C++,Java,Scala

    foreaseofintegration

    Allfunctionalitiessupported

    Succinct

    Succinct:Wherearewe?

  • ASparkpackage

    Enablesnewfunctionalities

    Documentstores

    Pointqueries

    Fasterfilters

    CompressedRDDs:Morein-memory

    DataframesAPInotsomature

    QueriesoncompressedRDDs

    SuccinctSpark

    Succinct:Wherearewe?

  • IfyouarealreadyusingSpark

    Newfunc(onali(es

    Documentstore,Key-Valuestore

    searchondocuments,values

    Fasteropera(onsintoRDDs

    randomaccess,filters

    avoidscans

    Morein-memory CompressedRDDs nodecompressionoverheads

    SuccinctSpark

  • importedu.berkeley.cs.succinct._valrdd=ctx.textFile(...).map(_.getBytes)

    valbytes=succinctRDD.extract(50,100)

    valcount=succinctRDD.count("Berkeley")

    valoffsets=succinctRDD.search("Berkeley")

    Importclasses

    CreateanRDD

    Extract100bytesfromoffset50

    Count#occurrencesofBerkeley

    FindalloccurrencesofBerkeley

    valsuccinctRDD=rdd.succinct CompressusingSuccinct

    SuccinctSpark:SuccinctRDD(unstructureddata)

  • importedu.berkeley.cs.succinct.kv._

    valkvRDD=rdd.zipWithIndex.map(t=>(t._2,t._1.getBytes))

    valvalue=succinctKVRDD.get(0)

    valvalueData=succinctKVRDD.extract(0,50,100)

    valkeys=succinctKVRDD.search("Berkeley")

    Importclasses

    Loaddata

    Getvalueforkey0

    Extract100bytesatoffset50inthevalueforkey0

    Findallkeysforvaluesthatcontain

    Berkeley

    valsuccinctKVRDD=kvRDD.succinctKV CompressusingSuccinct

    SuccinctSpark:SuccinctKVRDD(documentstore)

  • 5xAmazonEC2servers,30GBRAMeach

    Wikipediadataset,40GB

    Spark,Elasticsearch

    searchqueries

    #occurrences1-10k

    SuccinctEvaluation

  • Take-away:SuccinctSpark2.75xfasterthanElas(cSearchwhilebeing2.5xmorespaceefficient(datafitsinmemoryforallsystems)

    SuccinctSparkEvaluation(searchlatency)

  • SuccinctSparknowsupportsRegularExpressions!

    valmatches=succinctRDD.regexSearch("William.*Clinton")

    FindallmatchesfortheRegEx

    William.*Clinton

    valmatchKeys=succinctKVRDD.regexSearch("William.*Clinton")

    FindallkeysforvaluesthatcontainmatchesfortheRegExWilliam.*Clinton

    SuccinctRDD

    SuccinctKVRDD

  • Take-away:SuccinctsignificantlyspeedsupRegExqueriesevenwhenallthedatafitsinmemoryforallsystems

    SuccinctSparkEvaluation(RegExlatency)

  • valjsonDoc=succinctJsonRDD.get(0)

    valids1=succinctJsonRDD.filter("city","Berkeley")

    valids2=succinctJsonRDD.search("AMPLab")

    GetJSONdocumentwithid0

    FilterJSONdocumentswherecity=Berkeley

    SearchforJSONdocumentscontaining

    AMPLab

    SuccinctSparknowsupportsJSONdocuments!

  • Moretesting,benchmarking

    SuccinctSparkDataframes

    Newfunctionalities

    Where are we going?

  • Queriesoncompressedandencrypteddata

    BlowFish

    SuccinctEncryption

    SuccinctGraphs

    Newfunctionalities

    Succinct

    BlowFish

    Indexes

    Queriesoncompressedgraphs

    Storage

    QueryLatency

  • ANDMANYMORE!

    succinct.cs.berkeley.edu