Building Multi-Petabyte Data Warehouses with ClickHouse · PDF file– Paraccel (now...

Click here to load reader

  • date post

    17-Mar-2018
  • Category

    Documents

  • view

    219
  • download

    0

Embed Size (px)

Transcript of Building Multi-Petabyte Data Warehouses with ClickHouse · PDF file– Paraccel (now...

  • BuildingMulti-PetabyteDataWarehouseswithClickHouse

    AlexanderZaitsev

    LifeSteet,Altinity

    PerconaLiveDublin,2017

    Altinity

  • WhoamI

    GraduatedMoscowStateUniversityin1999

    Softwareengineersince1997

    Developeddistributedsystemssince2002

    Focusedonhighperformanceanalyticssince2007

    DirectorofEngineeringinLifeStreet

    Co-founderofAltinity

  • AdTechcompany(adexchange,adserver,RTB,DMPetc.)since2006

    10,000,000,000+events/day

    2K/event

    3monthsretention(90-120days)

    10B*2K*[90-120]=[1.8-2.4]PB

  • Tried/used/evaluated: MySQL(TokuDB,ShardQuery) InfiniDB

    MonetDB InfoBrightEE Paraccel(nowRedShift)

    Oracle Greenplum SnowflakeDB

    Vertica

    ClickHouse

  • Flashback:ClickHouseat08/2016

    1-2monthsinOpenSource

    InternalYandexproductnootherinstallations

    Nosupport,roadmap,communicatedplans

    3officialdevs

    Anumberofvisiblelimitations(andmanyinvisible)

    Storiesofotherdoomedopen-sourcedDBs

  • Developproductionsystemwiththat?

  • ClickHouseis/wasmissing:

    Transactions Constraints Consistency UPDATE/DELETE NULLs(addedfewmonthsago) Milliseconds Implicittypeconversions StandardSQLsupport Partitioningbyanycolumn(dateonly) Enterpriseoperationtools

  • SQLdevelopersreaction:

  • Butwetriedandsucceeded

  • Beforeyougo:

    Confirmyourusecase

    Checkbenchmarks

    Runyourown

    Considerlimitations,notfeatures

    MakeaPOC

  • Migrationproblem:basicthingsdonotfit

  • MainChallenges

    EfficientschemaUseClickHousebests

    Workaroundlimitations

    Reliabledataingestion

    Shardingandreplication

    Clientinterfaces

  • LifeStreetUseCase

    Publisher/Advertiserperformance

    Campaign/Creativeperformanceprediction

    Realtimealgorithmicbidding

    DMP

  • LifeStreetRequirements

    Load10Brows/day,500dimensions/row

    Ad-hocreportson3monthsofdata

    Lowdataandquerylatency

    HighAvailability

  • Multi-DimensionalAnalysis

    N-dimensionalcube

    M-dimensionalprojection

    slice

    OLAPquery:aggregation+filter+groupby

    Rangefilter

    Queryresult

    Disclaimer:averageslie

  • Typicalschema:star

    Facts Dimensions Metrics Projections

  • StarSchemaApproach

    De-normalized:dimensionsinafacttable

    Normalized:dimensionkeysinafacttableseparatedimensiontables

    Singletable,simple Multipletables

    Simplequeries,nojoins Morecomplexquerieswithjoins

    Datacannotbechanged Dataindimensiontablescanbechanged

    Sub-efficientstorage Efficientstorage

    Sub-efficientqueries Moreefficientqueries

  • Normalizedschema:traditionalapproach-joins

    LimitedsupportinClickHouse(1level,cascadesub-selectsformultiple)

    Dimensiontablesarenotupdatable

  • Dictionaries-ClickHousedimensionsapproach

    Lookupservice:key->value

    Supportsdifferentexternalsources(files,

    databasesetc.)

    Refreshable

  • Dictionaries.ExampleSELECT country_name, sum(imps) FROM T ANY INNER JOIN dim_geo USING (geo_key) GROUP BY country_name; vs SELECT dictGetString(dim_geo, country_name, geo_key) country_name, sum(imps) FROM T GROUP BY country_name;

  • Dictionaries.Configuration

    ...

    ...

    ...

    ...

    ...

  • Dictionaries.Sources file

    mysqltable

    clickhousetable

    odbcdatasource

    executablescript

    httpservice

  • Dictionaries.Layouts

    flat

    hashed

    cache

    complex_key_hashed

    range_hashed

  • Dictionaries.range_hashed

    EffectiveDatedqueries

    id

    start_date

    end_date

    dictGetFloat32('srv_ad_serving_costs','ad_imps_cpm',toUInt64(0),event_day)

  • Dictionaries.Updatevalues Bytimer(default)

    AutomaticforMySQLMyISAM

    Usinginvalidate_query

    Manuallytouchingconfigfile

    Warning:Ndict*Mnodes=N*MDBconnections

  • Dictionaries.Restrictions

    NormalkeysareonlyUInt64

    Noondemandupdate(addedinSep2017

    1.1.54289)

    Everyclusternodehasitsowncopy

    XMLconfig(DDLwouldbebetter)

  • Dictionariesvs.Tables

    +NoJOINs

    +Updatable

    +Alwaysinmemoryforflat/hash(faster)

    - Notapartoftheschema

    - Somewhatinconvenientsyntax

  • Tables

    Engines

    Sharding/Distribution

    Replication

  • Engine=?

    Inmemory:

    Memory

    Buffer

    Join

    Set

    Ondisk:

    Log,TinyLog

    MergeTreefamily

    Interface: Distributed Merge Dictionary

    Specialpurpose: View Materialized

    View

    Null

  • Mergetree Whatismerge

    PKsortingandindex

    Datepartitioning

    Queryperformance

    Block1 Block2

    Mergedblock

    PKindex

    Seedetailsat:https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324

  • MergeTreefamily

    ReplicatedReplacingCollapsingSummingAggergatingGraphite

    MergeTree+ +

  • DataLoad

    Multipleformatsaresupported,includingCSV,TSV,JSONs,nativebinary

    Errorhandling SimpleTransformations

    Loadlocally(better)ordistributed(possible)

    Temptableshelp

    Replicatedtableshelpwithde-dup

  • ThepowerofMaterializedViews

    MVisatable,i.e.engine,replicationetc.

    Updatedsynchronously

    Summing/AggregatingMergeTreeconsistentaggregation

    Altersareproblematic

  • DataLoadDiagram

    Temptables(local)

    Facttables(shard)

    SummingMergeTree(shard)

    SummingMergeTree(shard)

    LogFiles

    INSERT

    MV MV

    INSERT Buffertables(local)

    Realtimeproducers

    INSERT

    Bufferflush

    MySQL

    Dictionaries

    CLICKHOUSENODE

  • Updatesanddeletes

    Dictionariesarerefreshable

    ReplacingandCollapsingmergetrees

    eventuallyupdates

    SELECTFINAL

    Partitions

  • ShardingandReplication ShardingandDistribution=>Performance FacttablesandMVsdistributedovermultipleshards

    Dimensiontablesanddictsreplicatedateverynode(localjoinsandfilters)

    Replication=>Reliability 2-3replicaspershard

    CrossDC

  • DistributedQuerySELECTfooFROMdistributed_tableGROUPbycol1

    Server1,2or3

    SELECTfooFROMlocal_tableGROUPBYcol1

    Server1

    SELECTfooFROMlocal_tableGROUPBYcol1

    Server2

    SELECTfooFROMlocal_tableGROUPBYcol1

    Server3

  • Replication Pertabletopologyconfiguration:

    Dimensiontablesreplicatetoanynode Facttablesreplicatetomirrorreplica

    Zookepertocommunicatethestate State:whatblocks/partstoreplicate

    Asynchronous=>fasterandreliableenough

    Synchronous=>slower

    Isolatequerytoreplica Replicationqueues

  • SQL SupportsbasicSQLsyntax Non-standardJOINsimplementation:

    1levelonly

    ANYvsALL

    onlyUSING Aliasingeverywhere

    Arrayandnesteddatatypes,lambda-expressions,ARRAYJOIN

    GLOBALIN,GLOBALJOIN

    Approximatequeries

    Someanalyticalfunctions

  • HardwareandDeployment

    LoadisCPUintensive=>morecores

    Queryisdiskintensive=>fasterdisks 10-12SATARAID10 SAS/SSD=>x2performanceforx2priceforx0.5capacity

    10TB/serverseemsoptimal

    ZookeperkeepinonDCforfastquorum RemoteDCworkbad(e.g.EastanWestcoastinUS)

  • MainChallengesRevisited

    DesignefficientschemaUseClickHousebests

    Workaroundlimitations

    Designshardingandreplication

    Reliabledataingestion

    Clientinterfaces

  • Migrationprojecttimelines August2016:POC October2016:firsttestruns

    December2016:productionscaledataload: 10-50Bevents/day,20TBdata/day 12x2serverswith12x4TBRAID10

    March2017:ClientAPIready,startingmigration 30+clienttypes,20req/squeryload

    May2017:extensionto20x3servers

    June2017:migrationcompleted! 2-2.5PBuncompresseddata

  • Fewexamples

    :)selectcount(*)fromdw.ad8_fact_eventwhereaccess_day=today()-1;SELECTcount(*)FROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)count()75851067961rowsinset.Elapsed:0.503sec.Processed12.78billionrows,25.57GB(25.41billionrows/s.,50.82GB/s.)

  • :)selectdictGetString('dim_country','country_code',toUInt64(country_key))country_code,count(*)cntfromdw.ad8_fact_eventwhereaccess_day=today()-1groupbycountry_codeorderbycntdesclimit5;SELECTdictGetString('dim_country','country_code',toUInt64(country_key))AScountry_code,count(*)AScntFROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)GROUPBYcountry_codeORDERBYcntDESCLIMIT5country_codecntUS2159011287MX448561730FR433144172GB352344184DE3364793745rowsinset.Elapsed:2.478sec.Processed12.78billionrows,55.91GB(5.16billionrows/s.,22.57GB/s.)

  • :)SELECTdictGetString('dim_country','country_code',toUInt64(country_key))AScountry_code,sum(cnt)AScntFROM(SELECTcountry_key,count(*)AScntFROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)GROUPBYcountry_keyORDERBYcntDESCLIMIT5)GROUPBYcountry_codeORDERBYcntDESCcountry_codecntUS2159011287MX448561730FR433144172GB352344184DE3364793745rowsinset.Elapsed:1.471sec.Processed12.80billionrows,55.94GB(8.70billionrows/s.,38.02GB/s.)

  • :)SELECTcountDistinct(name)ASnum_cols,formatReadableSize(sum(data_compressed_bytes)ASc)AScomp,formatReadableSize(sum(data_uncompressed_bytes)ASr)ASraw,c/rAScomp_ratioFROMlf.columnsWHEREtable='ad8_fact_event_shard'num_colscomprawcomp_ratio308325.98TiB4.71PiB0.067576408347699441rowsinset.Elapsed:0.289sec.Processed281.46thousandrows,33.92MB(973.22thousandrows/s.,117.28MB/s.)

  • ClickHouseatfall2017

    1+yearOpenSource 100+prodinstallsworldwide

    Publicchangelogs,roadmap,and