Using Riak for Events storage and analysis at Booking · 2015. 6. 19. · KEY FIGURES • 600,000...

119
Using Riak for Events storage and analysis at Booking.com Damien Krotkine

Transcript of Using Riak for Events storage and analysis at Booking · 2015. 6. 19. · KEY FIGURES • 600,000...

  • Using Riak for Events storage and analysis at Booking.com

    Damien Krotkine

  • Damien Krotkine

    • Software Engineer at Booking.com

    • github.com/dams

    • @damsieboy

    • dkrotkine

  • KEY FIGURES

    • 600,000 hotels

    • 212 countries

    • 800,000 room nights every 24 hours

    • guest reviews 43 million+

    • offices worldwide 155+

    • 8,600 people

    • not a small website…

  • INTRODUCTION

  • www APImobi

  • www APImobi

    front

    end

  • www APIfro

    nten

    dba

    cken

    dmobi

    events storage

    events: info about subsystems status

  • back

    end

    web mobi api

    databases

    caches

    load balancersavailability

    cluster

    email

    etc…

  • WHAT IS AN EVENT ?

  • EVENT STRUCTURE

    • Provides info about subsystems

    • Data

    • Deep HashMap

    • Timestamp

    • Type + Subtype

    • The rest: specific data

    • Schema-less

  • { timestamp => 12345, type => 'WEB', subtype => 'app', dc => 1, action => { is_normal_user => 1, pageview_id => '188a362744c301c2', # ... }, tuning => { the_request => 'GET /display/...' bytes_body => 35, wallclock => 111, nr_warnings => 0, # ... }, # ... }

  • { type => 'FAV', subtype => 'fav', timestamp => 1401262979, dc => 1, tuning => { flatav => { cluster => '205', sum_latencies => 21, role => 'fav', num_queries => 7 } } }

  • EVENTS FLOW PROPERTIES

    • Read-only

    • Schema-less

    • Continuous, sequential, timed

    • 15 K events per sec

    • 1.25 Billion events per day

    • peak at 70 MB/s, min 25MB/s

    • 100 GB per hour

  • USAGE

  • ASSESS THE NEEDS

    • Before thinking about storage

    • Think about the usage

  • USAGE

    1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING

  • GRAPHS

    • Graph in real-time ( few seconds lag )

    • Graph as many systems as possible

    • General platform health check

  • GRAPHS

  • GRAPHS

  • DASHBOARDS

  • META GRAPHS

  • USAGE

    1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING

  • DECISION MAKING

    • Strategic decision ( use facts )

    • Long term or short term

    • Technical / Non technical Reporting

  • USAGE

    1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING

  • SHORT TERM ANALYSIS

    • From 10 sec ago -> 8 days ago

    • Code deployment checks and rollback

    • Anomaly Detector

  • USAGE

    1. GRAPHS 2. DECISION MAKING 3. SHORT TERM ANALYSIS 4. A/B TESTING

  • A/B TESTING

    • Our core philosophy: use facts

    • It means: do A/B testing

    • Concept of Experiments

    • Events provide data to compare

  • EVENT AGGREGATION

  • EVENT AGGREGATION

    • Group events

    • Granularity we need: second

  • SERIALIZATION

    • JSON didn’t work for us (slow, big, lack features)

    • Created Sereal in 2012

    • « Sereal, a new, binary data serialization format that provides high-performance, schema-less serialization »

    • https://github.com/Sereal/Sereal

    https://github.com/Sereal/Sereal

  • event

    eventevents storage

    eventevent

    even

    t

    eventevent

    eventevent

    eventeven

    t

    event event

  • e ee ee

    ee e

    e ee

    e eee

    ee

    e

    LOGGER

    e

    e

  • web apieeeeee

    eee

    e ee ee

    ee e

    e ee

    e eee

    ee

    ee

    e

  • web api dbseeeeeeee

    eeeeeeeee

    eeeeeee

    e ee ee

    ee e

    e ee

    e eee

    ee

    ee

    e

  • web api dbseeeeeeee

    eeeeeeeee

    eeeeeee

    e ee ee

    ee e

    e ee

    e eee

    ee

    e

    1 sec

    e

    e

  • web api dbseeeeeeee

    eeeeeeeee

    eeeeeee

    e ee ee

    ee e

    e ee

    e eee

    ee

    e

    1 sec

    e

    e

  • web api dbseeeee

    eeeee

    eeeee

    1 sec

    events storage

  • web api dbs

    ee

    eeeee

    eeeee

    1 sec

    events storage

    ee e reserialize + compress

  • events storage

    LOGGER …LOGGER LOGGER

  • STORAGE

  • WHAT WE WANT

    • Storage security

    • Mass write performance

    • Mass read performance

    • Easy administration

    • Very scalable

  • WE CHOSE RIAK

    • Security: cluster, distributed, very robust

    • Good and predictable read / write performance

    • The easiest to setup and administrate

    • Advanced features (MapReduce, triggers, 2i, CRDTs …)

    • Riak Search

    • Multi Datacenter Replication

  • CLUSTER

    • Commodity hardware • All nodes serve data • Data replication

    • Gossip between nodes • No master • distributed system

    Ring of servers

  • hash(key)

  • KEY VALUE STORE

    • Namespaces: bucket

    • Values: opaque or CRDTs

  • RIAK: ADVANCED FEATURES

    • MapReduce

    • Secondary indexes (2i)

    • Riak Search

    • Multi DataCenter Replication

  • MULTI-BACKEND FOR STORAGE

    • Bitcask

    • Eleveldb

    • Memory

  • BACKEND: BITCASK

    • Log-based storage backend

    • Append-only files (AOF files)

    • Advanced expiration

    • Predictable performance (1 disk-seek max)

    • Perfect for sequential data

  • CLUSTER CONFIGURATION

  • DISK SPACE NEEDED

    • 8 days

    • 100 GB per hour

    • Replication 3

    • 100 * 24 * 8 * 3

    • Need 60 T

  • HARDWARE

    • 12 then 16 nodes (soon 24)

    • 12 CPU cores ( Xeon 2.5Ghz)

    • 192 GB RAM

    • network 1 Gbit/s

    • 8 TB (raid 6)

    • Cluster total space: 128 TB

  • DATA DESIGN

  • web api dbs

    ee

    eeeee

    eeeee

    1 sec

    events storage

    1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE 500 KB max chunks

  • DATA

    • Bucket name: “data“

    • Key: “12345:1:cell0:WEB:app:chunk0“

    • Value: List of events (Hashmaps), serialized & compressed

    • 200 keys per seconds

  • METADATA

    • Bucket name: “metadata“

    • Key: - “1428415043-2“

    • Value: list of data keys:

[ “1428415043:1:cell0:WEB:app:chunk0“, 
 “1428415043:1:cell0:WEB:app:chunk1“ 
… 
 “1428415043:4:cell0:EMK::chunk3“ ]

    • As pipe separated value (PSV)

  • WRITE DATA

  • PUSH DATA IN

    • In each DC, in each cell, Loggers push to Riak

    • Use ProtoBuf

    • Every seconds:

    • Push data values to Riak, async

    • Wait for success

    • Push metadata

  • JAVA

    Bucket DataBucket = riakClient.fetchBucket("data").execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk1", Data2).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk2", Data3).execute();

    Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute(); MetaDataBucket.store("12345-1", metaData).execute(); riakClient.shutdown();

  • Perl

    my $client = Riak::Client->new(…);

    $client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1); $client->put(data => '12345:1:cell0:WEB:app:chunk1', $data2); $client->put(data => '12345:1:cell0:WEB:app:chunk2', $data3);

    $client->put(metadata => '12345-1', $metadata, 'text/plain' );

  • READ DATA

  • READ ONE SECOND

    • For one second (a given epoch)

    • Request metadata for -DC

    • Parse value

    • Filter out unwanted types / subtypes

    • Fetch the keys from the “data” bucket

  • Perl

    my $client = Riak::Client->new(…); my @array = split '\|', $client->get(metadata => '1428415043-1'); @filtered_array = grep { /WEB/ } @array; $client->get(data => $_) foreach @filtered_array;

  • READ ONE SECOND

    • For an interval epoch1 -> epoch2

    • Generate the list of epochs

    • Fetch in parallel

  • RIAK CLUSTER

  • CPU

  • DISK IOPS

  • DISK IO %

  • DISK SPACE RECLAIMED

    one day

  • REAL TIME PROCESSING OUTSIDE OF RIAK

  • STREAMING

    • Fetch 1 second every second

    • Or a range ( ex: last 10 min )

    • Fetch all epochs from Riak

    • Use data on the client side

  • EXAMPLES

    • Streaming => Graphite ( every sec )

    • Streaming => Anomaly Detector ( last 2 min )

    • Streaming => Experiment analysis ( last day )

    • Every minute => Hadoop

    • Manual request => test, debug, investigate

    • Batch fetch => ad hoc analysis

    • => Huge numbers of read requests

  • events storage

    graphite cluster

    Anomaly detector

    experiment
cluster

    hadoop cluster

    mysql analysis

    manual requests

    50 MB/s

    50 MB/s

    50 M

    B/s 50 M

    B/s

    50 MB/s50 MB/s

  • THIS IS REALTIME

    • 1 second of data

    • Stored in < 1 sec

    • Available after < 1 sec

    • Issue : network saturation

  • REAL TIME PROCESSING INSIDE RIAK

  • THE IDEA

    • Instead of

    • Fetch data,

    • Crunch data (ex: average),

    • Produce a small result

    • Do

    • Bring code to data

    • Crunch data on Riak

    • Fetch the result

  • WHAT TAKES TIME

    • Takes a lot of time

    • Fetching data out: network issue

    • Decompressing: CPU time issue

    • Takes almost no time

    • Crunching data

  • MAPREDUCE

    • Input: epoch-dc

    • Map1: metadata keys => data keys

    • Map2: data crunching

    • Reduce: aggregate

    • Realtime: OK

    • network usage: OK

    • CPU time: NOT OK

  • HOOKS

    • Every time metadata is written

    • Post-Commit hook triggered

    • Crunch data on the nodes

  • Riak post-commit hook

    REST serviceRIAK service

    key keysocket

    result sent for storage

    decompress
process all tasks

    NODE HOST

  • HOOK CODE

    metadata_stored_hook(RiakObject) -> Key = riak_object:key(RiakObject), Bucket = riak_object:bucket(RiakObject), [ Epoch, DC ] = binary:split(Key, ), MetaData = riak_object:get_value(RiakObject), DataKeys = binary:split(MetaData, , [ global ]), send_to_REST(Epoch, Hostname, DataKeys), ok.

  • send_to_REST(Epoch, Hostname, DataKeys) -> Method = post, URL = "http://" ++ binary_to_list(Hostname) ++ ":5000?epoch=" ++ binary_to_list(Epoch), HTTPOptions = [ { timeout, 4000 } ], Options = [ { body_format, string }, { sync, false }, { receiver, fun(ReplyInfo) -> ok end } ], Body = iolist_to_binary(mochijson2:encode( DataKeys )), httpc:request(Method, {URL, [], "application/json", Body}, HTTPOptions, Options), ok.

  • REST SERVICE

    • In Perl, using PSGI (WSGI-like), Starman, preforks

    • Allow to write data cruncher in Perl

    • Also supports loading code on demand

  • ADVANTAGES

    • CPU usage and execution time can be capped

    • Data is local to processing

    • Two systems are decoupled

    • REST service written in any language

    • Data processing done all at once

    • Data is decompressed only once

  • DISADVANTAGES

    • Only for incoming data (streaming), not old data

    • Can’t easily use cross-second data

    • What if the companion service goes down ?

  • FUTURE

    • Use this companion to generate optional small values

    • Use Riak Search to index and search those

  • THE BANDWIDTH PROBLEM

  • • PUT - bad case

    • n_val = 3

    • inside usage = 
3 x outside usage

  • • PUT - good case

    • n_val = 3

    • inside usage = 
2 x outside usage

  • • GET - bad case

    • inside usage = 
3 x outside usage

  • • GET - good case

    • inside usage = 
2 x outside usage

  • • network usage ( PUT and GET ): • 3 x 13/16+ 2 x 3/16= 2.81 • plus gossip • inside network > 3 x outside network

  • • Usually it’s not a problem • But in our case: • big values, constant PUTs, lots of GETs • sadly, only 1 Gbit/s

    • => network bandwidth issue

  • THE BANDWIDTH SOLUTIONS

  • THE BANDWIDTH SOLUTIONS

    1. Optimize GET for network usage, not speed 2. Don’t choose a node at random

  • • GET - bad case

    • n_val = 1

    • inside usage = 
1 x outside

  • • GET - good case

    • n_val = 1

    • inside usage = 
0 x outside

  • WARNING

    • Possible only because data is read-only

    • Data has internal checksum

    • No conflict possible

    • Corruption detected

  • RESULT

    • practical network usage reduced by 2 !

  • THE BANDWIDTH SOLUTIONS

    1. Optimize GET for network usage, not speed 2. Don’t choose a node at random

  • • bucket = “metadata” • key = “12345”

  • • bucket = “metadata”

    • key = “12345”

    Hash = hashFunction(bucket + key)

    RingStatus = getRingStatus

    PrimaryNodes = Fun(Hash, RingStatus)

  • hashFunction()

    getRingStatus()

  • hashFunction()

    getRingStatus()

  • THE IDEA

    • Do the hashing on the client • By default:

    • “chash_keyfun”:{"mod":"riak_core_util", 
 "fun":"chash_std_keyfun"},

    • look at the source on github • riak_core/src/chash.erl

  • THE HASH FUNCTION

    • Easy to re-implement client-side

    • sha1 as a BigInt

    • example: sha1(bucket + key) = 2450

  • RING STATUS

    • If hash = 2450 => nodes 16, 17, 18

    $ curl -s -XGET http://$host:8098/riak_preflists/myring{ “0" :"[email protected]" “5708990770823839524233143877797980545530986496” :"[email protected]" “11417981541647679048466287755595961091061972992”:"[email protected]" “17126972312471518572699431633393941636592959488":“[email protected]” …etc…}

  • WARNING

    • Possible only if • Nodes list is monitored • In case of failed node, default to random • Data is requested in an uniform way

  • RESULT

    • Network usage even more reduced ! • Especially for GETs

  • CONCLUSION

  • CONCLUSION

    • We used only Riak Open Source

    • No training, self-taught, small team

    • Riak is a great solution

    • Robust, fast, scalable, easy

    • Very flexible and hackable

    • Helps us continue scaling

  • Q&A@damsieboy