Firehose Engineering

51
Firehose Engineering designing high-volume data collection systems Josh Berkus FOSS4G-NA Wash. DC 2012

description

 

Transcript of Firehose Engineering

Page 1: Firehose Engineering

Firehose Engineeringdesigninghigh-volumedata collection systems

Josh BerkusFOSS4G-NA

Wash. DC 2012

Page 2: Firehose Engineering
Page 3: Firehose Engineering

Firehose Database Applications (FDA)

(1) very high volume of data input from many automated producers

(2) continuous processing and aggregation of incoming data

Page 4: Firehose Engineering

Mozilla Socorro

Page 5: Firehose Engineering

Upwind

Page 6: Firehose Engineering

GPS Applications

Page 7: Firehose Engineering

Firehose Challenges

Page 8: Firehose Engineering

1. Volume

● 100's to 1000's facts/second● GB/hour

Page 9: Firehose Engineering

1. Volume

● spikes in volume● multiple uncoorindated sources

Page 10: Firehose Engineering

1. Volume

volume always grows over time

Page 11: Firehose Engineering

2. Constant flow

since data arrives 24/7 …

while the user interface can be down, data collection can never be down

Page 12: Firehose Engineering

ETL

2. Constant flow● can't stop

receiving to process

● data can arrive out of order

Page 13: Firehose Engineering

3. Database size

● terabytes to petabytes● lots of hardware● single-node DBMSes aren't enough● difficult backups, redundancy,

migration● analytics are resource-consumptive

Page 14: Firehose Engineering

3. Database size

● database growth● size grows quickly● need to expand storage● estimate target data size● create data ageing policies

Page 15: Firehose Engineering

3. Database size

“We will decide on a data retention policy when we run out

of disk space.”– every business user everywhere

Page 16: Firehose Engineering

4.

Page 17: Firehose Engineering

many components= many failures

Page 18: Firehose Engineering

4. Component failure

● all components fail● or need scheduled downtime● including the network

● collection must continue● collection & processing must

recover

Page 19: Firehose Engineering

solving firehose problems

Page 20: Firehose Engineering

socorro project

Page 21: Firehose Engineering

http://crash-stats.mozilla.com

Page 22: Firehose Engineering
Page 23: Firehose Engineering

Mozilla Socorro

collectors

processors

webservers

reports

Page 24: Firehose Engineering
Page 25: Firehose Engineering

socorro data volume

● 3000 crashes/minute● avg. size 150K

● 40TB accumulated raw data● 500GB accumulated metadata /

reports

Page 26: Firehose Engineering

dealing with volume

load balancers collectors

Page 27: Firehose Engineering

dealing with volume

monitor processors

Page 28: Firehose Engineering

dealing with size

data40TB

expandible

metadataviews500GBfixed size

Page 29: Firehose Engineering

dealing with component failure

● 30 Hbase nodes

● 2 PostgreSQL servers

● 6 load balancers

● 3 ES servers● 6 collectors● 12 processors● 8 middleware &

web servers

… lots of failures

Lots of hardware ...

Page 30: Firehose Engineering

load balancing & redundancy

load balancers collectors

Page 31: Firehose Engineering

elastic connections

● components queue their data● retain it if other nodes are down

● components resume work automatically

● when other nodes come back up

Page 32: Firehose Engineering

elastic connections

collector

reciever local file queue

crash mover

Page 33: Firehose Engineering

server management

● puppet● controls configuration of all servers● makes sure servers recover● allows rapid deployment of

replacement nodes

Page 34: Firehose Engineering

Upwind

Page 35: Firehose Engineering

Upwind

● speed● wind speed● heat● vibration● noise● direction

Page 36: Firehose Engineering

Upwind

1. maximize power generation

2. make sure turbine isn't damaged

Page 37: Firehose Engineering

dealing with volume

each turbine:

90 to 700 facts/second

windmills per farm: up to 100

number of farms: 40+

est. total: 300,000 facts/second

(will grow)

Page 38: Firehose Engineering

dealing with volume

localstorage

historian analyticdatabase

reports

localstorage

historian analyticdatabase

reports

Page 39: Firehose Engineering

dealing with volume

localstorage

historian analyticdatabase

localstorage

historian analyticdatabase

masterdatabase

Page 40: Firehose Engineering

multi-tenant partitioning

● partition the whole application● each customer gets their own

toolchain

● allows scaling with the number of customers

● lowers efficiency● more efficient with virtualization

Page 41: Firehose Engineering

dealing with:constant flow and size

historianminutebuffer

hourstable

daystable

monthstable

yearstable

historianminutebuffer

hourstable

daystable

monthstable

historianminutebuffer

hourstable

daystable

Page 42: Firehose Engineering

time-based rollups

● continuously accumulate levels of rollup

● each is based on the level below it● data is always appended, never

updated● small windows == small resources

Page 43: Firehose Engineering

time-based rollups

● allows:● very rapid summary reports for

different windows● retaining different summaries for

different levels of time● batch/out-of-order processing● summarization in parallel

Page 44: Firehose Engineering

firehose tips

Page 45: Firehose Engineering

data collection must be:

● continuous● parallel● fault-tolerant

Page 46: Firehose Engineering

data processing must be:

● continuous● parallel● fault-tolerant

Page 47: Firehose Engineering

every component must be able to fail

● including the network● without too much data loss● other components must

continue

Page 48: Firehose Engineering

5 tools to use

1. queueing software

2. buffering techniques

3. materialized views

4. configuration management

5. comprehensive monitoring

Page 49: Firehose Engineering

4 don'ts

1. use cutting-edge technology

2. use untested hardware

3. run components to capacity

4. do hot patching

Page 50: Firehose Engineering

master your own

Page 51: Firehose Engineering

Contact● Josh Berkus: [email protected]

● blog: blogs.ittoolbox.com/database/soup

● PostgreSQL: www.postgresql.org● pgexperts: www.pgexperts.com

● Postgres BOF 3:30PM Today● pgCon, Ottawa, May 2012● Postgres Open, Chicago, Sept. 2012

The text and diagrams in this talk is copyright 2011 Josh Berkus and is licensed under the creative commons attribution license. Title slide image and other fireman images are licensed from iStockPhoto and may not be reproduced or redistributed separate from this presentation. Socorro images are copyright 2011 Mozilla Inc.