Spotting Hadoop in the wild

59
Spotting Hadoop in the wild Practical use cases from Last.fm and Massive Media @klbostee Thursday 12 January 12

description

Practical Hadoop use cases from Last.fm and Massive Media

Transcript of Spotting Hadoop in the wild

Page 1: Spotting Hadoop in the wild

Spotting Hadoop in the wildPractical use cases from Last.fm and Massive Media

@klbostee

Thursday 12 January 12

Page 2: Spotting Hadoop in the wild

• “Data scientist is a job title for an employee who analyses data, particularly large amounts of it, to help a business gain a competitive edge” —WhatIs.com

• “Someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning” —Hilary Mason, bit.ly

Thursday 12 January 12

Page 3: Spotting Hadoop in the wild

• 2007: Started using Hadoop as PhD student

• 2009: Data & Scalability Engineer at Last.fm

• 2011: Data Scientist at Massive Media

Thursday 12 January 12

Page 4: Spotting Hadoop in the wild

• 2007: Started using Hadoop as PhD student

• 2009: Data & Scalability Engineer at Last.fm

• 2011: Data Scientist at Massive Media

• Created Dumbo, a Python API for Hadoop

• Contributed some code to Hadoop itself

• Organized several HUGUK meetups

Thursday 12 January 12

Page 5: Spotting Hadoop in the wild

What are those yellow things?

Thursday 12 January 12

Page 6: Spotting Hadoop in the wild

Core principles

• Distributed

• Fault tolerant

• Sequential reads and writes

• Data locality

Thursday 12 January 12

Page 7: Spotting Hadoop in the wild

Pars pro toto

HDFS

Pig Hive

MapReduceHBa

se

Zoo

Kee

per

Hadoop itself is basically the kernel that provides a file system and task scheduler

Thursday 12 January 12

Page 8: Spotting Hadoop in the wild

Hadoop file system

DataNode DataNodeDataNode

Thursday 12 January 12

Page 9: Spotting Hadoop in the wild

Hadoop file system

DataNode DataNodeDataNode

File A =

Thursday 12 January 12

Page 10: Spotting Hadoop in the wild

Hadoop file system

DataNode DataNodeDataNode

File A =

File B =

Thursday 12 January 12

Page 11: Spotting Hadoop in the wild

Hadoop file system

DataNode DataNodeDataNode

File A =

File B =

Linuxblock

Hadoopblock

Thursday 12 January 12

Page 12: Spotting Hadoop in the wild

Hadoop file system

DataNode DataNodeDataNode

File A =

File B =

Linuxblock

Hadoopblock

No random writes!

Thursday 12 January 12

Page 13: Spotting Hadoop in the wild

Hadoop task scheduler

DataNode DataNodeDataNode

TaskTracker TaskTrackerTaskTracker

Thursday 12 January 12

Page 14: Spotting Hadoop in the wild

Hadoop task scheduler

DataNode DataNodeDataNode

TaskTracker TaskTrackerTaskTracker

Job A =

Thursday 12 January 12

Page 15: Spotting Hadoop in the wild

Hadoop task scheduler

DataNode DataNodeDataNode

TaskTracker TaskTrackerTaskTracker

Job A = Job B =

Thursday 12 January 12

Page 16: Spotting Hadoop in the wild

Some practical tips

• Install a distribution

• Use compression

• Consider increasing your block size

• Watch out for small files

Thursday 12 January 12

Page 17: Spotting Hadoop in the wild

HBase

HDFS

Pig Hive

MapReduceHBa

se

Zoo

Kee

per

HBase is a database on top of HDFS that can easily be accessed from MapReduce

Thursday 12 January 12

Page 18: Spotting Hadoop in the wild

Data model

Row keys Column X Column Y Column U

Column family A Column family B

Column V

... ... ... ... ...

Thursday 12 January 12

Page 19: Spotting Hadoop in the wild

Data model

Row keys Column X Column Y Column U

Column family A Column family B

Column V

... ... ... ... ...sort

ed

Thursday 12 January 12

Page 20: Spotting Hadoop in the wild

Data model

Row keys Column X Column Y Column U

Column family A Column family B

Column V

... ... ... ... ...sort

ed

• Configurable number of versions per cell

• Each cell version has a timestamp

• TTL can be specified per column family

Thursday 12 January 12

Page 21: Spotting Hadoop in the wild

Random becomes sequential

KeyValue

KeyValue

...

Commit log Memstoreso

rted

KeyValue

KeyValue

KeyValue

...

HDFS

Thursday 12 January 12

Page 22: Spotting Hadoop in the wild

Random becomes sequential

KeyValue

KeyValue

KeyValue

...

Commit log Memstoreso

rted

KeyValue

KeyValue

KeyValue

...

HDFS

Thursday 12 January 12

Page 23: Spotting Hadoop in the wild

Random becomes sequential

KeyValue

KeyValue

KeyValue

...

Commit log Memstoreso

rted

KeyValue

KeyValue

KeyValue

...

HDFS

KeyValue

sequential write

Thursday 12 January 12

Page 24: Spotting Hadoop in the wild

Random becomes sequential

KeyValue

KeyValue

KeyValue

...

Commit log Memstoreso

rted

KeyValue

KeyValue

KeyValue

...KeyValue HDFS

KeyValue

sequential write

Thursday 12 January 12

Page 25: Spotting Hadoop in the wild

Random becomes sequential

KeyValue

KeyValue

KeyValue

...

Commit log Memstoreso

rted

KeyValue

KeyValue

KeyValue

...KeyValue HDFS

KeyValue

sequential write

sequential write

Thursday 12 January 12

Page 26: Spotting Hadoop in the wild

Random becomes sequential

KeyValue

KeyValue

KeyValue

...

Commit log Memstoreso

rted

KeyValue

KeyValue

KeyValue

...KeyValue HDFS

KeyValue

sequential write

sequential write

High write throughput!

Thursday 12 January 12

Page 27: Spotting Hadoop in the wild

Random becomes sequential

KeyValue

KeyValue

KeyValue

...

Commit log Memstoreso

rted

KeyValue

KeyValue

KeyValue

...KeyValue HDFS

KeyValue

sequential write

sequential write

High write throughput!+ efficient scans+ free empty cells+ no fragmentation+ ...

Thursday 12 January 12

Page 28: Spotting Hadoop in the wild

Horizontal scalingRow keys sorted

Thursday 12 January 12

Page 29: Spotting Hadoop in the wild

Horizontal scalingRow keys sorted

Thursday 12 January 12

Page 30: Spotting Hadoop in the wild

Horizontal scalingRow keys sorted

Region

RegionServer

Thursday 12 January 12

Page 31: Spotting Hadoop in the wild

Horizontal scalingRow keys sorted

Region

RegionServer

Region

...

Region Region

... ...RegionServer RegionServer

Thursday 12 January 12

Page 32: Spotting Hadoop in the wild

Horizontal scalingRow keys sorted

Region

RegionServer

Region

...

Region Region

... ...RegionServer RegionServer

• Each region has its own commit log and memstores• Moving regions is easy since the data is all in HDFS • Strong consistency as each region is served only once

Thursday 12 January 12

Page 33: Spotting Hadoop in the wild

Some practical tips

• Restrict the number of regions per server

• Restrict the number column families

• Use compression

• Increase file descriptor limits on nodes

• Use a large enough buffer when scanning

Thursday 12 January 12

Page 34: Spotting Hadoop in the wild

Look, a herd of Hadoops!

Thursday 12 January 12

Page 35: Spotting Hadoop in the wild

• “Last.fm lets you effortlessly keep a record of what you listen to from any player. Based on your taste, Last.fm recommends you more music and concerts” —Last.fm

• Over 60 billion tracks scrobbled since 2003

• Started using Hadoop in 2006, before Yahoo

Thursday 12 January 12

Page 36: Spotting Hadoop in the wild

• “Massive Media is the social media company behind the successful digital brands Netlog.com and Twoo.com. We enable members to meet nearby people instantly” —MassiveMedia.eu

• Over 80 million users on web and mobile

• Using Hadoop for about a year now

Thursday 12 January 12

Page 37: Spotting Hadoop in the wild

Hadoop adoption

1. Business intelligence

2. Testing and experimentation

3. Fraud and abuse detection

4. Product features

5. PR and marketing

Thursday 12 January 12

Page 38: Spotting Hadoop in the wild

Hadoop adoption

1. Business intelligence

2. Testing and experimentation

3. Fraud and abuse detection

4. Product features

5. PR and marketing

Last.

fm

√√√√√

Thursday 12 January 12

Page 39: Spotting Hadoop in the wild

Hadoop adoption

1. Business intelligence

2. Testing and experimentation

3. Fraud and abuse detection

4. Product features

5. PR and marketing

Last.

fm

Mas

sive

Med

ia

√√√√√

√√√√

Thursday 12 January 12

Page 40: Spotting Hadoop in the wild

Business intelligence

Thursday 12 January 12

Page 41: Spotting Hadoop in the wild

Testing and experimentation

Thursday 12 January 12

Page 42: Spotting Hadoop in the wild

Fraud and abuse detection

Thursday 12 January 12

Page 43: Spotting Hadoop in the wild

Fraud and abuse detection

Thursday 12 January 12

Page 44: Spotting Hadoop in the wild

Product features

Thursday 12 January 12

Page 45: Spotting Hadoop in the wild

PR and marketing

Thursday 12 January 12

Page 46: Spotting Hadoop in the wild

Let’s dive into the first use case!

Thursday 12 January 12

Page 47: Spotting Hadoop in the wild

Goals and requirements

• Timeseries graphs of 1000 or so metrics

• Segmented over about 10 dimensions

Thursday 12 January 12

Page 48: Spotting Hadoop in the wild

Goals and requirements

• Timeseries graphs of 1000 or so metrics

• Segmented over about 10 dimensions

1. Scale with very large number of events

2. History for graphs must be long enough

3. Accessing the graphs must be instantaneous

4. Possibility to analyse in detail when needed

Thursday 12 January 12

Page 49: Spotting Hadoop in the wild

Attempt #1

• Log table in MySQL

• Generate graphs from this table on-the-fly

Thursday 12 January 12

Page 50: Spotting Hadoop in the wild

Attempt #1

• Log table in MySQL

• Generate graphs from this table on-the-fly

1. Large number of events

2. Long enough history

3. Instantaneous access

4. Analyse in detail

√⁄⁄√

Thursday 12 January 12

Page 51: Spotting Hadoop in the wild

Attempt #2

• Counters in MySQL table

• Update counters on every event

Thursday 12 January 12

Page 52: Spotting Hadoop in the wild

Attempt #2

• Counters in MySQL table

• Update counters on every event

1. Large number of events

2. Long enough history

3. Instantaneous access

4. Analyse in detail

⁄√√⁄

Thursday 12 January 12

Page 53: Spotting Hadoop in the wild

Attempt #3

• Put log files in HDFS through syslog-ng

• MapReduce on logs and write to HBase

Thursday 12 January 12

Page 54: Spotting Hadoop in the wild

Attempt #3

• Put log files in HDFS through syslog-ng

• MapReduce on logs and write to HBase

1. Large number of events

2. Long enough history

3. Instantaneous access

4. Analyse in detail

√ √√√

Thursday 12 January 12

Page 55: Spotting Hadoop in the wild

Architecture

Syslog-ng

HDFS

MapReduce

HBase

Thursday 12 January 12

Page 56: Spotting Hadoop in the wild

Architecture

Syslog-ng

HDFS

MapReduce

HBase

Realtimeprocessing

Thursday 12 January 12

Page 57: Spotting Hadoop in the wild

Architecture

Syslog-ng

HDFS

MapReduce

HBase

RealtimeprocessingAd-hoc

results

Thursday 12 January 12

Page 58: Spotting Hadoop in the wild

HBase schema

• Separate table for each time granularity

• Global segmentations in row keys• <language>||<country>||...|||<timestamp>

• * for “not specified”

• trailing *s are omitted

• Further segmentations in column keys• e.g. payments_via_paypal, payments_via_sms

• Related metrics in same column family

Thursday 12 January 12

Page 59: Spotting Hadoop in the wild

Questions?

Thursday 12 January 12