Data Infrastructure for a World of Music

37
Lars Albertsson, Data Engineer @Spotify Focus on challenges & needs Data infrastructure for a world of music

description

The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.

Transcript of Data Infrastructure for a World of Music

Page 1: Data Infrastructure for a World of Music

Lars Albertsson, Data Engineer @Spotify

Focus on challenges & needs

Data infrastructure for a world of music

Page 2: Data Infrastructure for a World of Music

1. Clients generate data2. ???3. Make profit

Users create data

Page 3: Data Infrastructure for a World of Music

Why data?

Reporting to partners, from day 1Record labels, ad buyers, marketing

AnalyticsKPIs, Ads, Business insights: growth, retention, funnels

FeaturesRecommendations, search, top lists, notifications

Product developmentA/B testing

OperationsRoot cause analysis, latency, planning

Customer supportLegal

Data purpose

Page 4: Data Infrastructure for a World of Music

Different needs: speed vs quality

Reporting to partners, from day 1Record labels, ad buyers, marketing (daily + monthly)

AnalyticsKPIs, Ads, Business insights: growth, retention, funnels

FeaturesRecommendations, search, top lists, notifications

Product developmentA/B testing

OperationsRoot cause analysis, latency, planning

Customer supportLegal

Data purpose

Page 5: Data Infrastructure for a World of Music

Most user actionsPlayed songsPlaylist modificationsWeb navigationUI navigation

Service state changesUserNotifications

IncomingContentSocial integration

Data purpose

What data?

Page 6: Data Infrastructure for a World of Music

26M monthly active users6M subscribers55 markets20M songs, 20K new / day1.5B playlists4 data centres10 TB from users / day400 GB from services / day61 TB generated in Hadoop / day600 Hadoop nodes6500 MapReduce jobs / day18PB in HDFS

Data purpose

Much data?

Page 7: Data Infrastructure for a World of Music

Data purpose

Data is true

Page 8: Data Infrastructure for a World of Music

Data purpose

Data is true

Page 9: Data Infrastructure for a World of Music

Get raw dataRefineMake it useful

Data infrastructure

Page 10: Data Infrastructure for a World of Music

2008:> for h in all_hosts

rsync ${h}:/var/log/syslog /incoming/$h/$date> echo ‘0 * * * * run_all_hourly_jobs.sh’ | crontabDump to Postgres, make graph

Still living with some of this…

Data infrastructure

It all started very basic

Page 11: Data Infrastructure for a World of Music

Data infrastructure

Collect, crunch, use/display

GatewayPlaylistservice

Kafka message bus

MapReduce

SQL

Reports

Cassandra

Recomm-endations

HDFS

service DB

Kafka@lon

logs

Page 12: Data Infrastructure for a World of Music

Data infrastructure

Fault scenarios

GatewayPlaylistservice

Kafka message bus

MapReduce

SQL

Reports

Cassandra

Recomm-endations

HDFS

service DB

Kafka@lon

logs

Page 13: Data Infrastructure for a World of Music

Most datasets are produced daily

Consumers want data after morning coffee

For each line, bottom level represents a good day

Destabilisation is the norm

Delay factors all over the infrastructure - client to display

Producers are not stakeholders

Data infrastructure

Shit happens

Page 14: Data Infrastructure for a World of Music

Get raw data from clients through GWsGWsService logsService databases

To HDFS

Data collection

Page 15: Data Infrastructure for a World of Music

Data collection

Data collection

GatewayPlaylistservice

Kafka message bus

HDFS

service DB

Kafka@lon

logs

Sources of truth

MapReduce?

Need to wait for “all” data for a time slot (hour)

What is all?Can we get all?

Most consumers want 9x% quickly.

Reruns are complex.

Page 16: Data Infrastructure for a World of Music

1. Rsync from hosts. Get list from hosts DB.- Rsync fragile, frequent network issues.- DB info often stale- Often waiting for dead host or omitting host

2. Push logs over Kafka. Wait for hosts according to hosts DB.+ Kafka better. Application level cross-site routing.- Kafka unreliable by design. Implement end-to-end acking.

3. Use Kafka as in #2. Determine active hosts by snooping metrics.+ Reliable? host metric.- End-to-end stability and host enumeration not scalable.

Data collection

Log collection evolution

Page 17: Data Infrastructure for a World of Music

Single solution cannot fit all needs. Choose reliability or low latency.

Reliable path with store and forwardService hosts must not store state.Synchronous handoff to HA Kafka with large replay buffer

Best effort path similarNo acks, asynchronous handoff

Message producers know appropriate semanticsFor critical data: handoff failure -> stop serving users

Measuring loss is essential

Data collection

Log collection future

Page 18: Data Infrastructure for a World of Music

~1% loss is ok, assuming that it is measuredFew % time slippage is ok, if unbiasedBiased slippage is not okTimestamp to use for bucketing: client, GW, HDFS?

Some components are HA (Cassandra, ZooKeeper). Most are unreliable. Client devices are very unreliable.

Buffers in “stateless” components cause loss.

Crunching delay is inconvenient. Crunching wrong data is expensive.

Data crunching

Data is false?

Page 19: Data Infrastructure for a World of Music

Core databases dumped daily (user x 2, playlist, metadata)Determinism required - delays inevitableSlave replication issues commonNo good solution:

Sqoop live - non-deterministicPostgres commit log replay - not scalableCassandra full dumps - resource heavy

Solution - convert to event processing?Experimenting with Netflix Aegisthus for Cassandra -> HDFSFacebook has MySQL commit log -> event conversion

Data collection

Database dumping

Page 20: Data Infrastructure for a World of Music

We have raw data, sorted by host and hour

We want e.g. active users by country and product over the last month

Data crunching

Page 21: Data Infrastructure for a World of Music

Data crunching

End goal example - business insights

Page 22: Data Infrastructure for a World of Music

1. Split by message type, per hour2. Combine multiple sources for similar data, per day - a core dataset.3. Join activity datasets, e.g. tracks played or user activity, with ornament dataset, e.g. track metadata, user demographics.4a. Make reports for partners, e.g. labels, advertisers.4b. Aggregate into SQL or add metadata for Hive exploration.4c. Build indexes (search, top lists), denormalise, and put in Cassandra.4d. Run machine learning (recommendations) and put in Cassandra.4e. Make notification decisions and send out....

Data crunching

Typical data crunching

MR

C*

Page 23: Data Infrastructure for a World of Music

Data crunching

Core dataset example: users

Page 24: Data Infrastructure for a World of Music

Generate - organicTransfer - KafkaProcess - Python MapReduce. Bad idea.

Big data ecosystem is 99% JVM -> moving to CrunchTest - in production.

Not acceptable. Working on it. No available tools.Deploy - CI + Debian packages.

Low isolation. Looking at containers (Docker).Monitor - organic

Cycle time for code-test-debug: 21 days

Data crunching

Data processing platform

Page 25: Data Infrastructure for a World of Music

Online storage: Cassandra, PostgresOffline storage: HDFSTransfer: Kafka, SqoopProcessing engine: Hadoop MapReduce in YarnProcessing languages: Luigi Python MapReduce, Crunch, PigMining: Hive, Postgres, QlikviewReal-time processing: Storm (mostly experimental)

Trying out:Spark - better for iterative algorithms (ML), future of MapReduce?Giraph and other graph toolsMore stable infrastructure: Docker, Azkaban

Data crunching

Technology stack

Page 26: Data Infrastructure for a World of Music

def mapper(self, items):

for item in items:

if item.type == ‘EndSong’

yield (item.track_id, 1, item)

else: # Track metadata

yield (item.track_id, 0, item)

def reducer(self, key, values):

for item in values:

if item.type != ‘EndSong’:

meta = item

else:

yield add_meta(meta, item)

Data crunching

Crunching tools - four joins

select * from tracks inner join metadata on tracks.track_id = metadata.track_id;

join tracks by track_id, metadata by track_id;

PTable <String, Pair<EndSong, TrackMeta>> = Join.innerJoin(endSongTable, metaTable);

Vanilla MapReduce - fragile SQL / Hive - exploration & display

Pig - deprecated

Crunch - future for processing pipelines

Page 27: Data Infrastructure for a World of Music

Lots of opportunities in PBs of data. Opportunities to get lost.

Organising data

Page 28: Data Infrastructure for a World of Music

Mostly organic - frequent discrepanciesAgile feature dev -> easy schema change

Currently requires client lib releaseAvro meta format in backend

Good Hadoop integrationNot best option in client

Some clients are hard to upgrade, e.g. old phones, hifi, cars.

Utopic (aka Google): client schema change -> automatic Hive/SQL/dashboard/report change

Data crunching

Schemas

Page 29: Data Infrastructure for a World of Music

Today:if date < datetime(2012, 10, 17): # Use old formatelse: …

Not scalableFew tools available. HCatalog?

Solution(?): Encapsulate each dataset in library. Owners decide compatibility vs reformat strategy. Version the interface. (Twitter)

Data crunching

Data evolution

Page 30: Data Infrastructure for a World of Music

Many redundant calculationsData discovery

Home-grown tool

Retention policySave the raw data (S3)Be brutal and delete

Data crunching

What is out there?

Page 31: Data Infrastructure for a World of Music

Technology is easy to change, humans hard

Our most difficult challenges are cultural

Organising yourself

Page 32: Data Infrastructure for a World of Music

Failing jobs, dead jobsDead dataData growthRerunsIsolation

Configuration, memory, disk, Hadoop resourcesTechnical debt

Testing, deployment, monitoring, remediationsCost

Be stringent with software engineering practices or suffer. Most data organisations suffer.

Data crunching

Staying in control

Page 33: Data Infrastructure for a World of Music

History:Data service departmentCore data + platform departmentData platform department

Self-service spurs data usageData producers and consumers have domain knowledge

Data infrastructure engineers do notData producers prioritise online services over offlineProducing and consuming is closely tied, yet often organisationally

separated

Data crunching

Who owns what?

Page 34: Data Infrastructure for a World of Music

Dos:

Solve domain-specific or unsolved thingsUse stuff from leaders (Kafka)Monitor aggressivelyHave 50+% backend engineersFocus on the data feature developer needsSeparate raw and generated dataHadoop was good bet, Spark even better?

Data crunching

Things learnt in the fire

Don’ts:

Choose your own path (Python)Use ad-hoc formatsBuild stuff with < 3 years horizonAccumulate debtUse SQL in data pipelinesHave SPOFs - no excuse anymoreRely on host configurationsCollect data with pullVanilla MapReduce“Data is special” - no SW practices

Page 35: Data Infrastructure for a World of Music

Innovation originates at Google (~10^7 data dedicated machines)MapReduce, GFS, Dapper, Pregel, Flume

Open source variants by the big dozen (10^5 - 10^6)Yahoo, Netflix, Twitter, LinkedIn, Amazon, Facebook. US onlyHadoop, HDFS, ZooKeeper, Giraph, Crunch. Cassandra

Improved by serious players (10^3 - 10^4)Spotify, AirBnB, FourSquare, Prezi, King. Mostly US

Used by beginners (10^1 - 10^2)

Big Data innovation

Innovation in Big Data - four tiers

Page 36: Data Infrastructure for a World of Music

Not much in infrastructure:Supercomputing legacy

MPI still in useBerkeley: Spark, Mesos

Cooperation with Yahoo and TwitterContainers

Xen, VMware

Data processing theory:Bloom filters, stream processing (e.g. Count-Min Sketch)

Machine learning

Big Data innovation

Innovation from academia

Page 37: Data Infrastructure for a World of Music

Fluid architectures / private cloudsLarge pools of machinesServices and jobs are independent of hostsMesos, Curator are scratching at the problemGoogle Borg = Utopia

LAMP stack for Big DataEnd to end developer testing

Client modification to insights SQL changeRunning on developer machine, in IDE

Scale is not an issue - efficiency & productivity is

Big Data innovation

Innovation is needed, examples