Post on 03-Aug-2015
The Rise of Digital Audio: Dwelling between BIG Data
and Fast Data
Philippe-Alexandre Leroux | Chief Operating OfficerBogdan Bocse | Solutions Architect
The way we consume music has evolved
Music is part of our lives, just not like
before
We can now consume music in many different ways
On Demand Live Radios Custom Radios
It’s now interactive, connected and tailored
around users… = New opportunities for publishers &
advertisers
So what’s different now?
What does it mean for the industry?
Less people are buying CDs
Publishers and Artists need new revenue models
Advertisers want to Digital Audio to be as easy as Display or Video+
+
=Great opportunity for an Ad Tech company to power
the Digital Audio Revolution !
AdsWizz in that ?
We are NOT an airline
We power the Digital Audio revolution
Audience Analytics AdServing Audio StreamingSSP
DSP
Real-Time Bidding
Real-Time reports
Supply Intelligence
Content Analysis
Mobile SDKs
Real-Time Ad Insertion
Traffic Forecasting
Some numbers
#5B +impressions per month#3500+ broadcast stations#10 000 custom stations#1000 podcast shows#100+ Amazon nodes#1+ Million concurrent sessions #100 Swizzers#7 offices world wide
Some of the cool brands we work with
How do we use Big Data?* It’s not just for showing off
Understand user trends
0:000:30
1:001:30
2:002:30
3:003:30
4:004:30
5:005:30
6:006:30
7:007:30
8:008:30
9:009:30
10:0010:30
11:0011:30
12:0012:30
13:0013:30
14:0014:30
15:0015:30
16:0016:30
17:0017:30
18:0018:30
19:0019:30
20:0020:30
21:0021:30
22:0022:30
23:0023:30
UK Online Listening Media Day
Lunch breakDaily peak
Commute
Real-time user profiling
RTB is like the stock exchange, but with ads
Traditional “small” data solutions simply don’t work
For every single transaction we collect 20+ data points
Applied to 5+ billion monthly impressions
A database which grows by 1TB per day
Good luck serving close to real-time queries with MySQL
+
=+
Yeah, yeah, it’s all BIG. What else?
Fast•Cache-Aside Pattern•Redis•Memcached
Complex
Query
•Data Warehousing•Redshift•HadoopStructu
red Query
•Sorted key-value stores•HBase•DynamoDB
Use Case #1: Handling User Profiles
Use Case #2: Distributed Worker
Use Case #3: Distributed Worker +Data Warehouse
Use Case #4: Distributed Worker +Data Warehouse + State Store
An evolving tech stack
Join the ride
We are looking for new Swizzers to join BIG DATA ENGINEER FOR DATA SCIENCE TEAM
MAD DEVOPS NINJA
INCIDENT MANAGER
SUPER VILLAIN (ÜBER JAVA DEVELOPER)
SENIOR MOBILE DEVELOPER (ANDROID/iOS)
SENIOR QA INTEGRATION ENGINEER
…jobs@adswizz.com
PHP / AngularJS DEVELOPERSENIOR IT PROJECT MANAGER
Philippe-Alexandre LerouxChief Operating Officerphilalex.leroux@adswizz.com
Bogdan BocșeSolutions ArchitectBogdan.bocse@adswizz.com
@followadswizz
Philippe-Alexandre LerouxChief Operating Officerphilalex.leroux@adswizz.com
Bogdan BocșeSolutions ArchitectBogdan.bocse@adswizz.com
@followadswizz
Backup Slides(on the off-chance 20 minutes are enough)
What’s it called? What does it mean?
Volumetry If it’s less than 100GB, don’t bother calling it BigData
Atomic Query Size Are you reading 10 or 10 million records per transaction?
Query Load Do you expect 5 or 5000 queries per second?
Response Time Do you expect your data store to answer in 1ms, 10ms or 10s?
Immutability Once your data is written, does it stay written?
Strict Consistency Do you need changes to be instantly visible to all readers?
Data Freshness Do you need the absolute latest data, to the millisecond?
ACID Compliance If you work with ordering or payments, you want transactions.
Query Accuracy Is there room for error for the results to your queries?
Persistence/Durability Should data be stored on a permanent medium (HDD, SSD)?
High Availability Is it required that the data stores stays available throughout hardware and network failures?
Big• Cost grows linearly with data size• No performance degradation with size
Flexible On-the-fly queries
Accurate Exact computationEstimate resultStrict consistency
Fast Fast ReadsFast WritesFast Updates
Cost & Complexity
Redshift: Queries at Scale• Tables have sort keys (like indexes)• Tables have one distribution key• Defines how data is split over nodes
• Tables are split in sorted regions• Each region has several slices spread across nodes• Split across several instances• Each column has its own compression type• SSD-enabled (200 GB node)• Results ….
The Results
• The query on the previous slide (it is actually 4-5 A4 pages long)• Over 39,031,958 rows (100-150 GB)• Took 4.039s
* The data store stores 3 TB over 12 instances
Ordered-Bucket Sampling
Let’s say we want to sample 20% of events for a specific scenario.We split events into 10 buckets, depending on the hash of their “user id”.
Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5 Bucket #6 Bucket #7 Bucket #8 Bucket #9 Bucket #10
Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5 Bucket #6 Bucket #7 Bucket #8 Bucket #9 Bucket #10
er1bhUygQoRrPvonNRyw -(hash)> Bucket 332m9bGzQQMs7162ObeRt -(hash)> Bucket 7(…)
Then we sample only those events from Bucket #1 and Bucket #2.