The Value of User Experience (from Web 2.0 Expo Berlin 2008)
Viki_bigdatameetup_2013_10
-
Upload
ishanagrawal90 -
Category
Technology
-
view
1.561 -
download
0
description
Transcript of Viki_bigdatameetup_2013_10
Viki Analytics Infrastructure
BigData Singapore Meetup
Oct 2013
Viki’s Data Pipeline
1. Collecting Data
What data do we collect?
• Clickstream data• An event is some user interaction or product
related• A client (web/mobile) sends these events as
HTTP calls.• Format: JSON
– Schema-less– Flexible
{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…
How to keep this data clean?
• Problem: Clients often send erroneous data.
eg. missing parameter
• Solution: We write client
libraries for each client to
enforce “world peace”
Ps: there is no such thing as
“world peace”
How to collect > 60 M events a day?
• fluentd Scalable Extensibility Let you send data to
Hadoop, MongoDB, PostgreSQL etc.• Writes to Hadoop (TD), Amazon
S3, MongoDB
Where do we store?
• Hadoop (Treasure Data)
Its fast and easy to setup!
We don’t have money or time to hire a
Hadoop engineer.
We retrieve data from Hadoop in batch
jobs
• Amazon S3
Backup
• MongoDB: Real-time data
2. Retrieving & Processing Data
2. Retrieving & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
2. Retrieving & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Getting All Data To 1 Place• Port data from different
production databases into PG
• Retrieve click-stream data from Hadoop to PG
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1
a) Production Databases Analytics DB:
thor db:cp --source A --destination B –t reporting.video_plays --increment
PostgreSQL
{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1008912v ca 2
2013-09-29 android viki video_play 1008912v us 18
…
b) Click-stream Data (Hadoop) Analytics DB:
Hadoop
PostgreSQL
Aggregation (Hive)
Export Output / Sqoop
SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['source'], v['partner'], v['event'], v['video_id'], v['country'], COUNT(1) as cntFROM eventsWHERE TIME_RANGE(time, '2013-09-29', '2013-09-30') AND v['event'] = 'video_play'GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['source'], v['partner'], v['event'], v['video_id'], v['country'];
Simple Aggregation SQL
The Data Is Not Clean!
Event properties and names change as we develop:
But…
{"user_id": "152u”, "country": "sg" }
{"user_id": "152", "country_code":"sg" }Old Version:
New Version:
SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['app_id'] AS `app_id`, CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END AS `partner`, CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END AS `source` , LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ) AS `country` , COALESCE ( v['device_size'] ,v['device'] ) AS `device`, COUNT( 1 ) AS `cnt`FROM eventsWHERE time >= 1380326400 AND time <= 1380412799 AND v['event'] = 'video_play'GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'], CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END , CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END, LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ), COALESCE ( v['device_size'] ,v['device'] );
(Not so) simple Aggregation SQL
Hadoop
UPDATE "reporting"."cl_main_2013_09"SET source = 'embed', partner = ’partner1'WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')
UPDATE "reporting"."cl_main_2013_09"SET app_id = '100105a'WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')
UPDATE reporting.cl_main_2013_09SET user_id = user_id || 'u’WHERE RIGHT(user_id, 1) ~ '[0-9]’
UPDATE "reporting"."cl_main_2013_09"SET app_id = '100106a'WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')
UPDATE reporting.cl_main_2013_09SET source = 'raynor', partner = 'viki', app_id = '100000a’WHERE event = 'pv’ AND source IS NULL AND partner IS NULL AND app_id IS NULL
…even after import
PostgreSQL
30% 70%
Import data Clean up data
Cleaning Up Data Takes Lots of Time
Transforming Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Transforming Data
…
Table A
Table B
…
Analytics DB (PostgreSQL)
date source partner event country cnt
2013-09-29 ios viki video_play ca 20
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1v ca 2
2013-09-29 ios viki video_play 2v ca 18
…
PostgreSQL
20M records
4M records
a) Reducing Table Size By Dropping Dimension
video_plays_with_video_id
video_plays
id title
1c Game of Thrones
2c My Girlfriend Is A Gumiho
…
PostgreSQL
b) Injecting Extra Fields For Analysis
id title video_count
1c Game of Thrones
30
2c My Girlfriend Is A Gumiho
16
…
containers videos
containers containers
1 n
id title
1c Game of Thrones
2c My Girlfriend Is A Gumiho
…
PostgreSQL
Injecting Extra Fields For Analysis
id title video_count
1c Game of Thrones
30
2c My Girlfriend Is A Gumiho
16
…
containers videos
containers containers
1 n
Chunk Tables By Month
video_plays_2013_06
video_plays_2013_07
video_plays_2013_08
video_plays_2013_09
…
ALTER TABLE video_plays_2013_09 INHERIT video_plays;
ALTER TABLE video_plays_2013_09ADD CONSTRAINT CHECK date >= '2013-09-01' AND date < '2013-10-01';
video_plays (parent table)
Managing Job Dependency
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Managing Job Dependency
…
tableA
tableB
…
Analytics DB (PostgreSQL)
Managing Job Dependency
…
tableA
tableB
…
Analytics DB (PostgreSQL)
AzkabanCron dependency management
(Viki Cron Dependency Graph)
Data Presentation
Data Presentation
`
Dashboard• Yes, dashboard on Rails.
• We have a daily logship process to port the data over to
dashboard server.
thor db:logship –t big_table
Data Visualization
Tableau is slow if directly working on PostgreSQL
Export compressed csv’s to tableau server Windows Line charts do solve most problems
Engineering involvement in report creation
• Bad idea!
• Enter Query Reports! Fast report churn rate
“Give me six hours to chop down a tree and I will spend the first four sharpening the axe” – Abraham Lincoln
Query Reports
Query Reports
Summary report
• Higher level view of metrics• See changes over time
• (screen shot)
Data Explorer“The world is your oyster”
One more thing! (Viki Live)
Recap
Lessons Learnt
• Line charts can solve most problems
• Chart your data quickly
• Our dataset is not that big
Simple DIY Suggestion• Put QueryReports on top of your database. Or Tableau
Desktop.
• Use Mixpanel/KISSMetrics for Product Analytics
• fluentd writes data to Postgres (hstore)
CAN
We are hiring!
Thank you!
Viki’s Data Pipeline