Viki_bigdatameetup_2013_10

Viki Analytics Infrastructure

BigData Singapore Meetup

Oct 2013

Viki’s Data Pipeline

1. Collecting Data

What data do we collect?

• Clickstream data• An event is some user interaction or product

related• A client (web/mobile) sends these events as

HTTP calls.• Format: JSON

– Schema-less– Flexible

{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…

How to keep this data clean?

• Problem: Clients often send erroneous data.

eg. missing parameter

• Solution: We write client

libraries for each client to

enforce “world peace”

Ps: there is no such thing as

“world peace”

How to collect > 60 M events a day?

• fluentd Scalable Extensibility Let you send data to

Hadoop, MongoDB, PostgreSQL etc.• Writes to Hadoop (TD), Amazon

S3, MongoDB

Where do we store?

• Hadoop (Treasure Data)

Its fast and easy to setup!

We don’t have money or time to hire a

Hadoop engineer.

We retrieve data from Hadoop in batch

jobs

• Amazon S3

Backup

• MongoDB: Real-time data

2. Retrieving & Processing Data

2. Retrieving & Processing Data

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

Getting All Data To 1 Place• Port data from different

production databases into PG

• Retrieve click-stream data from Hadoop to PG

thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1

a) Production Databases Analytics DB:

thor db:cp --source A --destination B –t reporting.video_plays --increment

PostgreSQL

{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…

date source partner event video_id country cnt

2013-09-29 ios viki video_play 1008912v ca 2

2013-09-29 android viki video_play 1008912v us 18

…

b) Click-stream Data (Hadoop) Analytics DB:

Hadoop

PostgreSQL

Aggregation (Hive)

Export Output / Sqoop

SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['source'], v['partner'], v['event'], v['video_id'], v['country'], COUNT(1) as cntFROM eventsWHERE TIME_RANGE(time, '2013-09-29', '2013-09-30') AND v['event'] = 'video_play'GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['source'], v['partner'], v['event'], v['video_id'], v['country'];

Simple Aggregation SQL

The Data Is Not Clean!

Event properties and names change as we develop:

But…

{"user_id": "152u”, "country": "sg" }

{"user_id": "152", "country_code":"sg" }Old Version:

New Version:

SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['app_id'] AS `app_id`, CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END AS `partner`, CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END AS `source` , LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ) AS `country` , COALESCE ( v['device_size'] ,v['device'] ) AS `device`, COUNT( 1 ) AS `cnt`FROM eventsWHERE time >= 1380326400 AND time <= 1380412799 AND v['event'] = 'video_play'GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'], CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END , CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END, LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ), COALESCE ( v['device_size'] ,v['device'] );

(Not so) simple Aggregation SQL

Hadoop

UPDATE "reporting"."cl_main_2013_09"SET source = 'embed', partner = ’partner1'WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')

UPDATE "reporting"."cl_main_2013_09"SET app_id = '100105a'WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')

UPDATE reporting.cl_main_2013_09SET user_id = user_id || 'u’WHERE RIGHT(user_id, 1) ~ '[0-9]’

UPDATE "reporting"."cl_main_2013_09"SET app_id = '100106a'WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')

UPDATE reporting.cl_main_2013_09SET source = 'raynor', partner = 'viki', app_id = '100000a’WHERE event = 'pv’ AND source IS NULL AND partner IS NULL AND app_id IS NULL

…even after import

PostgreSQL

30% 70%

Import data Clean up data

Cleaning Up Data Takes Lots of Time

Transforming Data


• Cleaning Data



Transforming Data

…

Table A

Table B

…

Analytics DB (PostgreSQL)

date source partner event country cnt

2013-09-29 ios viki video_play ca 20

…

date source partner event video_id country cnt



…

PostgreSQL

20M records

4M records

a) Reducing Table Size By Dropping Dimension

video_plays_with_video_id

video_plays

id title

1c Game of Thrones

2c My Girlfriend Is A Gumiho

…

PostgreSQL

b) Injecting Extra Fields For Analysis

id title video_count

1c Game of Thrones

30


16

…

containers videos

containers containers

1 n

id title

1c Game of Thrones


…

PostgreSQL

Injecting Extra Fields For Analysis

id title video_count

1c Game of Thrones

30


16

…

containers videos

containers containers

1 n

Chunk Tables By Month

video_plays_2013_06

video_plays_2013_07

video_plays_2013_08

video_plays_2013_09

…

ALTER TABLE video_plays_2013_09 INHERIT video_plays;

ALTER TABLE video_plays_2013_09ADD CONSTRAINT CHECK date >= '2013-09-01' AND date < '2013-10-01';

video_plays (parent table)

Managing Job Dependency


• Cleaning Data



Managing Job Dependency

…

tableA

tableB

…

Analytics DB (PostgreSQL)

AzkabanCron dependency management

(Viki Cron Dependency Graph)

Data Presentation

Data Presentation

`

Dashboard• Yes, dashboard on Rails.

• We have a daily logship process to port the data over to

dashboard server.

thor db:logship –t big_table

Data Visualization

Tableau is slow if directly working on PostgreSQL

Export compressed csv’s to tableau server Windows Line charts do solve most problems

Engineering involvement in report creation

• Bad idea!

• Enter Query Reports! Fast report churn rate

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe” – Abraham Lincoln

Query Reports

Summary report

• Higher level view of metrics• See changes over time

• (screen shot)

Data Explorer“The world is your oyster”

One more thing! (Viki Live)

Lessons Learnt

• Line charts can solve most problems

• Chart your data quickly

• Our dataset is not that big

Simple DIY Suggestion• Put QueryReports on top of your database. Or Tableau

Desktop.

• Use Mixpanel/KISSMetrics for Product Analytics

• fluentd writes data to Postgres (hstore)

CAN

We are hiring!

Thank you!

[email protected] [email protected]

Viki’s Data Pipeline

Viki_bigdatameetup_2013_10

Technology

Transcript of Viki_bigdatameetup_2013_10