Extracting Insights from Data at Twitter

Extracting Insights from Data at Twitter

Prasad Wagle Technical Lead, Core Data and Metrics, Data Platformtwitter.com/prasadwagle

Jan 26, 2016

● What are the properties of Big Data at Twitter?

● Where do we store it and how do we process it?

● What do we learn from the data?

Overview of the talk

● Velocity: Rate at which data is created

○ 313 million monthly active users. (June 2016)

○ Hundreds of millions of Tweets are sent per day. TPS record: one-second peak of 143,199 Tweets per second

○ 100 Billion interaction events per day

● Volume: 100s of petabytes of data

● Variety: Tweets, Users, Client events and many more

○ Client events logs have a unified Thrift format for wide variety of application events

3Vs of Big Data @Twitter

Data Processing Big Picture

Production systems

Batch Scalding

Spark

Real-timeHeron

Lambda (Batch + Real-time)Summingbird

TSAR

InteractivePrestoVertica

R

Custom Dashboards

Tableau

Apache Zeppelin

Command line tools

Batch

Hadoop (HDFS

MapReduce)

Analytics Tools

Analytics Front-ends

Real-time

Eventbus,Kafka

Streams

Data Abstraction Layer (DAL), Pipeline Orchestration

Data Platform

● Batch Processing Engine - Hadoop

● Real-time Processing Engine - Heron

● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet

● Data Pipeline - Data Access Layer (DAL), Orchestration

● Interactive SQL - Presto, Vertica

● Data Visualization - Tableau, Apache Zeppelin

● Core Data and Metrics

Data Platform Projects

● Largest Hadoop clusters in the world, some > 10K nodes

● Store 100s of petabytes of data

● More than 100K daily jobs

● Improvements to open source hadoop software

● hRaven - tool that collects run time data of hadoop jobs and lets users visualize job metrics

○ YARN Timelineserver is next-gen hRaven

● Log pipeline software (scribe -> HDFS)

○ Scribe is being replace by Flume

Hadoop

● Heron - a real-time, distributed, fault tolerant stream processing engine

● Successor of Storm, API compatible with Storm

● Analyze data as it is being produced

● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency

● Use cases

○ Real-time impression and engagement counts

○ Real-time trends, recommendations, spam detection

Real-time Processing

● Tools that make it easy to create MapReduce and Heron jobs

● Scalding

○ Scala DSL on top of Cascading

● Summingbird

○ Lambda architecture: real-time and batch

● Tsar: TimeSeries AggregatoR

○ DSL implemented on top of Summingbird

Core Data Libraries

● DAL is a service that simplifies the discovery, usage, and maintainability of data

● Users work with logical datasets

● Physical dataset describes the serialization of a logical dataset to a specific location (hadoop, vertica) and format

● Logical dataset can simultaneously exist in multiple places

● Users can use logical dataset name to consume data with different tools like Scalding, Presto

Data Access Layer (DAL)

● Eagleeye web application is front-end for end users

● Users discover datasets with Eagleeye

● Eagleeye displays metadata like owners and schema

● Applications access to datasets is recorded

● Enables Eagleye to show dependency graphs for a dataset - jobs that produce a dataset and jobs that consume it

Data Access Layer (DAL)

Data Discovery

● Statebird service

○ Tracks state of batch jobs

○ Used to manage dependencies

Pipeline Orchestration

● Interactive means that results of a query are available in the range of seconds to a few minutes

● SQL is still the lingua franca for ad hoc data analysis

● Vertica

○ Columnar architecture, high performance analytics queries

● Presto

○ Data in HDFS in Parquet format

Interactive SQL

● Custom Dashboards

● Apache Zeppelin Strengths

○ Notebook metaphor - notebook is a collection of notes, each note is a collection of paragraphs (queries)

○ Web based report authoring, collaborative like Google docs

○ Very easy to create a note and then share it

○ > 2K notes, 18K queries

○ Supports JDBC (Presto, Vertica, MySQL)

○ Open source, Easy to add new interpreters like Scalding

Data Visualization

● Tableau Strengths

○ Easy to create reports, does not require SQL expertise

○ Built in analytics functions e.g. Rank, Percentile

○ Polished visualizations

○ Row level security

Data Visualization

● Big part of data analysis is data cleansing

● Makes sense to do this once

● Core Data

○ Create pipelines to create “verified” datasets like Users, Tweets, Interactions

○ Reliable and easy to use

● Core Metrics

○ Create pipelines to compute Twitter’s important metrics

○ DAU, MAU, Tweet Impressions

Core Data and Metrics

Data Processing

● Analytics - Basic Counting

● A/B Testing

● Data Science - Custom analysis

● Data Science - Machine Learning

Data Processing

● Daily/Monthly Active Users

● Number of Tweets, Retweets, Likes

● Tweet Impressions

● Logic is relatively simple

● Challenges: scale and timeliness

○ Results for previous day should be available by 10 am

○ Some metrics are real-time

Basic Counting

● Goal: find the number of impressions and engagements for a tweet

● Real-time

● Used in analytics.twitter.com

Example - Counting Tweet Impressions

aggregate { onKeys( (TweetId) ) produce ( Count ) sinkTo (Manhattan) } fromProducer { ClientEventSource(“client_events”) .filter { event => isImpressionEvent(event) } .map { event => (event.timestamp, ImpressionAttributes(event.tweetId)) } }

TSAR job

Dimension

Metric

Data Sink

Data Source

● TSAR job is converted to a Summingbird job

● Summingbird job creates

○ Real-time pipeline with Heron

○ Batch pipeline with Scalding

● Users access results using TSAR query service

● Write once, run batch and real-time

Example - Counting Tweet Impressions

● Experimentation is at the heart of Twitter’s product development cycle

● Expertise needed in Statistics and Technology

A/B Testing Framework

● Goal: informative experiment,

● Minimize false positive and false negative errors

● How many users do we need to sample?

● How long should we run the experiment?

A/B Testing Statistics

● Process 100 B events daily, compute intensive.

● Metrics computed using Scalding pipeline that combines client event logs, internal user models, and other datasets.

● Lightweight statistics are computed in a streaming job using TSAR running on Heron.

A/B Testing Technology

● Cause of spikes and dips in key metrics

● Growth Trends

○ By country, client

● Analysis to understand user behavior

○ Creators vs Consumers

○ Distribution of followers

○ User clusters

● Analysis to inform product feature decisions

Data Science - Custom Analysis

● Recommendations

○ Users: WTF - who to follow

○ Tweets: Algorithmic timeline

● Cortex, Deep learning based on Torch framework

○ Identify NSFW images

○ Recognize what is happening in live feeds

Data Science - Machine Learning

● Product Safety

○ Detect fake accounts

○ Detect tweet spam and abuse

● Ad Targeting

○ Promoted Trends, Accounts and Tweets

○ Show only if it is likely to be interesting and relevant to that user

○ Predict click probability using signals including what a user chooses to follow, how they interact with a Tweet and what they retweet

Machine Learning

● Systems (Hadoop, Vertica)

○ Necessary because higher level abstraction are leaky

● Programming (Scala, Scalding, SQL)

● Math (Statistics, Linear Algebra)

Ideal Talent Stack

Systems Programming Statistics

Data Engineers Data Scientists

Data Platform and Data Science

work hand-in-hand

to extract insights from Big Data at Twitter

Summary

Questions?

● TSAR https://blog.twitter.com/2014/tsar-a-timeseries-aggregator

● DAL https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter

● Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron

● Heron http://www.slideshare.net/KarthikRamasamy3

● A/B testing https://blog.twitter.com/2015/twitter-experimentation-technical-overview

● A/B testing https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests

● Algorithmic timeline: https://support.twitter.com/articles/164083

● Cortex https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/

● Cortex https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/

References

https://blog.twitter.com/2014/tsar-a-timeseries-aggregator

https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter

https://blog.twitter.com/2015/flying-faster-with-twitter-heron

http://www.slideshare.net/KarthikRamasamy3

https://blog.twitter.com/2015/twitter-experimentation-technical-overview

https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests

https://support.twitter.com/articles/164083

https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/

https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/

Extracting Insights from Data at Twitter

Technology

Transcript of Extracting Insights from Data at Twitter