Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

15
BUILDING DATA PRODUCTS AT SCALE

description

Sanket Patil speaking on Building Data Products At Scale

Transcript of Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

Page 1: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

BUILDING DATA PRODUCTS AT SCALE

Page 2: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

DATAWEAVE: WHAT WE DO?

• Aggregate large amounts of data publicly available on the web, and serve it to businesses in readily usable forms

• Serve actionable data through APIs, Visualizations, and Dashboards

• Provide reporting and analytics layer on top of datasets and APIs

Page 3: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

DATAWEAVE PLATFORM

API Feeds

Data Services

Dashboards

Visualizations and Widgets

Data APIs

Unstructured , spread across sources and temporally changing

Pricing DateOpen Government Data

Social Media Data

Attributes

Attribute

Big Data Platform

Page 4: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

HOW DOES IT WORK - 1?

• Crawling/Scraping: from a large number of data sources

• Cleaning/Deduplication: remove as much noise as possible

• Data Normalization: represent related data together in standard forms

Page 5: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

HOW DOES IT WORK - 2?

• Store/Index: store optimally to support several complex queries

• Create "Views": on top of data for easy consumption, through APIs, visualizations, dashboards, and reports

• Package data as a product: to solve a bunch of related pain points in a certain domain (e.g., PriceWeave for retail)

Page 6: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

AGGREGATION AND EXTRACTION

Extraction LayerOffline Extraction of Factual Data

Aggregation LayerDistributed Crawler Infrastructure

Public Data on the Web

Page 7: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

AGGREGATION LAYER

Customized crawler infrastructure

• vertical specific crawlers• capable of crawling the "deep web"

Highly Scalable

• 500+ websites on a daily basis• more with the addition of hardware

Robust to failures (404s, timeouts, server restarts)• stateless distributed workers• crawl state maintained in a separate data store

Page 8: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

DATA EXTRACTION LAYER

• Extract as many data points from crawled pages as possible

• Completely offline process, independent of crawling

• Highly parallelized -- scales in a straightforward manner

Page 9: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

NORMALIZATIONNormalization Layer

Machine Learning Techniques

Remove Noise Fill Gaps in Data

Represent Data Clustering

Extraction LayerOffline Extraction of Factual Data

KnowledgeBase

Page 10: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

NORMALIZATION LAYER

• Remove noise, remove duplicates

• Gather data from multiple sources and fill "gaps" in info

• Normalize data points to a standard internal representation

• Cluster related data together (Machine Learning techniques)

• Build a "knowledge base" -- continuous learning

• "Human in the loop" for data validation

Page 11: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

DATA STORAGE AND SERVING

Data APIs Visualizations Dashboards Reports

Serving Layer

HighlyResponsive

Indexes Views

FiltersPre-Computed

Results

Serving LayerDistributed Data Storage

Crawl SnapshotsProcessed DataClustered Data

Page 12: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

DATA STORAGE LAYER

• Store snapshots of crawl data -- never throw away raw data!

• Store processed data -- both individual data points as well as "clusters" of related data points

• Distributed data stores

• Highly scalable -- add more hardware

• Highly available -- replication

Page 13: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

SERVING LAYER

This is the system as far as a user is concerned!

Must be highly responsive

Process data offline and periodically push it to the serving layer

• create Indexes for fast data retrieval

• create views to serve queries that are known a priori

• minimize computation to the extent possible

Page 14: Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

DATAWEAVE PLATFORM

API Feeds

Data Services

Dashboards

Visualizations and Widgets

Data APIs

Unstructured , spread across sources and temporally changing

Pricing DateOpen Government Data

Social Media Data

Attributes

Attribute

Big Data Platform