Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many...

24
Confidential Slide 1 Leveraging Big Data Offline and real-time use cases Sastry Malladi Chief Architect, StubHub

Transcript of Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many...

Page 1: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

Confidential Slide 1

Leveraging Big DataOffline and real-time use cases

Sastry Malladi

Chief Architect, StubHub

Page 2: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 2

We enable “access” to events

StubHub is about…..

We want to be more!!!

Worlds Largest Ticketing marketplace

10M active listings

Page 3: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 3

An eBay owned company

Over 25 million users and growing

We sell one ticket per second

~8.5 million page views a day, on an average

~ 3 million additional page views per day on Mobile devices

~10 M tickets for sale in sports, concerts and others.

~ 1 TB of data processed monthly by the analytics

infrastructure – This number will significantly go up as we

bring in data from many of the unstructured data sources

~300 Million SQL executions/day

Some Fun Facts about StubHub

Page 4: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 4

Use Cases

Challenges

Architecture/Solution

Overview

Page 5: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 5

Big Data – the 4 V’s

Volume

Variety

Velocity

Veracity

Hadoop

Refers to an ecosystem with so many supporting

tools/frameworks for data storing and processing

Continuously evolving

Big Data - Demystifying terms

Page 6: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 6

eBay SEARCH aka Cassini

Page 7: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 7

Traditional Data Warehouses can store Structured data (e.g. Transactions)

Outside of Transactions data, a lot of typical user data is either semi-structured or unstructured.

Data is any company’s biggest asset. The more they can leverage it, the more they can serve their customers better.

Many companies are beginning to leverage a hybrid approach (EDW + Hadoop) – But what’s the right balance ?

What’s the long term evolution/direction ?

Transition to Big Data …

Page 8: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 8

StubHub Vision

Make StubHub a worldwide destination for an end-to-end experience for all fans – this includes discovery, access and sharing post

event experiences.

Page 9: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 9

Major Big Data Use Cases that arise out of that vision

Personalization / Recommendations• Buyer Recommendations: Personalized experience• Seller Recommendations Ticket Pricing

Social Interactions• Support an enriched experience for a Fan for the complete event life cycle –

from discovery to access to post event experience..

Business Analytics• Making data available for traditional business analytics reporting, including

fraud analysis

Customer 360 / Insights• A multi-faceted view of a customer, including buying and / or selling

preferences, seating preferences, pricing preferences, preferred events, friends, likes/dislikes, transaction history etc.

Page 10: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 10

Improve User Experience with Big Data

Page 11: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 11

Personalization

Page 12: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 12

Personalization

Page 13: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 13

Personalization

Page 14: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 14

Pricing Analytics

Help sellers price it right

Demand

Social media

Weather

Team mix

Page 15: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 15

Fraud Analytics

• Inherent risk with increase in digital tickets

Track more attributes such as Known Good Device Ids

Real time analytics of customer profiles and interactions

Train and develop non-linear fraud models

Page 16: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 16

Many different data sources

Both internal and external

Near-real-time access and processing of raw data

Near-real-time feeding of the aggregate data back into OLTP

environment

Both structured, semi-structured as well as unstructured data

Updates to existing tools and processes to consume data

from Hadoop / NoSQL DBs

Building a platform for the future while continuing to support

existing platform/tools

Challenges

Page 17: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 17

Big Data – Needs an Engineering Mindset

Data IngestionData Ingestion

Job Scheduling &

Management

Job Scheduling &

Management

Data ValidationData Validation

Manageability

Open Source Compliance

Scalability

Adaptability: Data Import

Flexibility: Data Export

Integration with Visualization tools

Reliability

Data PublicationData Publication

Page 18: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 18

Structured

Transactional Data (orders, listings)

Catalog data

CS / Siebel HR data

Semi to Unstructured

Click stream data

External user segmentation data

External survey data

Email data

Social data

Meta classification data

Referral data

Few Example data sources

Page 19: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 19

Architecture

ETLETLData Adapter 1

Data Adapter 1

Data Adapter 2

Data Adapter 2

Data Adapter

N

Data Adapter

N

EDWEDW

Data Sources: Structured / Unstructured

Transactions Click Stream Social weblogs Catalog …….

HadoopHadoop

Data ValidationData Validation

HiveHive PigPig

HDFSHDFS

MahoutMahout

sqoopsqoop

MapReduceMapReduce

HBase

SolrSolr

Big Data Platform

zookeeper

Page 20: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 20

Architecture (contd.)

ETLETLData Adapter 1

Data Adapter 1

Data Adapter 2

Data Adapter 2

Data Adapter

N

Data Adapter

N

EDWEDW

Data Sources: Structured / Unstructured

Transactions Click Stream Social weblogs Catalog …….

HadoopHadoop

Data ValidationData Validation

HiveHive PigPig

HDFSHDFS

MahoutMahout

sqoopsqoop

MapReduceMapReduce

HBase

SolrSolr

Big Data Platform

SolrSolr

HBase

Website/Mobile apps/partner apps

Reco svc

otherother

Replication

OLTP world

Analytics world

Analytics engine (R / Mahout)

BI Tools

Sandbox data

Replication

zookeeper

Page 21: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 21

Click Stream Data

Adapter

Click Stream Data

Adapter

NoSQL Data

Adapter

NoSQL Data

Adapter

RDBMS Data

Adapter

RDBMS Data

Adapter

HiveHive

oozieoozie

PigPig

HDFSHDFS

MahoutMahout

MapReduceMapReduce

Data Validation Service

Data Validation Service

MapReduce Service (MARS)

MapReduce Service (MARS)

RDMS

NoSQL

File System

ZookeeperZookeeper

Solution – An Ecosystem for Data and Job Management

Data Ingestion Data Processing

Data Publication

EDW

SolrSolr

Zookeeper Group

Data Import: Tracking and Status

Job Scheduling and Management

Data Export Tracking and StatusData Streams

Audit Trail

HBaseHBase

Page 22: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 22

We use both Hbase and Mongo

Mongo Convenient JSON Object/document storage for retrieval by web applications

Hbase Distributed Storage – data is partitioned and stored across the cluster. Flexible Table-like structure: HBase is a multi-dimensional column oriented data store,

supporting both structured and unstructured data storage with consistency. Scalable – scaling HBase is a matter of adding more nodes to the cluster. High Availability – using multiple masters enables High Availability.

Our No SQL choices

Region Server 1

Region Server 1

Region Server 2

Region Server 2

HBase MasterHBase Master

Page 23: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 23

Data Driven Culture

Data insights should fuel creativity

Define & drive SMART Goals deeper into the organization

Data driven decision making

Easy access to data

Page 24: Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many different data sources Both internal and external Near-real-time access and processing

January 23, 2015 Confidential 24

Q & A

Thank [email protected]