Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many...
Transcript of Leveraging Big Data Offline and real-time use cases · January 23, 2015 Confidential 16 Many...
Confidential Slide 1
Leveraging Big DataOffline and real-time use cases
Sastry Malladi
Chief Architect, StubHub
January 23, 2015 Confidential 2
We enable “access” to events
StubHub is about…..
We want to be more!!!
Worlds Largest Ticketing marketplace
10M active listings
January 23, 2015 Confidential 3
An eBay owned company
Over 25 million users and growing
We sell one ticket per second
~8.5 million page views a day, on an average
~ 3 million additional page views per day on Mobile devices
~10 M tickets for sale in sports, concerts and others.
~ 1 TB of data processed monthly by the analytics
infrastructure – This number will significantly go up as we
bring in data from many of the unstructured data sources
~300 Million SQL executions/day
Some Fun Facts about StubHub
January 23, 2015 Confidential 4
Use Cases
Challenges
Architecture/Solution
Overview
January 23, 2015 Confidential 5
Big Data – the 4 V’s
Volume
Variety
Velocity
Veracity
Hadoop
Refers to an ecosystem with so many supporting
tools/frameworks for data storing and processing
Continuously evolving
Big Data - Demystifying terms
January 23, 2015 Confidential 6
eBay SEARCH aka Cassini
January 23, 2015 Confidential 7
Traditional Data Warehouses can store Structured data (e.g. Transactions)
Outside of Transactions data, a lot of typical user data is either semi-structured or unstructured.
Data is any company’s biggest asset. The more they can leverage it, the more they can serve their customers better.
Many companies are beginning to leverage a hybrid approach (EDW + Hadoop) – But what’s the right balance ?
What’s the long term evolution/direction ?
Transition to Big Data …
January 23, 2015 Confidential 8
StubHub Vision
Make StubHub a worldwide destination for an end-to-end experience for all fans – this includes discovery, access and sharing post
event experiences.
January 23, 2015 Confidential 9
Major Big Data Use Cases that arise out of that vision
Personalization / Recommendations• Buyer Recommendations: Personalized experience• Seller Recommendations Ticket Pricing
Social Interactions• Support an enriched experience for a Fan for the complete event life cycle –
from discovery to access to post event experience..
Business Analytics• Making data available for traditional business analytics reporting, including
fraud analysis
Customer 360 / Insights• A multi-faceted view of a customer, including buying and / or selling
preferences, seating preferences, pricing preferences, preferred events, friends, likes/dislikes, transaction history etc.
January 23, 2015 Confidential 10
Improve User Experience with Big Data
January 23, 2015 Confidential 11
Personalization
January 23, 2015 Confidential 12
Personalization
January 23, 2015 Confidential 13
Personalization
January 23, 2015 Confidential 14
Pricing Analytics
Help sellers price it right
Demand
Social media
Weather
Team mix
January 23, 2015 Confidential 15
Fraud Analytics
• Inherent risk with increase in digital tickets
Track more attributes such as Known Good Device Ids
Real time analytics of customer profiles and interactions
Train and develop non-linear fraud models
January 23, 2015 Confidential 16
Many different data sources
Both internal and external
Near-real-time access and processing of raw data
Near-real-time feeding of the aggregate data back into OLTP
environment
Both structured, semi-structured as well as unstructured data
Updates to existing tools and processes to consume data
from Hadoop / NoSQL DBs
Building a platform for the future while continuing to support
existing platform/tools
Challenges
January 23, 2015 Confidential 17
Big Data – Needs an Engineering Mindset
Data IngestionData Ingestion
Job Scheduling &
Management
Job Scheduling &
Management
Data ValidationData Validation
Manageability
Open Source Compliance
Scalability
Adaptability: Data Import
Flexibility: Data Export
Integration with Visualization tools
Reliability
Data PublicationData Publication
January 23, 2015 Confidential 18
Structured
Transactional Data (orders, listings)
Catalog data
CS / Siebel HR data
Semi to Unstructured
Click stream data
External user segmentation data
External survey data
Email data
Social data
Meta classification data
Referral data
Few Example data sources
January 23, 2015 Confidential 19
Architecture
ETLETLData Adapter 1
Data Adapter 1
Data Adapter 2
Data Adapter 2
Data Adapter
N
Data Adapter
N
EDWEDW
Data Sources: Structured / Unstructured
Transactions Click Stream Social weblogs Catalog …….
HadoopHadoop
Data ValidationData Validation
HiveHive PigPig
HDFSHDFS
MahoutMahout
sqoopsqoop
MapReduceMapReduce
HBase
SolrSolr
Big Data Platform
zookeeper
January 23, 2015 Confidential 20
Architecture (contd.)
ETLETLData Adapter 1
Data Adapter 1
Data Adapter 2
Data Adapter 2
Data Adapter
N
Data Adapter
N
EDWEDW
Data Sources: Structured / Unstructured
Transactions Click Stream Social weblogs Catalog …….
HadoopHadoop
Data ValidationData Validation
HiveHive PigPig
HDFSHDFS
MahoutMahout
sqoopsqoop
MapReduceMapReduce
HBase
SolrSolr
Big Data Platform
SolrSolr
HBase
Website/Mobile apps/partner apps
Reco svc
otherother
Replication
OLTP world
Analytics world
Analytics engine (R / Mahout)
BI Tools
Sandbox data
Replication
zookeeper
January 23, 2015 Confidential 21
Click Stream Data
Adapter
Click Stream Data
Adapter
NoSQL Data
Adapter
NoSQL Data
Adapter
RDBMS Data
Adapter
RDBMS Data
Adapter
HiveHive
oozieoozie
PigPig
HDFSHDFS
MahoutMahout
MapReduceMapReduce
Data Validation Service
Data Validation Service
MapReduce Service (MARS)
MapReduce Service (MARS)
RDMS
NoSQL
File System
ZookeeperZookeeper
Solution – An Ecosystem for Data and Job Management
Data Ingestion Data Processing
Data Publication
EDW
SolrSolr
Zookeeper Group
Data Import: Tracking and Status
Job Scheduling and Management
Data Export Tracking and StatusData Streams
Audit Trail
HBaseHBase
January 23, 2015 Confidential 22
We use both Hbase and Mongo
Mongo Convenient JSON Object/document storage for retrieval by web applications
Hbase Distributed Storage – data is partitioned and stored across the cluster. Flexible Table-like structure: HBase is a multi-dimensional column oriented data store,
supporting both structured and unstructured data storage with consistency. Scalable – scaling HBase is a matter of adding more nodes to the cluster. High Availability – using multiple masters enables High Availability.
Our No SQL choices
Region Server 1
Region Server 1
Region Server 2
Region Server 2
HBase MasterHBase Master
January 23, 2015 Confidential 23
Data Driven Culture
Data insights should fuel creativity
Define & drive SMART Goals deeper into the organization
Data driven decision making
Easy access to data