Spark Summit Keynote by Seshu Adunuthula

Enterprise Data Platform and role of Apache Spark

Seshu Adunuthula (@SeshuAd)

78%of GMB is fixed price$29.99

42%GMV VIA MOBILE

291MMOBILE APP DOWNLOADS

GLOBALLY

1.3BLISTINGS CREATED

VIA MOBILE EVERY WEEK

162MACTIVE BUYERS

25MACTIVE SELLERS

800MACTIVE LISTINGS

8.8MNEW LISTINGS

EVERY WEEK

Q4 2015 Q4 2015

eBay Marketplace

*Q3 2015 data

Items sold are new

79%Items are fixed

price

84%

Evolving from our Auction Roots..

Transactions offer free shipping

63%

Data is eBay’s Most Important Asset

Data Sciences

PersonalizationUser propensity modeling based on 5 quarters behavioral data. Cluster based unstructured data. User (Badges, Activity Synopsis, Word Cloud), Deals Personalization

Merchandising Similar items - recommending similar items on key placements on site and mobile. Powered among other things by co-clicked items.

Structured DataDeal discovery leverages machine learning model.

TrustPredictive machine learning models for fraud prevention, account take-over, prevent loss, bad buying experience prevention, buyer/seller risk prediction

Shipping Delivery experience: Building a model to predict more accurate and shorter delivery estimates.

BI & Analytical Apps

Search Backend950M items/ ~15TB indexes in 2.5 hours on a daily basis.2M item/11TB index subsets generated near real-time in 3 minutes

Search Sciences Search ranking/spam/recall factors or data (like phrase table, query-rewrite table, etc.) preparation, including many pipelines built on top of search Scala platform

s

Structured DataConvert 800M Item listings into Product pages that are automatically curated and persisted

Traffic – Paid Search10% efficiency lift for paid search (Amber model)Identify low performing items (Google search)

Data Preparation. Bot detection, data transformation, sessionization for user behavior and core data sets

ExperimentationA/B Testing for new features and user Experiences released on the site.

Behavioral Data

Search

SPD: Provides global B2C and C2C seller performance overview

Nous, DNA: Provide self service product experiences analysis on product health, behavior analytics and product domain specific reporting

System Services

Sherlock Monitoring: Applications logs (CAL logs) are stored and processed on Hadoop.

Data is eBay’s DNA

Enterprise Data Platform

ENTERPRISE DATA PLATFORM

9

Agile Data Warehouse Data Streams

BatchHumans

Sets of data

StreamsSystems

Sets of data

Data Services

ServicesApplicationsSpecific calls

PopulatedUsed by

How

Enterprise

PopulatedUsed by

How

Enterprise Data Platform

Structured DataSQL

Interactive Relational Analysis

Semi Structured DataJava/Scala…

Batch and Ad-Hoc Algorithmic Analysis

Relational AnalysisProgrammatic Exploration

EDW Hadoop

10

Analysts/BU PM/Executives/Tools Analysts/Scientists/Tools Scientists/Engineers

Agile Data Warehouse

11

Simplify Access to DW

Cross Platform VDMs

Geo Distributed Caches

Data Virtualization

Apache Kylin: Extreme OLAP Engine

12

Cube Build Engine

SQL

Low Latency - Seconds

Mid Latency - Minutes Routing

3rd Party App(Web App, Mobile…)

Metadata

SQL-Based Tool(BI Tools: Tableau…)

Query Engine

HadoopHive

REST API JDBC/ODBC

Star Schema Data Key Value Data

Data Cube

OLAPCube

(HBase)

SQL

REST Server

MOLAP Cubes

ANSI SQL on Hadoop

Interactive Query on Billionsof Rows

Apache Kylin: Extreme OLAP Engine

Data Streams Platform

Apache Kafka

Behavior TXN User

Streaming Apps

Sandbox

Stream Processing Clusters OpenSample Streams Needs

Approval

Staging Pool1 Pool2

Rheos App Manager

Configuration

Deployment

GitHub

Tora ETL

Connectors

Hadoop Teradata

Pool3

Risk Real-time RepresentationOf EDW Datasets

Centralized Shared DataStreams

DW populated using theStreams Platform

Data Streams Platform

Q3 2015

Data Search & Discovery

Collaborative Analytics

Data Governance

Data ServicesData Services

•Execution environment tailored for Spark

•Governance of Big Data Apps with a PaaS layer

•Application life cycle spanning development to deployment

Spades – Spark Provisioning & Deployment Environment

Data Pipelines

Ingest ServersListeners

AnalystsAnalytics Platform & DeliveryCALApplicationServereBay Visitors

Application servers

SingularityHadoopCentral App Logging

BEHAVIORAL DATA PIPELINE Behavioral Data Pipeline

18

Behavior Data: A/B TestingBehavioral Data: A/B Testing

19

Transactional Data Pipelines

• Initial pattern

• More data available faster on Hadoop• Leverage Hadoop SQL / Spark

• Subsequent pattern

• >10% datasets available on Hadoop• New innovation avenues via OpenSource

• Give humans more Teradata capacity • Teradata data available no later than before

• <95% extract / daily batch• >5% stream / frequent batch

Transactional Data Pipelines

Spark Summit Keynote by Seshu Adunuthula

Data & Analytics

Transcript of Spark Summit Keynote by Seshu Adunuthula