Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

Post on 16-Apr-2017

2.785 views 1 download

Transcript of Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

1

Avatar at eBay

Srinivasan Rengarajan (srengarajan@ebay.com)

Mohit Soni (mosoni@ebay.com)

CourtesyAnil Madan (amadan@ebay.com)

2

• 2007 Research Team Builds a 4 node Cluster – Subset of Click Stream and EDW data– Innovation with Mobius Query Language– Visualization and Click Path analysis

• 2009 Sept Search Clusters – Machine Learning Ranking cluster of 28 nodes– Search relevance cluster of 10 nodes– Subset of Click Stream and EDW Data

• 2010 May – Athena* Exploratory Cluster of 532 nodes– Platform Teams join hands with Search/Research to build a larger cluster .– Build it as a core competency for advanced insights for complex data– Rapid build-out with timelines pulled in by couple of months

* Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology

MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.

2

Infrastructure

3

• Enterprise Nodes – Sun 64bit , Red Hat Linux– 2 Quad Core Nehalem, 72GB RAM, 4TB– Servers

• NameNode(s)• Job Tracker• Zookeeper• HBaseMaster• Ganglia Server• eBay (Cloudera) HUE

• Data Nodes– SGI-Rackables, Cent OS, 1U , 5.3PB– 2 Quad Core Nehalem, 36GB RAM, 10TB– Hbase on 20 nodes

• Network– TOR 1Gbps– Core Switches uplink 40Gbps

3

Ecosystem

44

Hadoop Core (HDFS,Common)

MapReduce (Java, Streaming, Pipes,Scala)

Data Access (Hbase, Pig, Hive)

Tools & Libraries(HUE,UC4,Oozie.Mobius,Mahout)

Monitoring & Alerting (Ganglia, Nagios)

• MapReduce Sourcing data primarily Java Applications using Perl, Scala, Python…

• Data Access FrameworksHbase - for EDWdataPig – data piplelinesHive – Adhoc queries MQL – Mobius Query Language

• Monitoring & AlertingGanglia, Nagios

• Tools HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines Mahout – data mining

Administration

• Groups– Built to support multiple groups– Job invocation uses the group name– Fair Scheduler

• Allocations based on investment• Weights • Minimum share of mappers and reducers• poolMaxJobsDefault• userMaxJobsDefault• defaultMinSharePreemptionTimeout• fairSharePreemptionTimeout

• Auth & Auth– HUE – custom module to use corp. credentials– CLI*– PAM custom module– Security* - Implement token interface to replace

Kerberos with SAML.

* Work in Progress5

Data Sourcing Patterns

6

Click Stream

EDW

Images

Search Indices

Analytics Reporting

Algorithmic Models

AcquisitionDescription

Source Preparation Format Pattern

Click StreamSessionEventSession Container

Session/Event Streamed as LZO/Text

SessionContainer generate Sequence Files

Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/TwitterSession Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join

EDWItemTransactionUserFeedbackBids

Streamed as GZIP/TextGenerate SequenceFile/ Hbase snapshot with previous day snapshot and current day data.Hive StorageHandlers to point to SequenceFile/Hbase snapshot

TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers.Create Hbase regions using HfileUpdate RegionServers using ruby script loadtable.rb

Concerns - Hbase append performance, Hfile flush HBASE-1923

Search Use Case – Machine Learned Ranking

7

ClickStream Items Users Feedback

Classifiers

Ranking Function

Great Search Results

• Goal– Enhance search relevance for eBay’s items.

• Hadoop Usage– Build a ranking function that takes multiple factors into account like price, listing format, seller

track record, relevance.– Ability to add new factors to validate hypothesis

– .

Research Use Case – Description Data Mining

• Goal– Extend catalog coverage

• Hadoop Usage– Leverage data mining/machine learning techniques to create inventory into name value pairs in an completely unsupervised way

8

BARBIE1999 "PREMIERE NIGHT"

Home Shopping Special EditionGorgeous Doll With Beautiful Blond Hair /  In A Gown

Of Purple And SilverNew / Never Removed From Box / Doll Is In Mint

Condition / Remember This Beauty Is 11 Years OldFree Shipping To US Only / Will Ship International /

Please E-mail For CostFeel Free To Ask Me Any Questions Or Concerns

Smoke - Free EnvironmentFree Shipping

Year: 1999Model: premiere nightEdition: home shopping specialHair: blondGown: purple and silverCondition: new / never removed from box / mint

Platform DetailsMetrics Job Statistics, System/Disk Consumption, UtilizationInfrastructure Publish/Subscribe ETL tools, low latency data movementDevelopment Tools, Environment, IDE,Architecture Schemas, Metadata, Governance, PoliciesOperations Administration, Configuration, MonitoringReporting Visualization, BI Generation, Information deliverySecurity User & Group Management, Auth & Auth

9

Clusters DetailsExploratory Strategic investment 1000-5000 nodes

Production Site facing, low latency, high availability

Use Case Specific Advertising, Trust & Safety , Merchandizing

10

Acknowledgments

• Athena Team

• Cloudera Inc.

• Community