Leveraging Hadoop to mine customer insights in a developing market

15
Leveraging Hadoop in Wikimart Roman Zykov Head of analytics http://wikimart.ru London, Big Data World Europe, 20 th September 2012

description

I was a speaker at Big data world conference in London on the 18th september 2012. http://www.terrapinn.com/2012/big-data-world-europe/ See full text speech at http://webkpis.com/2012/11/hadoop-implementation-in-wikimart/ Incorporating Hadoop technology within your infrastructure to cut costs and increase the scale of your operations Understanding how Hadoop can provide insightful data analysis to the end user Combining Hadoop with existing enterprise systems to deepen your insight and discover previously hidden trends Will Hadoop replace the need for relational data warehousing systems?

Transcript of Leveraging Hadoop to mine customer insights in a developing market

Page 1: Leveraging Hadoop to mine customer insights in a developing market

Leveraging Hadoop in Wikimart Roman Zykov

Head of analytics http://wikimart.ru

London, Big Data World Europe, 20th September 2012

Page 2: Leveraging Hadoop to mine customer insights in a developing market

Key problem

To be or not to be….

Hadoop

Introduction

Page 3: Leveraging Hadoop to mine customer insights in a developing market

Key tasks for Wikimart

What

• BI tasks

• Web analytics (in-house solution)

• Recommendations on site

• Data services for marketing

Who

• Core analytics team

• Analytics members in other departments

• IT site operations

Page 4: Leveraging Hadoop to mine customer insights in a developing market

Problem

Too time consuming or too

expensive? • Data volume

• # of data services

Page 5: Leveraging Hadoop to mine customer insights in a developing market

Map Reduce

DATA

Standalone

Map Reduce

Page 6: Leveraging Hadoop to mine customer insights in a developing market

Our idea

New platform for “Big Data” tasks only

• Start research on Map Reduce software

• First patient - recommendation engine

Difficulties

- no planned budget -> Hadoop is free

- no experts -> learn it

- no hardware -> virtual cluster

Page 7: Leveraging Hadoop to mine customer insights in a developing market

Requirements for Hadoop

• Easy scalable

• Easy deployment

• Easy integration

• Less low level Java coding

• SQL-like querries

Page 8: Leveraging Hadoop to mine customer insights in a developing market

Data flow

Data feeds DWH

Page 9: Leveraging Hadoop to mine customer insights in a developing market

Accomplishments

Recommendations

• Collaborative filtering (item-to-item on browsing history, PIG)

• Similar products (items attributes, PIG)

• Most popular items (browsing history + orders, HiveQL)

• Internal and external search recommendations (HiveQL)

Some statistics after 1 year

• >10% of revenue

• 3 months to launch

• Tens of gigabytes are processed 2 hours daily

• 1 crash only (cluster lost power)

Decision: Invest to Hardware cluster

Page 10: Leveraging Hadoop to mine customer insights in a developing market

End user

Internal high-level languages

• HiveQL

• Pig

Reporting

• Pre-aggregated data for OLAP

• RDBMS - front end

• OLAP and Reporting software should

support HiveQL

Page 11: Leveraging Hadoop to mine customer insights in a developing market

Data Integration

• SQOOP

• Parallel data exchange with RDBMS

(MS SQL, MySQL, Oracle, Teradata… )

• Incremental updates

• HDFS, Hive, HBASE

• Talend Open Studio

Page 12: Leveraging Hadoop to mine customer insights in a developing market

Hadoop vs RDBMS

• Never replace RDBMS:

• Latency

• Weak capabilities of HiveQL vs SQL

• Only some tasks with offline processing:

• Machine learning

• Queries to Big tables

• ….

• Real time: NOSQL

Page 13: Leveraging Hadoop to mine customer insights in a developing market

Hadoop myth

Terabytes?

Petabytes?

Big tasks!

Page 14: Leveraging Hadoop to mine customer insights in a developing market

Conclusion

• Hadoop is not Rocket Science

• Intermediate data can be Big Data

Starter kit

• Hadoop management system

• Virtual hardware (cloud, virtual servers, etc)

• Offline data tasks

• Pig or HiveQL

• Sqoop: import data from existing data sources

Page 15: Leveraging Hadoop to mine customer insights in a developing market

Thank you!!!

[email protected]

linkedin.com/in/romanzykov