Data Science and Machine Learning for eCommerce and Retail
Dr. Andrei Lopatenko Director of Engineering,
Recruit Institute of Technology Recruit Holdings
former Walmart Labs, Google (twice), Apple (twice) [email protected]
ML for eCommerce
• Search, Browse, for commerce sites and application
• Help users to find and discover items they will purchase
• Maximize revenue/profit per user session
Search data size
• Catalogue items • 8 M items now compare ~ 400 M
Amazon / eBay • X 10 in near future • 2 K text description per item + images • Several hundreds of structured attributes
per catalog
Search – user searches
• Tens of millions per day • Tens billions session per year • Online sales 13.2 B per year (http://
fortune.com/2015/11/17/walmart-ecommerce/)
• 500B per year sales offline stories (8% USA economy) in ~ 11K stores
• The number of transactions ~ 10B (public data)
ML addressable problems
• Learning to rank • Given a query, what’s the list of items
with the highest probability of conversion (purchase), ATC (add to card), page view
ML addressable problems
• Typeahead • Given a sequence of characters types by
user, what’s most probably competitions, what are most probable items users wants to buy
ML addressable problems
• Spell correction • Given a user query, what’s the query user
actually wanted to type
ML addressable problems
• Cold start • Given a new items with it’s set of
attributes and no history of sales or exposure on site, predict items sales and item sales per query
ML addressable problems
• Prediction of LHN • Given a user query, what’s the best set of
facet and facet values, which gives higher probability of users interacting with them and finally buying an item
ML addressable problems
• Query understanding • Given a query, build a semantic parse of
query, tag tokens with attributes: blue tshirts for teenagers -> blue:color tshirts:type for:opt teenagers:agerestriction10-20
• Classification: blue tshirts for teenagers: -> type:apparel, price preference: 10-30, releaseyearpreference: 2014-2016
ML addressable problems
• Related searches • Given a query, what are queries which are
either semantically close to this one, or represent coincidental users interests
• Nike shoes -> adidas shoes, sport shoes, • Coffee mugs -> travel mugs, photo coffee
mugs, cappuccino cups
ML addressable problems
• product discovery • help users to explore product assortment, • drive users to diverse products • reduce risk of selecting irrelevant items • help to find price,quality,brand etc
alternatives • reduce pigeonhole risk • provide relevant data to make a decision
ML addressable problems
• Image similarity • Given images of the items, give other
items such that images of those are visually appealing to the users which like the original item (appealing by shape? Color? Texture?) -> causing high conversion in recommendation
ML addressable problems
• Voice search • Given voice input, reply with a list of the
best items • “what are the cheapest samsung tvs in the
store” • “what is best deal on queen bed today?”
ML addressable problems
• extraction of item attributes • Given an item: what are item attributes:
brand, color, size (wheel, screen, height, S/M/XL, Queen/Twin/King/Full), Gender, Pattern, Shape, Features
ML addressable problems
• Representations of users : actions on websites/apps -> searches, clicks, browsing behaviour, product -> purchase preferences, reviews, ratings, return rates
ML addressable problems
• title generation: how to generate the title which will cause maximum conversion rate
• which product attributes select for the title?
Limits
• Most models should be served in production
• 50ms on prediction • Part of big system, memory limits ~ 10G
Retail
• Key directions which require machine learning:
• discounting tools • coupons and rewards • loyalty • inventory management
Inventory management
• Customer want to buy products • Customers have diverse needs • Products should be in stock, ideally in
warehouses close to customers • but it’s expensive to store products • Problem: How many products of each type
should be stored, when product supply should be refilled?
Customer intelligence
• Retail • analyze sales data, find anomalies, explain
them • low sales of umbrellas during last month in
North California’s stores • No rains? (integration with external data about
weather conditions) • Seasonal / the same as last year / time series • Competitors
Fraud detection
• identify fraudulent transactions online • Hundreds fraud schemas detected daily • Global retail shrinkage is $119 billion in
2011, an average of 1.45% of retail sales. • from stolen credit card to price tag
replaced, price discounts by high level managers to achieve personal goals
Propensity Modeling for Marketing Campaigns
• build effective email/facebook/google ads campaign addressing proper customer at proper time at proper costs
• behavior based customer segmentation and clusterization with demographics, lifestyle, attitudinal information
Online Grocery
• which items can be replaced by other items and by which items they can be replaced
• data are individual purchases in chain grocery, drug stores, online grocery shopping
• the problem - find which items can be replaced by other item if they are not in store to fulfill the order
Dynamic pricing
• define the best price • scrap continuously prices of competitors,
predict demand by price, know the expenses
• online commerce sites change prices every 10 minutes
Challenges
• Data volumes: transactions: Walmart: 10 Million per day
• Computations: complicated modeling techniques
Data storage
• Volumes of data: • 10 M transactions per day, 5 years - 18
billion transactions -> 1T • Catalog: 500 M items * 2K per each -> 1T
Data Storage
• but if go to video: petabytes of data, RetailNext 75P per year from 30000+ sensors
• Walmart 500P • eBay 40 P in 2013 (transactions + online
behaviours)
Data processing
• Rebuild model over fresh data: • typically daily: add daily data (millions of
transactions, hundreds of millions of behavior units) to year data store (billions of transactions, hundred billion/trillion behavior units)
• build a model to serve in production the next day
Data processing
• some models such as fraud detection,dynamic pricing should be almost online (10-15 minutes)
• build over data such as daily transactions or web crawl over competitors' sites
Serving online
• online commerce WML - thousands / tens thousands queries per second in peak times
• complicated algorithm of ranking, recommendation,
• 50ms limit
serving online
• price, in store availability - millions requests per second in peak times
• item informations - millions requests per second
• serving online - Solr/Lucene/Elastic Search, Cassandra, MongoDB, Oracle, CouchDB,Node.JS/Java solutions etc
Data processing
• Hadoop / Spark clusters • a lot of I/O • HDFS does the redundancy , RAID is not
necessary, RAID is slow to write, Hadoop writes a lot
• SAN, NAS are not good either • so bare metal with DAS Directly Attached
Storage
Data Processing
• more servers, cheaper servers • more smaller disks is better than large
disks • allocate cluster 100% to Hadoop
Data processing
• Hadoop Masters vs Workers • large clusters: Masters > 64G RAM, dual
Ethernet NIC, dual quad core CPU • Workers: memory 64G+, SAS 6Gb/s disk
controllers, 2 Ethernet cards, 2*6core processors, 15M cache, Intel’s Hyper-Threading and QPI good to have
Data Processing
• big models, deep learning • Nvidia DGX-1 and alike • Pascal GPUs , NVLink interconnect • Tesla k40, K80 work pretty well too • may require a lot of tuning http://
timdettmers.com/2015/03/09/deep-learning-hardware-guide/
• hard to buy: big data solutions are considered profit generators, HPC servers are not
Serving online
• Typically large memory, but not necessary (for example, Elastic Search/Solr degrades over 64G)
• CPUs: more cores rather than faster • Disks: SSD, RAID 0, no NAS, a lot of
conditions frequently optimize wrt how easy to change drivers rather than SSD endurance
ecommerce example
• Database servers • Unified hardware platform : from HP • HP DL line: • 4 cpu sockets • 256 GM RAM • network interfaces • not much HDD, data is in NAS
ecommerce example
• cloud servers: • purchased by racks: 40 in a rack • 2 CPU socket • 198G • 18 core CPU • SSD
network requirements
• 1 network card per server - a big mistake, 1 switch per rack
• 3 cards per servers: • typical three data flows: • production • “administrative” (dockers etc) • analytics
example
• application servers vs big data servers • application servers (java, node.js apps): • 1TB SSD, RAID 5 • Big data servers: • 5T SAS
Questions?
Dr. Andrei Lopatenko Director of Engineering,
Recruit Institute of Technology Recruit Holdings
Top Related