Analytics and Data Mining Industry Overview
-
Upload
gregory-piatetsky-shapiro -
Category
Technology
-
view
20.117 -
download
2
description
Transcript of Analytics and Data Mining Industry Overview
(c) KDnuggets 2011 1
Analytics Industry Overview:To Big Data and Beyond !
Gregory Piatetskywww.KDnuggets.com/gps.html
(c) KDnuggets 2011 2
My Data Path
• PhD in applying Machine Learning to databases• Researcher at GTE Labs – started first project
on Knowledge Discovery in Databases in 1989• Organized first 3 KDD workshops (1989-93),
cofounded KDD conferences and ACM SIGKDD• Chief Scientist at analytics startup 1998-2001• Chair, SIGKDD, 2005-2009• Analytics/Data Mining Consultant, 2001-
(c) KDnuggets 2011 3
KDnuggets
• Stands for Knowledge Discovery Nuggets• 1993 - started KDnuggets News email newsletter (~
12,000 email subscribers now)• early website in 1994, www.KDnuggets.com in 1997
– 2011 best year, 45-50,000 unique visitors/month• twitter.com/kdnuggets ~3,000 followers• facebook.com/kdnuggets page• group: KDnuggets Analytics & Data Mining • Recently featured on CNN
(c) KDnuggets 2011 4
KDnuggets mission
Cover Analytics and Data Mining field : • News, Jobs, Software, Data (most popular)• Also Academic positions, CFP, Companies,
Consulting, Courses, Meetings, Polls, Publications, Solutions, Webcasts
• Subscribe to bi-weekly KDnuggets News at www.kdnuggets.com/subscribe.html
(c) KDnuggets 2011 5
Analyzing Data or …
• Statistics• Data mining• Knowledge Discovery in Data • KDD• Analytics• Data Science• …?
Core:
Finding Useful Patterns in Data
(c) KDnuggets 2011 6
History
• Statistics: 1800 - • Data dredging, data “fishing” : 1960s• Data Mining: 1980 –• Database Mining ~ 1985 (was HNC trademark, not used)
• Knowledge Discovery in Data: 1989 –– KDD workshop in 1989
• Analytics : 2006 – – Google Analytics, “Competing on Analytics” book
• Data Science: 2010 –
(c) KDnuggets 2011 7
Pre-history
From Google Ngram viewer – English language booksNote: Our analysis uses only English language data. Other languages, especially Chinese , need to be considered for full picture
Statistics is the biggest term in 20th century, but data mining and analytics appears in late 1990s
(c) KDnuggets 2011 8
Recent History: Analytics, Data Mining, Knowledge Discovery
Analytics has been used since 1800, but started to rise in 2005Data Mining jumps around 1996 (soon after first KDD conference) but declines after 2003 (TIA controversy, associated with gov. invasion of privacy).Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000
(c) KDnuggets 2011 9
Google N-gram Results case sensitive
Different capitalizations changes counts, but using lowercase is probably appropriate to measure general popularity.
(c) KDnuggets 2011 10
Earliest use of “data mining” 1962?
Source: Google Books
After eliminating many “following data. Mining cost is ” exampleswhich refer to Mining of minerals, and books from “1958” that have a CD attached (errors in book year)
The earliest “data mining” reference I found is
(c) KDnuggets 2011 11
Google Trends: After 2006, Data Mining < Analytics
(c) KDnuggets 2011
Google Trends: Analytics observations
Google Analytics introduced,Dec 2005
Competing on Analytics book, Apr 2007 December vacation drop
(c) KDnuggets 2011 13
Half of “Analytics” searches are for “Google Analytics”
(c) KDnuggets 2011 14
Excluding Google Analytics
(c) KDnuggets 2011 15
Google Insights: searches for data mining, analytics -googleare most popular in India, US
(c) KDnuggets 2011 16
Data Mining >> Predictive Analytics
(c) KDnuggets 2011 17
Business, Predictive, Text Analytics
(c) KDnuggets 2011 18
Analytics > Data Mining > Data Science
(c) KDnuggets 2011 19
Data Science, Big Data
(c) KDnuggets 2011 20
Analytics Today
KDnuggets Polls Findings
www.KDnuggets.com/polls/
(c) KDnuggets 2011 21
avg 2.4 industries
CRM/ consumer analytics Banking
Health care/ HR Fraud Detection
Direct Marketing/ Fundraising Finance
Telecom / Cable Science
Insurance Advertising
Education Web usage mining
Credit Scoring Retail
Medical/ Pharma Manufacturing
e-Commerce Social Networks
Search / Web content mining Government/Military
Biotech/Genomics Investment / Stocks
Entertainment/ Music Security / Anti-terrorism
Travel / Hospitality Social Policy/Survey analysis
Junk email / Anti-spam Other
0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0%
Where did you apply analytics/data mining?
www.KDnuggets.com/polls/2010/analytics-data-mining-industries-applications.html
(c) KDnuggets 2011 22
Data Types Analyzed/Mined
www.KDnuggets.com/polls/2011/data-types-analyzed-mined.html
(c) KDnuggets 2011 23
Data Types w. Most Growth in 2011
• location/geo/mobile data
• music / audio • time series
• Genomics, according to John Mattison
(c) KDnuggets 2011 24
Largest Dataset Analyzed?2011 median dataset size ~10-20 GB, vs 8-10 GB in 2010.
Increase in10 GB to 1 PB range
www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html
(c) KDnuggets 2011 25
Largest Dataset Analyzed by Region
(c) KDnuggets 2011 26
Which methods/algorithms did you use for data analysis in 2011
Decision Trees
Regression
Clustering
Statistics
Visualization
Time series/Sequence analysis
Support Vector (SVM)
Association rules
Ensemble methods
Text Mining
Neural Nets
Boosting
Bayesian
Bagging
Factor Analysis
Anomaly/Deviation detection
Social Network Analysis
Survival Analysis
Genetic algorithms
Uplift modeling
0% 10% 20% 30% 40% 50% 60% 70%
% analysts who used it
www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
(c) KDnuggets 2011 27
Algorithms with highest Industry Affinity
www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
(c) KDnuggets 2011 28
“Academic” algorithmslowest Industry affinity
www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
(c) KDnuggets 2011 29
Cloud Analytics is not common (yet)
www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
(c) KDnuggets 2011 30
JOBS AND SKILLS
(c) KDnuggets 2011 31
Shortage of Skills
• McKinsey: shortage by 2018 in the US of– 140-190,000 people with deep analytical skills
– 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions.
Source: www.mckinsey.com/mgi/publications/big_data/
(c) KDnuggets 2011 32
Job data: Data Scientist
(c) KDnuggets 2011 33
Jobs: Data Mining >> Data Scientist
(c) KDnuggets 2011 34
“Ground” Analytics (LinkedIn Skills)
~ 75,000 with Data Mining skill
~ 7,000 with Predictive Modeling
Also ~ 20,000 with Predictive Analytics(not related with Predictive Modeling ??
(c) KDnuggets 2011 35
Cloud (Big Data) Analytics Skills
(c) KDnuggets 2011 36
Analytics LinkedIn Skills
Machine LearningPredictive Analytics
Text Mining MapReduce
(c) KDnuggets 2011 37
Data Tsunami
• In 2010 enterprises stored 7 exabytes =7,000,000,000 GB
of new data (McKinsey)• 90 percent of the
world's data has been generated in the past two years (IBM)
Image with apologies to KDD-2011
(c) KDnuggets 2011 38
Big Data Aspects?
• Volume– Terabytes to Petabytes …
• V e l o c i t y – online streaming
• Variety – numbers, text, links, images, audio, video, …
(c) KDnuggets 2011 39
Volume + Velocity => No consistency
• CAP Theorem (Eric Brewer, 2000)For highly scalable distributed systems, you can only have
two of following: – 1) consistency, – 2) high availability, and – 3) (network) partition tolerance (network failure tolerance)
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Implication: Big data solutions must stop worrying about consistency if they want high availability
(c) KDnuggets 2011 40
Big Data
• 2nd Industrial Revolution
• Do old activities better
• Create new activities/businesses
(c) KDnuggets 2011 41
Application areas
• Doing old things better– Churn prediction – Direct marketing/Customer modeling– Recommendations– Fraud detection– Security/Intelligence – …
• Competition will level companies
(c) KDnuggets 2011 42
Limit to Predicting Customer Behavior?
• There is fundamental randomness in human behavior and once we find 1-level effects, more data or better algorithms will give diminishing returns in most cases
• Example: Netflix Prize: the most advanced algorithms were only a few percentages better than basic algorithms
Direct Marketing: Random and Model-sorted Lists
0102030405060708090100
5 15 25 35 45 55 65 75 85 95
RandomModel
5% of random list have 5% of hits5% of model-score ranked list have 21% of hits. Lift(5%) = 21%/5% = 4.2
Pct list
CPH: Cum
ulative Pct Hits
(c) KDnuggets 2011 44
Most lift curves are surprising similarStudy of lift curves in banking,
telecom
Best lift curves are similar
Special point T=Target percentage
Lift(T) ~ sqrt (1/T)
G. Piatetsky-Shapiro, B. Masand, Estimating
Campaign Benefits and Modeling Lift, in Proceedings of KDD-99 Conference, ACM Press, 1999.
0
2
4
6
8
10
12
14
0 5 10 15 20 25
100*T%
Lift
Actual lift(T) Est. lift(T)
(c) KDnuggets 2011 45
Big Data Enables New Things !
– Google – first big success of big data – Social networks (facebook, Twitter, LinkedIn, …)
success depends on network size, i.e. big data
– Location analytics– Health-care
• Personalized medicine
– Semantics and AI ?• Imagine IBM Watson, Siri in 2020 ?
(c) KDnuggets 2011 46
Big Data Growth By Industry
Source: http://www.mckinsey.com/mgi/publications/big_data/
(c) KDnuggets 2011 47
Research and Industry Disconnect?
• Uplift modeling – needs more research• Association rules need less papers• Data Mining with Privacy research – industry
use?
• KDD conference aims to bring researchers and industry people together
(c) KDnuggets 2011 48
Hot Growth Areas
• Social Analytics– Klout– many twitter micro-analytics (twitalyzer,
TweetEffect, TweetStats)
• Mobile Analytics– Privacy and data tracks (KDD Lab, Pisa)
49
Big Data Bubble?
Copyright © 2011 KDnuggets
Gartner Hype Cycle
Big Data