Data Mining
Introduction
Prithwis Mukerjee, Ph.D.
Prithwis Mukerjee 2
Why Suddenly Data Mining ?
Fluid, dynamic business environment Markets are, or seem to be, saturated Customers are aggresive and disloyal Speed is essential
“The quick and the dead”
Availability of data Vast amounts of data are generated, stored
electronically and are waiting to be processed !
Availability of tools and techniques Mathematical tools are available to “process”
this data This processing is significantly different from MIS / EDP
style of data processing
Software tools are available to implement some of these mathematical models
Prithwis Mukerjee 3
The Business Environment
Customer Behaviour Customers have access to more channels
Newer retail formats Online stores
Customers have access to more suppliers Increased commoditisation Customer loyalty is not assured any more
Market Saturation Multiple suppliers operating in each market Niche market based on demographics and preferences
Competition has intensified
Need for speed Product life cycles are getting shorter Time to market
Prithwis Mukerjee 4
New way of looking at customer
Customer Relationship Management Intimacy, collaboration, one-to-one partnerships
are necessary
Need to ask ... What classes of customers do we have ? Are
there subclasses in terms of behaviour ? How can we sell more to existing customers ?
What exactly are they buying now ? Is there a pattern in the way our customers
behave ? Who are my good customers ?
To whom we should sell more Who are my bad customers ?
Who are likely to default or defraud ?
Prithwis Mukerjee 5
Availability of vast amounts of data
ERP and OLTP systems With their centralised RDBMSs are huge pools
of firmwide data that can overwhelm even the most dedicated manager
Datawarehouse Technology has resulted in equally huge pools
of historical data
Storage Capacity Inexpensive Ultra high capacity
Prithwis Mukerjee 6
Availability of vast amounts of data
Cards Credit & Debit Cards Loyalty Card Result in capture of huge pools of data
Transactional Data Capture Point of sales systems, Bar code readers Capture vast amount of transaction data at
increasing levels of granularity What was sold ? Product, SKU When was it sold ? Date, time How was it sold ? Discount, Promotions
Beyond simple sales Telephone calls, frequent flyer data
Prithwis Mukerjee 7
Trends leading to Data Flood
More data is generated: Web, text, images … Business transactions,
calls, ... Scientific data:
astronomy, biology, etc
More data is captured: Storage technology
faster and cheaper DBMS can handle
bigger DB
Prithwis Mukerjee 8
Data Growth
In 2 years (2003 to 2005), the size of the largest database TRIPLED!
Prithwis Mukerjee 9
Coming to the point ....
Data mining Is the process of extracting unknown, valid
and actionable information from large databases and then using this information to make crucial business decisions.
DatabaseData Mining Tools
Data Presentationand Visualisation Tools
Decisions
Prithwis Mukerjee 10
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying valid novel potentially useful and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
Prithwis Mukerjee 11
Old Wine in new bottles ?
Are these the same as data mining ? SQL queries against large databases Multidimensional database analysis Online analytical processing Sophisticated graphic visualisation “Classical” statistical analysis
ANOVA ? Regression ? Correrelation ?
What is missing ? Discovery of information without a previously
articulated or formulated hypothesis
Prithwis Mukerjee 12
Related Fields
Statistics
MachineLearning
Databases
Visualization
Data Mining and Knowledge Discovery
Prithwis Mukerjee 13
Statistics, Machine Learning, Data Mining
Statistics: more theory-based more focused on testing hypotheses
Machine learning more heuristic focused on improving performance of a learning agent also looks at real-time learning and robotics – areas not
part of data mining
Data Mining and Knowledge Discovery integrates theory and heuristics focus on the entire process of knowledge discovery,
including data cleaning, learning, and integration and visualization of results
Distinctions are fuzzy
Data Mining Tasks
Prithwis Mukerjee 15
Some Definitions
Instance (also Item or Record): an example, described by a number of
attributes, e.g. a day can be described by temperature,
humidity and cloud status
Attribute or Field measuring aspects of the Instance, e.g.
temperature
Class (Label) grouping of instances, e.g. days good for
playing
Prithwis Mukerjee 16
Major Data Mining Tasks
Classification predicting an item
class
Clustering finding clusters in
data
Associations e.g. A & B & C occur
frequently
Visualization to facilitate human
discovery
Summarization describing a group
Deviation Detection finding changes
Estimation predicting a
continuous value
Link Analysis finding relationships
And So on ...
Prithwis Mukerjee 17
Classification
Learn a method for predicting the instance class from pre-labeled (classified) instances
Many approaches: Statistics, Decision Trees, Neural Networks, ...
Prithwis Mukerjee 18
Clustering
Find “natural” grouping of instances given un-labeled data
Prithwis Mukerjee 19
Association Rules & Frequent Itemsets
TID Produce
1 MILK, BREAD, EGGS
2 BREAD, SUGAR
3 BREAD, CEREAL
4 MILK, BREAD, SUGAR
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
TransactionsFrequent Itemsets:
Milk, Bread (4)Bread, Cereal (3)Milk, Bread, Cereal (2)…
Rules:Milk => Bread (66%)
Prithwis Mukerjee 20
Visualization & Data Mining
Visualizing the data to facilitate human discovery
Presenting the discovered results in a visually "nice" way
Prithwis Mukerjee 21
Summarization
Describe features of the selected group
Use natural language and graphics
Usually in Combination with Deviation detection or other methods
Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...
Prithwis Mukerjee 22
Data Mining Central Quest
Find true patterns and avoid overfitting (finding seemingly signifcantbut really random patterns due to searching too many possibilites)
Classification Methods
Prithwis Mukerjee 24
Classification
Learn a method for predicting the instance class from pre-labeled (classified) instances
Many approaches: Regression, Decision Trees,Bayesian,Neural Networks, ...
Given a set of points from classes what is the class of new point ?
Prithwis Mukerjee 25
Classification: Linear Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes wi from data to minimize squared error to ‘fit’ the data
Not flexible enough
Prithwis Mukerjee 26
Regression for Classification
Any regression technique can be used for classification Training: perform a regression for each class, setting
the output to 1 for training instances that belong to class, and 0 for those that don’t
Prediction: predict class corresponding to model with largest output value (membership value)
For linear regression this is known as multi-response linear regression
Prithwis Mukerjee 27
Classification: Decision Trees
X
Y
if X > 5 then blueelse if Y > 3 then blueelse if X > 2 then greenelse blue
52
3
Prithwis Mukerjee 28
DECISION TREE
An internal node is a test on an attribute.
A branch represents an outcome of the test, e.g., Color=red.
A leaf node represents a class label or class label distribution.
At each node, one attribute is chosen to split training examples into distinct classes as much as possible
A new instance is classified by following a matching path to a leaf node.
Prithwis Mukerjee 29
Weather Data: Play or not Play?
Notruehighmildrain
Yesfalsenormalhotovercast
Yestruehighmildovercast
Yestruenormalmildsunny
Yesfalsenormalmildrain
Yesfalsenormalcoolsunny
Nofalsehighmildsunny
Yestruenormalcoolovercast
Notruenormalcoolrain
Yesfalsenormalcoolrain
Yesfalsehighmildrain
Yesfalsehighhotovercast
Notruehighhotsunny
Nofalsehighhotsunny
Play?WindyHumidityTemperatureOutlook
Note:Outlook is theForecast,no relation to Microsoftemail program
Prithwis Mukerjee 30
overcast
high normal falsetrue
sunny rain
No NoYes Yes
Yes
Example Tree for “Play?”
Outlook
HumidityWindy
Prithwis Mukerjee 31
Classification: Neural Nets
Can select more complex regions
Can be more accurate
Also can overfit the data – find patterns in random noise
Prithwis Mukerjee 32
Classification: other approaches
Naïve Bayes
Rules
Support Vector Machines
Genetic Algorithms
…
See www.KDnuggets.com/software/
Prithwis Mukerjee 33
Direct Marketing Paradigm
Find most likely prospects to contact
Not everybody needs to be contacted
Number of targets is usually much smaller than number of prospects
Typical Applications retailers, catalogues, direct mail (and e-mail)
customer acquisition, cross-sell, attrition prediction
...
Prithwis Mukerjee 34
Direct Marketing Evaluation
Accuracy on the entire dataset is not the right measure
Approach
develop a target model
score all prospects and rank them by decreasing score
select top P% of prospects for action
How do we decide what is the best subset of prospects ?
Prithwis Mukerjee 35
Model-Sorted List
…4897N0.925
2422
2734
…
3820
2478
1024
1746
CustID
N0.06100
…N0.1199
…
…
…
…
…
Age
Y0.934
……
Y0.943
N0.952
Y0.971
Target
Score
No
Use a model to assign score to each customerSort customers by decreasing scoreExpect more targets (hits) near the top of the list
3 hits in top 5% of the list
If there 15 targets overall, then top 5 has 3/15=20% of targets
Data Mining Applications
Prithwis Mukerjee 37
Data Mining Applications
Science: Chemistry, Physics, Medicine Biochemical
analysis Remote sensors on
a satellite Telescopes – star
galaxy classification
Medical Image analysis
Bioscience Sequence-based
analysis Protein structure
and function prediction
Protein family classification
Microarray gene expression
Prithwis Mukerjee 38
Microarrays: Classifying Leukemia
Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999
72 examples (38 train, 34 test), about 7,000 genes
ALL AML
Visually similar, but genetically very different
Best Model: 97% accuracy,1 error (sample suspected mislabelled)
Prithwis Mukerjee 39
Microarray Potential Applications
New and better molecular diagnostics Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip,
based on Affymetrix technology
New molecular targets for therapy few new drugs, large pipeline, …
Improved treatment outcome Partially depends on genetic signature
Fundamental Biological Discovery finding and refining biological pathways
Personalized medicine ?!
Prithwis Mukerjee 40
Pharmaceutical companies, Insurance and Health care, Medicine Drug development Identify successful
medical therapies Claims analysis,
fraudulent behavior Medical diagnostic
tools Predict office visits
Data Mining Applications
Financial Industry, Banks, Businesses, E-commerce Stock and
investment analysis Identify loyal
customers vs. risky customer
Predict customer spending
Risk management Sales forecasting
Prithwis Mukerjee 41
Retail and Marketing Customer buying
patterns/demographic characteristics
Mailing campaigns Market basket
analysis Trend analysis
Data Mining Applications
Prithwis Mukerjee 42
Application: Direct Marketing and CRM Most major direct marketing companies are
using modeling and data mining
Most financial companies are using customer modeling
Modeling is easier than changing customer behaviour
Example
Verizon Wireless reduced customer attrition rate from 2% to 1.5%, saving many millions of $
Prithwis Mukerjee 43
Application: Security and Fraud Detection Credit Card Fraud Detection
over 20 Million credit cards protected by Neural networks (Fair, Isaac)
Securities Fraud Detection
NASDAQ KDD system
Phone fraud detection
AT&T, Bell Atlantic, British Telecom/MCI
Prithwis Mukerjee 44
Fraud Detection and Management (1)
Applications widely used in health care, retail, credit card
services, telecommunications (phone card fraud), etc.
Approach use historical data to build models of fraudulent
behavior and use data mining to help identify similar instances
Examples auto insurance: detect a group of people who stage
accidents to collect on insurance money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring
of doctors and ring of references
Prithwis Mukerjee 45
Fraud Detection and Management (2)
Detecting inappropriate medical treatment Australian Health Insurance Commission
identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).
Detecting telephone fraud Telephone call model: destination of the call,
duration, time of day or week. Analyze patterns that deviate from an expected norm.
British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.
Retail Analysts estimate that 38% of retail shrink is due
to dishonest employees.
Prithwis Mukerjee 46
Application: e-Commerce
Amazon.com recommendations
if you bought (viewed) X, you are likely to buy Y
Netflix
If you liked "Monty Python and the Holy Grail",
you get a recommendation for "This is Spinal Tap"
Comparison shopping
Froogle, mySimon, Yahoo Shopping, …
Prithwis Mukerjee 47
Example : Processing Loan Applications
Given: questionnaire with financial and personal information
Problem: should money be lend?
Borderline cases referred to loan officers
But: 50% of accepted borderline cases defaulted!
Solution: reject all borderline cases?
Borderline cases are most active customers!
Prithwis Mukerjee 48
Enter Machine Learning
Given: 1000 training examples of borderline cases
20 attributes: age, years with current employer,years at
current address, years with the bank, years at current job, other credit cards
Learned rules predicted 2/3 of borderline cases correctly!
Rules could be used to explain decisions to customers
Prithwis Mukerjee 49
Case study 2:Screening images
Given: radar satellite images of coastal waters
Problem: detecting oil slicks in those images
Oil slicks = dark regions with changing size and shape
Look-alike dark regions can be caused by weather conditions (e.g. high wind)
Expensive process requiring highly trained personnel
Prithwis Mukerjee 50
Dark regions extracted from normalized image
Attributes: size of region, shape, area, intensity, sharpness
and jaggedness of boundaries, proximity of other regions, info about background
Constraints: Scarcity of training examples (oil slicks are
rare!) Unbalanced data: most dark regions aren’t oil
slicks Regions from same image form a batch Requirement is adjustable false-alarm rate
Enter Machine Learning
Prithwis Mukerjee 51
Data Mining Applications ..
Prediction & Description Would this customer buy this product ? Is this customer likely to leave ?
Relationship Marketing What kind of products have been bought by
this customer ? What kind of marketing strategy has this
customer responded to ?
Outlier identification and Fraud detection Locating unusual cases and behaviours
Customer Profiling & Segmentation Is the bottomline that we are all looking at ...
Prithwis Mukerjee 52
Data Mining Challenges
Computationally expensive to investigate all possibilities
Dealing with noise/missing information and errors in data
Choosing appropriate attributes/input representation
Finding the minimal attribute space
Finding adequate evaluation function(s)
Extracting meaningful information
Not overfitting
Prithwis Mukerjee 53
Are All “Discovered” Patterns Interesting?
Interestingness measures: A pattern is interesting if
it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user
Objective vs. subjective measures: Objective: based on statistics and structures of
patterns support and confidence
Subjective: based on user’s belief in the data unexpectedness, novelty, action ability, etc.
Completeness - Find all the interesting patterns Can a data mining system find all the interesting patterns? Association vs. classification vs. clustering
Privacy Issues
Prithwis Mukerjee 55
Data Mining, Privacy, and Security
TIA: Terrorism (formerly Total) Information Awareness Program – TIA program closed by Congress in 2003 because of
privacy concerns
However, in 2006 we learn that NSA is analyzing US domestic call info to find potential terrorists Invasion of Privacy or Needed Intelligence?
Prithwis Mukerjee 56
Criticism of Analytic Approaches to Threat Detection:
Data Mining will be ineffective - generate millions of false positives and invade privacy
First, can data mining be effective?
Prithwis Mukerjee 57
Can Data Mining and Statistics be Effective for Threat Detection?
Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives
Reality: Analytical models correlate many items of information to reduce false positives.
Example: Identify one biased coin from 1,000. After one throw of each coin, we cannot After 30 throws, one biased coin will stand out
with high probability. Can identify 19 biased coins out of 100 million
with sufficient number of throws
Prithwis Mukerjee 58
Another Approach: Link Analysis
Can find unusual patterns in the network structure
Prithwis Mukerjee 59
Analytic technology can be effective
Data Mining is just one additional tool to help analysts
Combining multiple models and link analysis can reduce false positives
Today there are millions of false positives with manual analysis
Analytic technology has the potential to reduce the current high rate of false positives
Prithwis Mukerjee 60
Data Mining with Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion Replacing sensitive personal data with anon.
ID Give randomized outputs Multi-party computation – distributed data …
Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
Prithwis Mukerjee 61
Summary
Data Mining and Knowledge Discovery are needed to deal with the flood of data
Knowledge Discovery is a process !
Avoid overfitting (finding random patterns by searching too many possibilities)
Prithwis Mukerjee 62
Additional Resources
www.KDnuggets.com data mining software, jobs, courses, etc
www.acm.org/sigkddACM SIGKDD – the professional society for data mining