Data Mining and Knowledge Discovery in...
Transcript of Data Mining and Knowledge Discovery in...
1
1
Data Analytics
Fall 2014
Rattikorn HewettComputer Science DepartmentTexas Tech University
2
Class Information
Contact:
Tel: 325-742-3527
E-mail: [email protected]
Course Materials:
http://redwood.cs.ttu.edu/~hewett/te
ach.html
3
Acknowledgements
Materials in this course are adapted from
various sources including our texts and
data mining courses by:
Prof. Jeff Ullman, Stanford University
Prof. Chris Clifton, Purdue University
Prof. Osmar Zaiane, University of Alberta
4
Texts
Data Mining: Concepts and Techniques by
J. Han and M. Kamber, Morgan Kaufmann
2000
Data Mining: Practical Machine Learning
Tools and Techniques with Java
Implementations by I. Witten and E. Frank,
Morgan Kaufmann 1999.
2
5
What you should get out of this course
Concepts and techniques in data analytics, data
mining and knowledge discovery in data (KDD)
Understanding underlying processes and
algorithms
Experience with tools
Exposure to complex applications and research
in data analytics
6
Evaluation
Projects/reports 60%
Paper presentation 35%
Class participation 5%
There will be implementation projects
and research papers to read, review and
present
7
Remarks
Academic integrity: read the statement of
Academic Conduct for Engineering
students (see the syllabus)
Citation: unless noted, work submitted
should reflect your own capabilities If unsure, acknowledge sources and help
8
Data Analytics:
Overview
3
9
Outline: Part I What are data analytics, data mining and KDD?
Why is it a new multidisciplinary subject?
Research Community & Resources
Where do we see data analytics being used?
10
Motivation
Advanced technology
for data collection
generation and storage
Computerization of
business and government
transactions and documents
Flood of undigested data
+
Useful knowledge
For Decision-making
Can we automate this process?
11
What we need
New technologies that can intellectually and automatically
assist humans in analyzing and transforming
rapidly growing volumes of digital data into useful information
KDD (Knowledge Discovery in Databases)
[Fayad et al., 96]
12
Why KDD?
Manual analysis and interpretation
Slow, expensive and highly subjective
Databases are rapidly growing in size
Hundreds of millions objects
Hundreds to thousands attributes
Need to scale up human analysis capabilities
to cope with data overload problem
4
13
Data mining, a KDD process
Pre-processing
Data Mining
Post-processing
Selected cleaned data
Patterns Refinement
Databases or
Data warehouse
• Data Mining is the core step of discovery in KDD
• Blindly apply Data Mining can lead to meaninglessand invalid patterns
• Pre and Post processing are essential to ensurethat useful knowledge is derived from the data
Useful
Information
14
Data Mining - Then
The term (~1983) in statistics community for
“overusing data to draw invalid inferences”
Bonferroni’s theorem suggests that if there are too many possible conclusions, some will be true for purely statistical reasons with no physical validity
Famous example: ESP test by David Rhine at Duke in 1950 – declare students who can guess cards correctly 100% to have ESP
Data mining has negative implication
15
Data Mining - Now
Extraction of “interesting” information
(knowledge) from huge amount of data
Discovery of useful summaries of data
(Ullman)
Alternative terms:Data analysis, pattern analysis, data dredging, data
exploration, data understanding, data summarization,
data abstraction, KDD (other places) etc.
A misnomer?
Data Analytics
A new buzzword in business intelligenceData leverage in specific applications or functional
processes to enable context-specific insight that is
actionable (by Gartner)
Scientific process of transforming data into insight for
making better decisions (by INFORMS)
In this class …
Data Analytics ~ Data Science ~ Data Mining
~ KDD?Used with Big Data
5
17
Data Mining & our daily life
Groceries:
Beer -- Diapers (add Chips)
Wine -- Chocolate -- Flowers
Internet: Google search
E-commerce:
Amazon.com
Expedia.com
18
Outline: Part I
What are data analytics, data mining and KDD?
Why is it a new multidisciplinary subject?
Research Community & Resources
Where do we see data mining being used?
19
KDD Process
adapted from: Chris Clifton, Purdue University and
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
DataTarget
Data
Selection
Knowledge
Preprocessed
Data
Patterns
Data Mining
Interpretation/
Evaluation
Preprocessing
Preprocessing may take
60% of effort
Iterative
Process
20
KDD Process
1. Data cleaning: remove noise & inconsistent data
2. Data integration: from multiple sources
3. Data transformation and reduction: transform or
consolidate data into forms appropriate for data mining, select relevant data
4. Data mining: extracts patterns
5. Pattern evaluation/interpretation: by using
interestingness measures
6. Knowledge Presentation: visualization and knowledge
representation are used to present the mined knowledge to the user
Stored in Data Warehouse
6
21
Data Mining Algorithms
Data Set
Data Mining
Algorithm
provides:
• prediction/classification of unseen cases
• understanding relationships among variables
many different types, including:• classification algorithms (e.g., C4.5)• association algorithms (e.g., Apriori)• causal learning algorithms (e.g., PC)
many possible characteristics: • deterministic/stochastic relationships• static/dynamic processes
Model
(Pattern or
Knowledge)
22
Data Mining
Involves:
Fitting models to observed data as in
Statistics
Generalizing models that represent behaviors of
the system generating the data as in
Machine Learning
Finding patterns in observed data as in
Pattern Recognition
23
Interdisciplinary KDD
KDD
Databases
Knowledge Acquisition
Data Warehousing
Pre-processing
Pattern Recognition
Statistics
Machine Learning
Data Analytics
Expert Systems
Visualization, HCI
Computer Graphic
Other AI areas
Post-processing
Information Retrieval: Indexing, Inverted files
High Performance Computing:Parallel and Distributed Computing
Data Infrastructures Big Data Analytics
24
Data Analytics/Mining
Must cope with at least three issues:
Very large amount of data
Not all data can contain in main-memory
Scalability in size and complexity
“Scalable” if run time grows linearly in proportion to size
Efficiency
High performance algorithms are desired
7
25
Data Mining – A new discipline?
How is it different from existing fields?
Statistics – hypothesis testing
Machine learning – all data contains in main memory
Database systems – typically do not infer/generalize data
Pattern Recognition – hard for high volume and high
dimensional data
All – not explicitly concerned with efficiency and huge
amount of data
26
Data Mining – in database context
Can be thought of as
Algorithms for executing very complex queries on
non-main-memory data
An advanced on-line analytical processing (OLAP)
OLAP – supports summarization, consolidation,
aggregation and viewing in multiple perspectives
27
Outline: Part I
What are data analytics, data mining and KDD?
Why is it a multidisciplinary subject?
Why is it a new discipline?
Research Community & Resources
Where do we see data mining being used?
28
KDD Research Community
Key founders:
Usama Fayyad, JPL (then Microsoft, now has his own company,
Digimine)
Gregory Piatetsky-Shapiro (then GTE, now his own data mining
consulting company, Knowledge Stream Partners)
Rakesh Agrawal (IBM Research)
1989 IJCAI Workshop on Knowledge Discovery in
Databases (Piatetsky-Shapiro)
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and
W. Frawley, 1991)
8
29
KDD Research Community (contd)
1991-1994 Workshops on Knowledge Discovery in Databases Advances in Knowledge Discovery and Data Mining (U. Fayyad,
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) Journal of Data Mining and Knowledge Discovery (1997)
1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations
More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE)
ICDM (2001), etc.
30
KDD Research Community (contd)
Other research community in related fields:
Statistics
Machine Learning
Clustering
Visualization
Databases
Information Retrieval
Distributed and Parallel Computation
31
Useful Resources
KDNuggets (http://www.kdnuggets.com)
Weka 3 – open source data mining
software
(http://www.cs.waikato.ac.nz/ml/weka/inde
x.html)
UCI machine learning repository
(http://archive.ics.uci.edu/ml/)
KDD archive (http://kdd.ics.uci.edu/)
32
Outline: Part I
What is data mining and KDD?
Why is it a multidisciplinary subject?
Why is it a new discipline?
Research Community & Resources
Where do we see data mining being used?
9
33
Example Applications
Marketing & Retailing
Cross reference of items Market-basket analysis to find associations of
items bought to increase retail (e.g., diapers and beer adding chips)
Purchase recommendation Customer profiling to advertise to most likely
buyers (e.g., hot items, amazon.com)
Customer retention From purchasing records – loyalty card and credit
card transactions – detect changes in customer consumption to adjust price/quality
34
Example Applications
Finance and investment
Credit and Loan Use bank-loan records (of factors that may influence
loan payment) to build a predictive model to decide whether a loan should be granted
Predict trends of stock investment E.g., LBS Capital Management manages portfolios totaling $600
millions since 1993
Identify potential money laundering & financial
crimes from reports of large cash transactionsE.g., FAIS of U.S. Treas. Financial Crimes Enforcement Network
35
Example Applications
TelecommunicationDetect fraud
E.g. use records on phone services - destination, time, duration - to detect patterns that deviate from expected norm
Improve availability or promote sales of communication services E.g., from communication traffic records, associate
communication needs and events to avoid overload of communication facilities
36
Example Applications
Manufacturing & Engineering
Construct control model for controlling
manufacturing processes (e.g., semi-conductor
industries)
Inventory Forecast – avoid overstock
Improve aviation safety, from FAA’s pilot deviation
database and NTSB’s accident and incident database
Describe types of human errors (e.g., mistakes, slips,
others) that caused accidents
Predict accident problems
10
37
Example Applications
Science
Earth & Environmental Science Construct predictive model for lake inflows from solar
activity and climate conditions
Bioinformatics Comparing genotype of people with/without a
condition allowed discovery of a set of genes that together account for many cases of diabetes
Astronomy Skycat and Sloan Sky Survey – clustering sky objects
by their radiation levels – distinguish galaxies, stars
38
Example Applications
Internet
Web Search (e.g., Google)
Find pages with matching contents, rank, and
summarize content
E-commerce
IBM Surf-Aid analyzes web access logs to target customers, improve web organization or identify pages for advertisement
FIREFLY – music recommendation agents
39
Example Applications
Sport & Entertainment IBM’s advanced scout: analyzes NBA game statistics
to gain competitive advantage for NY Knicks and
Miami Heat
Sharp Lab: uses data mining to summarize sport
video
Homeland Security Intelligent analysis
Surveillance cameras – detect suspected individuals
40
A closer look
11
41
Outline: Part II
Data Mining
Input/Output
Tasks & Functionalities
System Architecture & System Categories
Mining the Data
Steps
Tools & Demos
Challenges and Issues42
Input: What kind of data to be mined?
Forms: Structured data: Relational (or Object-oriented or
Object-relational) Databases, Data Warehouses,
Transactional Databases
Semi-structured data: web pages, XML, html, other
special purpose domain
Unstructured data: text, e-mail
43
Examples
A relational database:
A multidimensional data cube
used in data warehousing
A transactional databaseDate/Time/Register Fish Turkey Cranberries Wine ...12/6 13:15 2 N Y Y N ...12/6 13:16 3 Y N N Y ...
Date
Cou
ntr
y
Cust_ID Name Contact Credit_info
Relation: customer
44
Input: What kind of data to be mined?
Types of media & content:Multimedia: Image/Audio/Video
Spatial Databases: Maps, Geographic database
Temporal and Time series Database
WWW (Web pages, Web access logs)
Heterogenous database: an interconnected set of
different types of stand-alone databases
Legacy database: a group of heterogenous
databases created in the past
12
45
Data Sources: Where are the data from?
Public Scientific databases
National laboratories and data centers(e.g., NOAA, human genome, NASA’s EOS, DOD & Intelligence)
Health-related service databases (e.g., benefits, medical analysis)
Financial, Commercial and Business
transactions (e.g., credit card transactions, loyalty cards,
discount coupons, customer complaint calls)
News group, e-mail, documents46
Output: What are the mined outputs?
Knowledge Types: (depends on data mining tasks)
Descriptions of general properties Summary reports
Answers of complex queries
Patterns (or Models) of regularities - Classification
models (classifiers), Categories or Clusters of data
Pattern of Irregularities
Sequences or trends of regularities
Inferences on available data
Predictive models for predicting unseen cases
47
Output: What are the mined outputs?
Forms: (depends on data mining functions)
Texts or query languages
Mathematical models, e.g., Neural net or regression models
Symbolic models, e.g.,
Rules – association rules, DNF forms
Decision Trees
Bayesian network
Visual presentation
48
Examples
Decision trees:
Rules: LHS RHS
Color = yellow & shape = cylinder-like fruit = banana
Turkey Cranberries, with support 90% and confidence 80%
Event = Failed Midterm & Unfinished Project Future Event = Drop or Fail the course
Income
Credit history
debt
Credit historyH Risk
H Risk M Risk
L Risk
M Risk
L Risk
M RiskH Risk
L M H
U Bad Good U Bad Good
H L
13
49
Examples (cont.)
Visualization of file organization using ring
visualization representation
From NSF and Science Magazine
Visualization Grand Challenge
First Prize in category illustration.
50
Outline: Part II
Data Mining
Input/Output
Tasks & Functionalities
System Architecture & System Categories
Mining the Data
Steps
Tools & Demos
Challenges and Issues
51
Data Mining Tasks
Discovery: (patterns in various granularities from databases)
Description: find human-interpretable patterns
describing general properties of data
Prediction: find patterns that predict future
behavior by using variables in the data to predict
other unknown variable values
Verification: find patterns that confirm user’s hypothesis
52
How?
Summarize
Cluster
Classify
Identify Sequences/links/dependencies
Detect Deviation
14
53
Data Mining Functionality
Characterization:
Summarizes general features of objects in a target
concept (or class or pattern to describe)
Concept description
Discrimination:Compares general features of objects between a target
class and a contrasting class
Concept comparison
54
Data Mining Functionality (cont.)
Association:
Studies the frequency of items occurring together in
transaction databases
Ex: buys(x, beer) buys(x, nuts)
Prediction:Predicts some unknown or missing values based on
known data
Ex: Forecast stock values based on company records,
political climates and economy
55
Data Mining Functionality (cont.)
Classification:
Describes data in a given class based on class features
of known classes (labeled data)
Supervised learning
Ex: Classify housing prices based on locations and
conditions
Clustering:Groups data in classes (or categories or clusters) based
on similarity of their features
Unsupervised learning
* Min. inter-class similarity and Max. intra-class similarity 56
Data Mining Functionality (cont.)
Outlier analysis:
Identifies and explains exceptions (surprises)
Time-series analysis:Identifies trends and deviations; sequential patterns,
similar sequences
15
57
Outline: Part II
Data Mining
Input/Output
Tasks & Functionalities
System Architecture & System Categories
Mining the Data
Steps
Tools & Demos
Challenges and Issues58
System Architectures
Data Cleaning &
Data IntegrationFiltering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledgebase
Data Warehouse
59
System Categories
Data Mining systems can be classified
based on
Types of knowledge to be discovered
Types of data to be mined
Types of techniques applied
Types of application domains
60
System Categories
Data Mining systems can be classified
based on
Types of knowledge to be discovered
Summary, comparison, association, classification
knowledge, deviation, trends
Knowledge can be at various levels of
abstractions, e.g., year, quarter, month, date, time
16
61
System Categories
Data Mining systems can be classified
based on
Types of data to be mined
Transaction data, time-series data, spatial data,
text data, www data, heterogeneous/distributed
data
62
System Categories
Data Mining systems can be classified
based on
Types of data models and techniques used
Database-oriented
Machine learning models
Statistical models
Visualization models
63
System Categories
Data Mining systems can be classified
based on
Types of application domains
Text mining systems
Web mining systems
Gene sequence analyzers
Multimedia mining systems
Micro array data analysis systems
64
Outline: Part II
Data Mining
Input/Output
Tasks & Functionalities
System Architecture & System Categories
Mining the Data
Steps
Tools & Demos
Challenges and Issues
17
65
Steps in mining the data
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
66
Some Data Mining Tools & Systems
C4.5, a decision tree learning system [Quinlan, 1994] See5
SOM, Self-organizing Maps [Kohonen, 1995]
Neural Net with Back Propagation learning [Ramerhart, 89]
CBA, Classifier Based on Association rule mining [Liu et al., 1998]
SORCER, Second-Order Relation Compaction for Extraction of
Rules [Hewett and Leuchner, 2002]
Naïve Bayes Classifier (Microsoft)
Tetrad, a Bayes net learning system (CMU)
BNT, Bayes Net Toolbox (MIT)
67
Some Data Mining Suites
DBMiner, IBM’s DataQuest Group
WEKA, Machine learning group at Waikato University
Many more can be found at www.kdnuggets.com
Let’s see them in action ….
68
Outline: Part II
Data Mining
Input/Output
Tasks & Functionalities
System Architecture & System Categories
Mining the Data
Steps
Tools & Demos
Challenges and Issues
18
69
Issues in Data Mining
User Interface issues
Performance issues
Data source issues
Security and Social issues
Mining Methodology issues
70
User Interface issues
Visualization issues:
Understandability and interpretation of results
Information representation and rendering
Interactivity
Manipulation of mined knowledge
Focus and refine tasks
Focus and refine results
71
Performance issues
Efficiency and scalability of mining
algorithms
Need at least linear time complexity
algorithms or bounded computation
Sampling
Parallelism
Incremental – can we use divide and
conquer?
72
Data source issues
Diversity of data types
Handling complex types of data
Is it possible to build a system that perform
well on all kinds of data?
Data Collection
Many collect data for archive
Identify problems before mining them
19
73
Security and Social issues
Social Impacts
Private/sensitive data are mined without
consent
New implicit knowledge is disclosed
(confidentiality, integrity)
Knowledge sharing
Regulations
There is need for data mining policy to
protect data security, integrity and privacy74
Mining Methodology issues
Mining different types of knowledge from diverse data
type (e.g., bio, stream, Web)
Incorporation with background knowledge
Handling noise and missing data
Performance: efficiency, effectiveness and scalability
Parallel, distributed and Incremental mining methods
Evaluation: the interestingness problem
Knowledge fusion: Integration of discovered knowledge
with existing one
75
The Interestingness Problems
Is all that is discovered “interesting”?
No.
How do we measure “interestingness”?
Objective: used statistics based on frequency
of occurrences – e.g., regular – might miss
important rare events
Subjective: user’s beliefs
76
Measures of “interestingness”
A pattern is “interesting” if it is:
Easy to understand by humans
Valid on test data with some degree of
certainty
Potentially useful (for users)
Novel or validate user’s hypothesis
20
77
The Interestingness Problems (cont)
Can the data mining system find all
interesting patterns? completeness
??? Read text and tell me in next class
Can the data mining system find only
interesting patterns? optimality
Yes, in some.
E.g., mining query optimization