Data Mining and Knowledge Discovery in...

1

1

Data Analytics

Fall 2014

Rattikorn HewettComputer Science DepartmentTexas Tech University

2

Class Information

Contact:

Tel: 325-742-3527

E-mail: [email protected]

Course Materials:

http://redwood.cs.ttu.edu/~hewett/te

ach.html

3

Acknowledgements

Materials in this course are adapted from

various sources including our texts and

data mining courses by:

Prof. Jeff Ullman, Stanford University

Prof. Chris Clifton, Purdue University

Prof. Osmar Zaiane, University of Alberta

4

Texts

Data Mining: Concepts and Techniques by

J. Han and M. Kamber, Morgan Kaufmann

2000

Data Mining: Practical Machine Learning

Tools and Techniques with Java

Implementations by I. Witten and E. Frank,

Morgan Kaufmann 1999.

mailto:[email protected]

2

5

What you should get out of this course

Concepts and techniques in data analytics, data

mining and knowledge discovery in data (KDD)

Understanding underlying processes and

algorithms

Experience with tools

Exposure to complex applications and research

in data analytics

6

Evaluation

Projects/reports 60%

Paper presentation 35%

Class participation 5%

There will be implementation projects

and research papers to read, review and

present

7

Remarks

Academic integrity: read the statement of

Academic Conduct for Engineering

students (see the syllabus)

Citation: unless noted, work submitted

should reflect your own capabilities If unsure, acknowledge sources and help

8

Data Analytics:

Overview

3

9

Outline: Part I What are data analytics, data mining and KDD?

Why is it a new multidisciplinary subject?

Research Community & Resources

Where do we see data analytics being used?

10

Motivation

Advanced technology

for data collection

generation and storage

Computerization of

business and government

transactions and documents

Flood of undigested data

+

Useful knowledge

For Decision-making

Can we automate this process?

11

What we need

New technologies that can intellectually and automatically

assist humans in analyzing and transforming

rapidly growing volumes of digital data into useful information

KDD (Knowledge Discovery in Databases)

[Fayad et al., 96]

12

Why KDD?

Manual analysis and interpretation

Slow, expensive and highly subjective

Databases are rapidly growing in size

Hundreds of millions objects

Hundreds to thousands attributes

Need to scale up human analysis capabilities

to cope with data overload problem

4

13

Data mining, a KDD process

Pre-processing

Data Mining

Post-processing

Selected cleaned data

Patterns Refinement

Databases or

Data warehouse

• Data Mining is the core step of discovery in KDD

• Blindly apply Data Mining can lead to meaninglessand invalid patterns

• Pre and Post processing are essential to ensurethat useful knowledge is derived from the data

Useful

Information

14

Data Mining - Then

The term (~1983) in statistics community for

“overusing data to draw invalid inferences”

Bonferroni’s theorem suggests that if there are too many possible conclusions, some will be true for purely statistical reasons with no physical validity

Famous example: ESP test by David Rhine at Duke in 1950 – declare students who can guess cards correctly 100% to have ESP

Data mining has negative implication

15

Data Mining - Now

Extraction of “interesting” information

(knowledge) from huge amount of data

Discovery of useful summaries of data

(Ullman)

Alternative terms:Data analysis, pattern analysis, data dredging, data

exploration, data understanding, data summarization,

data abstraction, KDD (other places) etc.

A misnomer?

Data Analytics

A new buzzword in business intelligenceData leverage in specific applications or functional

processes to enable context-specific insight that is

actionable (by Gartner)

Scientific process of transforming data into insight for

making better decisions (by INFORMS)

In this class …

Data Analytics ~ Data Science ~ Data Mining

~ KDD?Used with Big Data

5

17

Data Mining & our daily life

Groceries:

Beer -- Diapers (add Chips)

Wine -- Chocolate -- Flowers

Internet: Google search

E-commerce:

Amazon.com

Expedia.com

18

Outline: Part I

What are data analytics, data mining and KDD?

Why is it a new multidisciplinary subject?


Where do we see data mining being used?

19

KDD Process

adapted from: Chris Clifton, Purdue University and

U. Fayyad, et al. (1995), “From Knowledge Discovery to Data

Mining: An Overview,” Advances in Knowledge Discovery and

Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

DataTarget

Data

Selection

Knowledge

Preprocessed

Data

Patterns

Data Mining

Interpretation/

Evaluation

Preprocessing

Preprocessing may take

60% of effort

Iterative

Process

20

KDD Process

1. Data cleaning: remove noise & inconsistent data

2. Data integration: from multiple sources

3. Data transformation and reduction: transform or

consolidate data into forms appropriate for data mining, select relevant data

4. Data mining: extracts patterns

5. Pattern evaluation/interpretation: by using

interestingness measures

6. Knowledge Presentation: visualization and knowledge

representation are used to present the mined knowledge to the user

Stored in Data Warehouse

6

21

Data Mining Algorithms

Data Set

Data Mining

Algorithm

provides:

• prediction/classification of unseen cases

• understanding relationships among variables

many different types, including:• classification algorithms (e.g., C4.5)• association algorithms (e.g., Apriori)• causal learning algorithms (e.g., PC)

many possible characteristics: • deterministic/stochastic relationships• static/dynamic processes

Model

(Pattern or

Knowledge)

22

Data Mining

Involves:

Fitting models to observed data as in

Statistics

Generalizing models that represent behaviors of

the system generating the data as in

Machine Learning

Finding patterns in observed data as in

Pattern Recognition

23

Interdisciplinary KDD

KDD

Databases

Knowledge Acquisition

Data Warehousing

Pre-processing

Pattern Recognition

Statistics

Machine Learning

Data Analytics

Expert Systems

Visualization, HCI

Computer Graphic

Other AI areas

Post-processing

Information Retrieval: Indexing, Inverted files

High Performance Computing:Parallel and Distributed Computing

Data Infrastructures Big Data Analytics

24

Data Analytics/Mining

Must cope with at least three issues:

Very large amount of data

Not all data can contain in main-memory

Scalability in size and complexity

“Scalable” if run time grows linearly in proportion to size

Efficiency

High performance algorithms are desired

7

25

Data Mining – A new discipline?

How is it different from existing fields?

Statistics – hypothesis testing

Machine learning – all data contains in main memory

Database systems – typically do not infer/generalize data

Pattern Recognition – hard for high volume and high

dimensional data

All – not explicitly concerned with efficiency and huge

amount of data

26

Data Mining – in database context

Can be thought of as

Algorithms for executing very complex queries on

non-main-memory data

An advanced on-line analytical processing (OLAP)

OLAP – supports summarization, consolidation,

aggregation and viewing in multiple perspectives

27

Outline: Part I

What are data analytics, data mining and KDD?

Why is it a multidisciplinary subject?

Why is it a new discipline?



28

KDD Research Community

Key founders:

Usama Fayyad, JPL (then Microsoft, now has his own company,

Digimine)

Gregory Piatetsky-Shapiro (then GTE, now his own data mining

consulting company, Knowledge Stream Partners)

Rakesh Agrawal (IBM Research)

1989 IJCAI Workshop on Knowledge Discovery in

Databases (Piatetsky-Shapiro)

Knowledge Discovery in Databases (G. Piatetsky-Shapiro and

W. Frawley, 1991)

8

29

KDD Research Community (contd)

1991-1994 Workshops on Knowledge Discovery in Databases Advances in Knowledge Discovery and Data Mining (U. Fayyad,

G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) Journal of Data Mining and Knowledge Discovery (1997)

1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations

More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE)

ICDM (2001), etc.

30

KDD Research Community (contd)

Other research community in related fields:

Statistics

Machine Learning

Clustering

Visualization

Databases

Information Retrieval

Distributed and Parallel Computation

31

Useful Resources

KDNuggets (http://www.kdnuggets.com)

Weka 3 – open source data mining

software

(http://www.cs.waikato.ac.nz/ml/weka/inde

x.html)

UCI machine learning repository

(http://archive.ics.uci.edu/ml/)

KDD archive (http://kdd.ics.uci.edu/)

32

Outline: Part I

What is data mining and KDD?

Why is it a multidisciplinary subject?

Why is it a new discipline?



http://www.kdnuggets.com/

http://www.ics.uci.edu/

9

33

Example Applications

Marketing & Retailing

Cross reference of items Market-basket analysis to find associations of

items bought to increase retail (e.g., diapers and beer adding chips)

Purchase recommendation Customer profiling to advertise to most likely

buyers (e.g., hot items, amazon.com)

Customer retention From purchasing records – loyalty card and credit

card transactions – detect changes in customer consumption to adjust price/quality

34


Finance and investment

Credit and Loan Use bank-loan records (of factors that may influence

loan payment) to build a predictive model to decide whether a loan should be granted

Predict trends of stock investment E.g., LBS Capital Management manages portfolios totaling $600

millions since 1993

Identify potential money laundering & financial

crimes from reports of large cash transactionsE.g., FAIS of U.S. Treas. Financial Crimes Enforcement Network

35


TelecommunicationDetect fraud

E.g. use records on phone services - destination, time, duration - to detect patterns that deviate from expected norm

Improve availability or promote sales of communication services E.g., from communication traffic records, associate

communication needs and events to avoid overload of communication facilities

36


Manufacturing & Engineering

Construct control model for controlling

manufacturing processes (e.g., semi-conductor

industries)

Inventory Forecast – avoid overstock

Improve aviation safety, from FAA’s pilot deviation

database and NTSB’s accident and incident database

Describe types of human errors (e.g., mistakes, slips,

others) that caused accidents

Predict accident problems

10

37


Science

Earth & Environmental Science Construct predictive model for lake inflows from solar

activity and climate conditions

Bioinformatics Comparing genotype of people with/without a

condition allowed discovery of a set of genes that together account for many cases of diabetes

Astronomy Skycat and Sloan Sky Survey – clustering sky objects

by their radiation levels – distinguish galaxies, stars

38


Internet

Web Search (e.g., Google)

Find pages with matching contents, rank, and

summarize content

E-commerce

IBM Surf-Aid analyzes web access logs to target customers, improve web organization or identify pages for advertisement

FIREFLY – music recommendation agents

39


Sport & Entertainment IBM’s advanced scout: analyzes NBA game statistics

to gain competitive advantage for NY Knicks and

Miami Heat

Sharp Lab: uses data mining to summarize sport

video

Homeland Security Intelligent analysis

Surveillance cameras – detect suspected individuals

40

A closer look

11

41

Outline: Part II

Data Mining

Input/Output

Tasks & Functionalities

System Architecture & System Categories

Mining the Data

Steps

Tools & Demos

Challenges and Issues42

Input: What kind of data to be mined?

Forms: Structured data: Relational (or Object-oriented or

Object-relational) Databases, Data Warehouses,

Transactional Databases

Semi-structured data: web pages, XML, html, other

special purpose domain

Unstructured data: text, e-mail

43

Examples

A relational database:

A multidimensional data cube

used in data warehousing

A transactional databaseDate/Time/Register Fish Turkey Cranberries Wine ...12/6 13:15 2 N Y Y N ...12/6 13:16 3 Y N N Y ...

Date

Cou

ntr

y

Cust_ID Name Contact Credit_info

Relation: customer

44

Input: What kind of data to be mined?

Types of media & content:Multimedia: Image/Audio/Video

Spatial Databases: Maps, Geographic database

Temporal and Time series Database

WWW (Web pages, Web access logs)

Heterogenous database: an interconnected set of

different types of stand-alone databases

Legacy database: a group of heterogenous

databases created in the past

12

45

Data Sources: Where are the data from?

Public Scientific databases

National laboratories and data centers(e.g., NOAA, human genome, NASA’s EOS, DOD & Intelligence)

Health-related service databases (e.g., benefits, medical analysis)

Financial, Commercial and Business

transactions (e.g., credit card transactions, loyalty cards,

discount coupons, customer complaint calls)

News group, e-mail, documents46

Output: What are the mined outputs?

Knowledge Types: (depends on data mining tasks)

Descriptions of general properties Summary reports

Answers of complex queries

Patterns (or Models) of regularities - Classification

models (classifiers), Categories or Clusters of data

Pattern of Irregularities

Sequences or trends of regularities

Inferences on available data

Predictive models for predicting unseen cases

47

Output: What are the mined outputs?

Forms: (depends on data mining functions)

Texts or query languages

Mathematical models, e.g., Neural net or regression models

Symbolic models, e.g.,

Rules – association rules, DNF forms

Decision Trees

Bayesian network

Visual presentation

48

Examples

Decision trees:

Rules: LHS RHS

Color = yellow & shape = cylinder-like fruit = banana

Turkey Cranberries, with support 90% and confidence 80%

Event = Failed Midterm & Unfinished Project Future Event = Drop or Fail the course

Income

Credit history

debt

Credit historyH Risk

H Risk M Risk

L Risk

M Risk

L Risk

M RiskH Risk

L M H

U Bad Good U Bad Good

H L

13

49

Examples (cont.)

Visualization of file organization using ring

visualization representation

From NSF and Science Magazine

Visualization Grand Challenge

First Prize in category illustration.

50

Outline: Part II

Data Mining

Input/Output



Mining the Data

Steps

Tools & Demos

Challenges and Issues

51

Data Mining Tasks

Discovery: (patterns in various granularities from databases)

Description: find human-interpretable patterns

describing general properties of data

Prediction: find patterns that predict future

behavior by using variables in the data to predict

other unknown variable values

Verification: find patterns that confirm user’s hypothesis

52

How?

Summarize

Cluster

Classify

Identify Sequences/links/dependencies

Detect Deviation

14

53

Data Mining Functionality

Characterization:

Summarizes general features of objects in a target

concept (or class or pattern to describe)

Concept description

Discrimination:Compares general features of objects between a target

class and a contrasting class

Concept comparison

54

Data Mining Functionality (cont.)

Association:

Studies the frequency of items occurring together in

transaction databases

Ex: buys(x, beer) buys(x, nuts)

Prediction:Predicts some unknown or missing values based on

known data

Ex: Forecast stock values based on company records,

political climates and economy

55


Classification:

Describes data in a given class based on class features

of known classes (labeled data)

Supervised learning

Ex: Classify housing prices based on locations and

conditions

Clustering:Groups data in classes (or categories or clusters) based

on similarity of their features

Unsupervised learning

* Min. inter-class similarity and Max. intra-class similarity 56


Outlier analysis:

Identifies and explains exceptions (surprises)

Time-series analysis:Identifies trends and deviations; sequential patterns,

similar sequences

15

57

Outline: Part II

Data Mining

Input/Output



Mining the Data

Steps

Tools & Demos

Challenges and Issues58

System Architectures

Data Cleaning &

Data IntegrationFiltering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledgebase

Data Warehouse

59

System Categories

Data Mining systems can be classified

based on

Types of knowledge to be discovered

Types of data to be mined

Types of techniques applied

Types of application domains

60

System Categories


based on

Types of knowledge to be discovered

Summary, comparison, association, classification

knowledge, deviation, trends

Knowledge can be at various levels of

abstractions, e.g., year, quarter, month, date, time

16

61

System Categories


based on

Types of data to be mined

Transaction data, time-series data, spatial data,

text data, www data, heterogeneous/distributed

data

62

System Categories


based on

Types of data models and techniques used

Database-oriented

Machine learning models

Statistical models

Visualization models

63

System Categories


based on

Types of application domains

Text mining systems

Web mining systems

Gene sequence analyzers

Multimedia mining systems

Micro array data analysis systems

64

Outline: Part II

Data Mining

Input/Output



Mining the Data

Steps

Tools & Demos


17

65

Steps in mining the data

Learning the application domain

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation

Find useful features, dimensionality/variable reduction, invariant

representation.

Choosing functions of data mining

summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

66

Some Data Mining Tools & Systems

C4.5, a decision tree learning system [Quinlan, 1994] See5

SOM, Self-organizing Maps [Kohonen, 1995]

Neural Net with Back Propagation learning [Ramerhart, 89]

CBA, Classifier Based on Association rule mining [Liu et al., 1998]

SORCER, Second-Order Relation Compaction for Extraction of

Rules [Hewett and Leuchner, 2002]

Naïve Bayes Classifier (Microsoft)

Tetrad, a Bayes net learning system (CMU)

BNT, Bayes Net Toolbox (MIT)

67

Some Data Mining Suites

DBMiner, IBM’s DataQuest Group

WEKA, Machine learning group at Waikato University

Many more can be found at www.kdnuggets.com

Let’s see them in action ….

68

Outline: Part II

Data Mining

Input/Output



Mining the Data

Steps

Tools & Demos


18

69

Issues in Data Mining

User Interface issues

Performance issues

Data source issues

Security and Social issues

Mining Methodology issues

70

User Interface issues

Visualization issues:

Understandability and interpretation of results

Information representation and rendering

Interactivity

Manipulation of mined knowledge

Focus and refine tasks

Focus and refine results

71

Performance issues

Efficiency and scalability of mining

algorithms

Need at least linear time complexity

algorithms or bounded computation

Sampling

Parallelism

Incremental – can we use divide and

conquer?

72

Data source issues

Diversity of data types

Handling complex types of data

Is it possible to build a system that perform

well on all kinds of data?

Data Collection

Many collect data for archive

Identify problems before mining them

19

73

Security and Social issues

Social Impacts

Private/sensitive data are mined without

consent

New implicit knowledge is disclosed

(confidentiality, integrity)

Knowledge sharing

Regulations

There is need for data mining policy to

protect data security, integrity and privacy74

Mining Methodology issues

Mining different types of knowledge from diverse data

type (e.g., bio, stream, Web)

Incorporation with background knowledge

Handling noise and missing data

Performance: efficiency, effectiveness and scalability

Parallel, distributed and Incremental mining methods

Evaluation: the interestingness problem

Knowledge fusion: Integration of discovered knowledge

with existing one

75

The Interestingness Problems

Is all that is discovered “interesting”?

No.

How do we measure “interestingness”?

Objective: used statistics based on frequency

of occurrences – e.g., regular – might miss

important rare events

Subjective: user’s beliefs

76

Measures of “interestingness”

A pattern is “interesting” if it is:

Easy to understand by humans

Valid on test data with some degree of

certainty

Potentially useful (for users)

Novel or validate user’s hypothesis

20

77

The Interestingness Problems (cont)

Can the data mining system find all

interesting patterns? completeness

??? Read text and tell me in next class

Can the data mining system find only

interesting patterns? optimality

Yes, in some.

E.g., mining query optimization

Data Mining and Knowledge Discovery in...

Documents

Transcript of Data Mining and Knowledge Discovery in...