1 Advanced Databases Data Mining Dr Theodoros Manavis [email protected].

1

Advanced DatabasesAdvanced DatabasesData MiningData Mining

Dr Theodoros ManavisDr Theodoros [email protected]@ist.edu.gr

2

Data Mining Definition

• Data Mining is:(1) The efficient discovery of previously unknown,

non-trivial, implicit, valid, potentially useful, understandable patterns in large datasets.

(2) The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner

3


• Alternative names: – Data mining: a misnomer?– Knowledge discovery(mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• What is not data mining?– query processing. – Expert systems or small ML/statistical programs

4

Data Mining vs. Data Query

• Data Query:e.g.– A list of all customers who use a credit card to buy a PC– A list of all MIS students having a GPA of 3.5 or higher and

has studied 4 or less semesters

• Data Mining problems:e.g.– What is the likelihood of a customer purchasing PC with

credit card– Given the characteristics of MIS students predict her SPA in

the coming term– What are the characteristics of MIS undergrad students

5


6

Knowledge Discovery Process

Data Selection

• Select the information about people who have subscribed to a magazine

• Pollutions: Type errors, moving from one place to another without notifying change of address, people give incorrect information about themselves – Pattern Recognition Algorithms

Cleaning

• Lack of domain consistency

Cleaning

Enrichment

• Need extra information about the clients consisting of date of birth, income, amount of credit, and whether or not an individual owns a car or a house

• We select only those records that have enough information to be of value (row)

• Project the fields in which we are interested (column)

Coding

• Code the information which is too detailed – Address to region– Birth date to age– Divide income by 1000– Divide credit by 1000– Convert cars yes-no to 1-0– Convert purchase date to month numbers

starting from 1990• The way in which we code the information will

determine the type of patterns we find• Coding has to be performed repeatedly in order to

get the best results

Coding

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Examples of Large Datasets

• Government: IRS, NGA, …• Large corporations

– WALMART: 20M transactions per day– MOBIL: 100 TB geological databases– AT&T 300 M calls per day– Credit card companies

• Scientific– NASA, EOS (earth observing system) project: 50 GB per

hour– Environmental datasets

15

Examples of Data mining Applications

1. Fraud detection: credit cards, phone cards

2. Marketing: customer targeting

3. Data Warehousing: Walmart

4. Astronomy

5. Molecular biology

The Data Mining Process

• 1. Goal identification:– Define problem– relevant prior knowledge and goals of application

• 2. Creating a target data set: data selection• 3. Data preprocessing: (may take 60%-80% of

effort!)– removal of noise or outliers– strategies for handling missing data fields– accounting for time sequence information

• 4. Data reduction and transformation:– Find useful features, dimensionality/variable

reduction.

The Data Mining Process

• 5. Data Mining:– Choosing functions of data mining:

• summarization, classification, regression, association, clustering.

– Choosing the mining algorithm(s):• which models or parameters

– Search for patterns of interest

• 6. Presentation and Evaluation:– visualization, transformation, removing redundant

patterns, etc.• 7. Taking action:

– incorporating into the performance system – documenting– reporting to interested parties

18

An example: Customer Segmentation

• 1. Marketing department wants to perform a segmentation study on the customers of AllElectronics Company

• 2. Decide on revevant variables from a data warehouse on customers, sales, promotions– Customers: name,ID,income,age,education,...– Sales: hisory of sales– Promotion: promotion types durations...

• 3. Handle missing income, addresses.. determine outliers if any

• 4. Cenerate new index variables representing wealth of customers – Wealth = a*income+b*#houses+c*#cars...– Make neccesary transformations scores so that some data mining

algorithms work more efficiently

19

Example: Customer Segmentation cont.

• 5.a: Choose clustering as the data mining functionality as it is the natural one for a segmentation study so as to find group of customers with similar charecteristics

• 5.b: Choose a clustering algorithm– e.g. K-means or any suitable one for that problem

• 5.c: Apply the algorithm – Find clusters or segments

• 6. make reverse transformations, visualize the customer segments

• 7. present the results in the form of a report to the marketing deprtment– İmplement the segmentation as part of a DSS so that it can be

applied repeatedly at certain internvals as new customers arrive– Develop marketing strategies for each segment

Two Styles of Data Mining

• Descriptive data mining – characterize the general properties of the data in the database– finds patterns in data and – the user determines which ones are important

• Predictive data mining– perform inference on the current data to make predictions– we know what to predict

• Not mutually exclusive – used together– Descriptive predictive

• Eg. Customer segmentation – descriptive by clustering• Followed by a risk assignment model – predictive by ANN

21

Supervised vs. Unsupervised Learning

Supervised learning (classification, prediction) Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations

New data is classified based on the training set Unsupervised learning (summarization,

association, clustering) The class labels of training data is unknown Given a set of measurements, observations,

etc. with the aim of establishing the existence of classes or clusters in the data

22

Descriptive Data Mining• Discovering new patterns inside the data• Used during the data exploration steps• Typical questions answered by descriptive data mining:

– what is in the data– what does it look like– are there any unusual patterns– what does the data suggest for customer segmentation

• users may have no idea– which kind of patterns may be interesting

• patterns at various granularities– Geography (country - city - region – street)

– Student (university - faculty - department – minor)

• Functionalities of descriptive data mining– Clustering (e.g. customer segmentation)

– summarization– visualization– Association (e.g. market basket analysis)

23

Model Y outputinputsX1,X2

The user does not care what the model is doing: it is a black boxUser interested in the accuracy of its predictions

X: vector of independent variables or inputsY =f(X) : an unknown functionY: dependent variables or output a single variable or a vector

Prediction: A model is a black box

24

Predictive Data Mining

• Using known examples the model is trained– the unknown function is learned from data

• the more data with known outcomes is available– the better the predictive power of the model

• Used to predict outcomes whose inputs are known but the output values are not realized yet

• Never %100 accurate • The performance of a model on past data is not

important– Not important how well it predicts the known outcomes

• Its performance on unknown data is much more important

25

Typical questions answered by predictive models

• Who is likely to respond to our next offer– based on history of previous marketing campaigns

• Which customers are likely to leave in the next six months

• What transactions are likely to be fraudulent– based on known examples of fraud

• What is the total amount spending of a customer in the next month

Why Data Preprocessing?

• Data in the real world is dirty– incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data– noisy: containing errors or outliers– inconsistent: containing discrepancies in codes or names

• No quality data, no quality mining results!– Quality decisions must be based on quality data– Data warehouse needs consistent integration of quality data– Required for both OLAP and Data Mining!

Why can Data be Incomplete?

• Attributes of interest are not available (e.g., customer information for sales transaction data)

• Data were not considered important at the time of transactions, so they were not recorded!

• Data not recorder because of misunderstanding or malfunctions

• Data may have been recorded and later deleted!

• Missing/unknown values for some data

Data Cleaning

• Data cleaning tasks

– Fill in missing values

– Identify outliers and smooth out noisy data

– Correct inconsistent data

30

Data Mining Functionalities (1/5)

1. Concept description: Characterization and discrimination– Generalize, summarize, and contrast data characteristics, e.g., big

spenders vs. budget spenders

2. Association (correlation and causality)– Multi-dimensional vs. single-dimensional association

– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

– contains(T, “computer”) contains(x, “software”) [1%, 75%]

31


Characterisation (concept description): summarizing the characteristics of customers who spend more than $1000 a year at AllElectronics (a database of customers)

– age, employment, income– drill down on any dimension

32


Discrimination example (concept description):

• Example 1:Compare the general features of software products– whose sales increased by %10 in the last year (target class)– whose sales decreased by at least %30 during the same period (contrasting

class)• Example 2: Compare two groups of ‘AllElectronics’ customers

– I) who shop for computer products regularly (target class)• more than two times a month

– II) who rarely shop for such products (contrasting class)• less than three times a year

• The resulting description:• %80 of group I customers

– university education– ages 20-40

• %60 of group II customers– seniors or young– no university degree

33

3. Classification and Prediction

– Finding models (functions) that describe and distinguish classes or concepts for future prediction

– E.g., classify people as healthy or sick, or classify transactions as fraudulent or not

– Methods: decision-tree, classification rule, neural network

– Prediction: Predict some unknown or missing numerical values

4. Cluster analysis– Class label is unknown: Group data to form new classes, e.g.,

cluster customers of a retail company to learn about characteristics of different segments

– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity


34

5. Outlier analysis– Outlier: a data object that does not comply with the general behavior of

the data

– It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

6. Trend and evolution analysis

– Trend and deviation: regression analysis

– Sequential pattern mining: click stream analysis

– Similarity-based analysis

7. Other pattern-directed or statistical analyses


Classification: Definition

• Given a collection of records (training set )– Each record contains a set of attributes, one of the attributes is the

class.

• Find a model for class attribute as a function of the values of other attributes.

• Goal: previously unseen records should be assigned a class as accurately as possible.– A test set is used to determine the accuracy of the model. Usually, the

given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Example

Tid Home Owner

Marital Status

Taxable Income Default

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Home Owner

Marital Status


No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ? 10

TestSet

Training Set

ModelLearn

Classifier

Example of a Decision Tree

Tid Home Owner

Marital Status




3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


categoric

al

categoric

al

continuous

class

HO

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Another Example of Decision Tree

Tid Home Owner

Marital Status




3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


categoric

al

categoric

al

continuous

classMarSt

HO

TaxInc

YESNO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

Classification: Application 1

Direct Marketing– Goal: Reduce cost of mailing by targeting a set of consumers

likely to buy a new cell-phone product.– Approach:

• Use the data for a similar product introduced before.

• We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute.

• Collect various demographic, lifestyle, and company-interaction related information about all such customers.

– Type of business, where they stay, how much they earn, etc.

• Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997

• Fraud Detection– Goal: Predict fraudulent cases in credit card transactions.– Approach:

• Use credit card transactions and the information on its account-holder as attributes.

– When does a customer buy, what does he buy, how often he pays on time, etc

• Label past transactions as fraud or fair transactions. This forms the class attribute.

• Learn a model for the class of the transactions.• Use this model to detect fraud by observing credit card transactions on an

account.

Classification: Application 2

Clustering Definition

• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that:– Data points in one cluster are more similar to

one another.– Data points in separate clusters are less similar

to one another.

• Similarity Measures:– Euclidean Distance if attributes are continuous.– Other Problem-specific Measures.

Illustrating Clustering

Euclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intracluster distancesare minimized

Intercluster distancesare maximized

Intercluster distancesare maximized

Clustering: Application 1

• Market Segmentation:– Goal: subdivide a market into distinct subsets of customers

where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

– Approach: • Collect different attributes of customers based on their geographical

and lifestyle related information.• Find clusters of similar customers.• Measure the clustering quality by observing buying patterns of

customers in same cluster vs. those from different clusters.

• Document Clustering:– Goal: To find groups of documents that are similar

to each other based on the important terms appearing in them.

– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

– Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Clustering: Application 2

Illustrating Document Clustering

• Clustering Points: 3204 Articles of Los Angeles Times.• Similarity Measure: How many words are common in

these documents (after some word filtering).

Category TotalArticles

CorrectlyPlaced

Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278

Association Rule Discovery: Definition

• Given a set of records each of which contain some number of items from a given collection;

– Produce dependency rules which will predict occurrence of an item based on occurrences of other items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

• Marketing and Sales Promotion:– Let the rule discovered be {Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to determine what

should be done to boost its sales.– Bagels in the antecedent => Can be used to see which products

would be affected if the store discontinues selling bagels.– Bagels in antecedent and Potato chips in consequent => Can be

used to see what products should be sold with Bagels to promote sale of Potato chips!

Association Rule Discovery: Application

48

Thank You for Your Attention Thank You for Your Attention

1 Advanced Databases Data Mining Dr Theodoros Manavis [email protected].

Documents

Transcript of 1 Advanced Databases Data Mining Dr Theodoros Manavis [email protected].