Download - Why Data mining : Needs of Data Mining Kinds of Pattern and Technologies in DM KDD vs Data Mining Machine learning concepts OLAP Knowledge Representation.

DATA

MINING

U N IT -1 IN T R O D U C T I O N , K N O W L E D G E O F D ATA ,

Why Data mining :

Needs of Data Mining

Kinds of Pattern and Technologies in DM

KDD vs Data Mining

Machine learning concepts

OLAP

Knowledge Representation

Data Pre-Processing {cleaning,integration,

D ATA P R O C E S S I N G

reduction, transformationanddiscretization

Application with mining aspect example like weather prediction.

WH Y D ATA M I N I N G ? The Explosive Growth of Data: from

terabytes to petabytes Data collection and data availability

Automated data collection tools, database systems,

Web, computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated analysis of

massive data sets

E V O L U T I O N O F S C I E N C E S Before 1600, empirical science

1600-1950s, theoretical science Each discipline has grown a theoretical component. Theoretical models often

motivate experiments and generalize our understanding.

1950s-1990s, computational science Over the last 50 years, most disciplines have grown a third, computational branch

(e.g.empirical, theoretical, and computational ecology, or physics, or linguistics.)

Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.

1990-now, data science The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally

accessible Scientific info. management, acquisition, organization, query, and visualization

tasksscale almost linearly with data volumes. Data mining is a major new challenge!

E V O L U T I O N O F D ATAB A S E

TE C H N O L O G Y 1960s:

Data collection, database creation, IMS and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s: Data mining, data warehousing, multim edia databases, and Web

databases

2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems

WHAT IS DATA MINING? Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of

data Data mining: a misnomer?

Alternative names Knowledge discovery (mining) in databases(KDD), data

knowledge dredging,extraction, data/pattern analysis, data

archeology,information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems

DATA

MINING

Data Mining is thetechnologyPattern

whichand

warehousingautomates the process of Trend discovery indata environment.

MOTIVATION FOR DATA

MINING

There isgold Thisgold is

in the data waitingto be mined. inthe form of useful patternswaiting to be discovered.

Prime motivation for data mining remains optimal usage of data stored & collected over a period of time.

Useful patterns are actionable, i.e. they suggest decision that could improve some utility.

MOTIVATION FOR DATA

MINING Example : Examples of useful patterns mined from a database or data warehouse that the supermarket chain owner can discover & put to use are shown in the figure below.

DATA WAREHOUSING & DATA

MINING

TECHNOLOGIES Data is captured using technologies such as bar codes,

radiofrequency

identification (RIFD) tags, scanned texts, QR codes, digital cameras, satellites &the internet.

It is humanly impossible to digest and interpret all the collected data without the help of automated tools.

Data warehousing is a technology that allows one to gather, store, & present data in a form suitable for human exploration. It involves the following :

Data Cleaning – removing noise & inconsistent data. Data Integration -

bringing data from multiple source to a single location & common format.

OLAP(On-line analytical processing) tools enable us to explore the stored data along multiple dimensions, at any level of granularity, and manually discover patterns.

KDD(Knowledge discovery in databases) is the automatic extraction of novel, understandable, and useful patterns from large stores of data. It also includes post mining steps such as pattern evaluation & knowledge presentation.

HOW DISCOVERED PATTERNS

HELP??The above question is best answered by the following figure.

DATA

MODELS

A data model is a description of the organization or the structure of data in an information system. The structure of data can be observed at different levels of abstraction which are Conceptual, Logical, & Physical.

Conceptual Data Model – The manner in which users view the overall structure of the data denoted as a conceptual data model. Entity relationship models & ontologies are examples of conceptual data models.

Logical Data Model – The way in which a database system views the overall structure of the data is denoted as a logical data model. E.g. relational, object-oriented, & object-relational models.

Physical Data Model – The way the data is actually stored in a disk(in terms of cylinders & tracks) or in other storage media is denoted as a physical data model.

DATA WAREHOUSING & OLAP : USER’SPERSPECTIVE

Data warehousing : A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data to support the decision-making process of an enterprise. The data sources maybe volatile but the data warehouse is non-volatile and is time-variant.

Data warehouse users are analysts who explore data to find out patterns. They study how measures are related to dimensions.

The data warehouse is structured in terms ofsubjects,measurements & dimensions to facilitate exploration.

The data is organized in a multi-dimensional model & the data repository is said to be as subject-oriented.

N E E D S O F DATA MINI N G Nowadays, large quantities of data is being accumulated. The amount of data

collected is said to be almost doubled every 9 months.

Seeking knowledge from massive data is one of the most desired attributes of

Data Mining.

Data could be large in two senses. In terms of size, e.g. for Image Data or in terms

of dimensionality, e.g. for Gene expression data.

Usually there is a huge gap from the stored data to the knowledge that could

be construed from the data.

This transition won't occur automatically, that's where Data Mining comes into picture.

In Exploratory Data Analysis, some initial knowledge is known about the data, but

Data Mining could help in a more in-depth knowledge about the data.

KINDS OF PATTERN AND TECHNOLOGIES

IN DM Data mining deals with the kind of patterns that can be mined. On the basis of the

kind of data to be mined, there are two categories of functions involved in Data

Mining

Descriptive

Classification and Prediction

Descriptive Function

The descriptive function deals with the general properties of data in the database.

Here is the list of descriptive functions −

Class/Concept Description

Mining of Frequent Patterns

Mining of Associations

Mining of Correlations

Mining of Clusters

CLASS/CONCEPT DESCRIPTION: CHARACTERIZATION AND DISCRIMINATION

Classes of items for sale include computers and printers

Concepts of customers include bigSpenders and budgetSpenders.

To describe individual classes and concepts in summarized, concise, and

yet precise terms. Such descriptions of a class or a concept are called

class/concept descriptions.

Data characterization: by summarizing the data of the class under study

(often called the target class)

eg. characteristics of software products with sales that increased by 10% in

the previous year

eg. Summarize the characteristics of customers who spend more than $5000

a year at Electronics shops. Result found as , they are 40 to 50 years

old, employed, and have excellent credit ratings.

Data Discrimination: a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes.eg. Shopping of computer products regularly (e.g., more than twice a month) and those who rarely shop for such products (e.g., less than three times a year).Result provides like 80% of the customers who frequently purchase computer products are between 20 and 40 years old and have a university education, 60% of the customers who infrequently buy such products are either seniors or youthsMining Frequent PatternsA frequent item set typically refers to a set of items that often appeartogether in a transactional data seteg.milk and bread, are frequently bought together in grocery stores by many customers.A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.

Mining Associations

eg An example of such a rule, mined from the Electronics shops transactional database,

is

• where X is a variable representing a customer. A confidence, or certainty, of

50% means that if a customer buys a computer, there is a 50% chance that she

will buy software as well.

• A 1% support means that 1% of all the transactions under analysis show that

computer and software are purchased together.

• Association rule involves a single attribute or predicate (i.e., buys) that repeats. It is

referred as single-dimensional association rules.

• Dropping the predicate notation, the rule can be written simply as “computer ⇒software [1%, 50%].”

IN RELATIONAL DATABASE RELATED TO PURCHASES A

DATA MINING SYSTEM MAY FIND ASSOCIATION RULES LIKE

• Customers under study, 2% are 20 to 29 years old with an income of

$40,000 to $49,000 and have purchased a laptop (computer)

• 60% probability that a customer in this age and income group will

purchase a laptop. This is an association involving more than one

attribute or predicate (i.e., age, income, and buys). The above rule can

be referred to as a multidimensional association

Mining of Correlations

It is a kind of additional analysis performed to uncover interesting

statistical correlations between associated-attribute−value pairs or

between two item sets to analyze that if they have positive,

negative or no effect on each other.

Mining of Clusters

Cluster refers to a group of similar kind of objects. Cluster

analysis refers to forming group of objects that are very

similar to each other but are highly different from the objects in

other clusters.

CLASSIFICATION AND PREDICTION

Classification is the process of finding a model that describes the data classes

or concepts.

Purpose is to be able to use this model to predict the class of objects

whose class label is unknown.

Derived model is based on the analysis of sets of training data. The

derived model can be presented in the following forms −

Classification (IF-THEN) Rules

Decision Trees

Mathematical Formulae

Neural Networks

DATA MINING: CONFLUENCE OF

MULTIPLE

DISCIPLINES

Data Mining

DatabaseTechnolog

y

Statist ics

Machine Learning

Pat tern Recognition

AlgorithmOther

Disciplines

Visualization

MACHINE LEARNING

Machine learning is a fast-growing discipline.

Supervised learning is basically a synonym for classification.

• Eg: The postal code recognition problem

• A set of handwritten postal code images and their corresponding

machine-readable translations are used as the training examples

• It supervise the learning of the classification model.

Unsupervised learning is essentially a synonym for clustering.

• Eg: an unsupervised learning method can take, as input, a set

of images of handwritten digits.

• It finds 10 clusters of data. These clusters may correspond to the

10 distinct digits of 0 to 9, respectively.

Semi-supervised learning isa

class ofmachine

techniques that make use of both labeled and

LEAR

NING

UNLAB

ELED

examples when learning a model.

• Labeled examples are used to learn class models

• Unlabeled examples are used to refine the boundaries between

classes.

• Eg: For a two-class problem, we can think of the set of

examples belonging to one class as the positive examples

and those belonging to the other class as the negative examples.

In Figure if we do not consider the unlabeled examples, the dashed line is the decision boundary that best partitions the positive examples from the negative examples.Using the unlabeled examples, we can refine the decision boundary to thesolid line. Moreover, we can detect that the two positive examples at the top right corner, though labeled, are likely noise or outliers.

Active learning is a machine learning approach that lets

users play an active role in the learning process.

• An active learning approach can ask a user (e.g., adomain

expert) to label an example, which may be from a setof

unlabeled examples or synthesized by the learning program.

• The goal is to optimize the model quality by actively

acquiring knowledge from human users, given a constraint on

how many examples they can be asked to label.

KNOWLEDGE

REPRESENTATION

DATA PRE-PROCESSING Data Cleaning

Missing Values Noisy Data Data Cleaning as a Process

DATA INTEGRATION

Entity Identification P roblem Redundancy and Correlation AnalysisTuple DuplicationData Value Conflict Detection and Resolution

DATA REDUCTION

Overview of Data ReductionWavelet TransformPrinciple Component AnalysisAttribute Subset SelectionRegression and Long Linear ModelsHistogramsClustering

DATA TRANSFORMATION AND DATA DISCRETIZATION

Data Transformation Strategic Overview Data Transformation and Normalization Discretization by binning Discretization by Histogram Analysis Discretization by Cluster, Decision Tree ,

Correlation Analyses Concept of Hierarchy Generation for

Nominal Data

D ATA C L E A N IN G Importance

“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball

“Data cleaning is the number one problem in datawarehousing”—DCI survey

Data cleaning ta sks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

MISSING DATA Data is not always available

E.g., many tuples have no recorded value for several

attrib utes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important a t the time of

entry

not register history or changes of the data

Missing data may need to be inferred.

HOW TO HANDLE MISSING DATA? Ignore the tuple: usually done when class label is missing

(assuming the tasks in classification—not effective when the

percentage of missing values per attrib ute varies considerably.

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attrib ute mean for all sam ples belonging to the sam e

class: smarte r

the most probable value: inference-based such as Bayesian

form ula or decision tree

NOISY DATA Noise: random error or variance in a measuredvariable

Incorrect attribute values may due to faulty data collection instrumentsdata entry problemsdata t r ansmission problems technology limitation inconsistency in naming convention

Other data problems which requires data cleaningduplicate records incomplete data inconsistent data

HOW TO HANDLE NOISY DATA? Binning

first sort data and partition into (equal-frequency) bins

then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regressionsmooth by fitting the data into regression

functions Clustering

detect and remove outliers Combined computer and human inspection

detect suspicious values and check by human (e.g.,deal with possible outliers)

SIMPLE DISCRETIZATION METHODS: BINNING

Equal-wid th (distance) part itioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the

attrib ute, the width of intervals will be: W = (B –A)/N.

The most straightforwa rd, bu t outliers may dominate

presentation

Skewed data is not handled well

Equal-dep th (frequency) part itioning

Divides the range into N intervals, each containing

approximately sam e num ber of sam ples

Good data scaling

Managing categorical attrib utes can be tricky

BINNING METHODS FOR

DATA SMOOTHING Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,

28, 29, 34*Partit ion into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34*Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29*Smoothing by bin

boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

REGRESSION

x

y = x + 1

X1

y

Y1

Y1’

CLUSTER ANALYSIS

DATA INTEGRATION Data integration:

Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust -id B.cust-# In tegrate metadata from different sources

Entity identification problem: Identify real world entities from multiple data

sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources are different

Possible reasons: different representations, differentscales, e.g., metric vs. British units

H ANDLING REDUNDANCY IN DATA

INTEGRATION Redundant data occur often when integrat ion of

mul tiple da tabasesObject identification: The same attribute or

object may have different names in different databases

Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue

Redunda n t a t t r ibutes may be able to be detected bycorrelation analysis

Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

DATA TRANSFORMATION

Smoothing: remove noise from data Aggregation: summarization, data cube

construction Generalization: concept hierarchy

climbing Normalization: scaled to fall within a small,

specified rangemin-max normalizationz-score normalizationnormalization by decimal scaling

Attribute/feature constructionNew attributes constructed from the

given ones

DATA TRANSFORMATION: NORMALIZATION

Min-max normalization: to [new_min A, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0,1.0]. Then $73,000 is mapped to

Z-score normalization (μ: mean, σ: s tandard deviation):

98,000 12,00073,600 12,000 (1.0 0) 0 0.716

maxA minA

v minA

v' (new _ maxA new _ minA) new _ minA

A

Ex. Let μ = 54,000, σ = 16,000.Then

Normalization by decimal scaling

v ' v A

v

10 j

v' Where j is the smallest integer such that Max(|ν’|) < 1

1.22516,000

73,600 54,000

DATA REDUCTION STRATEGIES Why data reduction?

A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to

run on the complete data set Data reduction

Obtain a reduced representation of the data set tha t is much smaller in volume but yet produce the sam e (or almost the sam e) analytical results

Data reduction s t rategies Data cube aggregation: Dimensionality reduction — e.g., remove

unimportan t attrib utes Data Compression Numerosity reduction — e.g., fit data into models Discretization and concept hierarchy generat ion

DATA CUBE AGGREGATION The lowest level of a data cube (base cuboid)

The aggregated data for an individual entity of

interest

E.g., a customer in a phone calling data warehouse

Multiple levels of aggregation in da ta cubes

Further reduce the size of data to deal with

Reference appropriate levels

Use the smallest representation which is enough

to solve the task

Queries regarding aggregated inform a tion should be

answered using data cube, when possible

ATTRIBUTE SUBSET SELECTION Feature selection (i.e., a t t r ibute subset selection):

Select a minimum set of features such tha t the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features

reduce # of patterns in th e patterns, easier to understand

Heuris tic methods (due to exponential # of choices):Step-wise forward selectionStep-wise backward eliminationCombining forward selection and backward

eliminationDecision-tree induction

EXAMPLE OF DECISION TREE INDUCTIONInitial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

HEURISTIC FEATURE SELECTION

METHODS There are 2d possible sub-features of d features Several heuristic feature selection methods:

Best single features under the feature independence assumption: choose by significance tests

Best step-wise feature selection: The best single-featu re is picked first Then next best featu re condition to the first, ...

Step-wise feature elimination: Repeatedly eliminate the worst feature

Best combined feature selection and eliminationOptimal branch and bound:

Use feature eliminat ion and backtracking

Given N data vectors from n-dimensions, find k ≤ n

orthogonalvectors (principal components) tha t can be best used to representdata

Steps Normalize input da ta : E ach a tt r ibute falls with in th e sam e

range Compute k orthonorm al (unit) vectors, i.e.,

principal components E ach input data (vector) is a linear combination of

th e kprincipal component vectors

The principal components are sorted in order of decreasing “significance” or strength

Since the components a re sorted, the size of the data can be reduced by elimina t ing the weak components, i.e., those with low variance. (i.e., using the s t rongest principa l components, it is possible to reconstruct a good approxima tion of th e original data

DIMENSIONALITY REDUCTION: PRINCIPAL

COMPONENT ANALYSIS (PCA)

X1

X2

Y1

Y2

PRINCIPAL COMPONENT ANALYSIS

KDD V S . D ATA M I N I N G KDD is a process of extracting previously unknown, valid,

and actionable (understandable) information fromlarge

databases

Data mining is a step in the KDD process of applying data

analysis and discovery algorithms.

OLAP Basic idea : converting data into information tha tdecision makers need

Concept to analyze data by multiple dimension ina structure called data cube

Goal of OLAP is to suppor t ad-hoc querying forthebusiness analyst

Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work

with view of da ta is the foundationof

warehouse data Multidimensional

OLAP

Data Mining Issues

Human interaction- Interfaces required with both domain and technical experts- variety of databases, variety of users leading to numerous data mining techniques – What is required is not known hence extraction process need to be interactive.

Interpretation of results- Requirements of Experts- interpretability problems- Background knowledge or domain expertise is essential to guide the discovery process

Visualization of results- Visualization helps- multi- dimensional data is problematic – The discovered knowledge should expressed in the form of trees , tables, graphs, charts curves etc.

Data Mining Issues Continued

Large datasets- Scalability is a problem- algorithms do not scale well with massive real-world datasets- sampling and parallelization are effective tools

High dimensionality - Conventional Database may contain many different attributes, all are not relevant- increases complexity and reduces efficiency – dimensionality curse-data reduction-dimensionality reduction

Multimedia data - Found in GIS databases provesconventional data mining algorithms ineffective

Missing data -It is not always possible to ignore missing data but in preprocessing data mining algorithms can be used to replace missing data with estimates

Data Mining Issues Continued

Irrelevant data – Data reduced by removing irrelevant data

Noisy data and outliers –Invalid , incorrect data will lead to poor quality data mining- Outliers are very much different and do not fit nicely into the derived model

Changing data- Data warehouses contain non-volatile data-Dynamic data is uploaded and then algorithms are reapplied

Integration- KDD requests are one time needs-data miningfunctions are now integrated into traditional database systems

Applications – Effective use of output of mining algorithm is a challenge rather than the complexity of the mining algorithm

Data Mining Metrics

How to measure the effectiveness of data mining process?

-KDD process is expensive- Return on investment will be thesaving due to decision process using the results

-Difficult to measure and quantify

-Measured as increase in sales, reduction in advertising cost

Social Implications of Data mining

Two sides of the coin

Data mining can be used to improvecustomer service and satisfaction

Data mining can be used to confront one’s right to privacy

Omnipresent Invisible Data mining affecting everyone-profiling

isused to label typical characteristics

DATABASE

PROCESSING VS. DATA MINING

PROCESSING

Query Well

defined SQL

Query Poorly defined No precise query language

Output– Subset of da tabase

Output–Not a subset of da tabase

QUERY

EXAMPLES

D atabase– Find all credit applicants with last name of Smith.– Identify customers who have purchased more

than$10,000 in the last month.– Find all customers who have purchased milk

Data Mining–Find all credit applicants who are poorcreditrisks. (classification)–Identify customers with similar buying

habits. (Clustering)–Find all items which are frequently purchased with milk. (association rules)

OPERATIONAL VS. INFORMATIONAL

Operational Data Data Warehouse

Application OLTP OLAP

Use Precise Queries Ad Hoc

Temporal Snapshot Historical

Modification Dynamic Static

Orientation Application Business

Data Operational Values Integrated

Size Gigabits TerabitsLevel Detailed Summarized

Access Often Less Often

Response Few Seconds Minutes

Data Schema Relational Star/Snowflake

OLAP

O n line Analytic Processing (OLAP): provides morecomplex queries than OLTP.

O n L in e Transaction Processing (OLTP): t radi tional da tabase/t r ansaction processing.

Dimensional data; cube view Visualization of operat ions:

Slice: examine sub-cube. Dice: rotate cube to look a t another dimension. Roll Up/Dril l Down

OLAP OPERATIONS

Single Cell Multiple Cells Slice Dice

Roll Up

Drill Down

6

OLTP VS. OLAPOLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-datedetailed, flat relational isolated

historical,summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/writeindex/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

6

D ATA M IN IN G A P P L I C A T IO N S

(BI) technologies providehistorical,

current, and predictiveviews of business operations.

It includes reporting, onlineanalytical

processing,businessperformance management, competitive

intelligence, benchmarking, and predictive analytics “How important is business intelligence?” Effective market analysis, Compare customer feedback on similar products, Discover the strengths and weaknesses of their competitors, Retain highly valuable customers, and make smart

business decisions.

Classification and Prediction techniques are the core

of predictive analytics in business intelligence, for which there

are many applications in analyzing markets, supplies, and

sales.

Clustering plays a central role in customer

relationship management, which groups customers

based on their similarities.

Characterization mining techniques, we can better understand

features of each customer group and developcustomized

customer reward programs.

WEB SEARCH

ENGINE

Crawling

Indexing

Target Hits