172188628-data-mining

35
Data Mining: Concepts & Techniques 1

description

Enjoy!!

Transcript of 172188628-data-mining

Data Mining:

Concepts & Techniques

26

Motivation:

Necessity is the Mother of Invention

Data explosion problem

Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

Evolution of Database Technology

What Is Data Mining?

Data mining (knowledge discovery in databases):

Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) information or patterns from data in large databases

Alternative names and their inside stories:

Data mining: a misnomer?

Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What is not data mining?

(Deductive) query processing.

Expert systems or small ML/statistical programs

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining

summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Knowledge Discovery Process

The whole process of extraction of implicit, previously unknown and potentially useful knowledge from a large database

It includes data selection, cleaning, enrichment, coding, data mining, and reporting

Data Mining is the key stage of Knowledge Discovery Process

The process of finding the desired information from large database

Knowledge Discovery Process

Example: the database of a magazine publisher which sells five types of magazines on cars, houses, sports, music and comics

Data mining:

Find interesting categorical properties

Questions:

What is the profile of a reader of a car magazine?

Is there any correlation between an interest in cars and an interest in comics?

The knowledge discovery process consists of six stages

Data Selection

Select the information about people who have subscribed to a magazine

Cleaning

Pollutions: Type errors, moving from one place to another without notifying change of address, people give incorrect information about themselves

Pattern Recognition Algorithms

Cleaning

Lack of domain consistency

Enrichment

Need extra information about the clients consisting of date of birth, income, amount of credit, and whether or not an individual owns a car or a house

Enrichment

The new information need to be easily joined to the existing client records

Extract more knowledge

Coding

We select only those records that have enough information to be of value (row)

Project the fields in which we are interested (column)

Coding

Code the information which is too detailed

Address to region

Birth date to age

Divide income by 1000

Divide credit by 1000

Convert cars yes-no to 1-0

Convert purchase date to month numbers starting from 1990

The way in which we code the information will determine the type of patterns we find

Coding has to be performed repeatedly in order to get the best results

Coding

The way in which we code the information will determine the type of patterns we find

Coding

We are interested in the relationships between readers of different magazines

Perform flattening operation

Data mining

We may find the following rules

A customer with credit > 13000 and aged between 22 and 31 who has subscribed to a comics at time T will very likely subscribe to a car magazine five years later

The number of house magazines sold to customers with credit between 12000 and 31000 living in region 4 is increasing

A customer with credit between 5000 and 10000 who reads a comics magazine will very likely become a customer with credit between 12000 and 31000 who reads a sports and a house magazine after 12 years

Knowledge Discovery Process

Business-Question-Driven Process

Data Mining and Business

Intelligence

Increasing potential to supportbusiness decisionsMaking

Decisions

Data Presentation

Visualization Techniques

Data Mining

Information Discovery

Data ExplorationStatistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

OLAP, MDA

End User

Business

Analyst

Data

Analyst

DBA

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Architecture of a Typical Data

Mining System

Data Mining: On What Kind of Data?

Relational databases

Data warehouses

Transactional databases

Advanced DB and information repositories

Object-oriented and object-relational databases

Spatial databases

Time-series data and temporal data

Text databases and multimedia databases

Heterogeneous databases

WWW