Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in...

Basics of data mining, Knowledge Discovery in databases,

KDD process, data mining tasks primitives, Integration of

data mining systems with a database or data warehouse

system, Major issues in data mining, Data pre-processing:

data cleaning, data integration and transformation, data

reduction etc.

If a data mining system is not integrated with a database

or a data warehouse system, then there will be no system

to communicate with. This scheme is known as the non-

coupling scheme. In this scheme, the main focus is on

data mining design and on developing efficient and

effective algorithms for mining the available data sets.

The list of Integration Schemes is as follows −

Integration of data mining systems with a database or

data warehouse system

1. No Coupling − In this scheme, the data mining system

does not utilize any of the database or data warehouse

functions. It fetches the data from a particular source and

processes that data using some data mining algorithms.

The data mining result is stored in another file.



2. Loose Coupling − In this scheme, the data mining

system may use some of the functions of database and

data warehouse system. It fetches the data from the

data respiratory managed by these systems and

performs data mining on that data. It then stores the

mining result either in a file or in a designated place in

a database or in a data warehouse.





3. Semi−tight Coupling − In this scheme, the data

mining system is linked with a database or a data

warehouse system and in addition to that, efficient

implementations of a few data mining primitives can be

provided in the database.



4. Tight coupling − In this coupling scheme, the data

mining system is smoothly integrated into the database

or data warehouse system. The data mining subsystem is

treated as one functional component of an information

system.

Major issues in data mining

Major issues in data mining regarding mining

methodology, user interaction, performance, and diverse

data types. These issues are following:

Mining Methodology and User Interaction

Performance Issues

Diverse Data Types Issues

1. Mining Methodology and User Interaction

Issues

It refers to the following kinds of issues −

Mining different kinds of knowledge in databases −

Different users may be interested in different kinds of

knowledge. Therefore it is necessary for data mining to

cover a broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of

abstraction − The data mining process needs to be

interactive because it allows users to focus the search for

patterns, providing and refining data mining requests based

on the returned results.

Incorporation of background knowledge − To guide

discovery process and to express the discovered patterns, the

background knowledge can be used. Background knowledge

may be used to express the discovered patterns not only in

concise terms but at multiple levels of abstraction.

Data mining query languages and ad hoc data

mining − Data Mining Query language that allows the user

to describe ad hoc mining tasks, should be integrated with

a data warehouse query language and optimized for

efficient and flexible data mining.

Presentation and visualization of data mining

results − Once the patterns are discovered it needs to be

expressed in high level languages, and visual

representations. These representations should be easily

understandable.

Handling noisy or incomplete data − The data

cleaning methods are required to handle the noise and

incomplete objects while mining the data regularities. If

the data cleaning methods are not there then the accuracy

of the discovered patterns will be poor.

Pattern evaluation − The patterns discovered should

be interesting because either they represent common

knowledge or lack novelty.

2. Performance Issues

There can be performance-related issues such as follows −

Efficiency and scalability of data mining

algorithms − In order to effectively extract the

information from huge amount of data in databases, data

mining algorithm must be efficient and scalable.

Parallel, distributed, and incremental mining

algorithms − The factors such as huge size of databases,

wide distribution of data, and complexity of data mining

methods motivate the development of parallel and

distributed data mining algorithms. These algorithms

divide the data into partitions which is further processed in

a parallel fashion. Then the results from the partitions is

merged. The incremental algorithms, update databases

without mining the data again from scratch.

3. Diverse Data Types Issues

Handling of relational and complex types of data − The

database may contain complex data objects, multimedia data

objects, spatial data, temporal data etc. It is not possible for

one system to mine all these kind of data.

Mining information from heterogeneous databases and

global information systems − The data is available at

different data sources on LAN or WAN. These data source

may be structured, semi structured or unstructured. Therefore

mining the knowledge from them adds challenges to data

mining.

Data Mining Task Primitives

We can specify a data mining task in the form of a

data mining query.

This query is input to the system.

A data mining query is defined in terms of data

mining task primitives.

These primitives allow us to communicate in an

interactive manner with the data mining system.

List of Data Mining Task Primitives −

Set of task relevant data to be mined.

Kind of knowledge to be mined.

Background knowledge to be used in discovery

process.

Interestingness measures and thresholds for pattern

evaluation.

Representation for visualizing the discovered patterns.

Data mining task primitives

A data mining query is defined in terms of the following

primitives:

1. Task-relevant data: This is the database portion to be

investigated. For example, suppose that you are a

manager of All Electronics in charge of sales in the

United States and Canada. In particular, you would like to

study the buying trends of customers in Canada. Rather

than mining on the entire database. These are referred to

as relevant attributes.


2. The kinds of knowledge to be mined: This specifies

the data mining functions to be performed, such as

characterization, discrimination, association,

classification, clustering, or evolution analysis. For

instance, if studying the buying habits of customers in

Canada, you may choose to mine associations between

customer profiles and the items that these customers like

to buy


3. Background knowledge: Users can specify

background knowledge, or knowledge about the domain to

be mined. This knowledge is useful for guiding the

knowledge discovery process, and for evaluating the

patterns found. There are several kinds of background

knowledge.

4. Interestingness measures: These functions are used to

separate uninteresting patterns from knowledge. They may

be used to guide the mining process, or after discovery, to

evaluate the discovered patterns. Different kinds of

knowledge may have different interestingness measures.

5. Presentation and visualization of discovered

patterns: This refers to the form in which discovered

patterns are to be displayed. Users can choose from

different forms for knowledge presentation, such as rules,

tables, charts, graphs, decision trees, and cubes.

Data pre-processing is a data mining technique which is

used to transform the raw data in a useful and efficient

format.

Steps Involved in Data Pre-processing:

1.Data Cleaning: The data can have many irrelevant and

missing parts. To handle this part, data cleaning is done.

It involves handling of missing data, noisy data etc.

(a).Missing Data: This situation arises when some data is

missing in the data. It can be handled in various ways.

Some of them are:

Ignore the tuples: This approach is suitable only when the

dataset we have is quite large and multiple values are

missing within a tuple.

Fill the Missing values: There are various ways to do this

task. You can choose to fill the missing values manually,

by attribute mean or the most probable value.

b).Noisy Data: Noisy data is a meaningless data that can’t

be interpreted by machines. It can be generated due to

faulty data collection, data entry errors etc. It can be

handled in following ways :

1. Binning Method: This method works on sorted data in

order to smooth it. The whole data is divided into segments

of equal size and then various methods are performed to

complete the task. Each segmented is handled separately.

One can replace all data in a segment by its mean or

boundary values can be used to complete the task.

2.Regression: Here data can be made smooth by fitting

it to a regression function. The regression used may be

linear (having one independent variable) or multiple

(having multiple independent variables).

3.Clustering: This approach groups the similar data in a

cluster. The outliers may be undetected or it will fall

outside the clusters.

2.Data Transformation: This step is taken in order to

transform the data in appropriate forms suitable for

mining process. This involves following ways:

1.Normalization: It is done in order to scale the data

values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2.Attribute Selection: In this strategy, new attributes are

constructed from the given set of attributes to help the

mining process.

3.Discretization: This is done to replace the raw values

of numeric attribute by interval levels or conceptual

levels.

4.Concept Hierarchy Generation: Here attributes are

converted from level to higher level in hierarchy. For

Example-The attribute “city” can be converted to

“country”.

3. Data Reduction: Data reduction techniques can be

applied to obtain a reduced representation of the data set

that is much smaller in volume, yet closely maintains the

integrity of the original data. That is, mining on the

reduced data set should be more efficient yet produce

the same (or almost the same) analytical results.

Strategies for data reduction include the following.

1. Data cube aggregation, where aggregation operations

are applied to the data in the construction of a data cube.

2. Dimension reduction, where irrelevant, weakly

relevant or redundant attributes or dimensions may be

detected and removed.

3. Data compression, where encoding mechanisms are

used to reduce the data set size.

4. Numerosity reduction, where the data are replaced or

estimated by alternative, smaller data representations such as

parametric models (which need store only the model

parameters instead of the actual data), or nonparametric

methods such as clustering, sampling, and the use of

histograms.

5. Discretization and concept hierarchy generation, where

raw data values for attributes are replaced by ranges or higher

conceptual levels. Concept hierarchies allow the mining of data

at multiple levels of abstraction, and are a powerful tool for

data mining.

Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in...

Documents

Transcript of Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in...