Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in...
Transcript of Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in...
Basics of data mining, Knowledge Discovery in databases,
KDD process, data mining tasks primitives, Integration of
data mining systems with a database or data warehouse
system, Major issues in data mining, Data pre-processing:
data cleaning, data integration and transformation, data
reduction etc.
If a data mining system is not integrated with a database
or a data warehouse system, then there will be no system
to communicate with. This scheme is known as the non-
coupling scheme. In this scheme, the main focus is on
data mining design and on developing efficient and
effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
Integration of data mining systems with a database or
data warehouse system
1. No Coupling − In this scheme, the data mining system
does not utilize any of the database or data warehouse
functions. It fetches the data from a particular source and
processes that data using some data mining algorithms.
The data mining result is stored in another file.
Integration of data mining systems with a database or
data warehouse system
2. Loose Coupling − In this scheme, the data mining
system may use some of the functions of database and
data warehouse system. It fetches the data from the
data respiratory managed by these systems and
performs data mining on that data. It then stores the
mining result either in a file or in a designated place in
a database or in a data warehouse.
Integration of data mining systems with a database or
data warehouse system
Integration of data mining systems with a database or
data warehouse system
3. Semi−tight Coupling − In this scheme, the data
mining system is linked with a database or a data
warehouse system and in addition to that, efficient
implementations of a few data mining primitives can be
provided in the database.
Integration of data mining systems with a database or
data warehouse system
4. Tight coupling − In this coupling scheme, the data
mining system is smoothly integrated into the database
or data warehouse system. The data mining subsystem is
treated as one functional component of an information
system.
Major issues in data mining
Major issues in data mining regarding mining
methodology, user interaction, performance, and diverse
data types. These issues are following:
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
1. Mining Methodology and User Interaction
Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases −
Different users may be interested in different kinds of
knowledge. Therefore it is necessary for data mining to
cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be
interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based
on the returned results.
Incorporation of background knowledge − To guide
discovery process and to express the discovered patterns, the
background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data
mining − Data Mining Query language that allows the user
to describe ad hoc mining tasks, should be integrated with
a data warehouse query language and optimized for
efficient and flexible data mining.
Presentation and visualization of data mining
results − Once the patterns are discovered it needs to be
expressed in high level languages, and visual
representations. These representations should be easily
understandable.
Handling noisy or incomplete data − The data
cleaning methods are required to handle the noise and
incomplete objects while mining the data regularities. If
the data cleaning methods are not there then the accuracy
of the discovered patterns will be poor.
Pattern evaluation − The patterns discovered should
be interesting because either they represent common
knowledge or lack novelty.
2. Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining
algorithms − In order to effectively extract the
information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining
algorithms − The factors such as huge size of databases,
wide distribution of data, and complexity of data mining
methods motivate the development of parallel and
distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in
a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases
without mining the data again from scratch.
3. Diverse Data Types Issues
Handling of relational and complex types of data − The
database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
Mining information from heterogeneous databases and
global information systems − The data is available at
different data sources on LAN or WAN. These data source
may be structured, semi structured or unstructured. Therefore
mining the knowledge from them adds challenges to data
mining.
Data Mining Task Primitives
We can specify a data mining task in the form of a
data mining query.
This query is input to the system.
A data mining query is defined in terms of data
mining task primitives.
These primitives allow us to communicate in an
interactive manner with the data mining system.
List of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery
process.
Interestingness measures and thresholds for pattern
evaluation.
Representation for visualizing the discovered patterns.
Data mining task primitives
A data mining query is defined in terms of the following
primitives:
1. Task-relevant data: This is the database portion to be
investigated. For example, suppose that you are a
manager of All Electronics in charge of sales in the
United States and Canada. In particular, you would like to
study the buying trends of customers in Canada. Rather
than mining on the entire database. These are referred to
as relevant attributes.
Data mining task primitives
2. The kinds of knowledge to be mined: This specifies
the data mining functions to be performed, such as
characterization, discrimination, association,
classification, clustering, or evolution analysis. For
instance, if studying the buying habits of customers in
Canada, you may choose to mine associations between
customer profiles and the items that these customers like
to buy
Data mining task primitives
3. Background knowledge: Users can specify
background knowledge, or knowledge about the domain to
be mined. This knowledge is useful for guiding the
knowledge discovery process, and for evaluating the
patterns found. There are several kinds of background
knowledge.
4. Interestingness measures: These functions are used to
separate uninteresting patterns from knowledge. They may
be used to guide the mining process, or after discovery, to
evaluate the discovered patterns. Different kinds of
knowledge may have different interestingness measures.
5. Presentation and visualization of discovered
patterns: This refers to the form in which discovered
patterns are to be displayed. Users can choose from
different forms for knowledge presentation, such as rules,
tables, charts, graphs, decision trees, and cubes.
Data pre-processing is a data mining technique which is
used to transform the raw data in a useful and efficient
format.
Steps Involved in Data Pre-processing:
1.Data Cleaning: The data can have many irrelevant and
missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
(a).Missing Data: This situation arises when some data is
missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples: This approach is suitable only when the
dataset we have is quite large and multiple values are
missing within a tuple.
Fill the Missing values: There are various ways to do this
task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.
b).Noisy Data: Noisy data is a meaningless data that can’t
be interpreted by machines. It can be generated due to
faulty data collection, data entry errors etc. It can be
handled in following ways :
1. Binning Method: This method works on sorted data in
order to smooth it. The whole data is divided into segments
of equal size and then various methods are performed to
complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2.Regression: Here data can be made smooth by fitting
it to a regression function. The regression used may be
linear (having one independent variable) or multiple
(having multiple independent variables).
3.Clustering: This approach groups the similar data in a
cluster. The outliers may be undetected or it will fall
outside the clusters.
2.Data Transformation: This step is taken in order to
transform the data in appropriate forms suitable for
mining process. This involves following ways:
1.Normalization: It is done in order to scale the data
values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2.Attribute Selection: In this strategy, new attributes are
constructed from the given set of attributes to help the
mining process.
3.Discretization: This is done to replace the raw values
of numeric attribute by interval levels or conceptual
levels.
4.Concept Hierarchy Generation: Here attributes are
converted from level to higher level in hierarchy. For
Example-The attribute “city” can be converted to
“country”.
3. Data Reduction: Data reduction techniques can be
applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the
integrity of the original data. That is, mining on the
reduced data set should be more efficient yet produce
the same (or almost the same) analytical results.
Strategies for data reduction include the following.
1. Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
2. Dimension reduction, where irrelevant, weakly
relevant or redundant attributes or dimensions may be
detected and removed.
3. Data compression, where encoding mechanisms are
used to reduce the data set size.
4. Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations such as
parametric models (which need store only the model
parameters instead of the actual data), or nonparametric
methods such as clustering, sampling, and the use of
histograms.
5. Discretization and concept hierarchy generation, where
raw data values for attributes are replaced by ranges or higher
conceptual levels. Concept hierarchies allow the mining of data
at multiple levels of abstraction, and are a powerful tool for
data mining.