Seminar Report Vaibhav
-
Upload
vaibhav-dhattarwal -
Category
Documents
-
view
282 -
download
0
Transcript of Seminar Report Vaibhav
A Seminar Report On
ARTIFICIAL NEURAL NETWORKS BASED DATA MINING
TECHNIQUES
Submitted in partial fulfilment of the requirements
For the award of degree
Of
INTEGRATED DUAL DEGREE
In
COMPUTER SCIENCE AND ENGINEERING
(With Specialization in Information Technology)
Submitted by
Vaibhav Dhattarwal
CSE-IDD
Enrolment No: 08211018
Under the guidance of
DR. DURGA TOSHINWAL
Professor
ELECTRONICS AND COMPUTER ENGINEERING DEPARTMENT
INDIAN INSTITUTE OF TECHNOLOGY ROORKEE
ROORKEE-247667
AUGUST 2012
ii
Abstract
This report presents an overview of Data Mining Techniques and some of the applications of
these techniques in various utility networks. Companies have been collecting data for
decades, building massive data warehouses in which to store it. Even though this data is
available, very few companies have been able to realize the actual value stored in it. The
question these companies are asking is how to extract this value. The answer is Data mining.
There are many technologies available to data mining practitioners, including Artificial
Neural Networks, Regression, and Decision Trees. Many practitioners are wary of Neural
Networks due to their black box nature, even though they have proven themselves in many
situations. This report also provides a brief overview of artificial neural networks and
questions their position as an applicable tool in data mining.
iii
Table of Contents
Page
Abstract i
Table of Contents ii
List of Figures iii
Chapter 1 Introduction 1
1.1 Objective of Seminar 2
Chapter 2 Data Mining 3
2.1 Data Mining Process 4
2.2 CRISP-DM Model 5
Chapter 3 Data Mining Techniques 7
3.1 Classification 7
3.2 Clustering 15
3.3 Regression 18
3.4 Association Rule 18
3.5 Neural Networks 18
Chapter 4 Neural Networks in Data Mining 20
4.1 Feed Forward Neural Network 21
4.2 Back Propagation Algorithm 21
Chapter 5 Applications of Data Mining Techniques 22
5.1 Specific Application Areas 22
5.2 Spatial Data Mining 24
5.3 Multimedia Data Mining 24
5.4 Web Mining 24
Chapter 6 Conclusion 26
References 27
iv
List of Figures
Figure Title Page
2.1 Knowledge Discovery in Databases Process 3
2.2 Cross Industry Standard Process for Data Mining 6
3.1 Formation of Clusters 8
3.2 Linear Regression 9
4.1 An Artificial Neural Network 10
4.2 A Feed Forward Neural Network 10
5.1 Spatial Data Mining 11
5.2 Process Chart for conducting Text Mining 12
v
1 Introduction
The development of Information Technology has generated large amount of databases and
huge data in various areas. Nowadays corporate and organizations are accumulating data at
an enormous rate and from a very broad variety of sources such as customer transactions,
credit card transactions, bank cash withdrawal to hourly weather data. A lot of relational
database servers have been built to store such massive quantities of data. As the matter of
fact, the data itself is critical to a company’s growth. It contains knowledge that could lead to
important business decisions that bring business to the next level. These data has never been
examined in a superficial manner. It is becoming data rich but knowledge poor. In other
words “We are drowning in data, but starving for knowledge!”
We need information but what we have is a huge amount of data flooding around companies,
organizations even individuals. Because of the amount of data is so enormous that humans
cannot process it fast enough to get the information out of it at the right time, the machine
learning technology has been established to solve this problem potentially.
The research in databases and information technology has given rise to an approach to store
and manipulate this precious data for further decision making.
Data mining is the term used to describe the process of extracting value from a database. A
Data-warehouse is a location where information is stored. The type of data stored depends
largely on the type of industry and the company. Data mining (the analysis step of the
"Knowledge Discovery in Databases" process, or KDD), a relatively young and
interdisciplinary field of computer science, is the process that attempts to discover patterns in
large data sets. It utilizes methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems. The overall goal of the data mining process is to
extract information from a data set and transform it into an understandable structure for
further use. Aside from the raw analysis step, it involves database and data
management aspects, data pre-processing, model and inference considerations,
interestingness metrics, complexity considerations, post-processing of discovered
structures, visualization, and online updating.
vi
Data mining is the business of answering questions that you’ve not asked yet. Data mining
reaches deep into databases. Data mining tasks can be classified into two categories:
Descriptive and predictive data mining.
Descriptive data mining provides information to understand what is happening inside the data
without a predetermined idea. Predictive data mining allows the user to submit records with
unknown field values, and the previous patterns discovered form the database. Data mining
models can be categorized according to the tasks they perform: Classification and Prediction,
Clustering, Association Rules. Classification and prediction is a predictive model, but
clustering and association rules are descriptive models.
The most common action in data mining is classification. It recognizes patterns that describe
the group to which an item belongs. It does this by examining existing items that already
have been classified and inferring a set of rules. Similar to classification is clustering. The
major difference being that no groups have been predefined. Prediction is the construction
and use of a model to assess the class of an unlabeled object or to assess the value or value
ranges of a given object is likely to have. The next application is forecasting. This is different
from predictions because it estimates the future value of continuous variables based on
patterns within the data.
Four things are required to data-mine effectively: high-quality data, the “right” data, an
adequate sample size and the right tool. There are many tools available to a data mining
practitioner. These include decision trees, various types of regression and neural networks.
1.1 Objective of the Seminar
The introduction of Data Mining and a description of the Data Mining Process are presented
in this seminar report. The objective of this seminar is to present an overview of Data Mining
techniques that are in use and are applicable in various scenarios. The application of these
techniques has also been discussed after an explanation of the implementation of the
technique.
vii
2 Data Mining
Data mining is a process of extraction of useful information and patterns from huge data. It is
also called as knowledge discovery process, knowledge mining from data, knowledge
extraction or data /pattern analysis. In other words, it can be referred to as Knowledge-
Discovery in Databases (KDD). It involves searching large volumes of data for patterns.
Figure 2.1 Knowledge Discovery in Databases Process
The Knowledge Discovery in Databases (KDD) process is commonly defined with the
stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation/Evaluation.
viii
2.1 Data Mining Process
Data Mining is performed on the following types of data
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
o Object-oriented and object-relational databases
o Spatial databases
o Time-series data and temporal data
o Text databases and multimedia databases
o Heterogeneous and legacy databases
Some of the steps involved in the Data Mining process are:
Data cleaning The task of this step is to remove noise and inconsistent data.
Data integration In this step, multiple data sources like the ones mentioned in the
section above can be combined to an integrated collection of data.
Data selection All the data relevant to the analysis task is retrieved from the database
in this step.
Data transformation The data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.
Data mining The critical step where intelligent methods are applied in order to
extract data patterns.
Pattern evaluation This step is deployed to identify the truly interesting patterns
representing knowledge based on certain measures.
Knowledge presentation In the final step, various visualization and knowledge
representation techniques are used to present the mined knowledge to the user.
Data mining has five main functions:
Classification: It infers the defining characteristics of a certain group.
Clustering: It identifies groups of items that share a particular characteristic. (Clus-
tering differs from classification in that no predefining characteristic is given in clas-
sification.)
ix
Association: It identifies relationships between events that occur at one time.
Sequencing: It is similar to association, except that the relationship exists over a
period of time.
Forecasting: It estimates future values based on patterns within large sets of data.
2.2 Cross-Industry Standard Process for Data Mining (CRISP-DM) Model
Figure 2.2 Cross-Industry Standard Process for Data Mining (CRISP-DM)
1. Business understanding - In the business understanding phase, it is a must to understand
business objectives clearly and finding out what the client really want to achieve. Next,
we have to assess the situation by finding about the resources, assumptions, constraints
and other important factors. Then from the business objectives and current situations, we
x
need to create goals to achieve the business objective within the current situation. Finally
a good data mining plan has to be established to achieve both business and data mining
goals.
2. Data understanding - This phase starts with initial data collection from available
sources to get familiar with data. Data load and Data integration must be carried out to
ensure successful data collection. Next, the “surface” properties of acquired data need to
be examined carefully and reported. Then, the data need to be explored by tackling the
data mining questions, which can be addressed using querying, reporting and
visualization. Finally, we must check whether the acquired data is complete, and ensure
that there are no missing values in the acquired data.
3. Data preparation - The data preparation normally consumes about 90% of the time. The
outcome of the data preparation phase is the final data set. When the available data
sources are identified, they need to be selected, cleaned, constructed and formatted into
the desired form.
4. Modelling - Several modelling techniques are selected to be used for the prepared
dataset. A test scenario must be generated to validate the model’s quality. One or more
models are created by running the modelling tool on the prepared dataset. The created
models need to be assessed carefully so that they meet business initiatives.
5. Evaluation - In the evaluation phase, the model results must be evaluated in the context
of business objectives in the first phase. In this phase, new business requirements may be
raised due to new patterns has been discovered in the model results or from other factors.
Gaining business understanding is an iterative process in data mining. The final decision
must be made in this step to move to the deployment phase.
6. Deployment - The knowledge or information gained through data mining process needs
to be presented in such a way that it can be used, whenever it is desired. In this phase, the
deployment, maintained and monitoring plans have to be created for deployment and
future supports. From project point of view, the final evaluation of the project needs to
summarize the project experiences and review the project to see what needs to be
improved.
3 Data Mining Techniques
xi
3.1. Classification
Classification is the most commonly applied data mining technique, which employs a set of
pre-classified examples to develop a model that can classify the population of records at
large. . Classification is a classic data mining technique based on machine learning. Basically
classification is used to classify each item in a set of data into one of predefined set of classes
or groups. Fraud detection and credit risk applications are particularly well suited to this type
of analysis. This approach frequently employs decision tree or neural network-based
classification algorithms.
The data classification process involves learning and classification. In Learning, the training
data are analyzed by classification algorithm. In classification, test data are used to estimate
the accuracy of the classification rules. If the accuracy is acceptable, the rules can be applied
to the new data tuples Classification method makes use of mathematical techniques such as
decision trees, linear programming, neural network and statistics. In classification, we make
the software that can learn how to classify the data items into groups.
The classifier-training algorithm uses these pre-classified examples to determine the set of
parameters required for proper discrimination. The algorithm then encodes these parameters
into a model called a classifier.
Types of classification models:
Classification by decision tree induction
Bayesian Classification
Support Vector Machines (SVM)
Classification Based on Associations
3.2. Clustering
Clustering can be defined as identification of similar classes of objects. Clustering is a data
mining technique that makes meaningful or useful cluster of objects that have similar
characteristic using automatic technique. By using clustering techniques we can further
identify dense and sparse regions in object space and can discover overall distribution pattern
and correlations among data attributes. Due to the fact that classification approach can
become costly, Clustering can be used as pre-processing approach for attribute subset
selection and classification. In clustering technique, the classes are defined and accordingly
objects are put in them, whereas in classification objects are assigned into predefined classes.
xii
Figure 3.1 Formation of clusters
Types of clustering methods:
Partitioning Methods
Hierarchical methods
Density based methods
Grid-based methods
Model-based methods
3.3. Regression
Regression analysis helps in understanding how the typical value of the dependent variable
changes when any one of the independent variables is varied, while the other independent
variables are held fixed. Regression analysis estimates the conditional expectation of the
dependent variable given the independent variables. In other words, it estimates the average
value of the dependent variable when the independent variables are fixed.
In all cases, the estimation target is a function of the independent variables called the
regression function. In regression analysis, it is also of interest to characterize the variation of
the dependent variable around the regression function, which can be described by a
probability distribution.
xiii
Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning. Regression analysis is also used to
understand which independent variables are related to the dependent variable, and to explore
the forms of these relationships.
In data mining, independent variables are attributes already known and response variables
are what we want to predict. Real-world problems are very difficult to predict because they
may depend on complex interactions of multiple predictor variables. Therefore, more
complex techniques (e.g., logistic regression, decision trees, or neural nets) may be necessary
to forecast future values. The same model types can often be used for both regression and
classification. For example, the CART (Classification and Regression Trees) decision tree
algorithm can be used to build both classification trees (to classify categorical response
variables) and regression trees (to forecast continuous response variables). Neural networks
too can create both classification and regression models.
Figure 3.2 Linear Regression
Types of regression methods
Linear Regression
Multivariate Linear Regression
Nonlinear Regression
Multivariate Nonlinear Regression
3.4. Association Rule
Association is one of the best known data mining technique. In association, a pattern is
discovered based on a relationship of a particular item on other items in the same transaction.
xiv
Association and correlation is usually to find frequent item set findings among large data sets.
This type of finding helps businesses to make certain decisions, such as catalogue design,
cross marketing and customer shopping behaviour analysis. Association rules are usually
required to satisfy a user-specified minimum support and a user-specified minimum
confidence at the same time. Association rule generation is usually split up into two separate
steps: First, minimum support is applied to find all frequent item sets in a database. Second,
these frequent item sets and the minimum confidence constraint are used to form rules.
Association Rule algorithms need to be able to generate rules with confidence values less
than one. However the number of possible Association Rules for a given dataset is generally
very large and a high proportion of the rules are usually of little (if any) value.
Types of association rule
Multilevel association rule
Multidimensional association rule
Quantitative association rule
3.5. Neural networks
An Artificial Neural Network (ANN), usually called neural network (NN), is a mathematical
model or computational model that is inspired by the structure and functional aspects of
biological neural networks. A neural network consists of an interconnected group of artificial
neurons, and it processes information using a connection based approach to computation. In
most cases an ANN is an adaptive system that changes its structure based on external or
internal information that flows through the network during the learning phase. Modern neural
networks are non-linear statistical data modelling tools. They are usually used to model
complex relationships between inputs and outputs or to find patterns in data.
Neural network is a set of connected input/output units and each connection has a weight
present with it. During the learning phase, network learns by adjusting weights so as to be
able to predict the correct class labels of the input tuples. Neural networks have the
remarkable ability to derive meaning from complicated or imprecise data and can be used to
extract patterns and detect trends that are too complex to be noticed by either humans or other
computer techniques. These are well suited for continuous valued inputs and outputs. Neural
networks are best at identifying patterns or trends in data and well suited for prediction or
forecasting needs.
xv
4 Neural Networks in Data Mining
Neural networks are non-linear statistical data modelling tools. They can be used to model
complex relationships between inputs and outputs; or to find patterns in data and to infer rules
from them. Neural networks are useful in providing information on associations,
classifications, clusters, and forecasting. Using neural networks as a tool, data warehousing
firms can harvest information from datasets in the data mining process. Neural networks are
programmed to store, recognize, and associatively retrieve patterns or database entries; to
solve combinatorial optimization problems; to filter noise from measurement data; to control
ill-defined problems; in summary, to estimate sampled functions when we do not know the
form of the functions. The two abilities: pattern recognition and function estimation make
neural networks a very prevalent utility in data mining. With their model-free estimators and
their dual nature, neural networks serve data mining in a variety of ways.
Figure 4.1 an Artificial Neural Network
Neural networks, depending on the architecture, provide associations, classifications, clusters,
prediction and forecasting to the data mining industry. Neural networks essentially comprise
xvi
three pieces: the architecture or model; the learning algorithm; and the activation functions.
Due to neural networks, we can mine valuable information from a mass of history information
so that it can be efficiently used in financial areas. Hence, the applications of neural networks
in financial forecasting have become very popular.
4.1. Feed forward Neural Network:
Figure 4.2 a Feed Forward Neural Network
One of the simplest feed forward neural networks (FFNN), in Figure 4.2, consists of three
layers: an input layer, hidden layer and output layer. In each layer there are one or more
processing elements (PEs). PEs is meant to simulate the neurons in the brain and this is why
they are often referred to as neurons or nodes. A PE receives inputs from either the outside
world or the previous layer. There are connections between the PEs in each layer that have a
weight (parameter) associated with them. This weight is adjusted during training. Information
only travels in the forward direction through the network - there are no feedback loops. The
simplified process for training a FFNN is as follows:
xvii
1. Input data is presented to the network and propagated through the network until it
reaches the output layer. This forward process produces a predicted output.
2. The predicted output is subtracted from the actual output and an error value for the
networks is calculated.
3. The neural network then uses supervised learning, which in most cases is back
propagation, to train the network. Back propagation is a learning algorithm for
adjusting the weights. It starts with the weights between the output layer PE’s and
the last hidden layer PE’s and works backwards through the network.
4. Once back propagation has finished, the forward process starts again, and this cycle
is continued until the error between predicted and actual outputs is minimized.
4.2. The Back Propagation Algorithm:
Back propagation, or propagation of error, is a common method of teaching artificial neural
networks how to perform a given task. Back propagation is the method of training artificial
neural networks so as to minimize the objective function. The back propagation algorithm
performs learning on a feed-forward neural network. The back propagation algorithm is used
in layered feed forward ANNs. This means that the artificial neurons are organized in layers,
and send their signals “forward”, and then the errors are propagated backwards. The back
propagation algorithm uses supervised learning, which means that we provide the algorithm
with examples of the inputs and outputs we want the network to compute, and then the error
(difference between actual and expected results) is calculated. The idea of the back
propagation algorithm is to reduce this error, until the ANN learns the training data.
Algorithm for a 3-layer network:
Initialize the weights in the network
Do
For each example E in the training set
O = neural-net-output (network, e); forward pass
T = teacher output for e
Calculate error (T - O) at the output units
xviii
Compute delta_wh for all weights from hidden layer to output layer ;
backward pass
Compute delta_wi for all weights from input layer to hidden layer ; backward
pass continued
Update the weights in the network
Until all examples classified correctly or stopping criterion satisfied
Return the network
The Back Propagation learning algorithm can be divided into two phases:
Phase 1: Propagation
Every propagation involves the following steps:
1. Forward propagation of a training pattern's input through the neural network.
2. Backward propagation of the propagation's output activations through the neural
network using the training pattern's target.
Phase 2: Weight update
For each weight-synapse the following steps are used:
1. Multiply its output delta and input activation to get the gradient of the weight.
2. Bring the weight in the opposite direction of the gradient by subtracting a ratio of it
from the weight.
Repeat phase 1 and 2 until the performance of the network is satisfactory.
xix
5 Applications of Data Mining
Data mining is a relatively new technology that has not fully matured. Despite this, there are
a number of industries that are already using it on a regular basis. Some of these
organizations include retail stores, hospitals, banks, and insurance companies.
Many of these organizations are combining data mining with such things as statistics, pattern
recognition, and other important tools. Data mining can be used to find patterns and
connections that would otherwise be difficult to find. This technology is popular with many
businesses because it allows them to learn more about their customers and make smart
marketing decisions.
There are a number of applications that data mining has. The first is called market
segmentation. With market segmentation, we can find behaviours that are common among
customers. We can look for patterns among customers that seem to purchase the same
products at the same time. Another application of data mining is called customer churn.
Customer churn allows us to estimate which customers are the most likely to stop purchasing
our products or services and go to one of our competitors. In addition to this, a company can
use data mining to find out which purchases are the most likely to be fraudulent.
For example, by using data mining in retail stores, we may be able to determine which
products are stolen the most. By finding out which products are stolen the most, steps can be
taken to protect those products and detect those who are stealing them. We can also use data
mining to determine the effectiveness of interactive marketing. Some of the customers will be
more likely to purchase the products online than offline, and we must identify them.
While many businesses use data mining to help increase their profits, it can also be used to
create new businesses and industries. One industry that can be created by data mining is the
automatic prediction of both behaviours and trends. Using this automated prediction we can
have an advantage over the competition. Instead of simply guessing what the next big trend
will be, we can determine it based on statistics, patterns, and logic. Another application of
automatic prediction is to use data mining to look at the past marketing strategies to
xx
determine the best one so far and the reason for it being the best. We can avoid making any
mistakes that occurred in previous marketing campaigns.
Data mining is also a powerful tool for those who deal with finances. A financial institution
such as a bank can predict the number of defaults that will occur among their customers
within a given period of time, and they can also predict the amount of fraud that will occur as
well.
Another application of data mining is the automatic recognition of patterns that were not
previously known. While data mining is a very valuable tool, it is important to realize that it
is not a complete solution. Even if an automated technology should be invented, it will not
guarantee the success of the company. However, it will tip the odds in our favour.
5.1. Specific Application Areas of Data Mining
Data Mining for Financial Data Analysis few typical cases:
Design and construction of data warehouses for multidimensional data analysis.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Data Mining for the Retail Industry.
A few examples of data mining in the retail industry:
Design and construction of data warehouses based on the benefits of data mining
Multidimensional analysis of sales, customers, products, time, and region
Analysis of the effectiveness of sales campaigns
Customer retention—analysis of customer loyalty
Product recommendation and cross-referencing of items
Data Mining for the Telecommunication Industry
Multidimensional analysis of telecommunication data.
Fraudulent pattern analysis and the identification of unusual patterns.
Multidimensional association and sequential pattern analysis.
xxi
Mobile telecommunication services.
Use of visualization tools in telecommunication data analysis.
Data Mining for Biological Data Analysis
Semantic integration of heterogeneous, distributed genomic and proteomic databases.
Alignment, indexing, similarity search, and comparative analysis of multiple
nucleotide/protein sequences.
Discovery of structural patterns and analysis of genetic networks and protein
pathways.
Association and path analysis: identifying co-occurring gene sequences and linking
genes to different stages of disease development.
Visualization tools in genetic data analysis.
Data Mining in Other Scientific Applications
Data collection and storage technologies have recently improved, so that today, scientific data
can be amassed at much higher speeds and lower costs. This has resulted in the accumulation
of huge volumes of high-dimensional data, stream data, and heterogeneous data, containing
rich spatial and temporal information. Consequently, scientific applications are shifting from
the “hypothesize-and-test” paradigm toward a “collect and store data, mine for new
hypotheses, confirm with data or experimentation” process. This shift brings about new
challenges for data mining.
5.2. Spatial Data Mining
A spatial database stores a large amount of space-related data, such as maps, pre-processed
remote sensing or medical imaging data, and VLSI chip layout data. Spatial data mining
refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial databases.
Spatial data mining is the application of data mining methods to spatial data. The end
objective of spatial data mining is to find patterns in data with respect to geography. So far,
data mining and Geographic Information Systems (GIS) have existed as two separate
technologies, but now Data mining offers great potential benefits for GIS-based applied
decision-making.
xxii
Figure 5.1 Spatial Data Mining
Spatial Data Cube Construction and Spatial OLAP
As with relational data, we can integrate spatial data to construct a data warehouse that
facilitates spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time
variant and non-volatile collection of both spatial and noncapital data in support of spatial
data mining and spatial-data-related decision-making processes.
There are three types of dimensions in a spatial data cube:
A non spatial dimension
A spatial-to-non spatial dimension
A spatial-to-spatial dimension
We can distinguish two types of measures in a spatial data cube:
A numerical measure contains only numerical data
A spatial measure contains a collection of pointers to spatial objects
xxiii
Spatial database systems usually handle vector data that consist of points, lines, polygons
(regions), and their compositions, such as networks or partitions. Typical examples of such
data include maps, design graphs, and 3-D representations of the arrangement of the chains of
protein molecules.
5.3. Multimedia Data Mining
A multimedia database system stores and manages a large collection of multimedia data, such
as audio, video, image, graphics, speech, text, document, and hypertext data, which contain
text, text mark-ups, and linkages Similarity Search in Multimedia Data When searching for
similarities in multimedia data, we can search on either the data description or the data
content approaches:
Colour histogram–based signature
Multi feature composed signature
Wavelet-based signature
Wavelet-based signature with region-based granularity
Multidimensional Analysis of Multimedia Data
To facilitate the multidimensional analysis of large multimedia databases, multimedia data
cubes can be designed and constructed in a manner similar to that for traditional data cubes
from relational data. A multimedia data cube can contain additional dimensions and measures
for multimedia information, such as colour, texture, and shape.
Classification and Prediction Analysis of Multimedia Data
Classification and predictive modelling can be used for mining multimedia data, especially in
scientific research, such as astronomy, seismology, and geo-scientific research
Mining Associations in Multimedia Data:
Associations between image content and non-image content features:
Associations among image contents that are not related to spatial relationships
Associations among image contents related to spatial relationships:
Audio and Video Data Mining
xxiv
An extraordinary amount of audiovisual information is becoming available in digital form, in
digital archives, on the World Wide Web, in broadcast data streams, and in personal and
professional databases, and hence there is a need to mine them.
Visual data mining discovers implicit and useful knowledge from large data sets using data
and/or knowledge visualization techniques.
In general, data visualization and data mining can be integrated in the following ways:
Data visualization
Data mining result visualization
Data mining process visualization
Interactive visual data mining
5.4. Web Mining
Figure 5.2 Process Chart for conducting Text Mining
Text Data Analysis and Information Retrieval Information retrieval (IR) is a field that has
been developing in parallel with database systems for many years. Basic Measures for Text
Retrieval: Precision and Recall
Precision: This is the percentage of retrieved documents that are in fact relevant to the query
(i.e., “correct” responses).
xxv
Recall: This is the percentage of documents that are relevant to the query and were retrieved.
It is formally defined as Text Retrieval Methods
1) Document selection methods
2) Document ranking methods
Text Indexing Techniques
1) Inverted indices
2) Signature files.
Query Processing Techniques: Once an inverted index is created for a document collection, a
retrieval system can answer a keyword query quickly by looking up which documents contain
the query keywords.
Ways of dimensionality Reduction for Text:
Latent Semantic Indexing
Locality Preserving Indexing
Probabilistic Latent Semantic Indexing
Probabilistic Latent Semantic indexing schemas:
Keyword-Based Association Analysis
Document Classification Analysis
Document Clustering Analysis
Mining the World Wide Web
The World Wide Web serves as a huge, widely distributed, global information service centre
for news, advertisements, consumer information, financial management, education,
government, e-commerce, and many other information services. The Web also contains a rich
and dynamic collection of hyperlink information and Web page access and usage
information, providing rich sources for data mining.
Challenges:
xxvi
The Web seems to be too huge for effective data warehousing and data mining
The complexity of Web pages is far greater than that of any traditional text document
collection
The Web is a highly dynamic information source
The Web serves a broad diversity of user communities
Only a small portion of the information on the Web is truly relevant or useful
Besides mining Web contents and Web linkage structures, another important task for Web
mining is Web usage mining.
Data Mining for Intrusion Detection
The security of our computer systems and data is at continual risk. The extensive growth of
the Internet and increasing availability of tools and tricks for intruding and attacking
networks have prompted intrusion detection to become a critical component of network
administration. Some areas in which data mining technology may be applied or further
developed for intrusion detection:
1. Development of data mining algorithms for intrusion detection
2. Association and correlation analysis, and aggregation to help select and build
discriminating attributes
3. Analysis of stream data
4. Distributed data mining
5. Visualization and querying tools
Data Mining System Products and Research Prototypes data mining systems should be
assessed based on the following multiple features: Data types, System issues, Data sources,
Data mining functions and methodologies, Coupling data mining with database and/or data
warehouse systems, Scalability, Visualization tools, Data mining query language and
graphical user interface.
6 Conclusions
xxvii
This seminar report provided an overview of Data Mining Process, its techniques and
applications. The following conclusions can be drawn:
I. Data Mining is a crucial step in the Knowledge Discovery in Databases Process but
can only be performed after pre-processing and transformation.
II. Although the basic steps in data mining include data cleaning, selection and
transformation; the functions and techniques are only applied in the vital step where
intelligent methods are used to detect patterns.
III. A model for Data Mining is useful for a company or a data mining practitioner as it
helps in adapting a result oriented approach.
IV. Cross Industry Standard Process for Data Mining Model is an effective approach to a
model which considers business requirements at every step.
V. Classification and Clustering techniques are popular and easily applicable in data
mining, however classification we require prior characteristic information.
VI. Artificial Neural Networks can be deployed to detect patterns and make predictions
which make them capable tools in data mining. A feed forward neural network uses a
back propagation algorithm to train itself.
VII. The application of data mining techniques along with GIS techniques makes for a
potential opportunity to explore various aspects of Spatial Data Mining.
VIII. The growth of data available for processing, as well as multimedia elements and the
world wide web leads to greater opportunities for data mining techniques. However
the pre-processing, selection and transformation needs to be handled first.
7 References
xxviii
[1] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database
perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996.
[2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in
Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[3] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge Discovery in
Databases: An Overview. In G. Piatetsky-Shapiro et al. (eds.), Knowledge Discovery
in Databases. AAAI/MIT Press, 1991.
[4] J. Han and M. Kamber. Data Mining: Concepts and Techniques.
[5] Morgan T. Imielinski and H. Mannila. A database perspective on knowledge.
[6] G. Piatetsky-Shapiro, U. M. Fayyad, and P. Smyth. From data mining to
knowledge discovery: An overview.
[7] In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-
35. AAAI/MIT Press, 1996.
[8] G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in
Databases.AAAI/MIT Press, 1991.
[9] Portia A. Cerny, Data mining and Neural Networks from a Commercial Perspective
[10] Bharati M. Ramageri, Data Mining Techniques and applications.
[11] Dr. Yashpal Singh, Alok Singh Chauhan, Neural Networks in Data Mining.
xxix