Seminar Report Vaibhav

A Seminar Report On

ARTIFICIAL NEURAL NETWORKS BASED DATA MINING

TECHNIQUES

Submitted in partial fulfilment of the requirements

For the award of degree

Of

INTEGRATED DUAL DEGREE

In

COMPUTER SCIENCE AND ENGINEERING

(With Specialization in Information Technology)

Submitted by

Vaibhav Dhattarwal

CSE-IDD

Enrolment No: 08211018

Under the guidance of

DR. DURGA TOSHINWAL

Professor

ELECTRONICS AND COMPUTER ENGINEERING DEPARTMENT

INDIAN INSTITUTE OF TECHNOLOGY ROORKEE

ROORKEE-247667

AUGUST 2012

ii

Abstract

This report presents an overview of Data Mining Techniques and some of the applications of

these techniques in various utility networks. Companies have been collecting data for

decades, building massive data warehouses in which to store it. Even though this data is

available, very few companies have been able to realize the actual value stored in it. The

question these companies are asking is how to extract this value. The answer is Data mining.

There are many technologies available to data mining practitioners, including Artificial

Neural Networks, Regression, and Decision Trees. Many practitioners are wary of Neural

Networks due to their black box nature, even though they have proven themselves in many

situations. This report also provides a brief overview of artificial neural networks and

questions their position as an applicable tool in data mining.

iii

Table of Contents

Page

Abstract i

Table of Contents ii

List of Figures iii

Chapter 1 Introduction 1

1.1 Objective of Seminar 2

Chapter 2 Data Mining 3

2.1 Data Mining Process 4

2.2 CRISP-DM Model 5

Chapter 3 Data Mining Techniques 7

3.1 Classification 7

3.2 Clustering 15

3.3 Regression 18

3.4 Association Rule 18

3.5 Neural Networks 18

Chapter 4 Neural Networks in Data Mining 20

4.1 Feed Forward Neural Network 21

4.2 Back Propagation Algorithm 21

Chapter 5 Applications of Data Mining Techniques 22

5.1 Specific Application Areas 22

5.2 Spatial Data Mining 24

5.3 Multimedia Data Mining 24

5.4 Web Mining 24

Chapter 6 Conclusion 26

References 27

iv

List of Figures

Figure Title Page

2.1 Knowledge Discovery in Databases Process 3

2.2 Cross Industry Standard Process for Data Mining 6

3.1 Formation of Clusters 8

3.2 Linear Regression 9

4.1 An Artificial Neural Network 10

4.2 A Feed Forward Neural Network 10

5.1 Spatial Data Mining 11

5.2 Process Chart for conducting Text Mining 12

v

1 Introduction

The development of Information Technology has generated large amount of databases and

huge data in various areas. Nowadays corporate and organizations are accumulating data at

an enormous rate and from a very broad variety of sources such as customer transactions,

credit card transactions, bank cash withdrawal to hourly weather data. A lot of relational

database servers have been built to store such massive quantities of data. As the matter of

fact, the data itself is critical to a company’s growth. It contains knowledge that could lead to

important business decisions that bring business to the next level. These data has never been

examined in a superficial manner. It is becoming data rich but knowledge poor. In other

words “We are drowning in data, but starving for knowledge!”

We need information but what we have is a huge amount of data flooding around companies,

organizations even individuals. Because of the amount of data is so enormous that humans

cannot process it fast enough to get the information out of it at the right time, the machine

learning technology has been established to solve this problem potentially.

The research in databases and information technology has given rise to an approach to store

and manipulate this precious data for further decision making.

Data mining is the term used to describe the process of extracting value from a database. A

Data-warehouse is a location where information is stored. The type of data stored depends

largely on the type of industry and the company. Data mining (the analysis step of the

"Knowledge Discovery in Databases" process, or KDD), a relatively young and

interdisciplinary field of computer science, is the process that attempts to discover patterns in

large data sets. It utilizes methods at the intersection of artificial intelligence, machine

learning, statistics, and database systems. The overall goal of the data mining process is to

extract information from a data set and transform it into an understandable structure for

further use. Aside from the raw analysis step, it involves database and data

management aspects, data pre-processing, model and inference considerations,

interestingness metrics, complexity considerations, post-processing of discovered

structures, visualization, and online updating.

vi

Data mining is the business of answering questions that you’ve not asked yet. Data mining

reaches deep into databases. Data mining tasks can be classified into two categories:

Descriptive and predictive data mining.

Descriptive data mining provides information to understand what is happening inside the data

without a predetermined idea. Predictive data mining allows the user to submit records with

unknown field values, and the previous patterns discovered form the database. Data mining

models can be categorized according to the tasks they perform: Classification and Prediction,

Clustering, Association Rules. Classification and prediction is a predictive model, but

clustering and association rules are descriptive models.

The most common action in data mining is classification. It recognizes patterns that describe

the group to which an item belongs. It does this by examining existing items that already

have been classified and inferring a set of rules. Similar to classification is clustering. The

major difference being that no groups have been predefined. Prediction is the construction

and use of a model to assess the class of an unlabeled object or to assess the value or value

ranges of a given object is likely to have. The next application is forecasting. This is different

from predictions because it estimates the future value of continuous variables based on

patterns within the data.

Four things are required to data-mine effectively: high-quality data, the “right” data, an

adequate sample size and the right tool. There are many tools available to a data mining

practitioner. These include decision trees, various types of regression and neural networks.

1.1 Objective of the Seminar

The introduction of Data Mining and a description of the Data Mining Process are presented

in this seminar report. The objective of this seminar is to present an overview of Data Mining

techniques that are in use and are applicable in various scenarios. The application of these

techniques has also been discussed after an explanation of the implementation of the

technique.

vii

2 Data Mining

Data mining is a process of extraction of useful information and patterns from huge data. It is

also called as knowledge discovery process, knowledge mining from data, knowledge

extraction or data /pattern analysis. In other words, it can be referred to as Knowledge-

Discovery in Databases (KDD). It involves searching large volumes of data for patterns.

Figure 2.1 Knowledge Discovery in Databases Process

The Knowledge Discovery in Databases (KDD) process is commonly defined with the

stages:

(1) Selection

(2) Pre-processing

(3) Transformation

(4) Data Mining

(5) Interpretation/Evaluation.

viii

2.1 Data Mining Process

Data Mining is performed on the following types of data

Relational databases

Data warehouses

Transactional databases

Advanced DB and information repositories

o Object-oriented and object-relational databases

o Spatial databases

o Time-series data and temporal data

o Text databases and multimedia databases

o Heterogeneous and legacy databases

Some of the steps involved in the Data Mining process are:

Data cleaning The task of this step is to remove noise and inconsistent data.

Data integration In this step, multiple data sources like the ones mentioned in the

section above can be combined to an integrated collection of data.

Data selection All the data relevant to the analysis task is retrieved from the database

in this step.

Data transformation The data is transformed or consolidated into forms appropriate

for mining by performing summary or aggregation operations.

Data mining The critical step where intelligent methods are applied in order to

extract data patterns.

Pattern evaluation This step is deployed to identify the truly interesting patterns

representing knowledge based on certain measures.

Knowledge presentation In the final step, various visualization and knowledge

representation techniques are used to present the mined knowledge to the user.

Data mining has five main functions:

Classification: It infers the defining characteristics of a certain group.

Clustering: It identifies groups of items that share a particular characteristic. (Clus-

tering differs from classification in that no predefining characteristic is given in clas-

sification.)

ix

Association: It identifies relationships between events that occur at one time.

Sequencing: It is similar to association, except that the relationship exists over a

period of time.

Forecasting: It estimates future values based on patterns within large sets of data.

2.2 Cross-Industry Standard Process for Data Mining (CRISP-DM) Model

Figure 2.2 Cross-Industry Standard Process for Data Mining (CRISP-DM)

1. Business understanding - In the business understanding phase, it is a must to understand

business objectives clearly and finding out what the client really want to achieve. Next,

we have to assess the situation by finding about the resources, assumptions, constraints

and other important factors. Then from the business objectives and current situations, we

x

need to create goals to achieve the business objective within the current situation. Finally

a good data mining plan has to be established to achieve both business and data mining

goals.

2. Data understanding - This phase starts with initial data collection from available

sources to get familiar with data. Data load and Data integration must be carried out to

ensure successful data collection. Next, the “surface” properties of acquired data need to

be examined carefully and reported. Then, the data need to be explored by tackling the

data mining questions, which can be addressed using querying, reporting and

visualization. Finally, we must check whether the acquired data is complete, and ensure

that there are no missing values in the acquired data.

3. Data preparation - The data preparation normally consumes about 90% of the time. The

outcome of the data preparation phase is the final data set. When the available data

sources are identified, they need to be selected, cleaned, constructed and formatted into

the desired form.

4. Modelling - Several modelling techniques are selected to be used for the prepared

dataset. A test scenario must be generated to validate the model’s quality. One or more

models are created by running the modelling tool on the prepared dataset. The created

models need to be assessed carefully so that they meet business initiatives.

5. Evaluation - In the evaluation phase, the model results must be evaluated in the context

of business objectives in the first phase. In this phase, new business requirements may be

raised due to new patterns has been discovered in the model results or from other factors.

Gaining business understanding is an iterative process in data mining. The final decision

must be made in this step to move to the deployment phase.

6. Deployment - The knowledge or information gained through data mining process needs

to be presented in such a way that it can be used, whenever it is desired. In this phase, the

deployment, maintained and monitoring plans have to be created for deployment and

future supports. From project point of view, the final evaluation of the project needs to

summarize the project experiences and review the project to see what needs to be

improved.

3 Data Mining Techniques

xi

3.1. Classification

Classification is the most commonly applied data mining technique, which employs a set of

pre-classified examples to develop a model that can classify the population of records at

large. . Classification is a classic data mining technique based on machine learning. Basically

classification is used to classify each item in a set of data into one of predefined set of classes

or groups. Fraud detection and credit risk applications are particularly well suited to this type

of analysis. This approach frequently employs decision tree or neural network-based

classification algorithms.

The data classification process involves learning and classification. In Learning, the training

data are analyzed by classification algorithm. In classification, test data are used to estimate

the accuracy of the classification rules. If the accuracy is acceptable, the rules can be applied

to the new data tuples Classification method makes use of mathematical techniques such as

decision trees, linear programming, neural network and statistics. In classification, we make

the software that can learn how to classify the data items into groups.

The classifier-training algorithm uses these pre-classified examples to determine the set of

parameters required for proper discrimination. The algorithm then encodes these parameters

into a model called a classifier.

Types of classification models:

Classification by decision tree induction

Bayesian Classification

Support Vector Machines (SVM)

Classification Based on Associations

3.2. Clustering

Clustering can be defined as identification of similar classes of objects. Clustering is a data

mining technique that makes meaningful or useful cluster of objects that have similar

characteristic using automatic technique. By using clustering techniques we can further

identify dense and sparse regions in object space and can discover overall distribution pattern

and correlations among data attributes. Due to the fact that classification approach can

become costly, Clustering can be used as pre-processing approach for attribute subset

selection and classification. In clustering technique, the classes are defined and accordingly

objects are put in them, whereas in classification objects are assigned into predefined classes.

xii

Figure 3.1 Formation of clusters

Types of clustering methods:

Partitioning Methods

Hierarchical methods

Density based methods

Grid-based methods

Model-based methods

3.3. Regression

Regression analysis helps in understanding how the typical value of the dependent variable

changes when any one of the independent variables is varied, while the other independent

variables are held fixed. Regression analysis estimates the conditional expectation of the

dependent variable given the independent variables. In other words, it estimates the average

value of the dependent variable when the independent variables are fixed.

In all cases, the estimation target is a function of the independent variables called the

regression function. In regression analysis, it is also of interest to characterize the variation of

the dependent variable around the regression function, which can be described by a

probability distribution.

xiii

Regression analysis is widely used for prediction and forecasting, where its use has

substantial overlap with the field of machine learning. Regression analysis is also used to

understand which independent variables are related to the dependent variable, and to explore

the forms of these relationships.

In data mining, independent variables are attributes already known and response variables

are what we want to predict. Real-world problems are very difficult to predict because they

may depend on complex interactions of multiple predictor variables. Therefore, more

complex techniques (e.g., logistic regression, decision trees, or neural nets) may be necessary

to forecast future values. The same model types can often be used for both regression and

classification. For example, the CART (Classification and Regression Trees) decision tree

algorithm can be used to build both classification trees (to classify categorical response

variables) and regression trees (to forecast continuous response variables). Neural networks

too can create both classification and regression models.

Figure 3.2 Linear Regression

Types of regression methods

Linear Regression

Multivariate Linear Regression

Nonlinear Regression

Multivariate Nonlinear Regression

3.4. Association Rule

Association is one of the best known data mining technique. In association, a pattern is

discovered based on a relationship of a particular item on other items in the same transaction.

xiv

Association and correlation is usually to find frequent item set findings among large data sets.

This type of finding helps businesses to make certain decisions, such as catalogue design,

cross marketing and customer shopping behaviour analysis. Association rules are usually

required to satisfy a user-specified minimum support and a user-specified minimum

confidence at the same time. Association rule generation is usually split up into two separate

steps: First, minimum support is applied to find all frequent item sets in a database. Second,

these frequent item sets and the minimum confidence constraint are used to form rules.

Association Rule algorithms need to be able to generate rules with confidence values less

than one. However the number of possible Association Rules for a given dataset is generally

very large and a high proportion of the rules are usually of little (if any) value.

Types of association rule

Multilevel association rule

Multidimensional association rule

Quantitative association rule

3.5. Neural networks

An Artificial Neural Network (ANN), usually called neural network (NN), is a mathematical

model or computational model that is inspired by the structure and functional aspects of

biological neural networks. A neural network consists of an interconnected group of artificial

neurons, and it processes information using a connection based approach to computation. In

most cases an ANN is an adaptive system that changes its structure based on external or

internal information that flows through the network during the learning phase. Modern neural

networks are non-linear statistical data modelling tools. They are usually used to model

complex relationships between inputs and outputs or to find patterns in data.

Neural network is a set of connected input/output units and each connection has a weight

present with it. During the learning phase, network learns by adjusting weights so as to be

able to predict the correct class labels of the input tuples. Neural networks have the

remarkable ability to derive meaning from complicated or imprecise data and can be used to

extract patterns and detect trends that are too complex to be noticed by either humans or other

computer techniques. These are well suited for continuous valued inputs and outputs. Neural

networks are best at identifying patterns or trends in data and well suited for prediction or

forecasting needs.

xv

4 Neural Networks in Data Mining

Neural networks are non-linear statistical data modelling tools. They can be used to model

complex relationships between inputs and outputs; or to find patterns in data and to infer rules

from them. Neural networks are useful in providing information on associations,

classifications, clusters, and forecasting. Using neural networks as a tool, data warehousing

firms can harvest information from datasets in the data mining process. Neural networks are

programmed to store, recognize, and associatively retrieve patterns or database entries; to

solve combinatorial optimization problems; to filter noise from measurement data; to control

ill-defined problems; in summary, to estimate sampled functions when we do not know the

form of the functions. The two abilities: pattern recognition and function estimation make

neural networks a very prevalent utility in data mining. With their model-free estimators and

their dual nature, neural networks serve data mining in a variety of ways.

Figure 4.1 an Artificial Neural Network

Neural networks, depending on the architecture, provide associations, classifications, clusters,

prediction and forecasting to the data mining industry. Neural networks essentially comprise

xvi

three pieces: the architecture or model; the learning algorithm; and the activation functions.

Due to neural networks, we can mine valuable information from a mass of history information

so that it can be efficiently used in financial areas. Hence, the applications of neural networks

in financial forecasting have become very popular.

4.1. Feed forward Neural Network:

Figure 4.2 a Feed Forward Neural Network

One of the simplest feed forward neural networks (FFNN), in Figure 4.2, consists of three

layers: an input layer, hidden layer and output layer. In each layer there are one or more

processing elements (PEs). PEs is meant to simulate the neurons in the brain and this is why

they are often referred to as neurons or nodes. A PE receives inputs from either the outside

world or the previous layer. There are connections between the PEs in each layer that have a

weight (parameter) associated with them. This weight is adjusted during training. Information

only travels in the forward direction through the network - there are no feedback loops. The

simplified process for training a FFNN is as follows:

xvii

1. Input data is presented to the network and propagated through the network until it

reaches the output layer. This forward process produces a predicted output.

2. The predicted output is subtracted from the actual output and an error value for the

networks is calculated.

3. The neural network then uses supervised learning, which in most cases is back

propagation, to train the network. Back propagation is a learning algorithm for

adjusting the weights. It starts with the weights between the output layer PE’s and

the last hidden layer PE’s and works backwards through the network.

4. Once back propagation has finished, the forward process starts again, and this cycle

is continued until the error between predicted and actual outputs is minimized.

4.2. The Back Propagation Algorithm:

Back propagation, or propagation of error, is a common method of teaching artificial neural

networks how to perform a given task. Back propagation is the method of training artificial

neural networks so as to minimize the objective function. The back propagation algorithm

performs learning on a feed-forward neural network. The back propagation algorithm is used

in layered feed forward ANNs. This means that the artificial neurons are organized in layers,

and send their signals “forward”, and then the errors are propagated backwards. The back

propagation algorithm uses supervised learning, which means that we provide the algorithm

with examples of the inputs and outputs we want the network to compute, and then the error

(difference between actual and expected results) is calculated. The idea of the back

propagation algorithm is to reduce this error, until the ANN learns the training data.

Algorithm for a 3-layer network:

Initialize the weights in the network

Do

For each example E in the training set

O = neural-net-output (network, e); forward pass

T = teacher output for e

Calculate error (T - O) at the output units

xviii

Compute delta_wh for all weights from hidden layer to output layer ;

backward pass

Compute delta_wi for all weights from input layer to hidden layer ; backward

pass continued

Update the weights in the network

Until all examples classified correctly or stopping criterion satisfied

Return the network

The Back Propagation learning algorithm can be divided into two phases:

Phase 1: Propagation

Every propagation involves the following steps:

1. Forward propagation of a training pattern's input through the neural network.

2. Backward propagation of the propagation's output activations through the neural

network using the training pattern's target.

Phase 2: Weight update

For each weight-synapse the following steps are used:

1. Multiply its output delta and input activation to get the gradient of the weight.

2. Bring the weight in the opposite direction of the gradient by subtracting a ratio of it

from the weight.

Repeat phase 1 and 2 until the performance of the network is satisfactory.

xix

5 Applications of Data Mining

Data mining is a relatively new technology that has not fully matured. Despite this, there are

a number of industries that are already using it on a regular basis. Some of these

organizations include retail stores, hospitals, banks, and insurance companies.

Many of these organizations are combining data mining with such things as statistics, pattern

recognition, and other important tools. Data mining can be used to find patterns and

connections that would otherwise be difficult to find. This technology is popular with many

businesses because it allows them to learn more about their customers and make smart

marketing decisions.

There are a number of applications that data mining has. The first is called market

segmentation. With market segmentation, we can find behaviours that are common among

customers. We can look for patterns among customers that seem to purchase the same

products at the same time. Another application of data mining is called customer churn.

Customer churn allows us to estimate which customers are the most likely to stop purchasing

our products or services and go to one of our competitors. In addition to this, a company can

use data mining to find out which purchases are the most likely to be fraudulent.

For example, by using data mining in retail stores, we may be able to determine which

products are stolen the most. By finding out which products are stolen the most, steps can be

taken to protect those products and detect those who are stealing them. We can also use data

mining to determine the effectiveness of interactive marketing. Some of the customers will be

more likely to purchase the products online than offline, and we must identify them.

While many businesses use data mining to help increase their profits, it can also be used to

create new businesses and industries. One industry that can be created by data mining is the

automatic prediction of both behaviours and trends. Using this automated prediction we can

have an advantage over the competition. Instead of simply guessing what the next big trend

will be, we can determine it based on statistics, patterns, and logic. Another application of

automatic prediction is to use data mining to look at the past marketing strategies to

xx

http://www.exforsys.com/tutorials/data-mining/data-mining-applications.html

http://www.exforsys.com/tutorials/data-mining/data-mining-applications.html

determine the best one so far and the reason for it being the best. We can avoid making any

mistakes that occurred in previous marketing campaigns.

Data mining is also a powerful tool for those who deal with finances. A financial institution

such as a bank can predict the number of defaults that will occur among their customers

within a given period of time, and they can also predict the amount of fraud that will occur as

well.

Another application of data mining is the automatic recognition of patterns that were not

previously known. While data mining is a very valuable tool, it is important to realize that it

is not a complete solution. Even if an automated technology should be invented, it will not

guarantee the success of the company. However, it will tip the odds in our favour.

5.1. Specific Application Areas of Data Mining

Data Mining for Financial Data Analysis few typical cases:

Design and construction of data warehouses for multidimensional data analysis.

Loan payment prediction and customer credit policy analysis.

Classification and clustering of customers for targeted marketing.

Detection of money laundering and other financial crimes.

Data Mining for the Retail Industry.

A few examples of data mining in the retail industry:

Design and construction of data warehouses based on the benefits of data mining

Multidimensional analysis of sales, customers, products, time, and region

Analysis of the effectiveness of sales campaigns

Customer retention—analysis of customer loyalty

Product recommendation and cross-referencing of items

Data Mining for the Telecommunication Industry

Multidimensional analysis of telecommunication data.

Fraudulent pattern analysis and the identification of unusual patterns.

Multidimensional association and sequential pattern analysis.

xxi

Mobile telecommunication services.

Use of visualization tools in telecommunication data analysis.

Data Mining for Biological Data Analysis

Semantic integration of heterogeneous, distributed genomic and proteomic databases.

Alignment, indexing, similarity search, and comparative analysis of multiple

nucleotide/protein sequences.

Discovery of structural patterns and analysis of genetic networks and protein

pathways.

Association and path analysis: identifying co-occurring gene sequences and linking

genes to different stages of disease development.

Visualization tools in genetic data analysis.

Data Mining in Other Scientific Applications

Data collection and storage technologies have recently improved, so that today, scientific data

can be amassed at much higher speeds and lower costs. This has resulted in the accumulation

of huge volumes of high-dimensional data, stream data, and heterogeneous data, containing

rich spatial and temporal information. Consequently, scientific applications are shifting from

the “hypothesize-and-test” paradigm toward a “collect and store data, mine for new

hypotheses, confirm with data or experimentation” process. This shift brings about new

challenges for data mining.

5.2. Spatial Data Mining

A spatial database stores a large amount of space-related data, such as maps, pre-processed

remote sensing or medical imaging data, and VLSI chip layout data. Spatial data mining

refers to the extraction of knowledge, spatial relationships, or other interesting patterns not

explicitly stored in spatial databases.

Spatial data mining is the application of data mining methods to spatial data. The end

objective of spatial data mining is to find patterns in data with respect to geography. So far,

data mining and Geographic Information Systems (GIS) have existed as two separate

technologies, but now Data mining offers great potential benefits for GIS-based applied

decision-making.

xxii

Figure 5.1 Spatial Data Mining

Spatial Data Cube Construction and Spatial OLAP

As with relational data, we can integrate spatial data to construct a data warehouse that

facilitates spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time

variant and non-volatile collection of both spatial and noncapital data in support of spatial

data mining and spatial-data-related decision-making processes.

There are three types of dimensions in a spatial data cube:

A non spatial dimension

A spatial-to-non spatial dimension

A spatial-to-spatial dimension

We can distinguish two types of measures in a spatial data cube:

A numerical measure contains only numerical data

A spatial measure contains a collection of pointers to spatial objects

xxiii

Spatial database systems usually handle vector data that consist of points, lines, polygons

(regions), and their compositions, such as networks or partitions. Typical examples of such

data include maps, design graphs, and 3-D representations of the arrangement of the chains of

protein molecules.

5.3. Multimedia Data Mining

A multimedia database system stores and manages a large collection of multimedia data, such

as audio, video, image, graphics, speech, text, document, and hypertext data, which contain

text, text mark-ups, and linkages Similarity Search in Multimedia Data When searching for

similarities in multimedia data, we can search on either the data description or the data

content approaches:

Colour histogram–based signature

Multi feature composed signature

Wavelet-based signature

Wavelet-based signature with region-based granularity

Multidimensional Analysis of Multimedia Data

To facilitate the multidimensional analysis of large multimedia databases, multimedia data

cubes can be designed and constructed in a manner similar to that for traditional data cubes

from relational data. A multimedia data cube can contain additional dimensions and measures

for multimedia information, such as colour, texture, and shape.

Classification and Prediction Analysis of Multimedia Data

Classification and predictive modelling can be used for mining multimedia data, especially in

scientific research, such as astronomy, seismology, and geo-scientific research

Mining Associations in Multimedia Data:

Associations between image content and non-image content features:

Associations among image contents that are not related to spatial relationships

Associations among image contents related to spatial relationships:

Audio and Video Data Mining

xxiv

An extraordinary amount of audiovisual information is becoming available in digital form, in

digital archives, on the World Wide Web, in broadcast data streams, and in personal and

professional databases, and hence there is a need to mine them.

Visual data mining discovers implicit and useful knowledge from large data sets using data

and/or knowledge visualization techniques.

In general, data visualization and data mining can be integrated in the following ways:

Data visualization

Data mining result visualization

Data mining process visualization

Interactive visual data mining

5.4. Web Mining

Figure 5.2 Process Chart for conducting Text Mining

Text Data Analysis and Information Retrieval Information retrieval (IR) is a field that has

been developing in parallel with database systems for many years. Basic Measures for Text

Retrieval: Precision and Recall

Precision: This is the percentage of retrieved documents that are in fact relevant to the query

(i.e., “correct” responses).

xxv

Recall: This is the percentage of documents that are relevant to the query and were retrieved.

It is formally defined as Text Retrieval Methods

1) Document selection methods

2) Document ranking methods

Text Indexing Techniques

1) Inverted indices

2) Signature files.

Query Processing Techniques: Once an inverted index is created for a document collection, a

retrieval system can answer a keyword query quickly by looking up which documents contain

the query keywords.

Ways of dimensionality Reduction for Text:

Latent Semantic Indexing

Locality Preserving Indexing

Probabilistic Latent Semantic Indexing

Probabilistic Latent Semantic indexing schemas:

Keyword-Based Association Analysis

Document Classification Analysis

Document Clustering Analysis

Mining the World Wide Web

The World Wide Web serves as a huge, widely distributed, global information service centre

for news, advertisements, consumer information, financial management, education,

government, e-commerce, and many other information services. The Web also contains a rich

and dynamic collection of hyperlink information and Web page access and usage

information, providing rich sources for data mining.

Challenges:

xxvi

The Web seems to be too huge for effective data warehousing and data mining

The complexity of Web pages is far greater than that of any traditional text document

collection

The Web is a highly dynamic information source

The Web serves a broad diversity of user communities

Only a small portion of the information on the Web is truly relevant or useful

Besides mining Web contents and Web linkage structures, another important task for Web

mining is Web usage mining.

Data Mining for Intrusion Detection

The security of our computer systems and data is at continual risk. The extensive growth of

the Internet and increasing availability of tools and tricks for intruding and attacking

networks have prompted intrusion detection to become a critical component of network

administration. Some areas in which data mining technology may be applied or further

developed for intrusion detection:

1. Development of data mining algorithms for intrusion detection

2. Association and correlation analysis, and aggregation to help select and build

discriminating attributes

3. Analysis of stream data

4. Distributed data mining

5. Visualization and querying tools

Data Mining System Products and Research Prototypes data mining systems should be

assessed based on the following multiple features: Data types, System issues, Data sources,

Data mining functions and methodologies, Coupling data mining with database and/or data

warehouse systems, Scalability, Visualization tools, Data mining query language and

graphical user interface.

6 Conclusions

xxvii

This seminar report provided an overview of Data Mining Process, its techniques and

applications. The following conclusions can be drawn:

I. Data Mining is a crucial step in the Knowledge Discovery in Databases Process but

can only be performed after pre-processing and transformation.

II. Although the basic steps in data mining include data cleaning, selection and

transformation; the functions and techniques are only applied in the vital step where

intelligent methods are used to detect patterns.

III. A model for Data Mining is useful for a company or a data mining practitioner as it

helps in adapting a result oriented approach.

IV. Cross Industry Standard Process for Data Mining Model is an effective approach to a

model which considers business requirements at every step.

V. Classification and Clustering techniques are popular and easily applicable in data

mining, however classification we require prior characteristic information.

VI. Artificial Neural Networks can be deployed to detect patterns and make predictions

which make them capable tools in data mining. A feed forward neural network uses a

back propagation algorithm to train itself.

VII. The application of data mining techniques along with GIS techniques makes for a

potential opportunity to explore various aspects of Spatial Data Mining.

VIII. The growth of data available for processing, as well as multimedia elements and the

world wide web leads to greater opportunities for data mining techniques. However

the pre-processing, selection and transformation needs to be handled first.

7 References

xxviii

[1] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database

perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996.

[2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in

Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

[3] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge Discovery in

Databases: An Overview. In G. Piatetsky-Shapiro et al. (eds.), Knowledge Discovery

in Databases. AAAI/MIT Press, 1991.

[4] J. Han and M. Kamber. Data Mining: Concepts and Techniques.

[5] Morgan T. Imielinski and H. Mannila. A database perspective on knowledge.

[6] G. Piatetsky-Shapiro, U. M. Fayyad, and P. Smyth. From data mining to

knowledge discovery: An overview.

[7] In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-

35. AAAI/MIT Press, 1996.

[8] G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in

Databases.AAAI/MIT Press, 1991.

[9] Portia A. Cerny, Data mining and Neural Networks from a Commercial Perspective

[10] Bharati M. Ramageri, Data Mining Techniques and applications.

[11] Dr. Yashpal Singh, Alok Singh Chauhan, Neural Networks in Data Mining.

xxix

Seminar Report Vaibhav

Documents

Transcript of Seminar Report Vaibhav