Brief Overview of Data Mining

Data Mining: Looking Beyond the Tip of theIceberg

Sarabjot S. Anand and John G. Hughes Faculty of InformaticsUniversity of Ulster (Jordanstown Campus) Northern Ireland

1. What is Data Mining?

Over the past two decades there has been a huge increase in the amount of data being stored in databases as well as

the number of database applications in business and the scientific domain. This explosion in the amount ofelectronically stored data was accelerated by the success of the relational model for storing data and the developmentand maturing of data retrieval and manipulation technologies. While technology for storing the data developed fast to

keep up with the demand, little stress was paid to developing software for analysing the data until recently when

companies realised that hidden within these masses of data was a resource that was being ignored. The huge amounts

of stored data contains knowledge about a number of aspects of their business waiting to be harnessed and used for

more effective business decision support. Database Management Systems used to manage these data sets at present

only allow the user to access information explicitly present in the databases i.e. the data. The data stored in the

database is only a small part of the 'iceberg of information' available from it. Contained implicitly within this data is

knowledge about a number of aspects of their business waiting to be harnessed and used for more effective business

decision support. This extraction of knowledge from large data sets is called Data Mining or Knowledge Discovery in

Databases and is defined as the non-trivial extraction of implicit, previously unknown and potentially useful information

from data [FRAW91]. The obvious benefits of Data Mining has resulted in a lot of resources being directed towards

its development.

Almost in parallel with the developments in the database field, machine learning research was maturing with the

development of a number of sophisticated techniques based on different models of human learning. Learning by

example, cased-based reasoning, learning by observation and neural networks are some of the most popular learning

techniques that were being used to create the ultimate thinking machine.

Figure 1: Data Mining

While the main concern of database technologists was to find efficient ways of storing, retrieving and manipulatingdata, the main concern of the machine learning community was to develop techniques for learning knowledge from

data. It soon became clear that what was required for Data Mining was a marriage between technologies developedin the database and machine learning communities.

Data Mining can be considered to be an inter-disciplinary field involving concepts from Machine Learning, Database

Technology, Statistics, Mathematics, Clustering and Visualisation among others.

So how does Data Mining differ from Machine Learning? After all the goal of both technologies is learning from data.Data Mining is about learning from existing real-world data rather than data generated particularly for the learning

tasks. In Data Mining the data sets are large therefore efficiency and scalability of algorithms is important. Asmentioned earlier the data from which data mining algorithms learn knowledge is already existing real-world data.

Therefore, typically the data contains lots of missing values and noise and it is not static i.e. it is prone to updates.However, as the data is stored in databases efficient methods for data retrieval are available that can be used to make

the algorithms more efficient. Also, Domain Knowledge in the form of integrity constraints is available that can be used

to constrain the learning algorithms search space.

Data Mining is often accused of being a new buzz world for Database Management System (DBMS) reports. This isnot true. Using a DBMS Report a company could generate reports such as:

Last months sales for each service type

Sales per service grouped by customer sex or age bracketList of customers who lapsed their insurance policy

However, using Data Mining techniques the following questions may be answered

What characteristics do my customers that lapse their policy have in common and how do they differ from mycustomers who renew their policy?

Which of my motor insurance policy holders would be potential customers for my House ContentInsurance policy?

Clearly, Data Mining provides added value to DBMS reports and answers questions that DBMS reports cannot

answer.

2. Characteristics of a Potential Customer for Data Mining

Most of the challenges faced by data miners stem from the fact that data stored in real-world databases was not

collected with discovery as the main objective. Storage, retrieval and manipulation of the data were the main

objectives of the data being stored in databases. Thus most companies interested in data mining poses data with the

following typical characteristics:

The stored data is large and noisy

Conventional methods of data analysis are not useful due to the complexity of the data structures and the size of

the dataThe data is distributed and heterogeneous due to most of the data being collected over time in legacy systems

The sheer size of the databases in real-world applications causes efficiency problems. The noise in the data and

heterogeneity cause problems in terms of accuracy of the discovered knowledge and complexity of the discoveryalgorithms required.

3. Aspects of Data Mining

In this section we discuss a number of issues that need to be addressed by any serious data mining package.

Uncertainty Handling: Nothing is certain in this world and therefore any system that tries to model a real-

world scenario must allow a representation for uncertainty. A number of uncertainty models have been

proposed in the Artificial Intelligence community. Though no consensus has been arrived at as to which modelis best it is recognised that attention must be paid on the selection of a model that is suitable for the problem at

hand. Most Data Mining systems tend to employ the Bayesian Probability model though some support for

Fuzzy Logic, Rough Sets and Evidence Theory has been shown as well.

Dealing with Missing Values: Missing Values can occur in databases due to two reasons: Firstly, a valuemay not be available at the present time (incomplete information) and secondly, no value may be appropriate

due to some other attributes value in the tuple. Within the relational model missing values are represented as

NULLs. Facilities must be provided to deal with NULL values within a Data Mining system either by filling in

these values before the discovery process is undertaken or by taking NULLs into account within the discovery

process, perhaps by using a model of uncertainty like Evidence Theory that allows an explicit expression for

ignorance. A number of methodologies have been suggested in machine learning literature e.g. NULL as an

attribute value, using the most common attribute value and decision tree techniques.Dealing with Noisy data: Noise in data in real-world databases are a fact of life. Discovery techniques used

for Data Mining therefore need to be able to handle noisy data. As compared to symbolic learning techniques

like decision tree induction, Neural Network techniques tend to generalise and learn classification knowledge

better in the presence of knowledge. Though a number of techniques based on statistics have been used inmachine learning techniques more robust techniques are required in Data Mining for dealing with noise if useful

discovery from data is to be performed.

Efficiency of algorithms: Machine Learning algorithms though highly sophisticated and general get very

inefficient when used for learning from large data sets. In Data Mining the data sets are very large and thereforethe need to create new efficient, more specific algorithms is very important.

Constraining Knowledge Discovered to only Useful or Interesting Knowledge: From large amounts

data an even larger amount of knowledge can be discovered. Therefore what is required is techniques that

prioritise the knowledge in terms of its usefulness or interesting to the present needs of the user. At present theuncertainty and support of the knowledge, knowledge about the user domain and some measure of

interestingness is used. The measure of interestingness is accepted as being a subjective measure as what is

interesting to one user may be of no interest to another. However, some aspects of interestingness can be

automated and a number of measures have been suggested e.g. the J-measure by Smyth and Goodman andPiatetsky-Shapiros measure based on statistical independence.

Incorporating Domain Knowledge: Very often some reliable knowledge about he discovery domain may be

available to the user. An important question is how to use this knowledge to discover better knowledge in a

more efficient way.Size and Complexity of Data: As compared to machine learning problems the data sets in Data Mining are

much larger, noisier and incomplete. Also the data used for discovering knowledge in Data Mining was not

collected or stored for the purpose of discovery. Most data has been collected over a period of time and lies in

different formats in legacy systems. Thus, heterogeneity and distribution of data is of particular interest to DataMining. Techniques are required for integrating heterogeneous and distributed data.

Data Selection: Due to the large amounts of data, efficiency of the Data Mining algorithms is important. One

way of improving the efficiency of Data Mining techniques is by reducing the amount of data. A lot of work has

been done in Machine Learning with respect to relevance. Similar techniques need to be employed in DataMining.

Understandability of Discovered Knowledge: Knowledge discovered using Data Mining techniques must

be in a form that can be understood by the user as in the end of the day a user will only be able to use the

knowledge for decision making if he or she is able to understand the knowledge. This is the main failing ofNeural Networks as they are unintelligible black boxes. Decision Trees can get very large and opaque when

using a large training data set.

Consistency between Data and Discovered Knowledge: Data stored in databases may be updated from

time to time. Techniques are required for updating the knowledge discovered from the data so that it is

consistent with updates made to the data.

4. Classification of Data Mining Problems

Agrawal et. al [AGRA93] classify Data Mining problems into three categories :

Classification : Consider a bank that gives loans to its customers. The bank would obviously find it useful to be

able to predict which new customer would be a good investment and which one would not. Using data

collected about the previous customers, the bank would like to know the attributes that make a customer a

good investment or a bad investment. What is required is a set of rules that partition the data into two exclusivegroups - one of good investments and the other of bad investments. Such rules are called classification rules as

they classify the given data into a fixed number of groups. The data on old customers (for whom the group that

they belong to is known) is called the training set from which the classification rules are discovered. The

classification rules can then be used to discover as to which group a new customer belongs to.

Two approaches have been employed within machine learning to learn classifications. They are Neural Network

based approaches and Induction based approaches. Both approaches have a number of advantages and

disadvantages. Neural Networks may take longer to train than a rule induction approach but they are known to bebetter at learning to classify in situations where the data is noisy. However, as it is difficult to explain why a Neural

Network made a particular classification they are dismissed as unsuitable for real Data Mining. Rule Induction based

approaches to classification are normally Decision Tree based. Decision Trees can get very large and cumbersome

when the training set is large, which is the case in Data Mining, and though they are not black boxes like Neural

Networks become difficult to understand as well.

Both Neural Networks [ANAN95] and Tree Induction [AGRA92] techniques have been employed for Data Mining

as well along with Statistical techniques [CHAN91].

Association : This involves rules that associate one attribute of a relation to another. For example,

if we have a table containing information about people living in Belfast, an association rule could be of the type

(Age < 25) Ù (Income > 10000) ® (Car_model = Sports)

This rule associates the Age and Income of a person to the type of car he drives.

Set oriented approaches AGRA93, AGRA94, AGRA95] developed by Agrawal et. al. are the most efficient

techniques for discovery of such rules. Other approaches include attribute-oriented induction techniques [HAN 94],

Information Theory based Induction [SMYT91] and Minimal-Length Encoding based Induction [PEDN91].

Sequences : This involves rules that are based on temporal data. Suppose we have a database of natural

disasters. From such a database if we conclude that whenever there was an earthquake in Los Angeles, the

next day Mt. Kilimanjaro erupted, such a rule would be a sequence rule. Such rules are useful for making

predictions which could be useful in making market gains or for taking preventive action against naturaldisasters. The factor that differentiates sequence rules from other rules is the temporal factor.

Techniques have been developed for discovering sequence relationships using Discrete Fourier Transforms to

map time sequences to the frequency domain [AGRA93, AGRA95]. This technique is based on two

observations:

for most sequences of practical interest only the first few frequencies are strong

Fourier transforms preserve the Euclidean distance in the time or frequency domain

Another technique uses Dynamic Time Warping, a technique used in the speech recognition field, to find patterns in

temporal data [BERN94].

5. A Data Mining Model

5.1 Data Pre-processing: As mentioned in section 2, data stored in the real-world is full of anomalies that need to

be dealt with before sensible discovery can be made. This Data Pre-processing/ Cleansing may be done using

visualisation or statistical tools. Alternatively, a Data Warehouse (see section 8.1) may be built prior to the DataMining tools being applied.

Data Pre-processing involves removing outliers in the data, predicting and filling-in missing values, noise reduction,

data dimensionality reduction and heterogeneity resolution. Some of the tools commonly used for data pre-processing

are interactive graphics, thresholding and Principal Component Analysis.

5.2 Data Mining Tools: The Data Mining tools consist of the algorithms that automatically discover patterns from

the pre-processed data. The tool chosen depends on the mining task at hand.

5.3 User Bias: The User is central to any discovery/ mining process. User Biases in the Data Mining model are a

way for the User to direct the Data Mining tools to areas in the database that are of interest to the user. User Bias

may consist of:

Attributes of Interest in the databases

Goal of discovery

A minimum degree of support and confidence in any knowledge discovered

Domain KnowledgePrior Knowledge/ Beliefs about the domain

6. Data Mining Technologies

Various approaches have been used for Data Mining based on inductive learning [ IMAM93], Bayesian statistics

[WUQ91], information theory [SMYT91], fuzzy sets [YAGE91], rough sets [ZIAR91], Relativity strength

[BELL93], Evidence Theory [ANAN95] etc.

6.1 Machine Learning

To be able to make a machine mimic the intelligent behaviour of humans has been a long standing goal of Artificial

Intelligence researchers who have taken their inspiration from a variety of sources such as psychology, cognitive

science and neurocomputing.

Machine Learning paradigms can be classified into two classes: Symbolic and Non-symbolic paradigms. Neural

Networks are the most common non-symbolic paradigm while rule-induction is a symbolic paradigm.

6.1.1 Neural Networks: Neural Networks is a Non-Symbolic paradigm of Machine Learning that finds its

inspiration from Neuroscience. The realisation that most Symbolic Learning Paradigms are not satisfactory in a

number of domains e.g. pattern recognition, that are regarded by humans as trivial lead to research into trying to

model the human brain.

The human brain consists of a network of approximately 1011 neurones. Each biological neurone consists of a

number of nerve fibres called dendrites connected to the cell body where the cell nucleus is located. The axon is a

long, single fibre that originates from the cell body and branches near its end into a number of strands. At the ends ofthese strands are the transmitting ends of the synapse that connect to other biological neurones through the receiving

ends of the synapse found on dendrites as well as the cell body of biological neurones. A single axon typically makes

thousands of synapses with other neurones. The transmission process is a complex chemical process which effectively

increases or decreases the electrical potential within the cell body of the receiving neurone. When this electrical

potential reaches a threshold value (action potential) it enters it's excitatory state and is said to fire. It is the

connectivity of the neurones that give these simple 'devices' their real power. The figure above shows a typical

biological neurone.

An Artificial Neurone [HERT91] (or processing elements, PE) are highly simplified models of the biological neurone

(see figure). As in biological neurones an artificial neurone has a number of inputs, a cell body (consisting of the

summing node and the Semi-Linear function node in the figure) and an output which can be connected to a number of

other artificial neurones.

Neural Networks are densely interconnected networks of PE's together with a rule to adjust the strength of the

connections between the units in response to externally supplied data. Using neural networks as a basis for a

computational model has its origins in pioneering work conducted by McCulloch and Pitts in 1943 [McCU43]. They

suggested a simple model of a neurone that commuted the weighted sum of the inputs to the neurone and output a 1 ora 0 according to weather the sum was over a threshold value or not. A zero output would correspond to the inhibitory

state of the neurone while a 1 output would correspond to the excitatory state of the neurone. But the model was far

from a true model of a biological neurone as for a start the biological neurones output is a continuous function rather

than a step function. The step function has been replaced by other more general, continuous functions called

activation functions. The most popular of these is the sigmoid function defined as:

f(x) = 1/(1+e-x)

The overall behaviour of a network is determined by its connectivity rather than by the detailed operation of any

element. Different topologies for neural networks are suitable for different tasks e.g. Hopfield Networks for

optimization problems, Multi-layered Perceptron for classification problems and Kohonen Networks for data coding.

There are three main ingredients to a neural network :

the neurones and the links between them

the algorithm for the training phase

a method for interpreting the response from the network in the testing phase

The learning algorithms used are normally iterative e.g. back-propagation algorithm attempting to reduce the error in

the output of the network. Once the error is reduced (not necessarily minimum) the network can be used to classify

other unseen objects.

Though neural networks seem an attractive concept they have a number of disadvantages. Firstly, the learning process

is very slow and compared to other learning methods. The learned knowledge is in the form of a network and it is

difficult for a user to interpret it (the same is a disadvantage of using decision trees). User intervention in the learning

process, interactively is difficult to incorporate which is normally required in Data Mining applications. However,

neural networks are known to perform better that symbolic learning techniques in noisy data found in most real-worlddata sets.

6.1.2 Rule Induction: Automating the process of learning has enthralled AI researchers for some years now. The

basic idea is to build a model of the environment using sets of data describing the environment. The simplest model

clearly is to store all the states of the environment along with all the transitions between them over time. For example,

a chess game may be modelled by storing each state of the chess board along with the transitions from one state to the

other. But the usefulness of such a model is limited as the number of states and transitions between them are infinite.

Thus, it is unlikely that a state that occurs in the future would match, exactly, a state from the past. Thus, a better

model would be to store abstractions / generalizations of the states and the associated transitions. The process of

generalization is called Induction.

Each generalization of the states is called a class and has a class description associated with it. The class description

defines the properties that a state must have to be a member of the associated class.

The process of building a model of the environment using examples of states of the environment is called Inductive

Learning. There are two basic types of inductive learning :

Supervised Learning

Unsupervised Learning

In Supervised Learning the system is provided with examples of states and class labels for each of the

examples defining the class that the example belongs to. Supervised Learning techniques are then used on the

examples to find a description for each of the classes. The set of examples is called the training data set. Supervised

learning may be classified into Single Class Learning and Multiple Class Learning.

In Single Class Learning the supervisor defines a single class by providing examples of states belonging to that class

(positive examples). The supervisor may also provide examples of states that do not belong to that class (negative

examples). The inductive learning algorithm then constructs a class description for the class that singles out instances

of that class from other examples.

In Multiple Class Learning the examples provided by the supervisor belong to a number of classes. The inductivelearning algorithm constructs class description for each of the classes that distinguish states belonging to one class from

those belonging to another.

In Unsupervised Learning the classes are not provided by a supervisor. The inductive learning algorithm has to

identify the classes by finding similarities between different states provided as examples. This process is called

learning by observation and discovery.

6.2 Statistics: Statistical techniques may be employed for data mining at a number of stages of the mining process.

Infact statistical techniques have been employed by analysts to detect unusual patterns and explain patterns using

statistical models. However, using statistical techniques and interpreting their results is difficult and requires a

considerable amount of knowledge of statistics. Data Mining seeks to provide non-statisticians with useful information

that is not difficult to interpret. We know discuss how statistical techniques can be used within Data Mining.

(1) Data Cleansing: The presence of data which are erroneous or irrelevant (outliers) may impede the mining

process. Whilst such data therefore need to be distinguished, this task is particularly sensitive, as some outliers may be

of considerable interest in providing the knowledge that mining seeks to find: 'good' outliers need to be retained, whilst'bad' outliers should be removed. Bad outliers may arise from sources such as human or mechanical errors in

experimental measurement, from the failure to convert measurements to a consistent scale, or from slippage in time-

series measurements. Good outliers are those outliers that may be characteristic of the real world scenario being

modelled. While these are often of particular interest to users, knowledge about them may be difficult to come by and

is frequently more critical than knowledge about more commonly occurring situations. The presence of outliers may be

detected by methods involving thresholding the difference between particular attribute values and the average, using

either parametric or non-parametric methods.

(2) Exploratory Data Analysis: Exploratory Data Analysis (EDA) concentrates on simple arithmetic and easy-to-

draw pictures to provide Descriptive Statistical Measures and Presentation, such as frequency counts and table

construction (including frequencies, row, column and total percentages), building histograms, computing measures of

location (mean, median) and spread (standard deviation, quartiles and semi inter-quartile range, range).

(3) Data Selection: In order to improve the efficiency and increase the time performance of data analysis, it is

necessary to provide sampling facilities to reduce the scale of computation. Sampling is an efficient way of finding

association rules, and resampling offers opportunities for cross-validation. Hierarchical data structures may beexplored by segmentation and stratification.

(4) Attribute re-definition: We may define new variables which are more meaningful than the previous e.g. Body

Mass Index (BMI) = Weight / Height squared. Alternatively we may want to change the granularity of the data e.g.

age in years may be grouped into age groups 0-20 years, 20-40 years, 40-60 years, 60+ years.

Principal Component Analysis (PCA) is of particular interest to Data Mining as most Data Mining algorithms have

linear time complexity with respect to the number of tuples in the database but are exponential with respect to the

number of attributes of the data. Attribute Reduction using PCA thus provides a facility to account for a large

proportion of the variability of the original attributes by considering only relatively few new attributes (called Principal

Components) which are specially constructed as weighted linear combinations of the original attributes. The first

Principal Component (PC) is that weighted linear combination of attributes with the maximum variation; the second

PC is that weighted linear combination which is orthogonal to the first PC whilst maximising the variation, etc. The new

attributes formed by PCA may possibly themselves be assigned individual meaning if domain knowledge is invoked,

or they may be used as inputs to other Knowledge Discovery tools. The facility for PCA requires the partial

computation of the eigensystem of the correlation matrix, as the PC weights are the eigenvector components, with theeigenvalues giving the proportions of the variance explained by each PC.

(5) Data Analysis: Statistics provides a number of tools for data analysis some of which may be employed within

Data Mining. These include:

Measures of Association and Relationships between attributes, such as computation of expected

frequencies and construction of cross-tabulations, computation of chi-squared statistics of association,

presentation of scatterplots and computation of correlation coefficients. The interestingness of rules may be

assessed by considering measures of statistical significance [PIAT91].

Inferential Statistics for hypothesis testing, such as construction of confidence intervals, parametric and non-

parametric hypothesis tests for average values and for group comparisons.

Classification may be carried out using discriminant analysis (supervised) or cluster analysis (unsupervised).

6.3 Database Approaches - Set Oriented Approaches: Set-oriented approaches to data mining attempt to

employ facilities provided by present day DBMSs to discover knowledge. This allows the use of years of research

into database performance enhancement to be used within Data Mining processes. However, SQL is very limited in

what it can provide for Data Mining and therefore techniques based solely on this approach are very limited inapplicability. Though these techniques have shown that certain aspects of Data Mining can be performed within the

DBMS efficiently, providing a challenge for researchers into investigating how the data mining operations can be

divided into DBMS operations and non-DBMS operations to make the most of both worlds.

6.4 Visualisation: Visualisation techniques are used within the discovery process at two levels. Firstly, visualising the

data enhances exploratory data analysis. Exploratory Data Analysis is useful for data pre-processing allowing the user

to identify outliers and data subsets of interest. Secondly, Visualisation may be used to make underlying patterns in the

database more visible. NETMAP, a commercially available Data Mining Tool derives most of its power from this

pattern visualisation technique.

7. Knowledge Representation

7. 1 Neural Networks (see section 5.1.1)

7.2 Decision Tree

fig. 2 : An example Decision Tree

ID3 [QUIN86] is probably the most well known classification algorithms in machine learning that uses decision trees.

ID3 belongs to the family of TDIDT algorithms (Top-Down Induction of Decision Trees) and has undergone a

number of enhancements since it's conception e.g. ACLS [PATE83], ASSISTANT [KONO84].

A Decision Tree is a tree based knowledge representation methodology used to represent classification rules. The leaf

nodes represent the class labels while other nodes represent the attributes associated with the objects being classified.

The branches of the tree represent each possible value of the attribute node from which they originate. Figure 2 shows

a typical decision tree [QUIN86].

Once the decision tree has been built using a training set of data it may be used to classify new objects. To do so we

start at the root node of the tree and follow the branches associated with the attribute values of the object until we

reach a leaf node representing the class of the object.

Clearly, for a training set of examples there are a large number of possible decision trees that could be generated. The

basic idea is to pick the decision tree that would correctly classify the most unknown examples (which is the essenceof the induction process). One way of doing this is to generate all the possible decision trees for the training set and

picking the simplest tree. Alternatively, the tree could be built in such a way that the final tree is the best. In ID3 they

use an information theoretic measure for 'gain in information' by using a particular attribute as a node to decide on an

attribute for a particular node.

Though decision trees have been used successfully in a number of algorithms, they have a number of disadvantages.

Firstly, even for small training sets decision trees can be quite large and thus opaque. Quinlan [QUIN87] points out

that it is questionable whether opaque structures like decision trees can be described as knowledge, no matter how

well they function. Secondly, in the presence of missing values for attributes of objects in the test data set, trees can

have a problem with performance. Also the order of attributes in the tree nodes can have an adverse affect on

performance.

The main advantage of decision trees is their execution efficiency many due to their simple and economical

representation and ability to perform even though they lack the expressive power of semantic networks or other first

order logic methods of knowledge representation.

7.3 Rules

Rules are probably the most common form of data representation. A rule is a conditional statement that specifies an

action for a certain set of conditions, normally represented as X ® Y. The action, Y, is normally called the

consequent of the rule and the set of conditions, X, its antecedent. A set of rules is an unstructured group of IF..

THEN statements.

The popularity of rules as a method of knowledge representation is mainly due to their simple form. They are easily

interpreted by humans as they are a very intuitive and natural way of representing knowledge, unlike decision trees

and neural networks. Also as a system of rules is unstructured, it is less rigid, which can be advantageous at the early

stages of the development of a knowledge based system.

But representing knowledge as rules has a number of disadvantages. Rules lack variation and are unstructured. Their

format is inadequate to represent many types of knowledge e.g. causal knowledge. As the number of rules in the

system increases the performance of the system decreases and the system becomes more difficult to maintain and

modify. New rules cannot be added arbitrarily to the system as they may contradict existing rules of the system leading

to erroneous conclusions. The degradation in performance of a rule based system is not graceful.

The lack of structure in rule based representations makes the modelling of the real-world difficult if not impossible.

Thus, a more organized and structured representation for knowledge is desirable that can make partial inferences and

degrade gracefully with size.

7.4 Frames

Frames are a template for holding clusters of related knowledge about a particular, narrow subject, which is often the

name of the frame. The clustering of related knowledge is a more natural way of representing knowledge as a model

of the real-world.

Each frame consists of a number of slots that contain attributes, rules, hypotheses and graphical information related to

the object represented by the frame. These slots may be frames in their own right, giving the frames a hierarchical

structure. Relationships between frames are taxonomic and therefore a frame inherits properties of its 'parent frames'.

Thus, if the required information is not contained in a frame the next frame up in the hierarchy is searched.

Due to the structuring of knowledge in frames, representing knowledge in frames is more complex than a rule-basedrepresentation.

8. Related Technologies

8.1 Data Warehousing - "to manage the data that analyses your business"

On-line Transaction Processing (OLTP) systems are inherently inappropriate for decision support querying and,

therefore, the need for Data Warehousing. A data warehouse is a Relational Database Management System designed

to provide for the needs of decision makers rather than the needs of transaction processing systems. Thus, a data

warehouse provides data in a form suitable for business decision support. More specifically, a Data Warehouse

allows

Any business questions to be asked

Any data in the enterprise to be included in the analysis

Interactive Analysis, therefore, necessarily Uninhibited Performance so that the decision making process is not

inhibited in any way

A Data Warehouse brings together large volumes of business information obtained from transaction processing and

legacy operational systems. The information is cleansed and transformed so that it is complete and reliable and it is

stored over time so that trends can be identified. Data Warehouses are normally employed in one of two roles:

A provider of business relevant information to managers and analysts

A "closed-loop" system performing information driven functions such as intelligent inventory reordering

OLTP systems are designed to capture, store and manage day-to-day operations. The raw data collected in OLAP

systems exists in a number of different formats like hierarchical databases, flat files, COBOL datasets, in legacy

systems keeping them out of reach from business decision makers.

Ad-hoc queries and reports can take days

The nature of tuning an OLTP application makes rapid retrieval for business analysis impossible

SQL queries cannot deliver correlated information needed by business analysts

Typically Data Warehousing software has to deal with large updates with narrow batch windows. Getting the data intothe warehouse and fully preparing it for use is the key update to the warehouse. Therefore, a warehouse must provide

for the following requirements:

Data must be read from a number of different feeds e.g. disk files, magnetic tapes

Data must be converted to the database internal format from a variety of formats

Data must be filtered to reject invalid values

Records must be reorganised to match the relational schema

Records must be checked against the existing database to ensure global consistency and referential integrity

(inter table references)

Records must be written to physical storage observing requirements of data segmentation, physical device

placement

Records must be fully and richly indexed

system metadata must be updated

Heterogeneity resolution

Additionally, a Data Warehouses must:

Provide mechanisms to continuously guarantee overall data quality

Not have any architectural limits - They must be able to handle terabytes of data.

Require minimal Storage Management activities - no reorganisation should be required. For other Management

activities, they must be modular and parallelisable

Allow for hardware failures and continue to make available unaffected parts of the database as large databases

require large hardware dependence

Provide query performance dependent only on the complexity of the query, not on the size of the database

A Data Warehouse presents a dimensional view of the data. Users expect to view this data from different

perspectives or dimensions. Such functionality is provided by On-line Analytical Processing. Two general approaches

have emerged to meet this requirement of OLAP

Self-contained Multi-dimensional Databases (MDDB): Contain summaries and rich tools for exploring this

data. When the MDDB user needs to "drill down", the underlying database is used

Dimensional tools layered above the database.

Red Brick Warehouse V(Very Large Data Warehouse Support)P(Parallel Query Processing)T(Time Based

Data Management)

Consists of three components

A Database Server supporting SQL plus decision support extensions (RISQL - Red Brick Intelligent SQL):

Specialised Indexes designed solely for retrieval

B-TREE

STAR: Automatically created when tables are created. Join processing is greatly enhanced using

these indexes as they maintain relationships between primary keys and foreign keys

PATTERN: These are fully-inverted text indexes that reduce search time for partial character

string matching

Powerful extensions to SQL

Business analysis functions that perform sequential calculations

Numeric and string functions to manipulate character strings and numeric values

Macro building capabilities that simplify the use of repetitive SQL and calculationsStandard SQL is a set-oriented language. Thus all operations operate on unordered sets of data.

This does not allow SQL to answer many useful business questions e.g. moving averages, ranking,

n-tile ordering. It does not provide even basic statistical functionality.

Example query: What are the top ten products sold during the second quarter of 1993 and what were their rankings

by dollars and units sold?

A High Performance load subsystem called the Table Management Utility (TMU)

Provides data loading and index-building facilities with performance necessary in a data warehouse

Ability to transform OLTP data to a more appropriate form for business data analysis i.e. the warehouse

schema

Gateway technologies supporting client/ server access to the warehouse: Allows terminal and client/ server

access to the warehouse

8.2 On-line Analytical Processing

Today a business enterprise can prosper or fail based on the availability of accurate and timely information. The

Relational Model was introduced in 1970 by E. F. Codd in an attempt to allow the storage and manipulation of large

data sets about the day-to-day working of a business. However, just storing the data is not enough. Techniques are

required to help the analyst infer information from this data that can be used by the business enterprise to give it an

edge over the competition.

Relational Database Management System (RDBMS) products are not very good at transforming data stored in OLTP

applications and converting it into useful information. A number of reasons contribute to why RDBMSs are notsuitable for business analysis:

End-Users want a multi-dimensional/ business-oriented view of the data not which columns are indexed which

are the primary and foreign keys etc.

To make OLTP queries fast RDBMS applications generally normalise the data into 50 - 200 tables. Though

great for OLTP operations this is a nightmare for OLAP applications as it means a large number of joins are

required to access the data required for OLAP.

While parallel processing can be useful in table scans it offers very little performance enhancement for complex

joins

SQL is not designed with common business needs in mind. There is no way of using SQL for retrieving

information like "the top 10 salespersons", "bottom 20% of customers", "products with a market share of

greater than 25%" or "the sales ratio of cola to root beer".

Do not provide common data analysis tools like data rotation, drill downs, dicing and slicing

To allow truly ad-hoc end-user analysis, the Database Administrator must index the database on every possible

combination of columns and tables that the end user may ask for. This would create an unnecessary overhead

for OLTP and query response times.

Locking models, data consistency schemes, caching algorithms are based on the use of the RDBMS being used

for OLTP applications where the transactions are small and discrete. Long running, complex queries cause

problems in each of these areas.

Complex statistical functionality was never intended to be provided along with the RDBMS. Providing such

functionality was left to the user-friendly end-user products like spreadsheets that were to act as front ends to the

RDBMS. Though spreadsheets have provided a certain amount of functionality required by business analysts none

address the need for analysing the data according to its multiple dimensions.

Any product that intends to provide such functionality to business analysts must allow the following:

access to many different types of files

creation of multi-dimensional views of the data

experimentation with various data formats and aggregations

definition and animation of new information models

application of summations and other formulae to these models

Drilling down, Rolling up, slicing and dicing, rotation of consolidation pathsGeneration of a wide variety of reports

On-line Analytical Processing (OLAP) is the name given, by E. F. Codd (1993), to the technologies that attempt to

address these user requirements. Codd defined OLAP as "the dynamic synthesis, analysis and consolidation of large

volumes of multi-dimensional data". Codd provided 12 rules/ requirements of any OLAP system. These were:

Multi-dimensional Conceptual View

Transparency

Accessibility

Consistent Reporting Performance

Client-Server Architecture

Generic Sparse Matrix Handling

Multi User Support

Unrestricted Cross-dimensional Operations

Intuitive Data Manipulation

Flexible Reporting

Unlimited Dimensions and Aggregation Levels

Nigle Pendse provided another definition of OLAP which, unlike Codds definition, does not mix technology

prescriptions with application requirements. He defined OLAP as "Fast Analysis of Shared Multidimensional

Information". Within the definition:

FAST means that the system should provide answers to most queries within five seconds with only the very

complex queries taking a maximum of twenty seconds. This is so that the analyst does not loose his/ her chain

of thought due to delayed responses from the system.

ANALYSIS means that the system can cope with business logic and statistical analysis that is relevant to the

users application. The analysis functions should be provided to the user in an intuitive way.

SHARED means that the system services multiple users concurrently.

MULTIDIMENSIONAL (a key OLAP requirement) means that the system must provide a mutlidimensional

conceptual view of the data including support for multiple hierarchies

INFORMATION is the data and derived information required by the user application.

9. State of the Art and Limitations of Commercially Available Products

9.1 End-User Products

Most end-user products available in the market do not address many of the aspects of data mining enumerated in

section 3. Infact, these packages are really machine learning packages with added facilities for accessing databases.

Having said that they are powerful tools that can be very useful given a clean data set. However clean data sets are

not found in real-world applications and cleansing large data sets manually is not possible.

In this section we discuss two end-user products available in the UK. However, a much larger number of so called

"data mining tools" exist in the market.

9.1.1 CLEMENTINE

This package supplied by ISL Ltd., Basingstoke, England is a very easy to use package for "data mining". The

interface has been built with the intention of making it "as easy to use as a spreadsheet". CLEMENTINE uses a Visual

Programming Interface for building the discovery model and performing the learning tasks.

Accessible Data Sources: ASCII File format, Oracle, Informix, Sybase and Ingres.

Discovery Paradigms: Decision Tree Induction and Neural Network (Mutli Layer Perceptron).

Data Visualisation: Through interactive histograms, scatter plots, distribution graphs etc.

Data Manipulation: Sampling, Derive New Attributes, Filter Attributes

Hardware Platforms: Most Unix Workstations

Statistical Functionality: Correlations, Standard Deviation etc.

Deploying Applications: A Trained Neural Network or Decision Tree may be exported as C.

Data Mining Tasks Suitability: Classification problems with clean data sets available.

Consultancy: Available

9.1.2 DataEngine

This package is supplied by MIT GmbH, Germany. It also provides a Visual Programming interface.

Accessible Data Sources: ASCII File format, MS-Excel

Discovery Paradigms: Fuzzy Clustering, Fuzzy Rule Induction, Neural Network (Multi-Layered Perceptron,

Kohonen Self-Organising Map), Neuro-Fuzzy Classifier

Data Visualisation: Scatter and line plots, Bar charts and Area plots

Data Manipulation: Missing Data Handling, Selection, Scaling

Hardware Platforms: Most UNIX Workstations & Windows.

Statistical Functionality: Correlation, Linear Regression etc.

Deploying Applications: DataEngine ADL allows the integration of classifiers into other software Environments

Data Mining Tasks Suitability: Classification

Consultancy: Available

9.2 Consultancy Based Products

9.2.1 SGI

Silicon Graphics provide "Tailored Data Mining Solutions" which include Hardware support in the form of

CHALLENGE family of database servers and software support in the form of Data Warehousing and Data Mining

software. The CHALLENGE servers provide unique Data Visualisation capabilities an area that Silicon Graphics are

recognised as leaders. The Interface provided by SGI allows you to "fly" through visual representations of your data

allowing you to identify important patterns in your data and directing you to the next question you should ask withinyour analysis!

Apart from Visualisation, SGI provides facilities for Profile Generation and Mining for Association Rules.

9.2.2 IBM

IBM provide a number of tools to give users a powerful interface to Data Warehouses.

9.2.2.1 IBM Visualizer: Provides a powerful and comprehensive set of ready to use building blocks and

development tools that can support a wide range of end-user requirements for query, report writing, data analysis,

chart/ graph making and business planning.

9.2.2.2 Discovering Association Patterns: IBMs Data Mining group at Almaden pioneered research into efficient

techniques for discovering associations in buying patterns in supermarkets. There algorithms have been successfully

employed in supermarkets in the USA to discover patterns in the supermarket data that could not have been

discovered without data mining.

9.2.2.3 Segmentation or Clustering: Data Segmentation is the process of separating data tuples into a number of

sub-groups based on similarities in their attribute values. IBM provides two solutions based on two different discovery

paradigms: Neural Segmentation and Binary Segmentation. Neural Segmentation is based on a Neural Network

technique called self-organising maps. Binary Segmentation was developed at IBMs European Centre for Applied

Mathematics. It is based on a technique called relational analysis. This technique was developed to deal with binary

data.

Applications: Basket Analysis

10. Case Studies and Data Mining Success Stories

Case Study 1: Health Care Analysis using Health-KEFIR

KEFIR (Key Findings Reporter) is a domain independent system for discovering and explaining key findings i.e.

deviations developed by GTE labs.

KEFIR consists of four main parts:

Deviation Detection: Input to KEFIR:

Data

Predefined measures for deviation detection e.g. In Healthcare Analysis there are standard measures

used such as: Average_hospital_payments_per capita, Admission_rate_per_1000_people etc.

Predefined categories to create sub-populations (sector): The full population in question is the top sector.

Based on certain categories this population is sub-divided into sub-sectors recursively. e.g. The Inpatient

population based on category, Admission Type, may be spilt into Inpatient Surgical, Inpatient Medical,

Inpatient Mental etc.

KEFIR starts by discovering deviations in the predefined measures for the top sector and then discovers deviations in

the sub-sectors that seem interesting.

Evaluation/ Ordering by Interestingness: Two factors are taken into account when ordering the deviations in

descending order of interestingness. These are:

Impact of deviation e.g. effect of deviation on payments for healthcare.

Probability of success of associated recommendation

These two factors together give a value for potential saving which is used to rank the deviations.

Statistical significance is another important factor e.g. the deviation may be due to just one very costly case -

chances are it will not occur again. Measures like standard deviation may be useful.

Explanation: An explanation for a deviation can be found in two ways:

Investigating the underlying formula: If the measure for which the deviation has been detected has a

formula associated with it e.g. Total Payments = Pay_per_day*no_of_days - an explanation may be

found by investigating the deviation in the component measures of the formula.

Investigating the sub-populations: An explanation may be found by looking at which subpopulation(s) is

causing the deviation to occur.

Recommendation: Expert System based - provided by the domain expert. The Output is a Report including

business graphics and all.

KEFIR needs to be tailored to the specific domain. The Healthcare application resulted in Health-KEFIR, a versionspecific to Healthcare. The success of KEFIR-Health proves that Data Mining solutions need to be tailored to the

specific domain and specific solutions to mining tasks are more realistic than multi-purpose solutions.

Case Study 2: Mass Appraisal for Property Taxation

The Data Set consisted of 416 data items. One in three items were selected as the holdout sample

For each property the following variables were used as input to the Neural Network:

Ward

Transaction Date (Month and Year)

Size (Area)

Number of Bedrooms

House Class (Detached or Semi-Detached)

Age

House Type (Bungalow, House, CH, FH)

Heating Type (OFCH, FGCH, FSCH, PSCH, PECH, PGCH, GLHO, None, Not Known)

Garage (Drive, Single, None)

The goal was to predict the house price based on the other attributes of the houses. We used a Neural Network for

the prediction and achieved an accuracy of approximately 82% with a mean error of approximately 15%. We also

explored using a Rule Induction approach. The accuracy achieved by the Neural Network was greater than that

achieved through rule induction.

Using simple data visualisation outliers in the data were spotted and removed from the data. The neural network was

trained on the remaining dataset. The predictive accuracy of the network improved from 82% to approximately 93%

with the mean error reducing to approximately 7.8%.

Case Study 3: Policy Lapse/Renewal Prediction

A Large Insurance Company had collected a large amount of data on its motor insurance customers. Using this data,

the company wanted to be able to predict in advance which policies were going to lapse and which ones were going

to be renewed when they near expiry. The advantage of such a prediction is obvious. As the competition increases inthe business world customer retention has become an important issue. The insurance company wanted to target new

services at their customers to woo them into renewing their policies and not moving to other insurance companies.

Clearly, rather than targeting customers that are going to renew their policy anyway it would be better to target

customers who are more likely to lapse their policy. Data Mining provided the solution for the company.

Of the 34 different attributes stored for each policy we picked 12 attributes that seemed to have an effect on whether

or not the policy was going to lapse or not. Using a neural network we achieved a predictive accuracy of 71% and

using rule induction based techniques we were able to identify attributes of an insurance policy that was going to lapse.

The whole exercise took approximately 3 weeks to complete. The accuracy achieved by Data Mining equalled the

accuracy achieved by the companies statisticians who undoubtedly spent many more man months on their statistical

models.

11. The Mining Kernel System: The University of Ulster's perspective

The Mining Kernel System (MKS) being developed in the authors' laboratory embodies the inter-disciplinary nature

of Data Mining by providing facilities from Statistics, Machine Learning, Database Technology and Artificial

Intelligence. The functionality provided by MKS forms a strong foundation for building powerful Data Mining

implements required to tackle what is clearly recognised to be a complex problem.

MKS is a set of libraries implemented by the authors to remove the mundane, repetitive tasks like data access,

knowledge representation and statistical functionality from Data Mining so as to allow the user to concentrate on the

more complex aspects of the discovery algorithms being implemented. In this section we describe the motivation

behind each of the libraries within MKS along with brief descriptions of these libraries.

Figure 2 shows the architecture of MKS.

Figure 2: The Mining Kernel System (MKS)

At present MKS has 7 libraries providing Data Mining facilities. The facilities provided by MKS may be split into two

main modules: The Interface Module and the Mining Module.

The Interface Module provides facilities for Mining algorithms to interact with the environment. MKS has three distinct

interfaces to the outer world - the User using the User Interface, the Data Interface using the VDL Library and the

Knowledge Interface using the KIO Library. Information provided by the user using the user interface include the data

view in the form of the Data Source Mapping file (see section 2.1.2), Domain Knowledge, Syntactic Constraints e.g.

antecedent attributes of interest and Support, Uncertainty and Interestingness thresholds.

The Mining module provides core facilities required by most mining algorithms. At present the Mining module of MKS

consists of 5 Libraries that provide facilities from Conventional Statistics, Machine Learning and Artificial Intelligence.

The libraries within this module of MKS are: the Statistical Library (STS), the Information Theoretic Library (INF),

the Knowledge Representation Library (KNR), the Set Handling Library (SET) and the Evidence Theory Library

(EVR).

References

[AGRA92] Agrawal R, Ghosh S, Imielinski T, Iyer B and Swami A, An interval classifier for database mining

applications, Proc of 18th Int'l Conf. on VLDB, pp 560-573, 1992.

[AGRA93] Agrawal R, Imielinski T and Swami A, Database mining: A performance perspective, IEEE Transactions

on Knowledge and Data Engineering, Special issue on Learning and Discovery in Knowledge-Based Databases,

1993.

[AGRA93a] Agrawal R, Imielinski T and Swami A, Mining association rules between sets of items in large databases,

Proc of the ACM SIGMOD Conf. on Management of Data, 1993.

[AGRA93b] R. Agrawal, C. Faloutsos, A. Swami. Efficient similarity search in sequence databases. Proc. of the 4th

International Conference on Foundations of Data Organisation and Algorithms, 1993.

[AGRA94] R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. Proc. of

VLDB94, Pg. 487 - 499, 1994.

[AGRA95] R. Srikant, R. Agrawal. Mining Generalized Association Rules. Proc. of VLDB95.

[AGRA95a] Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time-Series Databases. Pro.

of VLDB95.

[ANAN95] S.S. Anand, D.A. Bell and J.G. Hughes, A General Framework for Data Mining Based on Evidence

Theory, Provisionally accepted by Data and Knowledge Engineering Journal.

[BELL93] D. A. Bell, From Data Properties to Evidence, IEEE Transactions on Knowledge and Data Engineering,

Vol. 5, No. 6, Special Issue on Learning and Discovery in Knowledge - Based Databases, December, 1993.

[BERN94] D. J. Berndt, J. Clifford. Using dynamic time warping to find patterns in time series. KDD94: AAAI

Workshop on Knowledge Discovery in Databases, Pg. 359 - 370, July, 1994.

[CHAN91] K. C. C. Chan, A. K. C. Wong. A Statistical Technique for Extracting Classificatory Knowledge from

Databases. Knowledge Discovery in Databases, Pg. 107 - 124, AAAI/MIT Press 1991.

[FRAW91] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge Discovery in Databases : An

Overview Knowledge Discovery in Databases, Pg. 1 - 27, AAAI/MIT Press 1991.

[HAN94] J. Han. Towards efficient induction mechanisms in database systems. Theoretical Computer Science 133,

Pg. 361 - 385, 1994.

[IMAM93] I. F. Imam, R. S. Michalski and L. Kershberg, Discovering Attribute Dependencies in Databases by

Integrating Symbolic Learning and Statistical Techniques, Working Notes of the Workshop in Knowledge Discovery

in Databases, AAAI-93, 1993.

[LU95] H. Lu, R. Setiono, H. Liu. NeuroRule: A Connectionist Approach to Data Mining. Proc. of VLDB95.

[PEND91]E. P. D. Pendault. Minimal-Length Encoding and Inductive Inference. Knowledge Discovery in Databases,

Pg. 71 - 92, AAAI/MIT Press 1991.

[SMYT91] ] P. Smyth and R. M. Goodman, Rule Induction Using Information Theory, Knowledge Discovery in

Databases, Pg. 159 - 176, AAAI/MIT Press 1991.

[WUQ91] Q. Wu, P. Suetens and A. Oosterlinck, Integration of Heuristic and Bayesian Approaches in a Pattern -

Classification System, Knowledge Discovery in Databases, Pg. 249 - 260, AAAI/MIT Press 1991.

[YAGE91] R. R. Yager, On Linguistic Summaries of Data, Knowledge Discovery in Databases, Pg. 347 - 366

AAAI/MIT Press 1991.

[ZIAR91] W. Ziarko, The Discovery, Analysis and Representation of Data Dependencies in Databases, Knowledge

Discovery in Databases, Pg. 195 - 212, AAAI/MIT Press 1991.

Brief Overview of Data Mining

Documents

Transcript of Brief Overview of Data Mining