CHAPTER 1 INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/24357/6/06_chapter 1.pdf · grown...
Transcript of CHAPTER 1 INTRODUCTIONshodhganga.inflibnet.ac.in/bitstream/10603/24357/6/06_chapter 1.pdf · grown...
1
CHAPTER 1
INTRODUCTION
1.1 DATAMINING
A promising area of contemporary research and development in Computer
Science is data mining (Usama et al 1996; Han and Kamber, 2006; Hand et al
2001; Tognazzini, 2003). Data mining in recent times has been attracting more
and more attention from an extensive range of diverse groups of people. Progress
in digital acquisition and storage technology has resulted in the growth of huge
databases. This has occurred in all areas of human endeavour, from the mundane
(such as supermarket transaction data, credit card usage records, telephone call
details and government statistics) to the more exotic (such as images of
astronomical bodies, molecular databases and medical records). The interest has
grown in the possibility of tapping these data and extracting from them
information that might be of value to the owner of the database. Information
retrieval is simply not enough anymore for decision-making. Confronted with
huge collections of data, new records are created that needs to help us make better
managerial decisions. These needs are automatic summarization of data,
extraction of the “essence” of information stored and the discovery of patterns in
raw data. The discipline concerned with this task is known as data mining. Data
mining is often defined as finding hidden information in a database. Alternatively
it has been called exploratory data analysis, data driven discovery and deductive
2
learning. Data mining techniques have been successfully applied in many
different fields including marketing, manufacturing, process control, fraud
detection and network management (Klaus, 2002).
Data mining techniques are heavily used to search for relationships and
information ‘hidden’ in transaction data. The impact of data mining to
information science has been discussed by Chen and Liu (2004) by reviewing
personalized environments, electronic commerce and search engines. A web
crawler design for data mining was proposed by Thelwali (2001) in order to
extract information with less processing time. Furthermore, mining method is
presented by Chang and lee (2005) for retrieving sequential patterns over online
data streams which is useful for retrieving embedded knowledge. Owing to the
current explosion of information and the accessibility of cheap storage, collecting
enormous data has been achievable during the last decades. The ultimate invent
of this massive data collection is the utilization of this information to achieve
competitive benefits, by determining formerly unidentified patterns in data with
the aid of Online Analytical Processing (OLAP) tools. It is a highly tedious
process which illustrates the necessity of an automated process to ascertain
interesting and concealed patterns in data. Data mining techniques have
increasingly been studied especially in their application in real world databases
(Bigus, 1996; Mitchell, 1997; Sousa et al 1998).
Data mining is a major step in the Knowledge Discovery in Databases
(KDD) process. It consists of applying computational techniques that, under
acceptable computational efficiency limitations produce a particular enumeration
of patterns (or models) over the data (Mitchell, 1997). Data mining involves the
use of sophisticated data analysis tools to discover previously unknown, valid
patterns and relationships in large data sets. These tools can include statistical
3
models, mathematical algorithms and machine learning methods (Jeffrey, 2004).
The progress in data mining research has made it possible to implement several
data mining operations efficiently on large databases (Fayyad, 1997).
1.1.1 Kinds of Information
A myriad of data is collected from simple numerical measurements and
text documents, to more complex information such as spatial data, multimedia
channels and hypertext documents. Here is a non-exclusive list of a variety of
information collected in digital form in databases and in flat files.
· Business transactions
· Scientific data
· Medical and personal data
· Surveillance video and pictures
· Satellite sensing Games
· Digital media
· CAD and Software engineering data
· Virtual Worlds
· Text reports and memos(e-mail messages)
· The World Wide Web repositories
1.1.2 Kinds of Data to be Mined
In principle, data mining is not specific to one type of media or data. Data
mining should be applicable to any of information repository. However,
algorithms and approaches may differ when applied to different types of data
(Han and Kamber, 2006). Indeed, the challenges presented by different types of
data vary significantly. Data mining is being put into use and studied for
databases, including relational databases, object-relational databases and
4
object-oriented databases, data warehouses, transactional databases, unstructured
and semi structured repositories such as World Wide Web and advanced
databases such as spatial, multimedia , time-series and textual databases and even
flat files. Here are some examples in more detail:
· Flat files: Flat files are actually the most common data source for data
mining algorithms, especially at the research level. Flat files are simple data files
in text or binary format with a structure known by the data mining algorithm to be
applied. The data in these files can be transactions, time-series data, scientific
measurements, etc.
· Relational Databases: Briefly, a relational database consists of a set of
tables containing either values of entity attributes or values of attributes from
entity relationships. Tables have columns and rows, where columns represent
attributes and rows represent tuples. A tuple in a relational table corresponds to
either an object or a relationship between objects and is identified by a set of
attribute values representing a unique key.
The most commonly used query language for relational database is SQL.
SQL allows retrieval and manipulation of the data stored in the tables, as well as
the calculation of aggregate functions such as average, sum, min, max and count.
For instance, an SQL query to select the videos grouped by category would
be: SELECT count (*) FROM Items WHERE type=video GROUP BY category.
Data mining algorithms using relational databases can be more versatile than data
mining algorithms specifically written for flat files, since they can take advantage
of the structure inherent to relational databases. While data mining can benefit
from SQL for data selection, transformation and consolidation, it goes beyond
what SQL could provide, such as predicting, comparing, detecting deviations, etc.
5
· Data Warehouses: A data warehouse as a storehouse is a repository of
data collected from multiple data sources (often heterogeneous) and is intended to
be used as a whole under the same unified schema. A data warehouse gives the
option to analyze data from different sources under the same roof. If suppose that
a Video Store becomes a franchise in North America , many video stores
belonging to a Video Store Company may have different databases and different
structures. If the executive of the company wants to access the data from all
stores for strategic decision-making, future direction, marketing, etc., it would be
more appropriate to store all the data in one site with a homogenous structure that
allows interactive analysis. In other words, data from the different stores would
be loaded, cleaned, transformed and integrated together. To facilitate
decision-making and multi-dimensional views, data warehouses are usually
modelled by a multi-dimensional data structure.
· Transaction Databases: A transaction database is a set of records
representing transactions, each with a time stamp, an identifier and a set of items.
For example, in the case of the video store, the rentals table represents the
transaction database. Each record is a rental contract with a customer identifier, a
date and the list of items rented (i.e. video tapes, games, VCR, etc.,). Since
relational databases do not allow nested tables (i.e. asset as attribute value),
transactions are usually stored in flat files or stored in two normalized transaction
tables, one for the transactions and another for the transaction items. One typical
data mining analysis on such data is the so-called market basket analysis or
association rules in which associations between items occurring together or in
sequence are studied.
· Multimedia Databases: Multimedia databases include video, images,
audio and text media. They can be stored on extended object-relational or
6
object-oriented databases or simply on a file system. Multimedia is characterized
by its high dimensionality, which makes data mining even more challenging.
Data mining from multimedia repositories may require computer vision,
computer graphics, image interpretation and natural language processing
methodologies.
· Time-Series Databases: Time-series databases contain time related
data such as stock market data or logged activities. These databases usually have
a continuous flow of new data coming in, which sometimes causes the need for a
challenging real time analysis. Data mining in such databases commonly
includes the study of trends and correlations between evolutions of different
variables, as well as the prediction of trends and movements of the variables in
time.
· Spatial Databases: Spatial databases are databases that, in addition to
usual data, store geographical information like maps and global or regional
positioning. Such spatial databases present new challenges to data mining
algorithms.
· World Wide Web: The World Wide Web (WWW) is the most
heterogeneous and dynamic repository available. A very large number of authors
and publishers are continuously contributing to its growth and metamorphosis,
and a massive number of users are accessing its resources daily. Data in the
World Wide Web is organised in inter-connected documents. These documents
can be text, audio, video, raw data and even applications. Conceptually, the
World Wide Web is comprised of three major components: The content of the
Web, which encompasses documents available; the structure of the Web, which
covers the hyperlinks and the relationships between documents; and the usage of
7
the web, describing how and when the resources are accessed. A fourth
dimension can be relating to the dynamic nature or evolution of the documents.
Data mining in the World Wide Web or web mining tries to address all these
issues and is often divided into web content mining, web structure mining and
web usage mining.
1.1.3 Data Mining and Knowledge Discovery in Databases
The terms Knowledge Discovery in Databases and data mining are often
used interchangeably. Knowledge Discovery in Databases (KDD) is the process
of finding useful information and patterns in data. Data Mining is the use of
algorithms to extract the information and patterns derived by the KDD process.
Data Mining is formally defined as “the non-trivial extraction of implicit,
formerly unidentified, and potentially useful information from data” (Frawley et
al, 1992; Jeffrey, 2004; Fayyad 1997). With the enormous amount of data stored
in files, database and other repositories is increasing enormously, it is necessary
to develop powerful means of analysis. Perhaps interpretation of such data and
the extraction of interesting knowledge could help in decision making.
Fig 1.1 An overview of steps comprising of KDD [Bradley et al, 1998]
8
While data mining and knowledge discovery in the databases are
frequently treated as synonyms, data mining is the actual part of knowledge
discovery process. Fig 1.1 shows data mining as a step in an iterative knowledge
discovery process.
The KDD process consists of the following five steps:
1.1.3.1 Selection: The data needed for the data mining process may be obtained
from many different and heterogeneous data sources. This first step obtains
the data from various databases, files and non-electronic resources.
1.1.3.2 Pre-processing: The data to be used by the process may have incorrect or
missing data. There may be anomalous data from multiple sources
involving different data types and metrics. There may be many different
activities performed at this time. Erroneous data must be corrected or
removed, whereas missing data must be supplied or predicted (often using
data mining tools).
1.1.3.3 Transformation: Data from different sources must be converted into a
common format for processing. Some data may be used to reduce number
of possible data values being considered.
1.1.3.4 Data mining: Based on the data mining task being performed, this step
applies algorithms to the transformed data to generate the desired results.
1.1.3.5 Interpretation/evaluation: How the data mining results are presented to
the users is extremely important because the usefulness of the results is
dependent on it. Various visualisation and GUI strategies are used at this
last step.
9
It is common to combine some of these steps together. For instance, data
cleaning and data integration can be performed together as a pre-processing phase
to generate a data warehouse. Data selection and transformation can also be
combined where the consolidation of the data is the result of selection or as for
the case of data warehouses, the selection is done on transformed data. The KDD
is an iterative process. Once the discovered knowledge is presented to the user,
the evaluation measures can be enhanced and the mining can be further refined or
new data can be selected or further transformed or new data sources can be
integrated, in order to get different and more appropriate results.
1.1.4 Basic Data Mining Tasks
Data mining commonly encompasses a variety of algorithms. In general
data mining tasks can be classified into two categories: Descriptive mining and
Predictive mining. Descriptive mining is the process of drawing the essential
characteristics or general properties of the data in the database. Clustering,
Association and sequential mining are some of descriptive mining techniques.
Predictive mining is the process of inferring patterns from data to make
predictions. The Predictive mining techniques involve tasks like Classification,
Regression and Deviation detection. Following are the tasks of the data mining
(Margaret, 2008). These individual tasks may be combined to obtain more
sophisticated data mining applications.
· Classification
Classification maps data into predefined groups or classes. It is often
referred to as supervised learning because the classes are determined before
examining the data. Classification algorithms require that the classes be defined
based on data attribute values. They often describe these classes by looking at the
10
characteristics of data already known to belong to the classes. Pattern recognition
is a type of classification where an input pattern is classified into one of several
classes based on its similarity to these predefined classes.
· Regression
Regression is used to map a data item to a real valued prediction variable.
In actuality, regression involves the learning of the function that does this
mapping. Regression assumes that the target data fit into some known type of
function (e.g., linear, logistics, etc.) and then determines the best function of this
type that models the given data. Some type of error analysis is used to determine
which function is “best”.
· Time Series Analysis
With time series analysis, the value of an attribute is examined as it varies
over time. The values usually are obtained as evenly spaced time points (daily,
weekly, hourly, etc.). A time series plot is used to visualise the time series. There
are three basic functions performed in time series analysis. In one case, distance
measures are used to determine the similarity between different time series. In the
second case, the structure of the line is examined to determine its behaviour. A
third application would be to use the historical time series plot to predict future
values.
· Prediction
Many real-world data mining applications can be seen as predicting future
data states based on past and current data. Prediction can be viewed as a type of
classification. The difference is that prediction is predicting a future state rather
11
than a current state. Prediction applications include flooding, speech recognition,
machine learning and pattern recognition.
· Clustering
Clustering is similar to classification except that the groups are not
predefined, but rather defined by data alone. Clustering is alternatively referred to
as unsupervised learning or segmentation. It can be thought of as partitioning or
segmenting the data into groups that might or might not be disjointed. The
clustering is usually accomplished by determining the similarity among the data
on predefined attributes. The most similar data are grouped into clusters. Since
the clusters are not predefined, a domain expert is often required to interpret the
meaning of the created clusters.
A special type of clustering is called Segmentation. With segmentation a
database is partitioned into disjointed groupings of similar tuples called segments.
Segmentation is often viewed as being identical to clustering. In other circles
segmentation is viewed as a specific type of clustering applied to a data base
itself.
· Summarization
Summarization maps data into subsets with associated simple descriptions.
Summarization is also called characterisation or generalisation. It extracts or
derives representative information about the database. This may be accomplished
by actually retrieving portions of the data. Alternatively, summary type
information (such as mean of some numeric attribute) can be derived from the
data. The summarization succinctly characterizes the contents of the databases.
12
· Association Rules
Link analysis, alternatively referred to as affinity analysis or association
refers to the data mining task of uncovering relationships among data. The best
example of this type of application is to determine association rules. An
association rule is a model that identifies specific types of data associations.
These associations are often used in retail sales community to identify items that
are frequently purchased together. Associations are also used in many other
applications such as predicting the failure of telecommunication switches.
· Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential
patterns in data. These patterns are similar to associations in that data (or events)
are found to be related, but the relationship is based on time. Unlike a market
analysis, which requires the items to be purchased at the same time, in sequence
discovery the items are purchased over time in some order.
1.2 MOTIVATION FOR THE RESEARCH
As the volume of data in warehouses and on the Internet is growing faster,
the scalability of mining algorithms is a major concern. Classical association rule
mining algorithms that require more number of passes over the entire database,
can take hours or even days to execute, and in the future this problem will only
become worse. Sampling approach can be used to solve this scalability problem
(Haas, 1999). In the context of “standard” association rule mining, use of samples
can make mining studies feasible that were formerly impractical due to enormous
time requirements. Indeed, a number of large companies routinely run mining
algorithms on a sample rather than on the entire warehouse (Bin et al, 2002).
Especially, if the data comes as a stream flowing at a faster rate than can be
13
processed, sampling seems to be the only choice (Yanrong and Raj, 2005). Two
facts support the applicability of sampling in data mining. One is that some data
mining algorithms may not use all data in a data set all the time during their
execution (Fayyad and Smith, 1995). The other is that mining on part of the data
may produce comparable results on the whole data (Toivonen, 1996; Zaki et al
1996).
Sampling is the powerful data reduction technique that has been applied to a
variety of problems in database systems. In general, sampling can significantly
reduce the cost of mining, since the mining algorithms need to deal with only a
small dataset compared to original database (Yanrong and Raj, 2005).
Sampling-based algorithms also facilitate interactive mining i.e., when the goal is
to obtain one or more “interesting” association rules as quickly as possible, a user
might first mine a very small sample (Bin et al 2002). Sampling has often been
suggested as an effective tool to reduce the size of the dataset operated at some
cost to accuracy (Parthasarathy, 2002). For example, statistics collected from a
sample of the database is used to generate near-optimal query execution plans
(Venkatesan et al 2009). A fairly obvious way of reducing the database activity of
knowledge discovery is to use only a random sample of a relation and to find
approximate regularities (Toivonen, 1996). Sampling is one of the principal
means to optimize the computational cost incurred in association rule mining by
reducing the timing incurred for frequent itemset mining.
The importance of sampling for association rule mining has been
recognized by several researchers (Bin et al 2002; Toivonen, 1996; Zaki et al
1996; Surong et al, 2005). The usual approach is to take a portion of database
randomly of a previously determined size and then calculate the frequency of
item sets over the sample using a lower minimum support threshold ‘σ’ that is
14
slightly smaller than the user - specified minimum support σ. Also the
computational cost of association rule mining can be reduced in four ways:
· By reducing the number of passes over the data base (Gu et al, 2000;
Agarwal et al, 2000; Yuang and Huang, 2005).
· By sampling the database. (Bin et al, 2002; Toivonen, 1996; Zaki et al
1996; Surong et al 2005)
· By adding extra constraints on the structure of patterns (Wojciechowski
and Zakrzewiez 2002).
· Through parallelization. (Li, 2004; Manning and Keane, 2001).
Recent work in the area of approximate aggregation processing shows that
the benefits of sampling are most fully realized when the sampling technique is
tailored to the specific problem at hand as stated by Acharya et al (2000).
The motivation behind the research is that the association rules mined from
large databases with the aid of an efficient sampling technique will generate
effective and relevant association rules. It has been found that the factors like
sample size, sampling method, transaction length, transaction frequency and
more, affect the quality of the samples selected for association rule mining.
Most of the association rule mining algorithms presented in the literature
are used for mining the static databases. Now-a-days, research community has
focused their researches into the incremental database on behalf of the real world
applications and the necessity of handling the dynamically updating new records.
Motivated by this research interest, Researchers have been motivated to design
innovative and incremental algorithm for association rule mining because the
quantity of data available in the real life databases are increasing at a tremendous
rate.
15
The mining of association rules on transactional database is usually an
offline process since it is costly to find the association rules in large databases.
With usual market-basket applications, new transactions are generated and old
transactions may be obsolete as time advances. As a result, incremental updating
techniques should be developed for maintenance of the discovered association
rules to avoid redoing mining on the whole updated database. A database may
allow frequent or occasional updates and such updates may not only invalidate
existing association rules but also activate new rules. Thus it is nontrivial to
maintain such discovered rules in large databases. Since the underlying
transaction database has been changed as time advances, some algorithms, such
as Apriori, may have to resort to the regeneration of candidate itemsets for the
determination of new frequent item sets. It is however very costly even if the
incremental data subset is small.
1.3 PROBLEM SPECIFICATION
The volume of electronically accessible data in warehouses and on the
internet is growing faster than the speedup in processing times predicted by
Moore’s law (Winter and Auer,1998).Scalability of mining algorithms is
therefore a major concern. Classical mining algorithms that require one or more
passes over the entire database can take hours or even days to execute and in the
future this problem will still worsen. One approach to the scalability problem is
to exploit the fact that approximate answers often suffice and execute mining
algorithms over a ‘synopsis’ or ‘sketch’. The computation of many synopses are
proposed in the literature (Charu et al, 1999) that require one or more expensive
passes over all the data. The use of synopses may still fail to adequately address
the scalability problem unless the cost of producing the synopses is amortized
over many queries.
16
Association rule mining is one of the important issues in data mining and
there is a high demand in industry as extraction of association rules is directly
related to sales. It is used to extract hidden information from large data sets.
Market Basket Analysis (MBA) is one of the common applications, which uses
ARM to extract customers’ purchase behaviours to improve sales was proposed
by Chen et al (2005) and Brin et al (1997). Association rule is also applied in
telecommunication networks. Association rules also finds application in various
areas like telecommunication networks, market and risk management and
inventory control.
The task of mining association rules is usually performed in transactional
or relational databases, to derive a set of strong association rules. This may
require repeated scans through the database. Therefore, it can result in huge
amount of processing when working on a very large database. Many efficient
algorithms can significantly improve the performance in both efficiency and
accuracy of association rules (Gu et al 2000). However, sampling can be a direct
and easy approach to improve the efficiency when accuracy is not the soul
concern.
In general the mining process of association rules can be divided into two steps.
1. Frequent Itemset Generation: generate all sets of items that have support
greater than a certain threshold, called minsupport.
2. Association Rule Generation: generate all association rules that have
confidence greater than a certain threshold called minconfidence from the
frequent itemsets.
It has been proved by Venkatesan et al (2009) that the first step of ARM
dictates the computational and I/O requirements necessitating repeated passes
17
over the complete database. The literature presents with numerous algorithms for
ARM, among which Apriori developed by Agarwal et al (1994) has been the
most renowned chiefly owing to its efficiency in Knowledge discovery. An
apparent approach for achieving a considerable reduction in the computational
and I/O requirements of frequent itemset generation is sampling as stated by
Venkatesan et al (2009). However the Apriori algorithm suffers from two
bottlenecks namely: 1) Complex frequent itemset generation that consumes most
of the time, space and memory and 2) Requires multiple scans of the database
(Sotiris and Dimitris, 2006).
The execution of traditional association rule mining algorithms that
necessitate multiple number of passes over the complete database can consume
hours or even days. Recently to reduce this adverse effect, researchers have
intended to develop efficient approaches that reduce the I/O and computational
requirements of the ARM techniques. Sampling has been recognized as a
dominant data reduction technique that has been successfully applied to an
assortment of data mining algorithms for reducing computational overhead.
In most sampling based association rule mining algorithms, the samples
are selected randomly from the large database without considering any of the
characteristics of the database. Moreover, it has been highly difficult to determine
an optimal sample size for effectively mining association rules.
Many algorithms have been developed for mining static datasets. It is
nontrivial to maintain such discovered rules from large datasets, this was the main
idea behind Incremental Association Rules Mining (IARM), which recently has
received much attention from the Data Mining researcher.Applying data mining
techniques to real-world applications is a challenging task because the databases
are dynamic i.e., changes continuously due to addition, deletion, modification
18
etc., of the contained data (Zhang et al 2003). Generally if the dataset is
incremental in nature, the frequent item sets discovering problem consumes more
time. Once in a while, the new records are added in an incremental dataset.
Generally when compared to the entire data set, the size of the increments or the
number of records added to the dataset is very small. But the assumption of the
rules in the updated dataset may get distorted due to the addition of these new
records (Nath et al, 2010). Hence a few new association rules may be created and
a few old ones may become obsolete. When new transactions are inserted into the
original databases, traditional batch-mining algorithms resolve this problem by
reprocessing the entire new databases. But they require much computational time
and ignore the available mined knowledge as stated by Hong et al (2008).
In the real world where large amounts of data grow steadily, some old
association rules can become useless and new databases may give rise to some
implicitly valid patterns or rules. Hence, updating rules or patterns are important.
The FUP algorithm proposed by Cheung et al (1996) is well known in relation to
this issue. A simple method for solving the updating problem is to reapply the
mining to the entire database, but this approach is time consuming. The FUP
algorithm uses information from old frequent itemsets to improve its
performance. The FUP algorithm has been extended to handle insertions as well
as deletions from the original database. Several other approaches to incremental
mining have been proposed (Ayan et al, 1999; Cheung et al, 1997; Hong et al,
2001; Lee et al, 2001; Ng and Lam, 1999; Ng et al, 2001; Sarda and Srinivas,
1998; Thomas et al, 1997; Velosa et al, 2000; Sarasere, 1995).
19
1.4 OBJECTIVE OF THE RESEARCH
The discovery of association rules is a computationally expensive task.
Further the transaction databases are typically very large. It is therefore
imperative to develop fast scalable techniques for mining them.
Following are the objectives of this thesis.
· To design a novel sampling based association rule mining algorithm that
- Effectively mine association rules by taking into account the
following factors namely sample size and the sampling method.
- Ultimately enhances the speed, efficiency and economy of mining
association rules.
- Achieves good tradeoff between accuracy and time
· To design incremental mining algorithm for growing databases that
- Extends the Pre-FUFP maintenance algorithm proposed by Hong
and Wang (2001).
- Addresses the recent items handling problem during the incremental
mining process.
- Solely designed to handle the recent items with more weightage
because of the fact that these items obtain better reachability among
customers.
- Adaptively design the support value based on the number of
transactions present in the incremental database rather than the
whole updated database.
1.5 SCOPE OF THE RESEARCH
In this thesis, a new sampling and incremental algorithm for effectively
mining association rules are proposed. Progressive sampling is employed in the
20
proposed approach to decide on an optimal sample size for achieve desirable
number and quality association rules.
The proposed work can operate on all transaction databases. In addition to
that it can be applied to data streams as well as continuous data such as colour
structure descriptor of images. It can also be applied to the medical data base to
mine for possible diseases and symptoms by mapping medical data to a
transaction format suitable for mining association rules and identifying
constraints which may prove to be priceless in future medical decision making.
Another area is direct marketing (retail, banking, insurance and fundraising
industries). Direct marketing is a modern business activity with an aim to
maximize the profit generated from marketing to a selected group of customers.
A key to direct marketing is to select a subset of customers so as to
maximize the profit return while minimizing the cost.
The proposed incremental approach efficiently handles the items that are
included recently in the updated database based on adaptive support threshold and
most suitable for developing better business strategies and can be used for sales
forecasting.
1.6 RESEARCH CONTRIBUTIONS
Scalability is one of the issue arise in Association Rule mining. Other issue
arise in incremental mining includes the efficiency of algorithms for the task, the
conciseness of result that are output by these algorithms and the re-mining of a
database after it has been updated with fresh data. Each of these issues are
addressed in this thesis.
21
The contributions of the thesis are:
(i) A novel progressive sampling based algorithm for the problem of
efficiently sampling for association rules is presented.
(ii) An extensive analysis of the progressive sampling-based approach for
association rule mining is done by using various real life datasets.
(iii) This approach is designed by taking into account the sample size and
sampling method with the intent of enhancing speed, efficiency and
economy of mining association rules
(iv) An efficient incremental mining approach entitled “Enhanced
Pre-FUFP “is presented.
(v) The proposed approach incorporates the recent items concept to
further extend the Pre-FUFP maintenance algorithm to address the
recent items handling problem during the incremental mining process
by incorporating an adaptive threshold value is incorporated to avoid
the costly process of rescanning.
1.7 CHAPTER ORGANIZATION
The remainder of this dissertation is organized in the following fashion:
Chapter 2 presents the necessary background for the thesis. These include
the ways for improving efficiency of Association Rule Mining and reviews
published researches related to sampling based Association Rule mining
algorithms as well as incremental mining algorithms that are closely related to the
research.
Chapter 3 presents the basic concepts and existing methodologies in
sampling techniques and incremental mining algorithms.
22
Chapter 4 presents the overall methodology of proposed progressive
sampling approaches for association rule mining from very large transaction
databases.
Chapter 5 presents the overall methodology of proposed algorithm for
incremental mining of Association Rules.
Chapter 6 presents the datasets that are been studied for carrying out the
sampling based as well as incremental approach for ARM. The experimental
setup, performance study conducted on the proposed algorithms and experimental
results are also discussed in this chapter.
Finally, Chapter 7 summarizes and concludes the main contributions of the
proposed work and outlines future avenues to explore.