MODIFIED APRIORI ALGORITHM -...
Transcript of MODIFIED APRIORI ALGORITHM -...
MODIFIED APRIORI ALGORITHM
CHAPTER IV
95
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
66
CHAPTER IV
MODIFIED APRIORI ALGORITHM
4.1. INTRODUCTION
In everyday life, information is collected almost everywhere. For example, at
supermarket checkouts, information about customer purchases is recorded. When
payback or discount cards are used, information about customer purchasing behavior and
personal details can be linked. Evaluation of this information can help retailers devise
more efficient and modified marketing strategies.
The majority of the recognized organizations have accumulated masses of
information from their customers for decades. With the e-commerce applications [115]
growing quickly, the organizations will have a vast quantity of data in months not in
years. Data Mining, also called as Knowledge Discovery in Databases, is to determine the
trends, patterns, correlations and anomalies in these databases that can assist to create
precise future decisions.
Physical analysis of these huge amount of information stored in modern databases
is very difficult. Data mining provides tools to reveal unknown information in large
databases which are already stored. A well-known data mining technique is Association
Rule Mining [87]. It is able to discover all the interesting relationships which are called
as associations in a database. Association rules are very efficient in revealing all the
interesting relationships in a relatively large database with a huge amount of data [102].
The large quantity of information collected through the set of association rules can be
used not only for illustrating the relationships in the database, but also for differentiating
between different kinds of classes in a database.
Association rule mining identifies the remarkable association or relationship
between a large set of data items [77]. With a huge quantity of data constantly being
obtained and stored in databases, several industries are becoming concerned in mining
association rules from their databases. For example, the detection of interesting
association relationships between large quantities of business transaction data can assist
in catalog design, cross-marketing, loss leader analysis, and various business decision
96
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
67
making processes. A typical example of association rule mining is Market Basket
Analysis. This method examines customer buying patterns by identifying associations
among various items that customers place in their shopping baskets. The identification of
such associations can assist retailers to expand marketing strategies by gaining insight
into the items that are frequently purchased jointly by customers. It is helpful to examine
the purchase behavior of customers and assist in increasing the sales and conserve
inventory by focusing on the point of sale transaction data. This work is a broad area for
the researchers to develop a better data mining algorithm for Market Basket Analysis.
Mining Association Rules is one of the most important application fields of Data
Mining [54, 83]. Given a set of customer transactions on items, the main intention is to
determine correlations among the sales of items. Mining association rules, also known as
Market Basket Analysis, is one of the application fields of Data Mining. Think a market
with a gathering of large amount of customer transactions. An association rule is XY,
where X is referred as the antecedent and Y is referred as the consequent. X and Y are
sets of items and the rule represents that customers who purchase X are probable to
purchase Y with a probability % c where c is known as the confidence. Such a rule may
be: “Eighty percent of people who purchase cigarettes also purchase matches". Such rules
assist to respond to questions of the variety “What is Coca Cola sold with?" or if the users
are intended to check the dependency among two items A and B, it is required to
determine the rules that have A in the consequent and B in the antecedent.
In [17], the author illustrates the reasons for considering the future of retail
businesses to involve selling smaller quantities of more products. Anderson concludes
with his concept as the “98% rule” contrasting with the well known “80/20 rule”.
The “98% rule” means that in a statistical distribution of the products, only 2% of the
items are very frequent and 98% of the items have very low frequency of purchase
creating a long tail distribution. The long tail shape appears in the new markets due to
three factors namely,
Democratization of the tools of production,
Democratization of the tools of distribution,
The connection between supply and demand based on online networks.
97
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
68
The long tail distribution is based on the type of retailer viz, the physical retailer
is restricted by the store’s size, equivalent to a short tail. The online retailers, like
Amazon.com, expanded the number of products, generating a longer tail. Lastly, the pure
digital retailers, like Rhapsody that sells music on-line, working with no physical goods,
have expanded the long tail even further. The emergence of the “98% rule” in the retail
sector has made the software that works with many low frequency items more relevant
and appealing. Market Basket Analysis is an active area of research in the field of
marketing.
4.1.1. Inference From Existing Works
All the existing techniques have their own advantages and disadvantages.
This section provides some of the drawbacks of the existing algorithms and the
techniques to overcome those difficulties.
Among the methods discussed for data mining, Apriori Algorithm is found to be
better for association rule mining. Still there are various difficulties faced by Apriori
Algorithm. They are,
It scans the database a number of times. Every time additional choices will
be created during the scanning process. This creates additional work for the
database to search. Therefore, database must store a huge number of data
services. This results in lack of memory to store those additional data. Also,
the I/O load is not sufficient and it takes a very long time for processing.
This results in very low efficiency.
Frequent item in the larger set length of the circumstances, leads to
significant increase in computing time.
Algorithm to narrow face. At this situation, the algorithm will not result in
better result. Therefore, it is required to improve, or re-design algorithms.
The above mentioned drawbacks can be overcome by modifying the Apriori
Algorithm effectively. The time complexity for the execution of Apriori Algorithm can
be solved by using the Effective Apriori Algorithm. This has the possibility of leading to
lack of accuracy in determining the association rule. To overcome this, the novel
98
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
69
technique is needed. This will help in better selection of association rules for Market
Basket Analysis.
This chapter introduces a modified version of Apriori Algorithm for better results
of the Market Basket Analysis.
4.2. OVERVIEW OF MARKET BASKET ANALYSIS AND APRIORI
This chapter introduces a modified version of the Apriori Algorithm for effective
Market Basket Analysis. The technique mainly analyzes the process of discovering
association rules for big supermarkets by the two-step modified Apriori technique, which
may be described as follows.
A first pass of the modified Apriori Algorithm verifies the existence of
association rules in order to obtain a new repository of transactions that reflect the
observed rules. A second pass of the proposed Apriori mechanism aims in discovering
the rules that are really inter-associated.
4.2.1. Market Basket Analysis
Market Basket Analysis is a mathematical modeling approach depending upon
the theory that if a customer buys a certain group of items, the same customer
is then likely to buy another group of items [71].
It is used to examine the customer’s purchasing behavior and aids in
increasing the sales and maintain inventory by focusing on the point of sale
transaction data.
Apriori Algorithm trains and recognizes product baskets and product
association rules for a given dataset
The input for the Market Basket Analysis is a dataset of purchases. A market
basket consists of items bought together in a single trip to the supermarket. The
transaction identification and item identification are the vital properties to be considered
[8]. The quantity bought and the price is not considered. Each transaction denotes a
purchase, which occurred at a particular time and place. This purchase can be associated
to an identified customer or to a non-identified customer.
99
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
70
The dataset with multiple transactions can be shown in a relational table
(transaction, item). Corresponding to each attribute there is a set called domain. The table
(transaction, item) is a set of all transactions 𝑇 = {𝑇1, 𝑇2, 𝑇3, … , 𝑇𝑛} where each
transaction contains a subset of items 𝑇𝑘 = {𝐼𝑎 , 𝐼𝑏 , 𝐼𝑐 … }.
To exemplify this problem, an instance with 5 items and 7 transactions is given in
table 4.1. The domain (item) is equal to {a, b, c, d, e} and the domain (transaction) is
equal to {1, 2, 3, 4, 5, 6, 7}.
Table 4.1 Sample Data To Illustrate The Market Basket Analysis
Date Customer Transaction Item
5-Jan 1001 1 b,c,d
5-Jan 1003 2 a,b
5-Jan 1004 3 a,c,e
7-Jan 1001 4 b,c,d,e
7-Jan 1005 5 a,c,d
7-Jan 1003 6 a,d
7-Jan 1006 7 b,c
Depending on the attributes (transaction, item), the market basket will be defined,
as the T items that are bought together more frequently. After getting the market basket
with T items, cross-selling can be done. The next step is to recognize all the customers
having bought T-m items of the basket and suggest the purchase of some m missing
items. In order to make decisions in marketing, the Market Basket Analysis is a vital and
significant tool supporting the implementation of cross-selling technique. For example, if
a specific customer's buying profile fits into an identified market basket, the next item
will be proposed.
The cross-selling policies lead to the recommender systems. A conventional
Recommender System is developed to recommend new items to frequent customers
100
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
71
depending on previous purchase patterns [123]. One of the first techniques, to find the
market basket consists of using the Collaborative Filtering software developed by Net
Perception that identifies a “soul mate” for each customer [7]. A customer’s “soul mate”
is a customer who has identical tastes and thus the same market basket. The software is
used in hundreds of companies. But, its disadvantage is that it just compares two
individuals, which does not give an overall view. For instance: user A bought four books
about Image Processing and a book about health tips, and customer B bought the same
four books about Image Processing. Therefore, the software will suggest the books on
health tips as the next item for customer Y, leading to possible mistakes.
So, later Apriori Algorithms are used for the better results in Market Basket
Analysis.
4.2.2. Association Rules and Market Basket Analysis
Market basket is the collection of items [95] purchased by a customer in a single
transaction (e.g. supermarket, web)
Transaction: It is a set of items (Itemset).
Confidence: It is the measure of uncertainty or trust worthiness
associated with each discovered pattern.
Support: It is the measure of how often the collection of items in an
association occur together as a percentage of all transactions
Frequent itemset: If an itemset satisfies minimum support, it is then a
frequent itemset.
Strong Association rules: Rules that satisfy both a minimum support
threshold and a minimum confidence threshold.
In Association rule mining, all frequent itemsets are identified and strong
association rules are then generated from the frequent item sets. Association rule mining
finds all the rules offered in the database that guarantees some minimum support and
minimum confidence constraint.
101
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
72
4.2.3. Apriori Approach
Apriori Algorithm is the most significant algorithm for finding frequent
item sets mining.
The basic principle of Apriori is “Any subset of a frequent itemset must be
frequent”.
Frequent itemsets are used to generate association rules
Association rules:
Unsupervised learning
Used for pattern discovery
Each rule has form: A -> B, or Left -> Right
For example: “70% of customers who purchase 2% milk will also purchase whole
wheat bread.”
Data mining using association rules is the process of looking for strong rules:
1. Find the large itemsets (i.e. most frequent combinations of items)
2. Most frequently used algorithm is Apriori Algorithm.
3. Generate association rules for the above item sets.
The strength of an association rule is measured using the following parameters:
1. Using support/confidence
2. Using dependence framework
Support
Support shows the frequency of the patterns in the rule; it is the percentage of
transactions that contain both A and B, i.e.
Support = Probability (A and B)
Support = (# of transactions involving A and B) / (total number of transactions).
102
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
73
Confidence
Confidence is defined as the strength of implication of a rule. It is the percentage
of transactions that contain B if they contain A, ie.
Confidence = Probability (B if A) = P(B/A)
Confidence = (# of transactions involving A and B) / (total number of transactions
that have A).
Example:
Customer Item
purchased
Item
purchased
1 pizza beer
2 salad soda
3 pizza soda
4 salad tea
If A is “purchased pizza” and B is “purchased soda” then,
Support = P (A and B) = ¼
Confidence = P (B / A) = ½
Confidence does not measure if the association between A and B is random or not.
For instance, if milk occurs in 30% of all baskets, information that milk occurs in
30% of all baskets with bread is useless. However, if milk is present in 50% of all baskets
that contain coffee, that is significant information.
Support allows to weed out most infrequent combinations – but sometimes it
should be ignored, for example, if the transaction is valuable and generates a large
revenue, or if the products repel each other.
103
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
74
Example:
P (Coke in a basket) = 50%
P (Pepsi in a basket) = 50%
P (coke and peps in a basket) = 0.001%
If Coke and Pepsi were independent, it is expected that
P (coke and Pepsi in a basket) = 0.5*0.5 = 0.25.
The fact that the joint probability is much smaller says that the products are
dependent and that they repel each other. In order to exploit this information, work with
the dependency framework is needed.
Dependence framework
Example. To continue the previous example,
Actual (coke and Pepsi in a basket) = 0.001%
Expected (coke and Pepsi in a basket) = 50%*50% = 25%
If items are statistically dependent, the presence of one of the items in the basket
provides a lot of information about the other items. The threshold of statistical dependence is
determined using,
Chi-square
Impact
Lift
Chi-square = (Expected Co-occurrence – Actual Co-occurrence) / Expected
Cooccurrence
Pick a small alpha (e.g. 5% or 10%). Number of degrees of freedom equals the
number of items minus 1.
Example: Chi-square (pepsi and coke) = (25-0.001)/25 = 0.999
Deg. freedom = 2
Alpha = 5%
104
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
75
From the tables, chi-square=3.84, which is higher than the Chi-square, therefore
Pepsi and coke are dependent.
Impact = Actual Cooccurrence/Expected Cooccurrence
Impact = 1 if products are independent, or ≠ if the products are dependent.
Example: Impact (Pepsi on coke) = 0.001/25
Lift (A on B) = (ActualCooccurrence - ExpectedCooccurrence )/(Frequency of
occurrence of A)
-1 ≤ Lift ≤ 1
Lift is similar to correlation: it is 0 if A and B is independent, and +1 or -1 if they
are dependent. +1 indicates attraction, and -1 indicates repulsion.
Example: Lift (coke on Pepsi) = (0.001-25)/50
Product Triangulation Strategy investigates cross-purchase tilts to answer the
above questions. If the most significant tilt occurs when triangulating with respect to
promotion or pricing, the products are substitutes. Pepsi and Coke repel and show no
cross-purchase patterns.
Apriori Algorithm was developed by Agrawal and Srikant [85], Agrawal et al.,
[86] takes all of the transactions in the database into consideration in order to define the
market basket. This technique was originally used in pattern recognition and later it
gained popularity with the innovation of the following rule: “on Thursdays, grocery store
customers often purchase diapers and beer together” [74].
The association rules can be evaluated using two measures namely, the support
measure and the confidence measure. The Apriori Algorithm is implemented in many
commercial institutes.
The outputs of the Apriori Algorithm are easy to understand and many new
patterns can be identified. However, the sheer number of association rules may make the
interpretation of the results difficult. A second weakness of the algorithm is the high
computational time, when it searches for large itemsets, due to the exponential
complexity of the algorithm.
105
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
76
Hence, a novel and effective technique for market bucket analysis is very much
needed.
4.3. PROPOSED MODIFIED APRIORI APPROACH FOR MARKET BASKET
ANALYSIS
The proposed Market Basket Analysis applies a modified version of the well-
known Apriori data mining algorithm to guide the users in the selection of the frequent
items.
4.3.1. Modified Apriori Algorithm
There have been various researches in the discovery of association rules between
the items of a transaction set. However, when defining user behavior patterns for services
like a supermarket, an e-learning site, or simply any website, the analysis of the items
composing their transactions should alone not be considered. For example, the behavior
of a user in a website cannot be measured well by the items the user buys (what); but the
way the user buys these items (how) should also be considered, in order to differentiate
from other users or to group that user to other users with similar behavior. This leads to a
step beyond the simple association rules discovery generated by the users of the service,
that is to say, the relationship existing between the whole set of users and each one of the
individuals has to be analyzed.
To perform this analysis the following steps are carried out,
i. Discovering existing association rules in the original repository of
transactions.
ii. Rewriting the original transactions to reflect all the association rules that
verify each one of them.
iii. Discovering and using the existing relations between rules discovered in
step 1 in order to automatically divide the set of transaction rules obtained
in step 2.
The behavior of a user at a supermarket with the data obtained according to the
manner the user acts in the service, i.e. based on the particular rules verified by the user
even as the user is interacting with the service.
106
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
77
i. Discovering existing association rules in the original repository of transactions
Initially, the original algorithm [91] for discovering the association rules between
the items generated by users. The original pseudocode by Agrawal and Srikant [91] is
offered in Fig 4. 1.
Fig 4.1: The original Apriori Algorithm
In the original algorithm apriori-gen is a function made up of two phases namely,
union and pruning. In the union phase (as in Fig 4.2), all k-itemsets candidates are
generated.
(1) L1 = {large 1-itemsets};
(2) for (k = 2; 𝐿𝑘−1 ≠ ∅; k+ +) do begin
(3) 𝐶𝑘= apriori-gen(𝐿𝑘−1);
(4) forall transactions 𝑡 ∈ 𝐷 do begin
(5) 𝐶𝑡 = subset(𝐶𝑘 , 𝑡); //Candidates contained in t
(6) forall candidates 𝐶 ∈ 𝐶𝑡 do
(7) c.count++;
(8) end
(9) 𝐿𝑘 = 𝑐 ∈ 𝐶𝑘 𝑐. 𝑐𝑜𝑢𝑛𝑡 ≥ 𝑚𝑖𝑛𝑠𝑢𝑝 ;
(10) end
(11) Answer = ∪𝑘 𝐿𝑘 ;
107
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
78
Fig 4.2: Union Phase original Apriori Algorithm
Now in the pruning phase (as in Fig 4.3), which provides the name to the
algorithm, all candidates generated in the union phase with some non-frequent (k -1)-
itemset are removed.
Fig 4.3: Pruning Phase original Apriori Algorithm
In Fig 4.4, the new implementation of the union and pruning phases for the
Apriori algorithm is given. When joining the union and pruning phases in the same
function, many insert and delete operations in a dynamic vector Ck are saved. Also, by
relaxing the pruning many search operations in tree L of frequent k-itemsets are saved.
forall itemsets 𝑐 ∈ 𝐶𝑘
forall (k -1)-subsets s of c do
if 𝑠 ∉ 𝐿𝑘−1 then
delete c from 𝐶𝑘 ;
insert into 𝐶𝑘
select p.item1, . . .,p.itemk_1, q.itemk_1
from Lk-1p,Lk-1 q
where p.item1 = q.item1, . . .,p.itemk-2 = q.itemk_2,
p.itemk_1 < q.itemk_1;
108
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
79
Fig 4.4: Union and pruning phases Modified Apriori Algorithm
Another important adjustment to the original Apriori Algorithm is the extraction
of the existing rules (function genrules) in the repository of transactions. The original
Apriori is presented in Fig 4.5.
insert into Ck
select c = {p.item1, . . .,p.itemk-1, q.itemk-1}
from Lk-1p,Lk-1q
where (p.item1=q.item1,.,p.itemk-2 = q.itemk-2,
p.itemk-1 < q.itemk-1)
and
(p.itemk-1, q.itemk-1) ∈ L2
109
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
80
Fig 4.5: The original Apriori Algorithm genrules
To denote a rule, the following notation is used,
𝑅𝑖 = 𝑎𝑖 ⟹ 𝑐𝑖
where 𝑅𝑖 is the ith rule, 𝑎𝑖 the antecedent k-itemset and 𝑐𝑖 is the consequent k-itemset.
By
𝑙𝑘𝑖= 𝑎𝑖 ∪ 𝑐𝑖
// Simple Algorithm
forall large itemsets 𝑙𝑘 , 𝑘 ≥ 2 do
call genrules(𝑙𝑘 , 𝑙𝑘 );
// The genrules generates all valid rules 𝑎 => (𝑙𝑘-
𝑎 ), for all 𝑎 ⊂ 𝑎𝑚
procedure genrules(𝑙𝑘 : large k-itemset, 𝑎𝑚 : large m-
itemset)
(1) A = {(m -1)-itemsets am-1|am-1 ⊂ am};
(2) forall 𝑎𝑚−1 ∈ 𝐴 do begin
(3) 𝑐𝑜𝑛𝑓 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐼𝑘)/𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑎𝑚−1);
(4) if (𝑐𝑜𝑛𝑓 ≥ 𝑚𝑖𝑛𝑐𝑜𝑛𝑓) then begin
(7) output the rule 𝑎𝑚−1 ⇒ (𝐼𝑘 − 𝑎𝑚−1) , with
confidence= 𝑐𝑜𝑛𝑓 and support=support(𝐼𝑘);
(8) if (𝑚 − 1 > 1) then
(9) call genrules(𝐼𝑘 − 𝑎𝑚−1); // to generate rules
with subsets of 𝑎𝑚−1 as the antecedents
(10) end
(11) end
110
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
81
The denoted the k-itemset defines rule 𝑅𝑖 . genrules takes two k-itemsets as
parameters: the first one is always 𝑙𝑘𝑖 and the second is a subset of 𝑙𝑘𝑖
, from which the
antecedents of the rules derived from 𝑙𝑘𝑖 are extracted. Each time genrules are called, the
confidence of a rule is obtained as 𝑎𝑚 ⇒ 𝑐𝑖 = 𝑙𝑘 − 𝑎𝑚 , two calls are hence needed to the
support function. This function has to pass over L until the level determined by the
number of items of the k-itemset received: 𝑘 + 𝑘 − 𝑚 searches.
This characteristic feature makes the original algorithm inefficient due to the way
the repository L of frequent k-itemsets is stored. Hence, the items that compose 𝑙𝑘 should
be searched from the root of the repository. Furthermore, the support of 𝑙𝑘 is added as a
parameter to genrules to get a more efficient implementation of the algorithm, which
eliminates many calls to the support function (only one call is performed before
processing each k-itemset of L and in the multiple recursive callings this function is not
called again with this parameter).
The creators of the Apriori Algorithm followed the next statement:
If a ⇒ 1 − a does not hold, neither does a ⇒ 1 − a for any a ⊂ a . By
rewriting, it follows that for a rule 1 − c ⟹ c to hold, all rules of the form (1 − c ) ⇒
c must also hold, where ~c is a non-empty subset of c. For example, if the rule AB ⇒
CD holds, then the rules ABC ⇒ D and ABD ⇒ c must also hold.
After that, a second method is proposed, which is quicker under certain
circumstances: when 𝑚𝑖𝑛𝑐𝑜𝑛𝑓 is large, if they can detect that 𝑐𝑜𝑛𝑓(𝐴𝐵𝐶 ⇒ 𝐷) <
𝑚𝑖𝑛𝑐𝑜𝑛𝑓, they then assume that it is not necessary to check the rules 𝐴𝐵 ⇒ 𝐶𝐷, 𝐴𝐶 ⇒
𝐵𝐷, 𝐵𝐶 ⇒ 𝐴𝐷, 𝐴 ⇒ 𝐵𝐶𝐷, 𝐵 ⇒ 𝐴𝐶𝐷, 𝐶 ⇒ 𝐴𝐵𝐷 . In the case they want to detect rules
without high confidence, this is rather a disadvantage that supposes an additional
verification and will slow down the execution of the algorithm.
Moreover, the performance of the algorithm is improved by analyzing the
characteristics of the rules generated by it.
Consider the following example:
Let the k-itemset be 𝐼𝑘 = 1,2,3,4,5,6 .
111
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
82
𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 𝐼𝑘) Will recursively call 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4,5 ) and this will call
recursively𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4 ).
Later, the initial reference will call 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4,5,6 ) and this will make
another recursive call to 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4 ).
Afterwards, 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4 ) is called twice, and thus duplicates all of the
performed callings is obtained with the subsets of 1,2,3,4 as the second parameter.
This generates a great number of searches and repeated operations, which will be of the
exponential order 𝑘.
In the proposed implementation, this problem is overcome by the use of a static
vector where the analyzed subsets of 𝐼𝑘 are stored as the second parameter. In order to
apply these properties, the algorithm has been redefined as depicted in Fig 4.6.
112
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
83
Fig 4.6: Modified Apriori genrules
For all large itemsets Ik , k ≥ 2 do
supp_l_k = support(𝐼𝑘);
Call 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 𝐼𝑘 , supp_l_k);
// The genrules generates all valid rules 𝑎 ⇒ 𝐼𝑘 − 𝑎 ,
for all 𝑎 ⊂ 𝑎𝑚
procedure genrules (𝐼𝑘 : large 𝑘 -itemset, 𝑎𝑚 : large 𝑚 -
itemset, supp_l_k: double)
(×) if (𝑚 == 𝑘)
× a_m_processed.clear(); //Initialize a_m_processed
for each new 𝐼𝑘
1 𝐴 = 𝑚 − 1 − 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠 𝑎𝑚−1|𝑎𝑚−1 ⊂ 𝑎𝑚 ;
2 forall 𝑎𝑚−1 ∈ 𝐴do begin
(×) if 𝑎𝑚−1 ∈ a_m_processed then
(×) continue;
(×) a_m_processed.add (𝑎𝑚−1);
(3) 𝑐𝑜𝑛𝑓 = support_l_k/support(𝑎𝑚−1);
(4) if (𝑐𝑜𝑛𝑓 ≥ 𝑚𝑖𝑛𝑐𝑜𝑛𝑓) then begin
7 output the rule 𝑎𝑚−1 ⇒ 𝐼𝑘 − 𝑎𝑚−1 , with
confidence= 𝑐𝑜𝑛𝑓 and support= support(Ik);
(8) if 𝑚 − 1 > 1 then
(9) call 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 𝑎𝑚−1 , supp_l_k) ; // to generate
rules with subsets of 𝑎𝑚−1 as the antecedents
10 end
11 end
113
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
84
This modified algorithm does not check any extra rule when minconf is set to 0%.
The execution time of the Modified Apriori Algorithm is not excessive at all, so it could
be executed in real-time. If applied to the transactions performed by the same user or a
group of users with homogeneous behavior in a supermarket, this modified algorithm
even gets lower execution times.
ii. Rewriting transactions to return association rules verified by each transaction
It is more appropriate to study the behavior of a customer according to the rules
the customer “verifies” rather than to the items themselves. Suppose a customer enters
the supermarket on Mondays and buys certain items A, B and C and next a series of items
T1. The customer then buys items A, G and H on Fridays and next another series of items
T2. Also suppose that the items of T1 and T2 are not frequently presented with item A.
As this approach is not considering temporal variables, it cannot detect if it is Monday or
Friday. Thus, when a user enters a supermarket and requests for item A, probably items B
or G will be recommended in the conventional technique.
If the user then buys item G, the method will suggest an option of item H as more
probable, next to some items T2 and finally another link to item B with less probability to
be bought by the user. Therefore, the fact to buy item A is not as significant as the
possibility to use the behavioral rules already stored (the existing relations between rules
containing the k-itemsets AGH and T2). The cue idea is the capacity of analyzing the
rules independently of the number of items they contain, providing a means to get the
relations between the two rules composed by the different items and using only the
transactions supplied by the users.
If the rules verified by each original transaction of D are studied, a set of rules (R)
that contains one line for each verified rule of each transaction of D is defined. The goal
of this conversion D-->R is to study the group of rules containing R again by means of
Apriori Algorithm.
At the beginning, this task is simply an exploratory task as lot of time is needed to
convert file R into huge repositories of data and it should be done for different support
and confidence thresholds. Rule repositories used to be bigger than the transaction
repositories. Indeed, if a transaction verifies a rule that consists of a given k-itemset lk, it
114
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
85
will then verify all the rules of the k-itemsets contained in lk. For example, if the transaction
ABCD generates one rule, it can then generate the fifty rules inferred by their subsets:
14 rules with 4 items: ABC=> D, ABD=> C . . . D =>ABC.
24 rules with 3 items: AB=> C, . . .CD => B.
12 rules with 2 items: A=> B, . . .D=> C.
If a low confidence threshold is used, one transaction of D with three items will
define twelve rules in R, one transaction of D with four items will define 50 rules in R,
and R will then grow exponentially according to the length of the transactions of D.
At this point, the impressive size of R leads to an exploratory study of the existing
differences between both the repositories for proposing a reduction of data contained in R
without losing any information contained in it.
iii. Automatic division of the repository of rules
Many techniques exist to partition the repository of original data. However all the
techniques depend on the highest control and classification of the set of items available to
the system. A set of families of rules are defined to partition the repository of rules and
group those rules that link users when interacting with the system. The proposed
algorithm is explained below.
(1) Select the rule of major support; in case of draw, select the rule of highest confidence
and in case of another draw, select the rule that contains more items.
R1|supp R1 = maxi{supp Ri } ∧ (conf R1 = maxj{conf Rj |supp Rj
= supp R1 })
This rule will be the principal element of the first family of rules, ℱ1.
(2) Divide ℛ into subsets of transactions. The first repository ℛ1 will contain all transactions
with the rule 𝑅1 and the second repository ℛ∞ will contain the remaining transactions.
(3) Run the Apriori Algorithm on ℛ1 and apply step 1 again to select 𝑅2the rule with the
highest frequency jointly with 𝑅1.
(4) Check the support of 𝑅2 into ℛ∞ : if the support of 𝑅2 into ℛ1 is greater than the
support in ℛ∞, then, add 𝑅2 to family ℱ1; in other case, remove 𝑅2 from ℛ1.
115
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
86
(5) Go to step 3 while there are rules still not classified in ℛ1.
When the definition of the first family of rules has been completed, remove all the
rules belonging to ℱ1 and all empty transactions from the real repository of rules ℛ. Next,
the second family of rules is constructed in the same manner. The algorithm finishes
when ℛ∞ has no rules associated with the other rules, i.e. when all the transactions of ℛ∞
possess only one rule.
iv. The proposed Market Basket Analysis System
At last, an effective system has been designed for Market Basket Analysis with a
two-step modified Apriori Algorithm. The main goal of this approach is to obtain a better
Market Basket Analysis system to facilitate the user in purchasing the items. The proposed
approach can be defined as follows:
(1) A user enters into the supermarket and selects item 𝐴.
(2) The system can use only the details collected about its proper items, specifically the
original rules with 𝐴 as antecedent – this is the classical suggestion using
association rules. As a result, the first recommendation will be only depending on
the frequency (the system will suggest only the consequents of the rules with the
huge confidence that have item 𝐴 as antecedent).
(3) The user selects a second item 𝐵.
(4) The rules are derived from the 𝑘-itemset 𝐼𝑘 and search for the family which belongs
to it ℱ𝑖 .
If ℱ𝑖 contains rules derived from the rules accomplished by the user in the
current visit, there is then a confidence of 100% between the discovered rules
and the rules already verified by the user. In this case, the system suggests the
items of the discovered rules with the greatest support. Dissimilar to many
classical methods, here the rules are discovered that have no items chosen as
antecedents, and this is highly significant as the analysis performed entirely
ignores the order of selection of the items.
116
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
87
If ℱ𝑖 does not contain such a kind of rules, the rules with huge confidence
(and support in case of draw) in relation to the rules verified by the user are
used. Additionally, with the intention of solving the classical problem of
discovering suggestions based on the antecedent, it is possible to discover the
rules with any item already selected by the user. This cannot be accomplished
with the classical approach.
(5) The user chooses a third item 𝐶.
(6) Then, the step 4 is proceeded, but now looking for one or more families based on
the derived rules from 𝑘-itemset 𝐼𝑘 = 𝐴, 𝐵, 𝐶 .
This technique provides the new experience to the customers within the
supermarket, where the customers should feel more comfortable in purchasing their
items. They are able to buy at least one item more directly when using the proposed
system.
4.4. DISCUSSION
In Apriori Algorithm, the function is made up of two phases namely, union and
pruning. In the union phase, all k-item set candidates are generated. The pruning provides
the name to the algorithm, all candidates generated in the union phase with some non-
frequent (k -1)-itemset are removed and it scans the database a number of times. Every
time additional choices will be created during the scan process. This creates additional
work for the database to search. Therefore, database must store huge number of data
services. This results in lack of memory to store those additional data and it takes a very
long time for processing. This results in very low accuracy.
In the Modified algorithm, the new implementation of the union and pruning
phases for the Apriori Algorithm is given. When joining the union and pruning phases in
the same function, many insert and delete operations in a dynamic vector Ck are saved.
Also, by relaxing the pruning many search operations in tree L of frequent k-itemsets are
saved.
The modified algorithm does not check any extra rule when minconf is set to 0%.
As the execution time of the Modified Apriori Algorithm is not excessive, it could be
117
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
88
executed in real-time. If applied to the transactions performed by the same user or group
of users with a homogeneous behavior in a supermarket, this modified algorithm even
gets lower execution times.
4.5. SUMMARY
In this chapter, a new method for Market Basket Analysis based on the modified
Apriori Algorithm has been introduced. This chapter introduces a modified version of the
well known Apriori Data Mining Algorithm to guide the users towards the selection of
the best combination of items in a super market. This approach mainly analyzes the
process of discovering association rules in this kind of big repositories.
Most of the association rule mining algorithms suffer from the problems of too
much execution time and generating too many association rules. Although conventional
Apriori Algorithm can identify meaningful itemsets and construct association rules, it
suffers the disadvantage of generating numerous candidate itemsets that must be
repeatedly contrasted with the entire database. The processing of the conventional
algorithm also utilizes a large amount of memory. Thus, this approach is very significant
for effective Market Basket Analysis and it helps the customers in purchasing their items
with more comfort, which in turn increases the sales rate of the markets.
118
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.