MODIFIED APRIORI ALGORITHM -...

MODIFIED APRIORI ALGORITHM

CHAPTER IV

95

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

66

CHAPTER IV

MODIFIED APRIORI ALGORITHM

4.1. INTRODUCTION

In everyday life, information is collected almost everywhere. For example, at

supermarket checkouts, information about customer purchases is recorded. When

payback or discount cards are used, information about customer purchasing behavior and

personal details can be linked. Evaluation of this information can help retailers devise

more efficient and modified marketing strategies.

The majority of the recognized organizations have accumulated masses of

information from their customers for decades. With the e-commerce applications [115]

growing quickly, the organizations will have a vast quantity of data in months not in

years. Data Mining, also called as Knowledge Discovery in Databases, is to determine the

trends, patterns, correlations and anomalies in these databases that can assist to create

precise future decisions.

Physical analysis of these huge amount of information stored in modern databases

is very difficult. Data mining provides tools to reveal unknown information in large

databases which are already stored. A well-known data mining technique is Association

Rule Mining [87]. It is able to discover all the interesting relationships which are called

as associations in a database. Association rules are very efficient in revealing all the

interesting relationships in a relatively large database with a huge amount of data [102].

The large quantity of information collected through the set of association rules can be

used not only for illustrating the relationships in the database, but also for differentiating

between different kinds of classes in a database.

Association rule mining identifies the remarkable association or relationship

between a large set of data items [77]. With a huge quantity of data constantly being

obtained and stored in databases, several industries are becoming concerned in mining

association rules from their databases. For example, the detection of interesting

association relationships between large quantities of business transaction data can assist

in catalog design, cross-marketing, loss leader analysis, and various business decision

96


67

making processes. A typical example of association rule mining is Market Basket

Analysis. This method examines customer buying patterns by identifying associations

among various items that customers place in their shopping baskets. The identification of

such associations can assist retailers to expand marketing strategies by gaining insight

into the items that are frequently purchased jointly by customers. It is helpful to examine

the purchase behavior of customers and assist in increasing the sales and conserve

inventory by focusing on the point of sale transaction data. This work is a broad area for

the researchers to develop a better data mining algorithm for Market Basket Analysis.

Mining Association Rules is one of the most important application fields of Data

Mining [54, 83]. Given a set of customer transactions on items, the main intention is to

determine correlations among the sales of items. Mining association rules, also known as

Market Basket Analysis, is one of the application fields of Data Mining. Think a market

with a gathering of large amount of customer transactions. An association rule is XY,

where X is referred as the antecedent and Y is referred as the consequent. X and Y are

sets of items and the rule represents that customers who purchase X are probable to

purchase Y with a probability % c where c is known as the confidence. Such a rule may

be: “Eighty percent of people who purchase cigarettes also purchase matches". Such rules

assist to respond to questions of the variety “What is Coca Cola sold with?" or if the users

are intended to check the dependency among two items A and B, it is required to

determine the rules that have A in the consequent and B in the antecedent.

In [17], the author illustrates the reasons for considering the future of retail

businesses to involve selling smaller quantities of more products. Anderson concludes

with his concept as the “98% rule” contrasting with the well known “80/20 rule”.

The “98% rule” means that in a statistical distribution of the products, only 2% of the

items are very frequent and 98% of the items have very low frequency of purchase

creating a long tail distribution. The long tail shape appears in the new markets due to

three factors namely,

Democratization of the tools of production,

Democratization of the tools of distribution,

The connection between supply and demand based on online networks.

97


68

The long tail distribution is based on the type of retailer viz, the physical retailer

is restricted by the store’s size, equivalent to a short tail. The online retailers, like

Amazon.com, expanded the number of products, generating a longer tail. Lastly, the pure

digital retailers, like Rhapsody that sells music on-line, working with no physical goods,

have expanded the long tail even further. The emergence of the “98% rule” in the retail

sector has made the software that works with many low frequency items more relevant

and appealing. Market Basket Analysis is an active area of research in the field of

marketing.

4.1.1. Inference From Existing Works

All the existing techniques have their own advantages and disadvantages.

This section provides some of the drawbacks of the existing algorithms and the

techniques to overcome those difficulties.

Among the methods discussed for data mining, Apriori Algorithm is found to be

better for association rule mining. Still there are various difficulties faced by Apriori

Algorithm. They are,

It scans the database a number of times. Every time additional choices will

be created during the scanning process. This creates additional work for the

database to search. Therefore, database must store a huge number of data

services. This results in lack of memory to store those additional data. Also,

the I/O load is not sufficient and it takes a very long time for processing.

This results in very low efficiency.

Frequent item in the larger set length of the circumstances, leads to

significant increase in computing time.

Algorithm to narrow face. At this situation, the algorithm will not result in

better result. Therefore, it is required to improve, or re-design algorithms.

The above mentioned drawbacks can be overcome by modifying the Apriori

Algorithm effectively. The time complexity for the execution of Apriori Algorithm can

be solved by using the Effective Apriori Algorithm. This has the possibility of leading to

lack of accuracy in determining the association rule. To overcome this, the novel

98


69

technique is needed. This will help in better selection of association rules for Market

Basket Analysis.

This chapter introduces a modified version of Apriori Algorithm for better results

of the Market Basket Analysis.

4.2. OVERVIEW OF MARKET BASKET ANALYSIS AND APRIORI

This chapter introduces a modified version of the Apriori Algorithm for effective

Market Basket Analysis. The technique mainly analyzes the process of discovering

association rules for big supermarkets by the two-step modified Apriori technique, which

may be described as follows.

A first pass of the modified Apriori Algorithm verifies the existence of

association rules in order to obtain a new repository of transactions that reflect the

observed rules. A second pass of the proposed Apriori mechanism aims in discovering

the rules that are really inter-associated.

4.2.1. Market Basket Analysis

Market Basket Analysis is a mathematical modeling approach depending upon

the theory that if a customer buys a certain group of items, the same customer

is then likely to buy another group of items [71].

It is used to examine the customer’s purchasing behavior and aids in

increasing the sales and maintain inventory by focusing on the point of sale

transaction data.

Apriori Algorithm trains and recognizes product baskets and product

association rules for a given dataset

The input for the Market Basket Analysis is a dataset of purchases. A market

basket consists of items bought together in a single trip to the supermarket. The

transaction identification and item identification are the vital properties to be considered

[8]. The quantity bought and the price is not considered. Each transaction denotes a

purchase, which occurred at a particular time and place. This purchase can be associated

to an identified customer or to a non-identified customer.

99


70

The dataset with multiple transactions can be shown in a relational table

(transaction, item). Corresponding to each attribute there is a set called domain. The table

(transaction, item) is a set of all transactions 𝑇 = {𝑇1, 𝑇2, 𝑇3, … , 𝑇𝑛} where each

transaction contains a subset of items 𝑇𝑘 = {𝐼𝑎 , 𝐼𝑏 , 𝐼𝑐 … }.

To exemplify this problem, an instance with 5 items and 7 transactions is given in

table 4.1. The domain (item) is equal to {a, b, c, d, e} and the domain (transaction) is

equal to {1, 2, 3, 4, 5, 6, 7}.

Table 4.1 Sample Data To Illustrate The Market Basket Analysis

Date Customer Transaction Item

5-Jan 1001 1 b,c,d

5-Jan 1003 2 a,b

5-Jan 1004 3 a,c,e

7-Jan 1001 4 b,c,d,e

7-Jan 1005 5 a,c,d

7-Jan 1003 6 a,d

7-Jan 1006 7 b,c

Depending on the attributes (transaction, item), the market basket will be defined,

as the T items that are bought together more frequently. After getting the market basket

with T items, cross-selling can be done. The next step is to recognize all the customers

having bought T-m items of the basket and suggest the purchase of some m missing

items. In order to make decisions in marketing, the Market Basket Analysis is a vital and

significant tool supporting the implementation of cross-selling technique. For example, if

a specific customer's buying profile fits into an identified market basket, the next item

will be proposed.

The cross-selling policies lead to the recommender systems. A conventional

Recommender System is developed to recommend new items to frequent customers

100


71

depending on previous purchase patterns [123]. One of the first techniques, to find the

market basket consists of using the Collaborative Filtering software developed by Net

Perception that identifies a “soul mate” for each customer [7]. A customer’s “soul mate”

is a customer who has identical tastes and thus the same market basket. The software is

used in hundreds of companies. But, its disadvantage is that it just compares two

individuals, which does not give an overall view. For instance: user A bought four books

about Image Processing and a book about health tips, and customer B bought the same

four books about Image Processing. Therefore, the software will suggest the books on

health tips as the next item for customer Y, leading to possible mistakes.

So, later Apriori Algorithms are used for the better results in Market Basket

Analysis.

4.2.2. Association Rules and Market Basket Analysis

Market basket is the collection of items [95] purchased by a customer in a single

transaction (e.g. supermarket, web)

Transaction: It is a set of items (Itemset).

Confidence: It is the measure of uncertainty or trust worthiness

associated with each discovered pattern.

Support: It is the measure of how often the collection of items in an

association occur together as a percentage of all transactions

Frequent itemset: If an itemset satisfies minimum support, it is then a

frequent itemset.

Strong Association rules: Rules that satisfy both a minimum support

threshold and a minimum confidence threshold.

In Association rule mining, all frequent itemsets are identified and strong

association rules are then generated from the frequent item sets. Association rule mining

finds all the rules offered in the database that guarantees some minimum support and

minimum confidence constraint.

101


72

4.2.3. Apriori Approach

Apriori Algorithm is the most significant algorithm for finding frequent

item sets mining.

The basic principle of Apriori is “Any subset of a frequent itemset must be

frequent”.

Frequent itemsets are used to generate association rules

Association rules:

Unsupervised learning

Used for pattern discovery

Each rule has form: A -> B, or Left -> Right

For example: “70% of customers who purchase 2% milk will also purchase whole

wheat bread.”

Data mining using association rules is the process of looking for strong rules:

1. Find the large itemsets (i.e. most frequent combinations of items)

2. Most frequently used algorithm is Apriori Algorithm.

3. Generate association rules for the above item sets.

The strength of an association rule is measured using the following parameters:

1. Using support/confidence

2. Using dependence framework

Support

Support shows the frequency of the patterns in the rule; it is the percentage of

transactions that contain both A and B, i.e.

Support = Probability (A and B)

Support = (# of transactions involving A and B) / (total number of transactions).

102


73

Confidence

Confidence is defined as the strength of implication of a rule. It is the percentage

of transactions that contain B if they contain A, ie.

Confidence = Probability (B if A) = P(B/A)

Confidence = (# of transactions involving A and B) / (total number of transactions

that have A).

Example:

Customer Item

purchased

Item

purchased

1 pizza beer

2 salad soda

3 pizza soda

4 salad tea

If A is “purchased pizza” and B is “purchased soda” then,

Support = P (A and B) = ¼

Confidence = P (B / A) = ½

Confidence does not measure if the association between A and B is random or not.

For instance, if milk occurs in 30% of all baskets, information that milk occurs in

30% of all baskets with bread is useless. However, if milk is present in 50% of all baskets

that contain coffee, that is significant information.

Support allows to weed out most infrequent combinations – but sometimes it

should be ignored, for example, if the transaction is valuable and generates a large

revenue, or if the products repel each other.

103


74

Example:

P (Coke in a basket) = 50%

P (Pepsi in a basket) = 50%

P (coke and peps in a basket) = 0.001%

If Coke and Pepsi were independent, it is expected that

P (coke and Pepsi in a basket) = 0.5*0.5 = 0.25.

The fact that the joint probability is much smaller says that the products are

dependent and that they repel each other. In order to exploit this information, work with

the dependency framework is needed.

Dependence framework

Example. To continue the previous example,

Actual (coke and Pepsi in a basket) = 0.001%

Expected (coke and Pepsi in a basket) = 50%*50% = 25%

If items are statistically dependent, the presence of one of the items in the basket

provides a lot of information about the other items. The threshold of statistical dependence is

determined using,

Chi-square

Impact

Lift

Chi-square = (Expected Co-occurrence – Actual Co-occurrence) / Expected

Cooccurrence

Pick a small alpha (e.g. 5% or 10%). Number of degrees of freedom equals the

number of items minus 1.

Example: Chi-square (pepsi and coke) = (25-0.001)/25 = 0.999

Deg. freedom = 2

Alpha = 5%

104


75

From the tables, chi-square=3.84, which is higher than the Chi-square, therefore

Pepsi and coke are dependent.

Impact = Actual Cooccurrence/Expected Cooccurrence

Impact = 1 if products are independent, or ≠ if the products are dependent.

Example: Impact (Pepsi on coke) = 0.001/25

Lift (A on B) = (ActualCooccurrence - ExpectedCooccurrence )/(Frequency of

occurrence of A)

-1 ≤ Lift ≤ 1

Lift is similar to correlation: it is 0 if A and B is independent, and +1 or -1 if they

are dependent. +1 indicates attraction, and -1 indicates repulsion.

Example: Lift (coke on Pepsi) = (0.001-25)/50

Product Triangulation Strategy investigates cross-purchase tilts to answer the

above questions. If the most significant tilt occurs when triangulating with respect to

promotion or pricing, the products are substitutes. Pepsi and Coke repel and show no

cross-purchase patterns.

Apriori Algorithm was developed by Agrawal and Srikant [85], Agrawal et al.,

[86] takes all of the transactions in the database into consideration in order to define the

market basket. This technique was originally used in pattern recognition and later it

gained popularity with the innovation of the following rule: “on Thursdays, grocery store

customers often purchase diapers and beer together” [74].

The association rules can be evaluated using two measures namely, the support

measure and the confidence measure. The Apriori Algorithm is implemented in many

commercial institutes.

The outputs of the Apriori Algorithm are easy to understand and many new

patterns can be identified. However, the sheer number of association rules may make the

interpretation of the results difficult. A second weakness of the algorithm is the high

computational time, when it searches for large itemsets, due to the exponential

complexity of the algorithm.

105


76

Hence, a novel and effective technique for market bucket analysis is very much

needed.

4.3. PROPOSED MODIFIED APRIORI APPROACH FOR MARKET BASKET

ANALYSIS

The proposed Market Basket Analysis applies a modified version of the well-

known Apriori data mining algorithm to guide the users in the selection of the frequent

items.

4.3.1. Modified Apriori Algorithm

There have been various researches in the discovery of association rules between

the items of a transaction set. However, when defining user behavior patterns for services

like a supermarket, an e-learning site, or simply any website, the analysis of the items

composing their transactions should alone not be considered. For example, the behavior

of a user in a website cannot be measured well by the items the user buys (what); but the

way the user buys these items (how) should also be considered, in order to differentiate

from other users or to group that user to other users with similar behavior. This leads to a

step beyond the simple association rules discovery generated by the users of the service,

that is to say, the relationship existing between the whole set of users and each one of the

individuals has to be analyzed.

To perform this analysis the following steps are carried out,

i. Discovering existing association rules in the original repository of

transactions.

ii. Rewriting the original transactions to reflect all the association rules that

verify each one of them.

iii. Discovering and using the existing relations between rules discovered in

step 1 in order to automatically divide the set of transaction rules obtained

in step 2.

The behavior of a user at a supermarket with the data obtained according to the

manner the user acts in the service, i.e. based on the particular rules verified by the user

even as the user is interacting with the service.

106


77

i. Discovering existing association rules in the original repository of transactions

Initially, the original algorithm [91] for discovering the association rules between

the items generated by users. The original pseudocode by Agrawal and Srikant [91] is

offered in Fig 4. 1.

Fig 4.1: The original Apriori Algorithm

In the original algorithm apriori-gen is a function made up of two phases namely,

union and pruning. In the union phase (as in Fig 4.2), all k-itemsets candidates are

generated.

(1) L1 = {large 1-itemsets};

(2) for (k = 2; 𝐿𝑘−1 ≠ ∅; k+ +) do begin

(3) 𝐶𝑘= apriori-gen(𝐿𝑘−1);

(4) forall transactions 𝑡 ∈ 𝐷 do begin

(5) 𝐶𝑡 = subset(𝐶𝑘 , 𝑡); //Candidates contained in t

(6) forall candidates 𝐶 ∈ 𝐶𝑡 do

(7) c.count++;

(8) end

(9) 𝐿𝑘 = 𝑐 ∈ 𝐶𝑘 𝑐. 𝑐𝑜𝑢𝑛𝑡 ≥ 𝑚𝑖𝑛𝑠𝑢𝑝 ;

(10) end

(11) Answer = ∪𝑘 𝐿𝑘 ;

107


78

Fig 4.2: Union Phase original Apriori Algorithm

Now in the pruning phase (as in Fig 4.3), which provides the name to the

algorithm, all candidates generated in the union phase with some non-frequent (k -1)-

itemset are removed.

Fig 4.3: Pruning Phase original Apriori Algorithm

In Fig 4.4, the new implementation of the union and pruning phases for the

Apriori algorithm is given. When joining the union and pruning phases in the same

function, many insert and delete operations in a dynamic vector Ck are saved. Also, by

relaxing the pruning many search operations in tree L of frequent k-itemsets are saved.

forall itemsets 𝑐 ∈ 𝐶𝑘

forall (k -1)-subsets s of c do

if 𝑠 ∉ 𝐿𝑘−1 then

delete c from 𝐶𝑘 ;

insert into 𝐶𝑘

select p.item1, . . .,p.itemk_1, q.itemk_1

from Lk-1p,Lk-1 q

where p.item1 = q.item1, . . .,p.itemk-2 = q.itemk_2,

p.itemk_1 < q.itemk_1;

108


79

Fig 4.4: Union and pruning phases Modified Apriori Algorithm

Another important adjustment to the original Apriori Algorithm is the extraction

of the existing rules (function genrules) in the repository of transactions. The original

Apriori is presented in Fig 4.5.

insert into Ck

select c = {p.item1, . . .,p.itemk-1, q.itemk-1}

from Lk-1p,Lk-1q

where (p.item1=q.item1,.,p.itemk-2 = q.itemk-2,

p.itemk-1 < q.itemk-1)

and

(p.itemk-1, q.itemk-1) ∈ L2

109


80

Fig 4.5: The original Apriori Algorithm genrules

To denote a rule, the following notation is used,

𝑅𝑖 = 𝑎𝑖 ⟹ 𝑐𝑖

where 𝑅𝑖 is the ith rule, 𝑎𝑖 the antecedent k-itemset and 𝑐𝑖 is the consequent k-itemset.

By

𝑙𝑘𝑖= 𝑎𝑖 ∪ 𝑐𝑖

// Simple Algorithm

forall large itemsets 𝑙𝑘 , 𝑘 ≥ 2 do

call genrules(𝑙𝑘 , 𝑙𝑘 );

// The genrules generates all valid rules 𝑎 => (𝑙𝑘-

𝑎 ), for all 𝑎 ⊂ 𝑎𝑚

procedure genrules(𝑙𝑘 : large k-itemset, 𝑎𝑚 : large m-

itemset)

(1) A = {(m -1)-itemsets am-1|am-1 ⊂ am};

(2) forall 𝑎𝑚−1 ∈ 𝐴 do begin

(3) 𝑐𝑜𝑛𝑓 = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐼𝑘)/𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑎𝑚−1);

(4) if (𝑐𝑜𝑛𝑓 ≥ 𝑚𝑖𝑛𝑐𝑜𝑛𝑓) then begin

(7) output the rule 𝑎𝑚−1 ⇒ (𝐼𝑘 − 𝑎𝑚−1) , with

confidence= 𝑐𝑜𝑛𝑓 and support=support(𝐼𝑘);

(8) if (𝑚 − 1 > 1) then

(9) call genrules(𝐼𝑘 − 𝑎𝑚−1); // to generate rules

with subsets of 𝑎𝑚−1 as the antecedents

(10) end

(11) end

110


81

The denoted the k-itemset defines rule 𝑅𝑖 . genrules takes two k-itemsets as

parameters: the first one is always 𝑙𝑘𝑖 and the second is a subset of 𝑙𝑘𝑖

, from which the

antecedents of the rules derived from 𝑙𝑘𝑖 are extracted. Each time genrules are called, the

confidence of a rule is obtained as 𝑎𝑚 ⇒ 𝑐𝑖 = 𝑙𝑘 − 𝑎𝑚 , two calls are hence needed to the

support function. This function has to pass over L until the level determined by the

number of items of the k-itemset received: 𝑘 + 𝑘 − 𝑚 searches.

This characteristic feature makes the original algorithm inefficient due to the way

the repository L of frequent k-itemsets is stored. Hence, the items that compose 𝑙𝑘 should

be searched from the root of the repository. Furthermore, the support of 𝑙𝑘 is added as a

parameter to genrules to get a more efficient implementation of the algorithm, which

eliminates many calls to the support function (only one call is performed before

processing each k-itemset of L and in the multiple recursive callings this function is not

called again with this parameter).

The creators of the Apriori Algorithm followed the next statement:

If a ⇒ 1 − a does not hold, neither does a ⇒ 1 − a for any a ⊂ a . By

rewriting, it follows that for a rule 1 − c ⟹ c to hold, all rules of the form (1 − c ) ⇒

c must also hold, where ~c is a non-empty subset of c. For example, if the rule AB ⇒

CD holds, then the rules ABC ⇒ D and ABD ⇒ c must also hold.

After that, a second method is proposed, which is quicker under certain

circumstances: when 𝑚𝑖𝑛𝑐𝑜𝑛𝑓 is large, if they can detect that 𝑐𝑜𝑛𝑓(𝐴𝐵𝐶 ⇒ 𝐷) <

𝑚𝑖𝑛𝑐𝑜𝑛𝑓, they then assume that it is not necessary to check the rules 𝐴𝐵 ⇒ 𝐶𝐷, 𝐴𝐶 ⇒

𝐵𝐷, 𝐵𝐶 ⇒ 𝐴𝐷, 𝐴 ⇒ 𝐵𝐶𝐷, 𝐵 ⇒ 𝐴𝐶𝐷, 𝐶 ⇒ 𝐴𝐵𝐷 . In the case they want to detect rules

without high confidence, this is rather a disadvantage that supposes an additional

verification and will slow down the execution of the algorithm.

Moreover, the performance of the algorithm is improved by analyzing the

characteristics of the rules generated by it.

Consider the following example:

Let the k-itemset be 𝐼𝑘 = 1,2,3,4,5,6 .

111


82

𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 𝐼𝑘) Will recursively call 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4,5 ) and this will call

recursively𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4 ).

Later, the initial reference will call 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4,5,6 ) and this will make

another recursive call to 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4 ).

Afterwards, 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 1,2,3,4 ) is called twice, and thus duplicates all of the

performed callings is obtained with the subsets of 1,2,3,4 as the second parameter.

This generates a great number of searches and repeated operations, which will be of the

exponential order 𝑘.

In the proposed implementation, this problem is overcome by the use of a static

vector where the analyzed subsets of 𝐼𝑘 are stored as the second parameter. In order to

apply these properties, the algorithm has been redefined as depicted in Fig 4.6.

112


83

Fig 4.6: Modified Apriori genrules

For all large itemsets Ik , k ≥ 2 do

supp_l_k = support(𝐼𝑘);

Call 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 𝐼𝑘 , supp_l_k);

// The genrules generates all valid rules 𝑎 ⇒ 𝐼𝑘 − 𝑎 ,

for all 𝑎 ⊂ 𝑎𝑚

procedure genrules (𝐼𝑘 : large 𝑘 -itemset, 𝑎𝑚 : large 𝑚 -

itemset, supp_l_k: double)

(×) if (𝑚 == 𝑘)

× a_m_processed.clear(); //Initialize a_m_processed

for each new 𝐼𝑘

1 𝐴 = 𝑚 − 1 − 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠 𝑎𝑚−1|𝑎𝑚−1 ⊂ 𝑎𝑚 ;

2 forall 𝑎𝑚−1 ∈ 𝐴do begin

(×) if 𝑎𝑚−1 ∈ a_m_processed then

(×) continue;

(×) a_m_processed.add (𝑎𝑚−1);

(3) 𝑐𝑜𝑛𝑓 = support_l_k/support(𝑎𝑚−1);

(4) if (𝑐𝑜𝑛𝑓 ≥ 𝑚𝑖𝑛𝑐𝑜𝑛𝑓) then begin

7 output the rule 𝑎𝑚−1 ⇒ 𝐼𝑘 − 𝑎𝑚−1 , with

confidence= 𝑐𝑜𝑛𝑓 and support= support(Ik);

(8) if 𝑚 − 1 > 1 then

(9) call 𝑔𝑒𝑛𝑟𝑢𝑙𝑒𝑠(𝐼𝑘 , 𝑎𝑚−1 , supp_l_k) ; // to generate

rules with subsets of 𝑎𝑚−1 as the antecedents

10 end

11 end

113


84

This modified algorithm does not check any extra rule when minconf is set to 0%.

The execution time of the Modified Apriori Algorithm is not excessive at all, so it could

be executed in real-time. If applied to the transactions performed by the same user or a

group of users with homogeneous behavior in a supermarket, this modified algorithm

even gets lower execution times.

ii. Rewriting transactions to return association rules verified by each transaction

It is more appropriate to study the behavior of a customer according to the rules

the customer “verifies” rather than to the items themselves. Suppose a customer enters

the supermarket on Mondays and buys certain items A, B and C and next a series of items

T1. The customer then buys items A, G and H on Fridays and next another series of items

T2. Also suppose that the items of T1 and T2 are not frequently presented with item A.

As this approach is not considering temporal variables, it cannot detect if it is Monday or

Friday. Thus, when a user enters a supermarket and requests for item A, probably items B

or G will be recommended in the conventional technique.

If the user then buys item G, the method will suggest an option of item H as more

probable, next to some items T2 and finally another link to item B with less probability to

be bought by the user. Therefore, the fact to buy item A is not as significant as the

possibility to use the behavioral rules already stored (the existing relations between rules

containing the k-itemsets AGH and T2). The cue idea is the capacity of analyzing the

rules independently of the number of items they contain, providing a means to get the

relations between the two rules composed by the different items and using only the

transactions supplied by the users.

If the rules verified by each original transaction of D are studied, a set of rules (R)

that contains one line for each verified rule of each transaction of D is defined. The goal

of this conversion D-->R is to study the group of rules containing R again by means of

Apriori Algorithm.

At the beginning, this task is simply an exploratory task as lot of time is needed to

convert file R into huge repositories of data and it should be done for different support

and confidence thresholds. Rule repositories used to be bigger than the transaction

repositories. Indeed, if a transaction verifies a rule that consists of a given k-itemset lk, it

114


85

will then verify all the rules of the k-itemsets contained in lk. For example, if the transaction

ABCD generates one rule, it can then generate the fifty rules inferred by their subsets:

14 rules with 4 items: ABC=> D, ABD=> C . . . D =>ABC.

24 rules with 3 items: AB=> C, . . .CD => B.

12 rules with 2 items: A=> B, . . .D=> C.

If a low confidence threshold is used, one transaction of D with three items will

define twelve rules in R, one transaction of D with four items will define 50 rules in R,

and R will then grow exponentially according to the length of the transactions of D.

At this point, the impressive size of R leads to an exploratory study of the existing

differences between both the repositories for proposing a reduction of data contained in R

without losing any information contained in it.

iii. Automatic division of the repository of rules

Many techniques exist to partition the repository of original data. However all the

techniques depend on the highest control and classification of the set of items available to

the system. A set of families of rules are defined to partition the repository of rules and

group those rules that link users when interacting with the system. The proposed

algorithm is explained below.

(1) Select the rule of major support; in case of draw, select the rule of highest confidence

and in case of another draw, select the rule that contains more items.

R1|supp R1 = maxi{supp Ri } ∧ (conf R1 = maxj{conf Rj |supp Rj

= supp R1 })

This rule will be the principal element of the first family of rules, ℱ1.

(2) Divide ℛ into subsets of transactions. The first repository ℛ1 will contain all transactions

with the rule 𝑅1 and the second repository ℛ∞ will contain the remaining transactions.

(3) Run the Apriori Algorithm on ℛ1 and apply step 1 again to select 𝑅2the rule with the

highest frequency jointly with 𝑅1.

(4) Check the support of 𝑅2 into ℛ∞ : if the support of 𝑅2 into ℛ1 is greater than the

support in ℛ∞, then, add 𝑅2 to family ℱ1; in other case, remove 𝑅2 from ℛ1.

115


86

(5) Go to step 3 while there are rules still not classified in ℛ1.

When the definition of the first family of rules has been completed, remove all the

rules belonging to ℱ1 and all empty transactions from the real repository of rules ℛ. Next,

the second family of rules is constructed in the same manner. The algorithm finishes

when ℛ∞ has no rules associated with the other rules, i.e. when all the transactions of ℛ∞

possess only one rule.

iv. The proposed Market Basket Analysis System

At last, an effective system has been designed for Market Basket Analysis with a

two-step modified Apriori Algorithm. The main goal of this approach is to obtain a better

Market Basket Analysis system to facilitate the user in purchasing the items. The proposed

approach can be defined as follows:

(1) A user enters into the supermarket and selects item 𝐴.

(2) The system can use only the details collected about its proper items, specifically the

original rules with 𝐴 as antecedent – this is the classical suggestion using

association rules. As a result, the first recommendation will be only depending on

the frequency (the system will suggest only the consequents of the rules with the

huge confidence that have item 𝐴 as antecedent).

(3) The user selects a second item 𝐵.

(4) The rules are derived from the 𝑘-itemset 𝐼𝑘 and search for the family which belongs

to it ℱ𝑖 .

If ℱ𝑖 contains rules derived from the rules accomplished by the user in the

current visit, there is then a confidence of 100% between the discovered rules

and the rules already verified by the user. In this case, the system suggests the

items of the discovered rules with the greatest support. Dissimilar to many

classical methods, here the rules are discovered that have no items chosen as

antecedents, and this is highly significant as the analysis performed entirely

ignores the order of selection of the items.

116


87

If ℱ𝑖 does not contain such a kind of rules, the rules with huge confidence

(and support in case of draw) in relation to the rules verified by the user are

used. Additionally, with the intention of solving the classical problem of

discovering suggestions based on the antecedent, it is possible to discover the

rules with any item already selected by the user. This cannot be accomplished

with the classical approach.

(5) The user chooses a third item 𝐶.

(6) Then, the step 4 is proceeded, but now looking for one or more families based on

the derived rules from 𝑘-itemset 𝐼𝑘 = 𝐴, 𝐵, 𝐶 .

This technique provides the new experience to the customers within the

supermarket, where the customers should feel more comfortable in purchasing their

items. They are able to buy at least one item more directly when using the proposed

system.

4.4. DISCUSSION

In Apriori Algorithm, the function is made up of two phases namely, union and

pruning. In the union phase, all k-item set candidates are generated. The pruning provides

the name to the algorithm, all candidates generated in the union phase with some non-

frequent (k -1)-itemset are removed and it scans the database a number of times. Every

time additional choices will be created during the scan process. This creates additional

work for the database to search. Therefore, database must store huge number of data

services. This results in lack of memory to store those additional data and it takes a very

long time for processing. This results in very low accuracy.

In the Modified algorithm, the new implementation of the union and pruning

phases for the Apriori Algorithm is given. When joining the union and pruning phases in

the same function, many insert and delete operations in a dynamic vector Ck are saved.

Also, by relaxing the pruning many search operations in tree L of frequent k-itemsets are

saved.

The modified algorithm does not check any extra rule when minconf is set to 0%.

As the execution time of the Modified Apriori Algorithm is not excessive, it could be

117


88

executed in real-time. If applied to the transactions performed by the same user or group

of users with a homogeneous behavior in a supermarket, this modified algorithm even

gets lower execution times.

4.5. SUMMARY

In this chapter, a new method for Market Basket Analysis based on the modified

Apriori Algorithm has been introduced. This chapter introduces a modified version of the

well known Apriori Data Mining Algorithm to guide the users towards the selection of

the best combination of items in a super market. This approach mainly analyzes the

process of discovering association rules in this kind of big repositories.

Most of the association rule mining algorithms suffer from the problems of too

much execution time and generating too many association rules. Although conventional

Apriori Algorithm can identify meaningful itemsets and construct association rules, it

suffers the disadvantage of generating numerous candidate itemsets that must be

repeatedly contrasted with the entire database. The processing of the conventional

algorithm also utilizes a large amount of memory. Thus, this approach is very significant

for effective Market Basket Analysis and it helps the customers in purchasing their items

with more comfort, which in turn increases the sales rate of the markets.

118


MODIFIED APRIORI ALGORITHM -...

Documents

Transcript of MODIFIED APRIORI ALGORITHM -...