Web viewThe wide spectrum of uncertainties involved in the web navigation ... (Dhillon 2001) for...

FUZZY BICLUSTERING APPROACH FOR WEB COMMUNITIES

IDENTIFICATION AND WEB PERSONALIZATIONH. Hannah Inbarani1,K. Thangavel2

1Department of Computer Science,PeriyarUniversity ,SALEM-636 011,India,Email:[email protected] of Computer Science,PeriyarUniversity ,SALEM-636 011,India,Email:[email protected]

Abstract

Information overload appears to be a growing problem for people in the web era. The

rapid development of web technologies has made the World Wide Web a huge information

source. Existence of huge amounts of data and lack of well defined data models for the web

makes information retrieval a tedious task. Due to this, web user navigates through the web site

without finding relevant resources. Web personalization is a step to alleviate the information

overload problem thereby helping the user to make interest driven visits. Web usage mining tries

to reveal the underlying access patterns from web transactions or user session data that are

recorded in web log files. Generally, web users navigate through web pages based on their

interest and the coherence between web pages. They may exhibit different types of access

interests during their surfing period. Thus, employing data mining techniques on the observed

usage data may lead to finding the underlying usage patterns. Hence there is a need to develop

efficient technique for uncovering web user communities based on which relevant pages can be

recommended based on users preferences. In this paper we propose a Robust Fuzzy Biclustering

approach which captures web communities and recommends pages based on the bicluster

patterns. Experiments were performed using the web log collected from the web server for a

leading IT Services and Solutions company. In order to show the effectiveness of the proposed

Robust Fuzzy Biclustering approach, recommendation results are compared with the existing

approaches like Conventional Biclustering,CDK-Means, spectral co-clustering approaches.

Experimental results show the effectiveness of the proposed algorithm over the existing

biclustering approaches.

Keywords: Information overload, Web usage mining, Fuzzy Biclustering, Web Page

Recommendation, Personalization,Co-clustering

1

1. IntroductionThe technology behind personalization has undergone tremendous changes, and several web-based personalization systems have been proposed in recent years. Although personalization can be accomplished in numerous ways, most web personalization techniques fall into four major categories: decision rule-based filtering, content-based filtering, and collaborative filtering and web usage mining. Decision rule-based filtering surveys users to obtain user demographics or static profiles, and then lets web sites manually specify rules based on them. Content-based filtering relies on items being similar to what a user has liked previously. Collaborative filtering (CF), also called social or group filtering is the most successful personalization technology to date. Most successful recommender systems on the web typically use explicit user ratings of products or preferences to sort user profile information into peer groups(Sung Ho Ha et al.2002). It then tells users what products they might want to buy by combining their personal preferences with those of like-minded individuals. Additionally, traditional collaborative or content-based filtering, have problems, such as reliance on subject user ratings and static profiles or the inability to capture richer semantic relationships among web objects. To overcome these shortcomings, the web personalization attempts to increasingly incorporate web usage mining techniques. Web usage mining can help improve the scalability, accuracy, and flexibility of recommender systems. Web usage mining also can reduce the need for obtaining subjective user ratings or registration-based personal preferences.

Web usage mining (WUM) uses data mining algorithms to automatically discover and extract patterns from web usage data and predict user behavior

while users interact with the web and helps in the discovery of web communities.Although web

usage mining has exposed limitations— sparsity in usage data or regular changes in site content,

it also has several advantages over traditional techniques. The data source for web usage mining

2

is generally the server access log, but sometimes a client-side agent collects data.An

interesting problem associated with the web is the definition and delineation of so called web

communities. A community is loosely defined to be a collection of content creators that share a

common interest or topic. The systematic extraction of emerging communities is useful for many

reasons including communities which provide high quality information to interested users

(Jayson E. Rome 2005).

Discovery of web communities, groups of related web pages sharing common interests, is

important for assisting users' information retrieval from the web. There are several different

granularities of overlapping web communities, and this makes the identification of objective

boundaries of web communities difficult (Grieser et al. 2003).This paper focuses on identification of web communities based on user’s navigation behavior.These communities are used in web page prediction. Web prediction systems based on WUM obtain user profiles dynamically from usage patterns, and thus their performance does not degrade over time as the profiles age.

1.1 . MotivationThe web usage mining tasks can involve the discovery of association rules, sequential patterns, page view clusters, user clusters, probabilistic models or any other pattern discovery method (Sarabjot Singh Anand et al. 2007). The discovered patterns are used by the online component to provide personalized content to users based on their current navigational activity. The personalized content can take the form of recommended links or products, targeted advertisements, or text and graphics tailored to the user’s preferences. The web server keeps track of the active server session as the user’s browser makes HTTP requests. The recommendation engine considers the active server session in conjunction with the discovered patterns to provide personalized content.

The primary motivation behind the use of clustering in collaborative filtering

(GuandongXu2008) and web usage mining is to improve the efficiency and scalability of the

3

real-time personalization tasks.In the context of web personalization, this task involves clustering user sessions identified in the preprocessing stage. A variety of clustering techniques can be used for clustering similar users’ sessions based on occurrence patterns of URL (Uniform Resource Locator) references. User sessions can be mapped into a multidimensional space as vectors of URL references.

Another approach for obtaining aggregate usage profiles is to directly compute (overlapping) clusters of page view references based on how often they occur together across user sessions (rather than clustering sessions, themselves). The usage profiles obtained in this way is called page cluster(BamshadMobasher et al. 2002).

However, both User clustering (UC)and Page view clustering (PC) are one-sided approaches, inthe sense that they examine similarities either only between users or only between pages, respectively. This way, they ignore the clear duality that exists between users and items. Furthermore, UC and PC algorithms cannot detect partial matching of preferences, because their similarity measures consider the entire set of pages or users, respectively. Another limitation ofuser or page clustering algorithms is that number of clusters must be

given as input based on the structure of input patterns. Hence the first goal this work is to

simultaneously cluster users and pages based on their URL references.

The flow of information in a completely automated web personalization system can be

prone to significant amounts of error and uncertainty. This uncertainty pervades all stages from

the user’s web navigation patterns to the final recommendations, including the intermediate

stages of logging web usage, preprocessing, segmenting web log data into web user sessions, and

learning a usage model from this data (OlfaNasraoui et al. 2003). Hence the second goal of this

work is to handle the uncertainty prevailing in the web pattern discovery process.

1.2. Contribution

4

The simultaneous clustering of users and pages discovers biclusters, which correspond to groups

of users which exhibit high correlation on groups of pages. For page recommendation, biclusters

allow the computation of similarity between a test user and a bicluster only on the pages that are

included in the bicluster. Thus, partial matching of preferences is taken into account too.

Moreover, a user can be matched with several nearest biclusters, thus to receive

recommendations that cover the range of his various preferences. A simple and robust

Biclustering approach was already proposed (Hannah Inbarani et al. 2011) in our previouswork

for web page recommendation.

The wide spectrum of uncertainties involved in the web navigation process can be

modeled and handled using well studied formal models of uncertainty in fuzzy set theory and

soft computing. Hence to facade the second described goal, i.e., to handle uncertainty in the web

pattern discovery process, fuzzy biclustering approach isintroduced in this work for web

personalization.

The contributions of this paper are summarized as follows:

To capture the range ofthe user’s preferencesand to handle uncertainty which prevails in

the web navigation patterns we introduce for the first time, to our knowledge,

theapplication of Fuzzy Biclustering (FB)algorithm for web personalization. The

effectiveness of this approach is compared with spectral co-clustering approach proposed

by (Dhillon 2001) for co-clusteringof words and documents,CDK-Means approach

proposed by (Pensal et al. 2005) which is a K-Means like approach for Biclustering of

categorical data and Conventional Biclustering (CB) approach proposed in our previous

work (Hannah Inbarani et al.2011) using recommendation evaluation metrics and the

results are discussed in section 7.

A web user profiling approach and recommendation approach based on fuzzy

biclustering is also proposed for web page recommendation.

The rest of this paper is organized as follows: section 2 summarizes the relatedwork, whereas

section 3 lists out the research issues addressed in this paper, section 4describes the methodology

for web page recommendation process and the proposed FB approach, section 5 discusses the

5

performance analysis of FB and comparative analysis of FB with CB,CDK-Means and spectral

co-clusteringapproaches.

2. Related Work

Web clustering can involve either grouping of users who present similar browsing patterns or

grouping of pages having related content based on information derived from different sources.

Specifically, user clustering approaches can be based on usage data recorded in web

server log files and create web communities i.e groups of users with similar browsing behavior

(Pallis, G. and Koutsonikola et al.2006). On the other hand, in web page clustering approaches,

information can be extracted from pages’ content (Hammouda, K.M and Kamel, M.S 2004),

structure i.e links between web pages or pages’ structure as described by the involved tags

(DoruTanasa and Brigitte Trousse2004) , and usage data i.e which pages tend to be accessed by

users with similar interests ( Nakagawa and Mobasher 2003). Moreover, the clustering results

may be beneficial for a wide range of applications such as websites’ personalization (Nasraoui,

O., Soliman, M. Saka, 2008), web caching and prefetching (Li, H-Y etal. 2007), search engines

(Liu etal.2005) and Content Delivery Networks (Pallis, G. and Koutsonikolaet al. 2006) . In

addition, the clustering results can contribute to the enhancement of recommendation engines

(Chi, C-C etal.2008) and to the design of collaborative filtering systems (Srinivasa, N. and

Medasani, S 2004).

In user/page clustering approaches, the exact user access patterns are not taken into

account. Hence recent studies have used biclustering approaches to disclose this duality between

users and pages, by grouping them in both dimensions simultaneously (Liu X., He P. and Yang

Q., 2005 and Koutsonikolaet al. 2009). The goal of these approaches is to identify groups of

related web users and pages, which results from the tendency of some users to visit the same set

of pages. This behaviour characterizes users’ interests as similar and highly related to the topic

that the specific set of pages involves. The obtained results are particularly useful for

applications such as e-commerce and recommendation engines, since relations between clients

and products may be revealed. These relations are more meaningful than the one-way clustering

of users or pages.(Koutsonikolaetal.2009).

6

Usually, the clusters (or biclusters) resulting from the web usage mining algorithms may

not necessarily have crisp boundaries, rather they have fuzzy or rough boundaries (Hannah

Inbaraniet al. 2009). Koutsonikola et al. (2009) has proposed Fuzzy Biclustering approach to

cluster users and pages simultaneously. The limitation of this two way clustering approach is

that it is based on clustering and so the exact user access patterns cannot be obtained. Hence it is

not suitable for page recommendation as correlation between pages disappear as the user access

patterns are merged in user and page clustering techniques. So as defined in (Fu et al.1999)

precision of the recommendation sets will be lower.

The concept of biclustering has been used in (Mirkin B et al. 1996) to perform grouping

in amatrix by using both rows and columns. However, biclustering has been usedpreviously in

(Hartigan et al.1972) under the name direct clustering. Recently, biclustering (alsoknown as co-

clustering, two-sided clustering, two-way clustering) has been exploited by many researchers in

diverse scientific fields, towards the discovery ofuseful knowledge (Cheng Y and Church 2000,

Dhillon 2001,Dhillon et al. 2003 Long B,et al. 2005) . One of these fields is bioinformatics

(Tang C 2001), and morespecifically, microarray data analysis.The results of each microarray

experiments are represented as a data matrix, with different samples as rows and different genes

as columns. Other fields are text mining(Dhillon 2001) and web mining (Koutsonikola et al.

2009).

There are several approaches to deal with the biclustering problem. Many different

algorithms for biclustering have already been proposed in the literature (Cheng, Y. and Church

2000 and Tang C 2001). In short, these methods can be classified by (i) the type of biclusters

they find; (ii) the structure of these biclusters; and (iii) the way the biclusters are discovered.

The type of the biclusters is related to the concept of similarity between the elements of

the matrix. For instance, some algorithms search for constant value biclusters, while others

search for coherent values of the elements or even for coherent evolution biclusters (PabloA. D.

de Castro, Fabrício, 2007).The structure of the biclusters can be of many types. There are single

bicluster algorithms, which find only one bicluster in the center of the matrix; the exclusive

columns and/or rows, in which the biclusters cannot overlap in either columns or rows of the

matrix; arbitrary positioned, overlapping biclusters and overlapping biclusters with hierarchical

7

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/i/Inbarani:H=_Hannah.html

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/i/Inbarani:H=_Hannah.html

structure. The way the biclusters are discovered refers to the number of biclusters discovered per

run. Some algorithms find only one bicluster, others simultaneously find several biclusters and

some of them find small sets of biclusters at each run.

By performing co-clustering(biclusteringor two way clustering), the users and pages are

simultaneously clustered into several co-clusters. Each co-cluster consists of a pair of highly

relevant user cluster and page cluster. Co-clustering o ers some advantages such asff

dimensionality reduction, interpretable document cluster (Dhillon et al. 2003), and improvement

in accuracy due to local model of clustering (Madeira and Oliveira 2004). Fuzzy co-clustering

further improves the representation of overlapping clusters using fuzzy membership function.

These advantages make fuzzy co-clustering a suitable option to categorize users and pages,

particularly the ones in the World Wide Web.

Web personalization systems have recently attempted to incorporate techniques for

webmining.Webmining turns out to be an enablingmechanismto overcomethe problems

associated with more traditionalweb personalization techniques, such as collaborative or content-

based filtering.Accessinformation, coded into server logs, is processed by applying web mining

techniques forseveral purposes, such as clustering users with similar browsing behavior,

extracting interesting usage patterns, and discovering potential correlations between web pages

and usergroups (Flesca et al. 2005). In a collaborative filtering approach, they provide a user

with personalized recommendations based on the similarity between his/her profile and the ones

of other users with similar interests. User profiles, representing the information needs and

preferences of users, can be inferred from the ratings that users provide on information items,

explicitly or implicitly, through their interactions with a system(GirardiA et al. 2007). A user

model, a representation of this profile, can be obtained implicitly through the application of web

usage mining techniques. The personalization of services offered by a web site is an important

step in the direction of alleviating information overload,making the web a friendlier

environmentfor its individual user and hence creating trustworthy relationships between the

website and the visitor-customer(Pierrakos et al. 2003). One of the best illustrations of

recommendations is Amazon’s recommendation engine,where a user is informed that

“Customers who bought this item also bought this” or“Customers who bought music by this

artist also bought music by these artists (Pabarskaite 2007).

8

In this paper, we propose a Robust Fuzzy Biclustering technique for simultaneous

clustering of users and pages and the proposed approach is compared with co-clustering

approach proposed by (Dhillon 2001) for co-clustering word and documents and CDK-Means

proposed by (Pensal et al. 2005) which is a K-means like approach for Biclustering of categorical

data and the results are discussed in section 5.

3. Research Issues

In this section, we examine the issues of web page recommendation systems. Table 1

summarizes the symbols that are used in the sequel.

Accuracy

Webpersonalization is viewed as a data mining task. Hence the accuracy of models

learned for this purpose can be evaluated using a number of metricsthat have been used in

machine learning and data mining literature such as Mean Absolute Error (MAE) and area under

theReceiver Operating Characteristic (ROC) curve, depending on the formulation of the learning

task.In this work, MAE is used to measure the accuracy of web page recommendation results.

Scalability

The performance and scalability dimension aims to measure the response time of agiven

recommendation algorithm and how easily it can scale to handle a large numberof concurrent

requests for recommendations. Typically, these systems need to be ableto handle large volumes

of recommendation requests without significantly adding to theresponse time of the web site that

they have been deployed on. The proposed approach is scalable because the recommendation is

performed online and user profile discovery is an offline process. The on-line parts concern the

time it takes to create a recommendation list, based on the pages visited by the active user

session. As proved in (Symeonidis et al. 2008), online part of Biclustering approaches take less

execution time than user / page clustering approaches.

Sparsity

Sparsity refers to the fact that as the number of pages in a web site increases, even the

most prolific users of the system will only visit a very small percentage of all pages. As a result,

there will be many pairs of customers that have no pages in common and even those that do will

9

not have a large number of common pages.Sparsity can be handled well by selecting appropriate

value for K.

Precision

Precision and Recall are standard metrics used in information retrieval. While precision

measures the probability that a selected item is relevant, recall measures theprobability that a

relevant item is selected. Precision and recall are commonly used inevaluating the selection task.

Coverage measures the percentage of the universe of items that the recommendation system is

capable of recommending. The F1-Measure that combines precision and coverage has also been

used for this purpose task. In this work, Precision,coverage and F1-Measure are the metrics used

for measuring the prediction process.

Similarity Measure

Similarity measure: The most extensively used similarity measures are basedon

correlation and cosine-similarity (Symeonidis et al. 2008). Specifically, user-based clustering

algorithms mainly use Pearson’s correlation, whereas for page view clustering algorithms, the

Adjusted CosineMeasure(Mobasher et al. 2002) is preferred.The Adjusted CosineMeasure is a

variation of the simple cosine formula thatnormalizes bias from subjective ratings of different

users. In this work, cosine similarity measure is to find the similarity of users with patterns.

Table 1. Symbols

Symbol Definitionnb Number of Biclusters

K Number of recommended biclusters

m Number of users

n Number of pages

UP User profile matrix of size(nb x n)

P Pattern Matrix of size(nb x n)

BU Users in Bicluster(nb x m)

BP Pages in Bicluster(nb x n)

µui User Membership values

µpj Page Membership values

nn Active sub session size

10

Extracting User profiles

Web Server LogInternetPreprocessing

Pattern Discovery

Recommendedpages Matching Module

HTTP Request

Match current session with User profile

S Session Matrix/User access Matrix

p1,p2,…, pn Pages/URLs

u1,u2,…, um Users

4. MethodologyWeb personalization system based on web usage mining discovers web usage profiles,

followed by a recommendation system that can respond to the users’ individual interests.

The architecture of the proposed system is shown in Figure 1. In the offline processing,

user sessions are extracted and Fuzzy biclustering approach is used for extracting user access

patterns and the user profiles are generated. In online processing, current session of the user is

matched with user profiles and the most similar profiles are used for page recommendation.

Figure1. System Architecture

The Recommendation Process consists of two phases.

Offline Phase

11

ONLINE MODULE

OFFLINE MODULE

The three steps of offline phase are:

Preprocessing,

Pattern Discovery (Biclustering)

User Profiling

Online phase

The two steps of online phase are:

Match active session with user profiles

Recommend top N list of pages

4.1 PreprocessingData cleaning operation is performed as defined in (DoruTanasa and Brigitte Trousse 2004),

which removes image files and style sheet files. The access log of a web server is a record of all

files (URLs)accessed by users on a web site. Each log entry consists of the information

components such as remotehost, Rfc931, Authuser, date, request,status and bytes.

The sample entries in the web log file are listed in Figure 2.

218.248.30.146 - - [21/Nov/2009:03:10:51 +0530] "POST /make_slides.php HTTP/1.1" 200 740

216.104.15.130 - - [21/Nov/2009:03:20:37 +0530] "GET /messengerplus.php HTTP/1.0" 200 15202

Figure 2.Sample Web Log file

In the next step, using user session identification process, user sessions are identified and Session

Matrix is created. User Access Matrix S = {sij} where sij =1 if page j has been visited by user i

otherwise it is set to zero. The weight associated with each visited page is represented by W =

{wij} where each entry in the weight matrix specifies the number of hits on a specific page as

defined in (Claypool M., 2001). For each user, the weight vector of each navigational session is

represented as a sequence of visited pages with corresponding weights{w11, w12, w13,…w1n}

where wijdenotes the weight for a page j visited in ithuser session. The sample user access

/session matrix is shown in Figure 3. Each row of user access matrix is called a session

vector/user access vector/transaction.

12

p1 p2 p3 p4 p5

u1 1 1 1 1 0

u2 0 1 1 1 0

u3 1 0 1 1 1

u4 1 1 1 0 0

Figure3.Sample User Access Matrix

4.2 The Biclustering Process

The biclustering process on a User access matrix involves the determination of a set of

clusters taking into account both users and pages. Each bicluster is defined on a subset of users

and a subset of pages. Moreover, two biclusters may overlap, which means that several users or

pages of the session matrix may participate in multiple biclusters. Another important

characteristic of biclusters is that each bicluster should not be fully contained in another

determined bicluster. Overlapping is allowed in order not to miss important biclusters.

Three biclusters formed from the User access matrix in Figure 3 are listed in Figure 4.

Bicluste

r

Users in the Bicluster Pages in the Bicluster

B1 BU1 = {u1,u4} BP1 = {p1,p2,p3}

B2 BU2= {u1,u2} BP2 = {p2,p3, p4}

B3 BU3 = {u1,u2,u3} BP3 = {p3,p4}

Figure4.Biclusters of the sample user access matrix in Figure 3

4.3 User profilingThe first step in intelligent Web personalization is the automatic identification of user profiles.

This constitutes the knowledge discovery engine. The discovered user profiles are used to

13

recommend relevant URLs to old and new anonymous users of a web site (OlfaNasraoui and

Chris Petenes 2003).

User profiling is the process of collecting information about the characteristics,

preferences, and activities of web communities. An efficient and effective algorithm for web

recommendations is the user profiling approach, which is on a basis of collaborative filtering

techniques, a kind of commonly used algorithms in recommender systems.

This can be accomplished either explicitly or implicitly. Explicit collection of user

profile data is performed through the use of online registration forms, questionnaires, and the

like. The methods that are applied for implicit collection of user profile data vary from the use of

cookies or similar technologies to the analysis of the users’ navigational behavior that can be

performed using web log mining techniques (Jian-Guo Liu and Wei-Ping Wu 2004).

Mobasher B. et al., (2002) have proposed a potentially effective method PACT (Profile

Aggregations based on Clustering Transactions ) to generate aggregate profiles based on the

centroids of each transaction cluster. However the centroid of each cluster may represent the

different groups of pages without much correlation. Hence in this paper, a robust Fuzzy

biclustering approach is proposed to generate profiles which reveal the implicit relationship that

exists between the pages and users. Discovery of aggregate profiles based on Biclustering was

already proposed in our previous work (Hannah Inbarani et al. 2011).

4.4. Recommendation ProcessThe goal of personalization based on anonymous web usage data

is to compute recommendation set for the current (active) user session, consisting of the objects (links, ads, text, products, etc.) that most closely match the current user profile. The recommendation engine is the online component of a usage-based personalization system. The procedure for recommendation is described in Figure. 8.

5. Proposed work 5.1 FuzzyBiclustering(FB) approach

14

In contrast to traditional clustering, a biclustering method produces biclusters, each of

which identifies a correlation between a set of users and a set of pages. The boundary of a

bicluster is usually fuzzy in practice as users and pages can belong to multiple biclusters at the

same time but with different membership degrees. In contrast to a crisp bicluster, which either

contains a user or a page completely or does not contain it at all, a fuzzy bicluster can contain a

user or a page completely or does not contain it at all. To deal with the ambiguity and the

uncertainty underlying web interaction data, fuzzy reasoning appears to be an effective tool.

Fuzzy biclustering algorithm works as follows: In the first step, distinct patterns of the

session matrix S is extracted using Hadamard product defined in Def (1). Given that Sis made

up of nbdistinct patterns, Pattern Matrix P can be expressed as P = p lj where nb is the

number of distinct patterns and j = 1, 2, . . . , n and n is the number of pages. In the second step,

insert the pages of the patterns in the biclusters. The complete description of Fuzzy biclustering

is shown in Figure 6.

Definition 1: Hadamard Product:

Hadamard product (named after French mathematician Jacques Hadamard, also known as the

entry wise product. Note also that both A and B need to be the same size, but not necessarily

square.Formally, for two matrices of the same dimensions:

The HadamardproductA · B is a matrix of the same dimensions

with elements given by

Pattern Extraction:

A pattern v can be extracted by the Hadamard product of each row(considered as a user

access vector (1 x n) ) with other rows of user access matrix denoted by S i◦ Sj where Si ={Si1,Si2,

…, Sin}and Sj={Sj1,Sj2, …, Sjn}

15

The various patterns extracted by Hadamard product for the sample user access matrix in Figure

3 are listed in Figure5.

Patter

n

Pages

P1 {p2,p3,p4}

P2 {p3,p4}

P3 {p1,p2,p3}

Figure.5Patterns obtained from the user access matrix in Figure 3

From the Figure. 5,it can be observed that the patterns extracted finds all the pages in the

bicluster and the number of biclustersnb is equal to the number of patterns.

Algorithm 3 :Fuzzy Biclustering(S,m,n,NU,BP,BU,nb)

Input : Session Matrix S(m,n)

NU - Number of users

n - Number of pages

m - Fuzzy Index

minp - Minimum number of pages allowed in a bicluster

Output : nbbiclusters

nb =0; /* Index of bicluster

Identify distinct patterns of S and store it in Matrix P

/* P - set of distinct patterns

/* L - is the number of distinct patterns in S

Step 1 :Extract all the nbdistinct Patterns

Step 2 :Place all the pages in the extracted Pattern l in BiclusterBPi

Step 3:If the Extracted pattern exists is user session , Place user j in BiclusterBUi

Step 4: Set Initial Page Membership μpijfor each page in the Pattern /Bicluster I as

μpij = 1 if Pagej∊Biclusteri

0 Otherwise

Step 5 : Compute similarity of user i with all the patterns

Step 6: Compute User membership μuijusing Eqn. (2)

Step 7: Update Page membership of pages in the pattern/Bicluster using user

16

membership

Update each Pattern using

P (i , j )=∑j=1

m

( μuij )m . P(i . j)

∑j=1

m

( μuij)m

, i=1 ,2, …, L

Step 8: Stopping criterion: Repeat steps 5 to 7 until the changes in |Pij+1 – Piji| between

two iterations are greater than a fixed threshold ε.

Step 9 : Set μPij = Updated P(i,j)

Output nb /* Number of biclusters

Output BU and BP /* Users and Pages in each Bicluster

Output User membership

Output Page membership

Figure.6Fuzzy Biclustering approach

Definition 2 : The membership of user in each bicluster is calculated by computing the

similarity of each user access pattern with each pattern in the bicluster.

μij=simi ( si , p j )

∑ simi ( si , p j )i=1, 2 ,…,m j=1, 2 , …,nb (2)

where Si represents user access pattern of ith user of the bicluster, j specifies a page in the

bicluster, n specifies the number of pages in the bicluster, nb represents number of distinct

patterns in the session matrix and Simi¿i, P j❑) represents the similarity of each session with

pattern j. Cosine similarity [1] is used for computing the similarity of the user with the pattern.

The proposed Robust Fuzzy biclustering algorithm seems to be an effective tool for web

personalization because the membership of each page is optimized whenever a new user is added

to the bicluster. These memberships are then used as weights for web page prediction.

5.2 Discoveryof Aggregate Profiles Based on Fuzzy Biclustering

In this method, the result of Fuzzy biclustering is used for obtaining user profiles.

Fuzzy membership values of pages in the page biclusters are used as weights and low

support page views i.e pages with membership values below the threshold value α , arefiltered

out. The steps for building user profile based on Fuzzy biclustering are described in Figure7.

17Algorithm : Building user profile based on Fuzzy Biclustering

Input :A set of biclusters,Membership values and Threshold α

nb – Number of Biclusters

Output :Set of user profiles UPj j = 1, 2, . . . , ,nb

Procedure

Figure 7 .Building user profiles based on Fuzzy Biclustering

This fuzzy model generates robust profile because the weights are determined from the

membership values of page views in the biclusters and the membership values are

determined from Fuzzy biclustering techniques. The value of α is taken as 0.4.

5.3 Biclustering based RecommendationWeb Personalization aims to provide intelligent online services such as web

recommendations, based on past web user navigation patterns. Biclustering based

recommendation process is described in Figure8.

18

Algorithm : Biclustering Based Recommendation

Input :Recommendation Threshold α, a set of user profiles generated from the

Biclusters and

t - current sub session

K - number of biclusters to be recommended

N - number of pages to be recommended from K biclusters

Output: Recommendation vector R.

Step 1 : Generate integrated User and Page biclusters (co-clusters) using Biclustering

algorithm.

Step 2 :Generate User profiles using the method specified in Figure 2.

Step 3 :Compute the similarity between user’s sub session vector t and the user

profiles generated .

Step 4 : Sort each row of similarity matrix in descending order based on weights

Step 5 : Include N pages in Top K biclusters if weight wti> threshold α in the

Recommendation vector R

Algorithm : Building user profile based on Fuzzy Biclustering

Input :A set of biclusters,Membership values and Threshold α

nb – Number of Biclusters

Output :Set of user profiles UPj j = 1, 2, . . . , ,nb

Procedure

Figure.8Biclustering based Recommendation process

In order to provide recommendations, we have to find the biclusters containing users

withpreferences that have strong partial similarity with the test user. This stage is executed

online and consists of two basic operations:

The formation of test users’ neighborhood, i.e., to find the K nearest biclusters.

The generation of the top-N recommendation list of pages

6. Experimental Evaluation

6.1Data set 1The web access logs from http://www.technmantix.com were used for our experiments.

Technmantix is a leading IT Services and Solutions company. The actual web log contains nearly

31415 entries. After preprocessing(data cleaning) the web access logs and removing references

by image files and style sheet files, a total of 13375 log entries were identified and after applying

data filtering and session identification,2599(maximum data set size) user sessions were

identified . The total number of URLs representing pageviews was 362 and after eliminating the

image files, style sheet files, the total number of remaining pageview URLs in the training and

the evaluation sets is 113. Approximately 25% of these transactions were randomly selected as

the testing set, and the remaining portion was used as the training set for page recommendation.

6.1.1 Evaluation Metrics

The performance ofFB, CB, CDK-Means, coclusteringmethodsare measured using 4 different

standard measures, namely, precision, coverage, F1-Measure and MAEas defined in

(BamshadMobasheret al.2002). These measures are adaptations of the standard measures,

19

Algorithm : Biclustering Based Recommendation

Input :Recommendation Threshold α, a set of user profiles generated from the

Biclusters and

t - current sub session

K - number of biclusters to be recommended

N - number of pages to be recommended from K biclusters

Output: Recommendation vector R.

Step 1 : Generate integrated User and Page biclusters (co-clusters) using Biclustering

algorithm.

Step 2 :Generate User profiles using the method specified in Figure 2.

Step 3 :Compute the similarity between user’s sub session vector t and the user

profiles generated .

Step 4 : Sort each row of similarity matrix in descending order based on weights

Step 5 : Include N pages in Top K biclusters if weight wti> threshold α in the

Recommendation vector R

precision and recall, often used in information retrieval. MAE is used for measuring the accuracy

of recommendation process. In this context, precision measures the degree to which the

recommendation engine produces accurate recommendations. On the other hand, coverage

measures the ability of the recommendation engine to produce all of the page views that are

likely to be visited by the user. The precision measure represents the ratio of matches between

the recommendation set and the target set to the size of recommendation set. The coverage

measure represents the ratio of matches to the size of the target set.

If we have transaction t (taken from the evaluation set) viewed as a set of pageviews, and

that we use a window nn⊆ t (of size |nn|) to produce a recommendation set R using the

recommendation engine. Then the precision of R with respect to t is defined as:

Precision(R, t) = |R ∩ (t −nn)| / |R| (3)

and the coverage of R with respect to t is defined as:

Coverage(R, t) = |R ∩ (t − nn)| / |t −nn| (4)

6.1.2 Parameter Setting

The minimum number of pages and users in a bicluster is set to 2. For CB,

coclustering and CDK-Means, the implicit rating obtained from the hits of the users in different

pages are used as weights and the weight of each page in the bicluster is determined as per the

user profiling algorithm discussed in (Hannah Inbarani et al. 2011). Unless otherwise specified,

the default values for the parameters are K=4,N=4,nn=2. These optimum values are selected

after several runs based on sensitivity analysis for the best performance in terms of coverage and

precision.

6.2 Recommendation results of Fuzzy Biclustering

The recommendation engine takes a collection of user profiles as input and generates a

recommendation set by matching the current user’s activity against the discovered patterns. We

20

use a fixed-size sliding window over the current active session to capture the current user’s

history depth. Thus, the sliding window of size n over the active session allows only the last n

visited pages to influence the recommendation value of items in the recommendation set. This

sliding window is called as active session window.

In eachiteration, each user sessiontin the evaluation set was divided into two parts. The

firstnn page views were used for generating recommendations, whereas, the remaining portion of

t(target set) was used to evaluate the generated recommendations. For the recommendation

process we chose a session window size of 2. The recommendation results are given in Table 2

for the sample path.

Table.2 Recommendation Results for a Typical User Navigation Path

Pages of Active User

session

Recommended Web pages Recommendation

score

/make_slides.php

/website-design-

services.php

/consultancy-services- utomation.php

/software-development- ompany.php

/outsourcing.php

/support.php

/website-application- commerce.php

/web_hosting.html

/hosting/livezilla/server.php

/Products/billing-software.php

About-us/TechCmantiX- infrastructure.php

0.4512

0.5218

0.5477

0.5492

0.5492

0.3162

0.4776

0.5715

0.4666

0.5611

21

/downloads.php

/support.php

/leadership.php

/register.php

/testimonial.php

/careers.php

/support.php

/website-application-ecommerce.php

/content-writing.php

/billing-automation-with- counts.php

/leads-management-system.php

0.7530

0.5610

0.6073

0.6375

0.7218

0.4807

0.3986

6.3 Performance Analysis for FuzzyBiclustering(FB)

The required input of the algorithm is minimum number of pages to be included in the

bicluster. In order to discover the best biclusters it is important to fine-tune this input variable.

Figure 9 depicts the average numbers of pages in a bicluster, which increases with increasing

minp.

2 3 4 5 60

1

2

3

4

5

6

7

8

Minimum number of pages/BicluserMinp

Figure 9. Average number of pages in the bicluster

Impact of Recommended Number of Biclusters

Figure 10illustrates the values of F1 measure, Precision and Coverage for varying values

of K. As shown, the best performance is attained for K =2. As minimum number of biclusters are

22

Ave

rage

num

ber

of p

ages

recommended which are very similar to the current active session, the values of F1-Measure ,

precision and coverage remain increased.

2 3 4 5 60

0.10.20.30.40.50.60.70.8

PrecisionCoverageF-Measure

Figure10. Number of Recommended Biclusters versus F1-Measure, Precision and

Coverage

Impact of membership values and Recommendation Threshold α

The recommendation score is computed based on membership values as explained in 6.

Figure 11illustrates the values of F1-Measure,Precision and Coverage for varying values of

Recommendation Threshold α. As shown, the best performance is attained for α = 0.8 and 1 . As

the value of α is increased, the values of F1-Measure, Precision and coverage remains increased.

0.2 0.4

0.600000000000001 0.8 1

00.10.20.30.40.50.60.70.8

PrecisionCoverageF1-Measure

23

Number of Recommended Biclusters

Rec

omm

enda

tion

Mea

sure

s

Recommendation Threshold

Rec

omm

enda

tion

Mea

sure

Figure 11. RecommendationThreshold versus F1-Measure, Precision and Coverage

for FB

Impact of sub session size nn

Figure12illustrates the values of F1 measure, Precision and Coverage for varying values

of nn. As shown, the best performance is attained for nn =1. When the value of nn is small,

Precision and F1-Measure remains increased but the coverage value is increased when the value

of nn becomes increased.

1 2 3 4 50

0.10.20.30.40.50.60.70.8


Active Session sizenn

Figure12. Active session size versus F1-Measure, Precision and

Coverage for FB

6.4 Comparative results for effectiveness In this section, we compare the performance of Robust Fuzzy Biclustering(FB), with

Conventional biclustering, CDK-Meansand spectral co-clustering. The parameters, are tuned as

follows: the size of the recommendation list (N, default value 4), Number of biclusters

recommended is set to 2 and the size of training set (default value 75%). The test set consists of

all remaining users, i.e., those not in the training set. Users in the test set are the basis for

measuring the examined metrics. The performance comparison of FB, CB, CDK-Means and

spectral co-clustering using F1-Measure, Precision and Coverage for the maximum data set size

is shown Figure 13.

24

Rec

omm

enda

tion

M

easu

re

CDK-M

eans

Coclustering CB FB

00.10.20.30.40.50.60.70.8


Figure 13. Comparative analysis of CDK-Means,Co-clustering,CB and FB

Table 3 shows the values of Precision, Coverage and F1-Measure for the maximum session size.

Table 3.Comparison between CDK-Means, spectral co-clustering, CB and FB

in terms of Precision, Coverage and F1-Measure

Approach Precision Coverage F1- Measure

CDK-Means 0.49 0.4362 0.4615

Spectral

Co-clustering0.3534 0.7003 0.4698

CB 0.503 0.7375 0.5981

FB 0.6512 0.7512 0.6977

In terms of precision, FB outperforms all the other methods CDK-Means,Coclustering

and CB. Precision of CB is higher than that of CDK-Means and precision of CDK-Means is

higher than that of coclustering. In terms of coverage FBshows superior performance than the

other methods. The coverage of coclustering approach is slightly lower than that of CB and

significantly higher than that of CDK-Means. The overall Performance is measured using F1-

Measure and FB shows superior performance than other methods. The performance of CB in

terms of F1-Measure is better than that of Co-clustering and CDK-Means. There is only slight

25

difference in F1-Measure of CDK-Means and coclustering. The F1 measure attains its maximum

value when both precision and coverage are maximum.

Measure of Accuracy

The performance measure MAE, indicates the degree of deviation of users desired

pagesfrom the recommended set of pages. MAE for various Biclustering for MAXSIZE of data

set 1 is shown in Figure 14.

012345678

MAE

MAE

val

ue

Figure 14. Mean Absolute Error(MAE)

Complexity of FB

The Robust Fuzzy Biclustering Algorithm can be shown to have complexity of O(m x n

xnb x τ), where m is thenumber of rows of the session matrix A, n is thenumber of columns in

S, τ is the number of iterations taken for convergence.

Impact of Test data size

Training/Test data size: Now we test the impact of the size of the trainingset, which is

expressed as percentage of the total data set size. The results for F1are given in Figure15. As

expected, when the training set is small, performancedowngrades for FB. Therefore, we should

be careful enough to use adequately large training sets.

26

50 60 70 800

0.10.20.30.40.50.60.70.80.9

F1-Measure

Training set sizeF1

-Mea

sure

Figure15. F1-Measure for various Training set sizes

6.5 Data set 2

For the purpose of evaluating the performance and the effectiveness of the FB algorithm,

experiments were conducted with preprocessed web access logs of www.microsoft.com which is

available in UCI repository[http://www.ics.uci.edu/].This log file records the use of

www.microsoft.com by 5000 anonymous, randomly-selected users who have visited the web site

in a one week timeframe in February 1998 with an average of 5.7 page views per user. The file

contains no personally identifiable information. This data set includes visits which are recorded

are recorded in time order and no preprocessing is required since data set was given in sessions.

The 294 web pages are identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g.

"/stream"). These algorithms are applied only for testing instances available in UCI repository by

taking 294 web pages and 5000 (Maximum data set size) users.

6.5.1 Comparative results for effectiveness In this section, we compare the performance of Robust Fuzzy Biclustering(FB),

Conventional biclustering(CB), CDK-Meansand spectral co-clustering using data set2. The

optimum values of parameters are set to K= 6, N = 4 and nn = 3 after performing sensitivity

analysis.

27

Precision ofpage recommendation

The precision of page recommendation results of data set 2 for 5000 users and 294 web

pages is measured by precision,coverage and F1-Measure and is shown in Figure 16.

FB CB

CDK-M

eans

Coclustering

00.10.20.30.40.50.60.70.8


Figure16. Precision of Page recommendation

It can be observed from the Figure 16that FB proveshigh performance than other Biclustering

approaches.

Accuracy of Page recommendation

The accuracy of page recommendation results for data set 2 is illustrated in Figure 17. It can

observed from the figure that FB proves low MAE than other Biclustering approaches.

FB CB

CDK-M

eans

Coclustering

01234567

MAE

Figure17. Accuracy of Page recommendation

28

Conclusion

The target of personalization based on web usage data is to compute a recommendation

set for the current user session based on user’s past navigation patterns. In this paper a new

personalized recommendation method based on biclustering is proposed to improve the web-

personalized recommendation. An extensive experimental comparison of Robust Fuzzy

Biclustering approach is made with CB, CDK-Meansand spectral co-clusteringusing the

recommendation measures MAE, precision,coverage and F1-Measure. This work improves

precision and coverage ratio and reduces MAE at the same time.

We highlight the following observations from our examination:

Our biclustering approaches show significant improvements over existing user and page

clustering algorithms, in terms of effectiveness, because it exploits the duality of users

and pages .

In our experiments,FB outperforms slightly other Biclustering approaches, in terms of

accurate recommendations. The reason is that the weights are computed based on

membership values of pages in the bicluster and the weights are optimized based on the

number of users and their access patterns thereby making FB more suitable for

recommendation.

Summarizing the aforementioned conclusions, it can be inferred that, the proposed Fuzzy-

biclustering algorithm attains maximum efficiency than the existing biclustering algorithms.

Hence Robust Fuzzy Biclustering approach provides improved web-personalized

recommendation than other biclusteringapproaches and is more suitable for those web sites in

which users navigate through the web pages with much uncertainty.

ACKNOWLEDGEMET

We thank the anonymousreviewers and editors for the valuable suggestions on this work and

ideas which helpedus in the improvement of the paper.

REFERENCES

29

BamshadMobasher, Honghua Dai, Taoluo, Miki Nakagawa: Discovery and Evaluation of

Aggregate Usage Profiles for Web Personalization. Data Mining and Knowledge

Discovery (6) (2002) 61–82.

Cheng, Y., Church, G.: Biclustering of expression data. In: Proceedings of the ISMB

Conference (2000) 93–103.

Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph

partitioning. In: Proceedings of the ACM SIGKDD Conference (2001).

DoruTanasa, Brigitte Trousse, Advanced Data Preprocessing for Intersites web usage

mining. IEEE Intelligent Systems (2004) 59-65.

DimitriosPierrakos ,GeorgiosPaliouras, Christos Papatheodorou and Constantine D.

Spyropoulos, Web Usage Mining as a Tool for Personalization:ASurvey,User Modeling

and User-Adapted Interaction 13: 311-372,2003.

GuandongXu ,Web Mining Techniques for Recommendation and Personalization. Ph.d

Thesis (2008).

Gunter Grieser, Yuzuru Tanaka and Akihiro Yamamoto ,Discovery of Web ommunities

from Positive and Negative Examples ,Discovery Science, In : Proceedings of 6th

International Conference, DS 2003, Sapporo, Japan, October 17-19, springer-verlag,pp:

69-376, ( 2003).

Hammouda, K.M., Kamel, M.S, Efficient phrase-based document indexing for Web

document clustering, IEEE Transactions on Knowledge and Data Engineering. 10(6)

(2004) 1279–1296.

Hannah Inbarani, H., Thangavel, K, A Robust Biclustering Approach for Effective Web

Personalization, Visual Analytics and Interactive Technologies: Data, Text and Web

Mining Applications (2011).

30

Hartigan, J.A, Direct clustering of a data matrix, Journal of the American Statistical

Association 67(337) (1972)123–129.

Jayson E. Rome and Robert M,Towards a formal concept analysis approach to exploring

communities on the WWW , B. Ganter and R.Godin(Eds), ICFCA (2005), LNAI 3403,

pp : 33-48, 2005.

Jian-Guo Liu, Wei-Ping Wu,Web Usage Mining For Electronic Business Applications,

In: Proceedings of the Third International Conference on Machine Learning and

Cybernetics, Shanghai, (Aug 2004) 57-63.

Li, H.Y., Xie, C.S., Liu Y,A new method of prefetching I/O requests, In: Proceedings of

International Conference on Networking”, Architecture and Storage, Guilin, China (July

2007)217–224.

Liu, X., He, P., Yang, Q.: Mining user access patterns based on web logs,Canadian

Conference on Electrical and Computer Engineering, May, Saskatoon Inn Saskatoon,

Saskatchewan Canada 2280–2283 (2005).

Long, B., Zhang (Mark), Z., Yu, P.S, Co-clustering by block value decomposition,In:

Proceeding of the eleventh ACM SIGKDD International Conference on Knowledge

discovery in data mining. ACM Press, New York (2005) 635–640.

Mirkin, B.: Mathematical classification and clustering, Kluwer Academic Publishers

Dordrecht (1996).

Nakagawa, M., Mobasher, B, A hybrid web personalization model based on site

connectivity, In the fifth international WEBKDD workshop: Web mining as a premise to

effective and intelligent web applications (2003) 59–70.

31

Nasraoui, O., Soliman, M., Saka, E., Badia, A., Germain R, A web usage mining

framework for mining evolving user profiles in dynamic websites, IEEE Transactions on

Knowledge and Data Engineering, 20(2) (2008) 202–215.

OlfaNasraoui, Chris Petenes, An Intelligent Web Recommendation Engine Based on

Fuzzy Approximate Reasoning, In :proceedings of the IEEE International Conference on

Fuzzy Systems1116-1121 (2003).

Pallis G.,VakaliKoutsonikola A, Insight and perspectives for content delivery

networks,Communications of the ACM 49(1) (2006)101–106.

Pablo A. D. de Castro, Fabrício2007, ApplyingBiclustering to Text Mining: An Immune-

Inspired Approach,In: ICARIS, Vol. 4628,Springer (2007), pp. 83-94.

PanagiotisSymeonidis, AlexandrosNanopoulos, ApostolosN. Papadopoulos and

YannisManolopoulos, Nearest-biclusters collaborative filtering based on constant and

coherent values ,Information Retrieval , 1(11), pp. 51-75.

Rosario Girardi.A, Leandro BalbyMarinho E, A domain model of web recommender

systems based on usage mining and collaborative filtering , International Journal of

Requirements Engineering (2007) 12: 23–40, Springer verlag, London.

Ruggero, G. Pensa1.,CelineRobardet, Jean-Fran¸ CoisBoulicaut,A Bi-clustering

Framework for Categorical Data, A. Jorge et al. (Eds.): PKDD 2005, LNAI 3721,

Springer Verlag Berlin Heidelberg (2005) 643–650 .

Sarabjot Singh Anand,BamshadMobasher, Intelligent Techniques for Web

Personalization. ACM Transactions on Internet Technology7(4)(October 2007).

Sergio Flesca , Sergio Greco ,Andrea Tagarelli ,Ester Zumpano, Mining User references,

Page Content and Usage to Personalize website navigation, World Wide Web: Internet

32

http://www.springerlink.com/content/1386-4564/

http://www.springerlink.com/content/?Author=Yannis+Manolopoulos

http://www.springerlink.com/content/?Author=Apostolos+N.+Papadopoulos

http://www.springerlink.com/content/?Author=Alexandros+Nanopoulos

http://www.springerlink.com/content/?Author=Panagiotis+Symeonidis

and web information systems, 8, 317–345, 2005, Springer Science + Business Media,

Inc.

Srinivasa N,Medasani S, Active fuzzy clustering for collaborative filtering,In:

Proceedings of IEEE International Conference on Fuzzy Systems, July, Budapest,

Hungary (2004) 1607–1702.

Sung Ho Ha, Helping Online Customers Decide through Web Personalization, IEEE

Intelligent systems (2002) 34-43.

Tang, C., Zhang, L., Zhang, I.Ramanathan, M, Interrelated two-way clustering: An

unsupervised approach for gene expression data analysis, In: Proceedings of the 2nd

IEEE Int. Symposium on Bioinformatics and Bioengineering (2001)41–48.

TsuyoshiMurata , DOI: 10.1007,Discovery of Web Communities from Positive and

Negative Examples, Lecture Notes in Computer Science, 2003, Volume 2843/2003, 369-

376.

Vassiliki A. Koutsonikola and Athena I. Vakali(2009), A fuzzy bi-clustering approach to

correlate web users and pages, Int. J. Knowledge and Web Intelligence, 1(2),3-23 .

ZidrinaPabarskaite&AistisRaudys, A process of knowledge discovery from web log

data:Systematization and critical review ,Springer ,Journal of Intelligent Information

Systems (2007) 28:79–104.

33

Web viewThe wide spectrum of uncertainties involved in the web navigation ... (Dhillon 2001) for...

Documents

Transcript of Web viewThe wide spectrum of uncertainties involved in the web navigation ... (Dhillon 2001) for...