PACE: Prefetching and Filtering of Personalized Emails at the

PACE: Prefetching and Filtering of Personalized Emails at the Network Edges(Technical Report: CS-MIST-TR-2003-005)

Jayashree Ravi†, Weisong Shi†, and Chengzhong Xu¶

† Department of Computer Science ¶ Department of ECEWayne State University Wayne State University

{jravi123,weisong}@wayne.edu [email protected]=

Abstract

In this paper, we present a new technique for generating personalized dynamic pages for displaying emails byusing pre-fetching and caching methods at the edges of the network. Our approach is to pre-fetch emails from allthe email accounts of individuals and cache the necessary templates of the service provider and create personalizeddynamic page at the edges of the network. We have proposed, developed and evaluated the prototype comparingit with popular HTTP based free email servers and shown that with this technique there is an improvement inperformance with respect to user-perceived latency of up to 93% and savings in the bandwidth of up to 86% forthe most popular size of emails. We have also proposed a centralized spam management at proxy level whichhelps both the user and the service provider to eliminate spam from the origin server and thus save the utilizationof bandwidth by spam emails.

1 Introduction

Dynamic and personalized content delivery has attracted a great deal of attention recently from both commercialand research communities. One reason is of course the growing popularity of dynamic Web services, exemplifiedby news sites and personalized sites (e.g., my.yahoo.com) both of which require dynamic generation of content. Theother reason is the “trickle-down” effect [10, 15] of widely deployed proxy caches and content delivery networks(CDNs), which effectively filter incoming requests for static Web content and presumably shift the traffic seen onthe Internet from popular static objects to less popular objects and dynamic Web content

Dynamic Web content can be broadly classified into two types. The first type which we henceforth call justdynamic pagesare those that are generated without taking sessions of the user into account. This type does not needto know who has accessed the page and for every user the dynamic page generated is the same at any instant of time.The second type, which we callpersonalized dynamic pagesare generated when the user accesses them through asecured system. In this case the dynamic page generated is tailor made for each user.

The most popular personalized dynamic page today is the one which is generated for displaying emails. Forexample,Hotmail.com alone hosts about 84 million free e-mail accounts where the account holders are spreadacross the globe. Our previous analysis of a middle-size personalized Web site shows that 90% of the access topersonalized web site are for emails [27] only. In today’s world most of us have at least one Web based emailaccount. Web-based (a.k.a HTTP-based) email account has the advantage of accessing it at any place which hasan internet access point. However for every email access the user is logged onto the main HTTP email serversand conventional Web caching and CDNs are of little help. The reason being since it is personalized and needsauthentication, Akamai [1] servers just pass the entire request to the origin servers and pass the response back to theuser without caching any data in between. There are several techniques which are proposed to cache dynamic pagesas described in Section 2.1. Personalized pages could be cached with these techniques, however it needs additionalinformation to be stored at the cache which is the per user information. This is the method adopted in CONCA [26].

From the account holder’s point of view, the user would have numerous email accounts with different serviceproviders and the emails pertaining to him/her would be stored in geographically distant places spread across theglobe. However all these data, which are stored at distant places should eventually flow towards the user. Thishappens when the user tries to read his emails. Also the user has to access each email account separately withseparate authentication for each email account though they all fetch the same kind of information to the user namely,emails with attachments which are send over SMTP at the back end. The user also has to manage a separate spamlist at each email account though each spam list may just be the same list which could probably be copied to otherservers instead of duplicating the entry at each server separately as is being done today.

From these observations, we conclude that caching and generation of personalized pages for emails has to beaddressed with a unique technique which coherently handles all the common features which make the email pagebut at the same time provides room for enough diversification to include the service providers requirements. In thispaper we have proposed such a new technique and developed a prototype called PACE to generate these pages at theedges of the network. PACE is built on five concepts: (1) Capture user’s details and pre-fetch the most recent emailsfrom all the email accounts of each user and store them separately; (2) Cache different templates of various emailservice providers; (3) Centralized spam management to help the user in eliminating spam from the user’s originemail servers; (4) Single Authentication per user to access all his/her emails accounts; (5) Dynamically decide theplacement of per user information in a proxy which is closest to the user/nomadic user based on the user’s/nomadicuser’s access patterns.

We have built and evaluated a PACE prototype in this paper. The evaluation results have shown the follow-ing benefits for the most common email sizes: (1) Improvement in user-perceived latency between 5%-93%, (2)Bandwidth savings between 75%- 86% by caching the basic templates for email alone. Along with these perceivedbenefits from our prototype we can also see these additional benefits: (1) Availability of the service is improved dueto close proximity of the application to the end user. (2) Network problems from the proxy to the origin server wouldnot affect the system. (3) Reduced load on the origin servers and thus improving the scalability of the origin servers.(4) Additional bandwidth savings due to elimination of spam at the origin server itself.

The rest of this paper is organized as follows. In Section 2, we describe operations of the regular HTTP emailsystem. The design and implementation of PACE architecture is described in Section 3 and Section 4 respectively.Section 5 presents the performance evaluation. Section 6 discusses related work and we summarize in Section 7.

2 Background

2.1 Caching Methods for Dynamic Pages

The present day techniques for caching dynamic Web content can be broadly classified into two methods,contentcachingand function caching, which could be used for either server side proxies or cache side proxies. In thecontent caching, the HTML page which the application generates is cached as different fragments/ channels. Thesefragments/channels are maintained as separate objects in the edge server’s cache and are dynamically assembled intoWeb pages using XML/JavaScript type languages by fetching only non cacheable or expired fragments/channelsfrom the origin server in response to user requests. The origin server supports this assembly and the exchangeof information between origin server and proxy is XML type data. The representatives of this approach includeAkamai [1], CONCA [26], and Client Side Include(CSI) [24]. All of them are built upon the edge-side include(ESI) technology [29].

In the function caching, the application itself is replicated and cached along with its associated applications sothat the edge servers run applications instead of the origin server. The exchange of information between the originserver and proxy is the application itself. Examples of this method are vMatrix [4], IBM Websphere [16], ActiveCache [6], Gemini, SEE [19].

2

2.2 Basic HTTP Email Architecture

A typical HTTP based email system is as shown in Figure 2 (a). The user logs in through a secured protocol likeHTTPS to the HTTP server. The HTTP server is in turn connected to a back end Mail Server in which the userwould have an account. The mail server at the back end sends and receives mails with other mail servers throughSMTP protocol.

Another way of viewing emails is through POP protocol by configuring the user’s email agent like OutlookExpress on the client machine. The disadvantage here is for a user who uses different machines to access emails.

Spam handling: In case of HTTP email systems, spam protection is handled at the origin mail servers andeach email account of the user with different email servers have to be configured separately for spam. In case of useragents using POP protocol, a spam email would be first downloaded into the client machine after which the useragent classifies it as spam by which time the useful bandwidth of the network is already used.

Generic content of an email page:A typical email page consists of three parts: (1) Basic email pertaining tothe user. (2) HTML objects which are embedded with the emails by the email service provider which come alongwith the email as one single page with one single session key associated with the set. These objects make up 18KBto 40KB of data as noted down from popular email web sites like Yahoo and Hotmail. We refer to this as thebasictemplatein our paper. (3) Objects referred through ‘src’ tags in basic template, which are typically used to retrieveimages through separate HTTP GET methods which may or may not be associated with the particular session.

Even if the email itself is 1 byte, a page of 18-40KBis generated to plug in this 1 byte of wanted information.This 18-40 KB of data can be kept at the edge server itselfif the page is generated there instead of at the origin server.If we consider millions of users, savings of 18-40KB/emailof all these users will significantly reduce the internet traf-fic between the edge server and origin server. There aremany solutions available today for caching part (3) objects,however we have not come across solutions to address part(1) and part (2). Our contribution in this paper is to pre-fetch and cache part (1) and (2) and generate the page at theedge.

0

0.1

0.2

0.30.4

0.50.6

0.7

0.8

0.9

1

1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 11k 12k 13k

Size (Bytes)

CD

F

Figure 1 Email size distribution.

Emails that the user receives varies in its size depending upon what the sender has sent. Whatever the size ofthe email, the template that is added by the service provider is of a fixed size. To understand the size distributionof emails, we collected the sizes of emails that our group members received in their Inbox of email accounts. Thesize distribution is as shown in the Figure 1. From this we see that more than 90% of the emails that our groupmembers have received are less than 6KB of which more than 80% are less than 3KB. However the basic templatethat is added to this email is between 18KB to 40KB. This data reinforces our motivation to generate the dynamicpage at the edge by caching the template.

3 Design of PACE

The flow of email in a PACE system is as shown in Figure 2(b). As can be seen from the figure, the user instead ofaccessing the origin HTTP mail server will now access the PACE account to check his/her emails and PACE in turnpre-fetches the emails and caches the templates and other associated objects to generate the email page for the user.

3.1 Content Design

The content of the HTML page generated by PACE is designed with two possible customers:

3

(a) (b)

Figure 2: (a) Flow of emails in regular HTTP/POP email systems; (b) Flow of email in a PACE system.

1. Email service provider: Email service provider can utilize the services of Internet service provider(similarto oursourcing to CDN networks) who hosts PACE for pre-fetching emails and caching the Email serviceprovider’s templates which would be plugged in while delivering the emails to the user. The user could havea free account with PACE as well as with Email service provider and views different templates for differentaccounts. This utility will be in line with the concept of pre-fetching and caching dynamic pages at the edgesof the network for which is it primarily designed.

2. User: If the Email service provider does not wish to utilize the services of PACE then the end user can possiblyuse it. End user can create an account with the Internet service provider by paying him for the services of aPACE account. In this case PACE would just retrieve emails and display them on a HTML page without anyadvertisements of the Email Service provider. The user gets all the benefits of a HTTP email account andPACE works as a Web based user agent with improved accessibility/latency/availability and with additionaledge services(e.g., virus scanning/spam management etc) provided by the service provider.

Internet service provider(ISP) would be the agency which would be hosting PACE. Hence we have not consideredISP as a customer. However if the ISPs intend to plug in their advertisements with the emails instead of the Emailservice provider, then they can do so. In this case the user can open an account with the ISP for free and the ISPadds his advertisement with the emails retrieved from Email service provider and generates the HTML page. All theother edge services could also be provided with it as a complete package to the user.

With these options PACE can generate three types of HTML pages depending upon the type of user’s account fordisplaying emails: (1) HTML page with emails and with advertisements from Email Service provider. (2) SimpleHTML page with no advertisements and only emails of the user. (3) HTML page with emails and with advertise-ments from Internet Service provider who is hosting PACE. It could be a mix of any of the three combinations aswell depending upon which agency is paying for the services of PACE.

PACE proxy does not have a mail server with it. Hence all email accounts have to be created with the respectivemail servers of the Email service providers and the information of the account should be entered in PACE. Forsending mails the user has to access the respective mail server where he has an account for which links are providedwhich redirect the user to the Email service provider’s web site. PACE design could be extended to other templatestoo, for example templates used for sending emails could also be cached.

3.2 Support for Dynamic Placement of Per User Information

Any user who has created a proxy account could be accessing PACE from many different locations. For examplea student could be accessing this account from his/her home in the mornings and evenings and from school during

4

the daytime. Similarly an office worker would be accessing from office during the daytime. We assume that mostof the users would be operating from different locations with in a radius for which a single proxy would probablybe the closest to both the locations. However if the locations are coming under different PACE proxies as shownfor user 1 in Figure 3(a), then each proxy which is close to the user’s two access points(ie, home and school/office)would be the child PACE proxy for the user. Generally the two Internet access points would be coming underdifferent networks. A parent PACE proxy is identified which is higher up in the network hierarchy and which isconnected to all the child proxies. This parent proxy will keep all the pre-fetched emails of the user and the homeaccount of the user is established here. Each child would have cached the other common cacheable componentswhich make the dynamic page. The email page would be generated from the child proxy by fetching emails fromthe parent proxy and plugging in the common components at the child. To improve the latency the emails could bereplicated in the child proxies from the parent proxies. We intend to study this separately to find the best method ofcaching/replication of data between child and parent.

PACE-CacheParent node for

User 1

PACE-Cache

PACE-CacheChild node for

User 1

PACE-CacheChild node for

User 1

PACE-CacheParent nodefor User 2

Hotmail server Yahoo mail server

User 1at home

User 1at office

User 2

Hotmail.com

Yahoo.com|||||

wayne.edu|||

user 1|||

user n

temp user 1||

temp user m

Per

sona

lized

Sha

red

S1

S1

S1

||

Hotmail.com

Yahoo.com

Wayne.edu

||

ID,Password

Emails

(a) (b)

Figure 3: (a)Distributed caches dynamically identify the parent node. (b)Structure showing shared and personalizedportions

However if the user temporarily moves away from the parent zone, then we extend our CONCA [26] nomadicsupport for this situation. Each CONCA node stores per user state. This allows the state to be recreated on anotherproxy node that the user is currently close to. When a user travels away from his home cache, requests from clientapplications are routed to whichever cache is nearest to the user’s current location. This new cache contacts theuser’s home cache to obtain information about the state associated with the user. It can then satisfy user’s requestsmore efficiently by reusing locally cached content and pre-fetching personalized content from the user’s home cache.Each proxy node is having both shared and personalized information stored separately as shown in Figure 3(b).

3.3 Authentication

We extend the CONCA proxy for handling the user authentication. The edge server itself can handle trust for thesake of email retrieval instead of passing on the HTTPS request to the origin server. The user inputs all his/herlogin information with respect to his/her various email accounts into the PACE database which is used to pre-fetchemails. The pre-fetched emails are stored in PACE. We propose a single authentication for the end user at PACE toview emails of all email accounts of the user. PACE in turn keeps the authentication information of the individualemail account of the user and uses this to pre-fetch the emails for the user. Generally when a user checks his/her

5

email on one account the user would like to check all the email accounts on one go. Hence a single authenticationat the edge proxy server will help in saving time and bandwidth needed for authentication at individual origin emailservers. However when the user needs to enter the credit card information and other more trust dependent activitieslike shopping cart etc, a link would be provided to take the user to higher degree of trust management at the originservers.

Pre-fetching: When a user creates his profile in the proxy, separate IDs are created for each user and this is usedto create per user directories in the proxy to store emails and attachments in the personalized portion of the proxyas shown in Figure 3(b). A pre-fetching program called MailFetcher runs as a daemon process checking for newemails at regular intervals. All the emails are stored in per user directories and the database. After a certain periodof time the cache may be full with user’s emails and both origin server and cache would be holding the same set ofdata. Our intention is not to duplicate the storage of emails. We assume that most people would like to see only thelatest emails and not the emails read by them once. Once the user logs in and views the emails in the cache, they aremarked for deletion and after a time interval they are deleted thus helping in conserving memory at PACE. Howeverthe origin server will still carry all the emails both new and old. Hence the origin mail server will be primary storageof all the user’s emails. When the user wishes to see any of the old emails then at that instant the required mail isfetched and the email page is generated. The time taken for generation of this page would be higher than the timetaken to view any of the latest emails but we assume that this would be closer to the latency of accessing it from theorigin server directly.

3.4 Spam Filtering

Recent studies have shown that today 40% of all the emails are spams [2], and the cost related is escalating tremen-dously [17]. We propose to address this problem in two ways. In the first method, the user can mark any emailfrom any account, as spam and this would be logged in the database, using which appropriate action is taken whilepre-fetching the emails from origin server. Hence just identifying once is enough to handle spam in all the accounts.

However just identifying the sender’s email ID is not enough to handle spam. At present as most spam emailsare sent either with invalid sender’s address or false sender’s address and sometimes the sender’s address of thespam will be the same as the user himself. To handle this we propose to develop a spam management system, whichstudies the user’s behavior to identify spam. An agent, which is deployed for each user would study the emails thatthe user deletes with or without reading them on a regular basis. If the user has deleted a particular email from aparticular account then the agent will search for the same content in the emails of the other accounts. If it findsanother email with content matching more than a threshold valueα then it classifies that email as spam. We intendto focus on the content of the email rather than the sender’s email ID in this case. This information of one user canbe merged with the other user agent’s results as well and putting all the users data together can help in determiningthe spam at the higher level of abstraction. This valuable information on spam collected at every node individuallycan be exchanged with other proxy nodes to effectively tackle spam. Every identified spam mail will be assignedweights based on these methods. The mail with the highest weight is circulated around the other caches. Identifiedspam email can be deleted from the origin server itself by sending control signals from PACE instead of using thebandwidth to fetch the email to the proxy. This would be especially advantageous for users who are using user agentsto fetch mails. However self-detecting spam management is not incorporated in our prototype at present.

4 Implementation

We have designed and developed a prototype version in Java using JSP & Servlets for user interface. Jakarta Tomcat3.2.3 server [3] is used for generating JSP pages. We have used MySQL [22] for back end database to store emailsand user information.

Our prototype has two main modules. First one is the user interface, which enables the user to create an account

6

(a)Login page (b) JSP page generated using Hotmail template

Figure 4: A snapshot of dynamic pages generated from PACE.

with our edge server proxy, PACE. During the account creation, user inputs login and password for the proxy andalso the details of all his emails accounts which includes the authentication details for each account. For example ifthe user has a WSU account, then the user inputs smtp.wayne.edu as the mail server URL for retrieving mails andalso inputs his login and password for the WSU account. This per user information is stored in the proxy to pre-fetchemails from the origin server. After the account creation, the user can log into his/her account as shown in Figure4(a) and read emails by clicking on the links provided for each account without a separate authentication for eachaccount as shown in Figure 4(b).

Accept request

Is emailrequest?

Forward therequest like aregular proxy

Authentication

Generate personalizedJSP pages with email

account links

Logout

Forward to Tomcatto handle

NO

YES

Any new mails?

Store the mails inPACE

YES

Mail Server NO

(a) (b)

Figure 5: Sequences of operations (a)user interface, (b)Mail Fetcher.

The second module is the Mail Fetcher(refer Figure 5b), which periodically pre-fetches emails of all users forall accounts they have configured for pre-fetching. When the Mail Fetcher periodically checks the mails, it ensuresthat mails that are already fetched are not fetched again. This is done even before the mail is retrieved from theserver hence extra bandwidth is not utilized during the process of pre-fetching these emails, which were fetchedearlier. The complete information of the mails that are fetched is collected in MySQL database and attachments to

7

mails are saved as files on hard disk by creating per user directories.The input for making these personalized dynamic pages are emails pre-fetched through SMTP and the templates

of the Email service provider or the ISP, which is stored on the hard disk.In Tomcat, the class files are compiled the first time the page is invoked and the subsequent times onwards even

though the underlying data changes in the database, it takes lesser time than the first invocation of the same page asit uses the pre-compiled Java classes from second time onwards. However we would like to emphasize here that thecreation of dynamic pages is purely proxy vendor’s choice and there is no binding between proxy application whichis used to generate the pages to origin mail server application.

5 Performance Evaluation

User-perceived latency is the sum of the latencies of the network, server and the client. Network and server latenciesin turn depends upon some dynamic factors (e.g., network traffic, server load). We cannot simulate these dynamicfactors for our experimentation as there are no correlations or dependencies between them. Client latency depends onthe browser program, client CPU speed, other programs running on the same machine. We can keep the client latencyconstant by running no other program on the client machine and using a single web browser in our case, Netscapebrowser to render the pages. With this background we assume that in the real world scenario users would typicallyaccess his/her accounts from home and office/school. From this angle we evaluated PACE from two environments:

1. Experimental setup 1 (Home): Tomcat running on Windows XP with Pentium 4 CPU, 1.6 GHz, 512 MB ofRAM connected to the internet through Comcast Internet Cable network. Two clients were used to measurethe latency, one client is on the same machine which is running PACE and another client is on Windows 98connected to PACE machine through wireless LAN.

2. Experimental setup 2 (School): Tomcat running on Windows XP connected to the Internet through LAN onWSU Campus. Two clients were used to measure the latencies in this set up too. One client is on the samemachine which is running PACE and another client is on Linux machine with in the WSU campus which isconnected to the Internet through WSU Campus LAN.

Real world host of PACE would be the ISPs using high speed clusters with high speed disk access like SCSIwith RAIDs which would significantly improve the processing time of a JSP page. Moreover these clusters wouldbe connected to the Internet directly. However our experimental setup uses a personal computer and the connectionto the Internet is through the local LAN. The traffic to the server competes with the traffic on the LAN which we donot have much control on especially with the WSU campus LAN. The templates are stored in the hard disk whichwould be accessed every time a response is processed for a client request. Since it is a personal computer, IDE harddisks are used for storage which is not particularly good in terms of disk access speed. Despite these short comingsour experimental results show a good reduction in the user-perceived latency when the pages are generated at PACEcompared to the origin server.

5.1 Evaluation of User Perceived Latency

For our evaluation we selected Wayne State University (WSU) Web email service, Hotmail, Yahoo and Rediffmail [25]. WSU mail server is situated in the WSU campus itself, we find that Yahoo servers on which we did theexperiments are located in California, we were unable to determine the Hotmail servers whereabouts. We selectedRediff mail another free email portal for which the server is located in India. Our PACE should especially be helpfulin reducing the cross continental Internet traffic if all the emails are pre-fetched and dynamic pages generated fromthe edges close to the client.

8

We have evaluated our model by sending same emails with cc to various accounts. Since WSU enables usersto retrieve the emails through POP protocol our Mail Fetcher pre-fetches the emails from WSU account. User-perceived latency in fetching the same emails from various accounts are measured and compared with the PACEaccount.

Due to the dynamics of the Web, it is very hard to pin point the factors which affect the latency at any point oftime. To overcome this issue we conducted the experiments for a particular measurement by accessing all the webservers one after the other in a round robin fashion for a particular email dynamic page. We discarded an occasionallarge download time for a particular page of any origin server. Over all we find that the user-perceived latency fora particular origin email server is consistent and varies with a small deviation for multiple invocations of the samepage for a particular period of time though the latency for the same page is different at another point of time. We findthat there is no significant time difference between the first invocation and second or subsequent invocations of thesame page and in many cases the second or subsequent invocation is slower than the first probably due to networkcongestion/server load during the second or subsequent time. From this we can reasonably assume that no cache isoperating in between to help generate these dynamic pages at present.

The content for the dynamic pages which are generated at PACE to display emails were designed for two differentcustomers:Email service providerandUser.

5.1.1 Email Service Provider

We studied a typical email page of Hotmail and Yahoo and find that Yahoo adds 40 KB of data along with the emailand Hotmail adds about 18 KB of data. This is the data (which we callbasic template) which is embedded alongwith the email itself and come as one single page. Along with this there are other objects which are retrieved byseparate GET methods from the client. Javascripts make a set of objects which the browser downloads to run themon the browser. Every time the same page is invoked different advertisement images are downloaded. We cachedone such instance and we assume that the effect of changing images for every loading of the page has negligibleaffect on latency. The emails which are pre-fetched are plugged in along with the templates and images/Javasciptsstored at PACE and a dynamic page is generated for the client. The size of objects that are cached at PACE for Yahoowas around 120KB. This is apart from the basic template for email page. For Hotmail we cached around 56KB ofobjects over and above the basic template. These 120KB and 56KB could already be cached as on today as sessionID’s are not required to fetch them. We do not think that basic templates are cached anywhere at the moment. Westudied the user-perceived latency for each of these pages generated at PACE and those of retrieving the same fromthe origin servers.

We measured the user-perceived latency as seen by the client to access the page generated by PACE and thepage generated by origin server at a given interval of time. We cached two templates: Hotmail template and Yahootemplate with associated objects in both cases. We tested for two clients withhome1 setup; (1) Client and PACE runon the same machine. Results are as shown in Figure 6. (2) Client and PACE run on different machines. Results asshown in Figure 7.

From these results we see that despite using a personal computer to serve as a web server, user-perceived latencyis reduced for the pages which are generated at PACE using Yahoo and Hotmail templates. We have noted earlier thatmost of the email sizes are below 6KB for which 40-140KB of template and object data is added. Hence consideringa size of 100KB as the total size of most of the dynamic pages, we see an improvement in the user perceived latencyof 59% and 66% by caching Hotmail objects and 5% and 23% by caching Yahoo objects.

5.1.2 User

User is interested in seeing only his emails and not the advertisements that come along with the emails. User centricevaluation was carried out by removing all these additional data and generating the page with only the email in it.

1Due to space constraints we have not shown the results forSchool setup here.

9

0

1

2

3

4

5

6

7

1k 10k 100k 200k 300k 400k 500k 600k 700k 800k

Email size(Bytes)

Tim

e in

sec

onds

Hotmail-Template

Hotmail

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1k 10k 100k 200k 300k 400k 500k 600k 700k 800k

Email size(Bytes)

Tim

e in

sec

on

ds Yahoo-Template

Yahoo

(a)With Hotmail template (b) With Yahoo template

Figure 6: User-perceived latency for the page created by PACE using origin server templates and for the pageretrieved from the origin server from the client on the same machine as PACE. Both on Windows XP connected tointernet through Comcast Internet cable service.

0

5

10

15

20

25

1k 10k 100k 200k 300k 400k 500k 600k 700k 800k

Email size(Bytes)

Tim

e in

sec

on

ds

Hotmail-Template

Hotmail

0

1

2

3

4

5

6

1k 10k 100k 200k 300k 400k 500k 600k 700k 800k

Email size(Bytes)

Tim

e in

sec

onds Yahoo template

Yahoo

(a)With Hotmail template (b) With Yahoo template

Figure 7: User-perceived latency for the page created by PACE using origin server templates and for the pageretrieved from the origin server from the client on Windows 98 with PACE on XP. Client connected to PACE throughwireless LAN. LAN connected to internet through Comcast Internet cable service.

10

We carried out the experiments for both the experimental setups and with two clients in each setup. In the firstcase client and PACE run on the same machine and in the second case client and PACE run on different machines.Through the client we accessed the same email page from all the accounts with various Email service providersand PACE. Figure 8 shows the results for the two clients onSchool setup and Figure 9 shows the results forthe two clients onHomesetup. From the results we can see that for the same email the user perceived latency isdifferent with different email accounts. PACE has given us the best performance and Hotmail the worst. Though theperformance of WSU webmail was very close to that of PACE, we see that as the size of the email increases the timetaken increases at a higher rate for both WSU webmail and Hotmail. Though Yahoo plugs in more data for everyemail, the performance of fetching emails from Yahoo was the best during our experimentation. Over all we see animprovement in user-perceived latency between 28% to 93% for a dynamic page size of 100KB.

02468

101214161820

1k 10k 100k 200k 300k 400k 500k 600k 700k

Email size(Bytes)

Tim

e in

sec

on

ds

PaceWayneYahooRediffHotmail

0

2

4

6

8

10

12

14

16

1k 10k 100k 200k 300k 400k 500k 600k 700kEmail size(Bytes)

Tim

e in

sec

on

ds Pace

WayneYahooRediffHotmail

(a) (b)

Figure 8: (a)User-perceived latency with client on Linux and PACE on Windows XP with client accessing PACEthrough WSU Campus LAN. (b) User-perceived latency with client and PACE on the same Windows XP connectedto Internet through WSU Campus LAN.

0

5

10

15

20

25

1k 10k 100k 200k 300k 400k 500k 600k 700k 800k

Email size(Bytes)

Tim

e in

sec

on

ds Pace

WayneYahooRediffHotmail

0

1

2

3

4

5

6

7

1k 10k 100k 200k 300k 400k 500k 600k 700k 800kEmail size(Bytes)

Tim

e in

sec

on

ds

PaceWayneYahooRediffHotmail

(a) (b)

Figure 9: (a)User-perceived latency with client on Windows 98 and PACE on Windows XP with client accessingPACE through wireless LAN and LAN connected to Comcast cable service. (b) User-perceived latency with clientand PACE on the same Windows XP connected to Internet through Comcast cable service.

5.2 Bandwidth Evaluation

By caching the templates at the edge proxy and generating the dynamic page from the edge, the only bandwidthused is for retrieving the emails alone. The bandwidth used to retrieve expired basic templates and objects canbe amortized across all the users who are served from that edge proxy. We can broadly classify the cacheablecomponents into two types. One is the basic template itself which is used to generate the email page and the other is

11

the embedded objects with ‘src’ tags in the basic template, which is retrieved through separate GET methods.Figure 10 shows the bandwidth savings that can be achieved for different email sizes for basic template and all

objects which includes the basic template and embedded objects. As can be expected smaller email sizes achievehigh bandwidth savings. In this study we have just considered the email page alone. However before retrieving theemail page, a main page listing all the emails is generated. This page contains the links to the emails. Differentemail providers use different header details of the emails as links for example, in Hotmail the Subject of the emailis used as link. In Yahoo the senders name is used as link. Our contribution of caching the template applies to thispage too however we have shown the bandwidth savings only for email page alone.

00.10.20.30.40.50.60.70.80.9

1

1k 10k 100k 200k 300k 400k 500k 600k 700k

Email Sizes(Bytes)

Ban

dwid

th S

avin

gs:H

otm

ail

Basic templateAll objects

00.10.20.30.40.50.60.70.80.9

1

1k 10k 100k 200k 300k 400k 500k 600k 700k

Email Size(Bytes)

Ban

dwid

th s

avin

gs: Y

ahoo

Basic templateAll objects

(a)Hotmail template (b)Yahoo template

Figure 10: Bandwidth savings by caching the templates at the edge proxy and generating the page from the edge.

6 Related Work and Discussion

Our work on PACE builds upon a large body of related work in the general area of Web caching. Instead of describingeach separately, we group related efforts into three broad categories: dynamic content generation and delivery, edgeservices framework, and prefetching.

Dynamic content cachingCaching dynamic pages is a challenging area in which much of research work isconcentrated today, including server-side effort [7, 8, 31], proxy-side effort [6, 9, 20, 21, 26], and recently client-side effort [24]. For example, performance and cache hit has been improved dramatically by introducing DUPalgorithm [7, 8] by keeping a data dependency information between cached objects and underlying data. As soonas the system becomes aware of changes in the underlying data by way of triggers, graph traversal algorithms areapplied to determine which cached objects are affected by the change. The constraint in this system is the datastorage source and cache are highly integrated and is implemented at the source end of the cache. To adopt the sameat the network edge requires more complex integration with the origin server and implementation becomes vendorand application specific. Moreover, this works for dynamic pages that are not personalized. Having a session buildinto the dynamic page generation is not addressed.

Another technique used by Akamai [1] and CONCA cache [26] is by assembling dynamic content on the edgeservers using Edge Side Include [29] technology. Using ESI lets a content provider break a dynamic page intofragments with independent cacheability properties. These fragments are maintained as separate objects in theedge servers cache and are dynamically assembled into Web pages in response to users requests. This again worksfor pages which are not personalized in which session are not maintained. Whenever authentication is required,Akamai [1] edge servers do not cache any data and the entire request is forwarded to the origin server. Client SideInclude [24] assembles the ESI fragments at the client end using the browser. JavaScript/ ActiveX objects are usedto run the applications on the browser which fetch the needed fragments from the origin server. This reduces thelatency in the “last mile”, which is especially useful with dial up clients who have slow connections. However

12

Personalized dynamic pages are not addressed either.In this paper we are concentrating specifically on personalized dynamic email pages generation and delivery. To

our knowledge, this is the first effort in moving the personalized dynamic page generation to edge side.Edge services frameworkThe email spam filtering and management proposed in this paper is motivated by

IETF’s Open Pluggable Edge Service (OPES) framework [28] and content adaptation [13, 14]. Content adaptationallows the system to inject additional functionalities along the data path between client and server. OPES proposesan environment to provide value added services to the end-users and/or content providers. Providing services atthe edges enables incremental deployment and amortization of operating costs, thus benefiting the client and theprovider both [5, 13, 19]. Although content adaptation and filtering has been proposed for a while, PACE is the firstreal system that support email spam filtering.

Pre-fetching In this paper we pre-fetch emails from the origin server and use the pre-fetched emails to generatedynamic pages at the edges. Pre-fetching has been demonstrated to be an efficient mechanism for reducing Webaccess latency. As noted by several researchers [11, 12, 18, 23], there are three distinct pre-fetching scenarios:pre-fetching between clients and servers, between clients and proxies and between proxies and servers [18]. We areusing pre-fetching between proxies and servers. Personalized pre-fetching is introduced by [30] in which HTMLpages are pre-fetched based on keywords which are extracted from pages which are most often visited by the user.We are extending this notion to pre-fetch personalized emails. Pre-fetching is measured by the hit rate of the pre-fetched documents. Unlike pre-fetching in other scenarios, we expect our hit rate to be almost 100% without anybandwidth waste when implemented in a commercial environment as every person would like to view his/her latestemails at all times.

7 Summary and Future Work

Personalized dynamic pages generated to display emails make an important part of the Internet traffic today, makingit necessary to find efficient solutions to improve the user-perceived latency and reduce the traffic resulting fromemails on the network. In this paper we have presented a new technique in which emails are pre-fetched at the edgeservers and dynamic pages are generated at the edges using the information from the origin server. By developing aprototype we have shown that the user perceived latency reduces if this method is adopted. By having a centralizedspam management at the proxy will help in reducing the spam traffic which today is consuming more than 40% ofthe email traffic. Along with this a relative study on user perceived latencies across many of the popular web emailsservices are tabulated. We have shown that this technique helps in reduction of latency as well as bandwidth savings.

Our future work includes developing an AI based spam management [30]. We would also like to integrate thiswith our CONCA proxy and evaluate the model in a more distributed environment which includes the ISP networkand placing the proxy with the ISP by caching all types of templates.

References[1] Akamai Technologies Inc.,http://www.akamai.com/ .

[2] America online steps up spam fight by lauching litigation offensive against spammers, Apr. 2003,http://media.aoltimewarner.com/media/newmedia/cb_press_view.cfm?release%_num=55253129 .

[3] Apache Jakarta Project,http://jakarta.apache.org .

[4] A. Awadallah and M. Rosenblum. The vMatrix: A network of virtual machine monitors for dynamic content distribution.Proc. of the 7th International Workshop on Web Caching and Content Distribution (WCW’02), Aug. 2002.

[5] A. Beck and M. Hofmann. Enabling the internet to deliver content-oriented services.Proc. of the 6th International Work-shop on Web Caching and Content Distribution (WCW’01), June 2001,http://www.cs.bu.edu/techreports/2001-017-wcw01-proceedings/107_beck.pd%f .

[6] P. Cao, J. Zhang, and K. Beach. Active cache: Caching dynamic contents on the web.Proc. of IFIP Int’l Conf.Dist. Sys. Platforms and Open Dist. Processing, pp. 373-388, 1998,http://www.cs.wisc.edu/˜cao/papers/active-cache.ps .

13

[7] J. Challenger, A. Iyengar, and P. Dantzig. A scalable system for consistently caching dynamic web data.Proc. of IEEEConference on Computer Communications (INFOCOM’99), Mar. 1999.

[8] J. Challenger, A. Iyengar, K. Witting, C. Ferstat, and P. Reed. A publishing system for efficiently creating dynamic webcontent.Proc. of IEEE Conference on Computer Communications (INFOCOM’00), Mar. 2000.

[9] F. Douglis, A. Haro, and M. Rabinovich. HPP:HTML macro-pre-processing to support dynamic document caching.Proc. of the 1st USENIX Symposium on Internet Technologies and Systems (USITS’97), pp. 83-94, Dec. 1997,http://www.douglis.org/fred/work/papers/hpp.pdf .

[10] R. Doyle, J. Chase, S. Gadde, and A. Vahdat. The trickle-down effect: Web caching and server request distribution.Proc.of the 6th International Workshop on Web Caching and Content Distribution (WCW’01), June 2001.

[11] D. Duchamp. Prefetching hyperlinks.Proc. of the 2nd USENIX Symposium on Internet Technologies and Systems(USITS’99), Oct. 1999.

[12] L. Fan, P. Cao, and Q. Jacobson. Web prefetching between low-bandwidth clients and proxies: Potential and performance.Proceedings of ACM SIGMETRICS’99, May 1999,http://www.cs.wisc.edu/˜cao/papers/prepush.ps.gz .

[13] A. Fox, S. Gribble, Y. Chawathe, and E. A. Brewer. Adapting to Network and Client Variation Using InfrastructuralProxies: Lessons and Prespectives.IEEE Personal Communication, Aug. 1998,http://www.cs.washington.edu/homes/gribble/papers/adapt.ps.zip .

[14] X. Fu, W. Shi, A. Akkerman, and V. Karamcheti. CANS: Composable, Adaptive Network Services Infrastructure.Proc.of the 3rd USENIX Symposium on Internet Technologies and Systems (USITS’01), pp. 135-146, Mar. 2001.

[15] J. Gecsei. Determining hit ratios for multilevel hierarchies.IBM J. Res. Dev, July 1974.

[16] IBM Corp. Websphere platform,http://www.ibm.com/websphere .

[17] J. Krim. Spam’s cost to business escalates, Mar. 2003,http://www.washingtonpost.com/ac2/wp-dyn/A17754-2003Mar12 .

[18] T. M. Kroeger, D. E. Long, and J. C. Mogul. Exploring the bounds of web latency reduction from caching and prefetching.Proc. of the 1st USENIX Symposium on Internet Technologies and Systems (USITS’97), Dec. 1997,http://www.cse.ucsc.edu/˜tmk/publications/ideal .

[19] V. Mastoli, V. Desai, and W. Shi. SEE: a service execution environment for edge services.Proceedings of the 3rd IEEEWorkshop on Internet Applications (WIAPP’03), June 2003.

[20] M. Mikhailov and C. E. Wills. Change and relationship-driven content caching, distribution and assembly. Tech. Rep.WPI-CS-TR-01-03, Computer Science Department, WPI, Mar. 2001,http://www.cs.wpi.edu/˜cew/papers/tr01-03.pdf .

[21] A. Myers, J. Chuang, U. Hengartner, Y. Xie, W. Zhang, and H. Zhang. A secure and publisher-centric web cachinginfrastructure.Proc. of IEEE Conference on Computer Communications (INFOCOM’01), Apr. 2001.

[22] MySQL Project,http://www.mysql.org .

[23] J. Pitkow and P. Pirolli. Mining longest repeating subsequences to predict world wide web surfing.Proc. of the 2ndUSENIX Symposium on Internet Technologies and Systems (USITS’99), Oct. 1999.

[24] M. Rabinovich, Z. Xiao, F. Douglis, and C. Kamanek. Moving edge side includes to the real edge – the clients.Proc. ofthe 4th USENIX Symposium on Internet Technologies and Systems (USITS’03), Mar. 2003.

[25] Rediff Mail, http://www.rediff.com .

[26] W. Shi and V. Karamcheti. CONCA: An architecture for consistent nomadic content access.Workshop on Cache, Coher-ence, and Consistency(WC3’01), June 2001.

[27] W. Shi, R. Wright, E. Collins, and V. Karamcheti. Workload characterization of a personalized web site — and it’s im-plication on dynamic content caching.Proc. of the 7th International Workshop on Web Caching and Content Distribution(WCW’02), pp. 1-16, Aug. 2002,http://www.cs.wayne.edu/˜weisong/papers/wcw02.pdf .

[28] G. Tomlinson, R. Chen, and M. Hofmann. A model for open pluggable edge services, work in progress, Nov. 2001,http://www.ietf.org/internet-drafts/draft-tomlinson-opes-model-00.txt .

[29] M. Tsimelzon, B. Weihl, and L. Jacobs. ESI language sepcification 1.0, 2000,http://www.esi.org .

[30] C. Xu and T. Ibrahim. Keyword-based semantic prefetching in internet news services.Accepted by IEEE Transactions onKnowledge and Data Engineering, 2003.

[31] H. Zhu and T. Yang. Class-based cache management for dynamic web content.Proc. of IEEE Conference on ComputerCommunications (INFOCOM’01), Apr. 2001, http://www.cs.ucsb.edu/projects/swala/cache2001.ps .

14

PACE: Prefetching and Filtering of Personalized Emails at the

Documents

Transcript of PACE: Prefetching and Filtering of Personalized Emails at the