Automatically Detecting Banner Ads in Web Pages

5
Automatically Detecting Banner Ads in Web Pages Douglas Greiman Stanford University, Stanford, CA 94305 [email protected] 1. Introduction This paper describes AdZap, a program for detecting and blocking advertisements on web pages. AdZap uses a set of labeled training data collected from the user as input to a supervised learning algorithm. The trained algorithm then examines images embedded in HTML documents shown to the user and hides images classified as advertisements. 2. Background There are a number of mechanisms for delivering advertisements on the web. For simplicity, AdZap only considers IMG tags, which are the most common mechanism. An HTML document contains embedded images where each image is represented by an IMG element. An IMG element has a set of HTML attributes, and a set of CSS attributes. One of the HTML attributes is a URL pointing to the image file to be displayed in the document. In the rest of this paper, references to “HTML attributes” will mean all attributes except the URL. Advertisement URLs are usually generated by ad-serving software based on the location and content of the document in which they appear, along with other factors such as current advertisement inventory. The same page viewed at different times will display different advertisements, although usually in the same format and the same location on the page. There are a large number of different ad-serving software packages available. Each package constructs URLs in a fairly arbitrary and distinctive way. There are also number of ad-blocking software packages available to detect and block advertisements on web pages. Currently, these ad-blockers use hand-constructed lists of regular expressions. Every URL fetched by the user's browser is compared against the regular expressions. Images that match are hidden from display or replaced with a blank image. Other HTML and CSS attributes of the images are ignored. Lists of regular expressions are reasonably effective for advertisements already seen. However, they require regular maintenance. Furthermore, the great majority of end users are not capable of or willing to craft regular expressions to identify advertisements. Thus, ad-blocking programs rely on lists maintained by small groups of experts and distributed to end users on a regular basis. 3. Classifying Images AdZap constructs labeled feature vectors based on clicks from the user, as described below. There are two possible labels: “ad” and “not- ad”. These vectors are used to train a Naïve Bayes Multivariate Bernoulli event model. The model is retrained each time the user creates a new feature vector. When the model changes, all currently displayed images are reclassified and redisplayed as appropriate. An image has two types of information associated with it: image content, and image context. Image content is the actual pixel data of the image. Image context is all the ancillary information required by the browser to locate and display the image correctly. AdZap, like all other ad-blockers, only uses context information. The actual content of the image is not examined.

Transcript of Automatically Detecting Banner Ads in Web Pages

Page 1: Automatically Detecting Banner Ads in Web Pages

Automatically Detecting Banner Ads in Web PagesDouglas Greiman

Stanford University Stanford CA 94305duggelzgmailcom

1 IntroductionThis paper describes AdZap a program fordetecting and blocking advertisements on webpages AdZap uses a set of labeled training datacollected from the user as input to a supervisedlearning algorithm The trained algorithm thenexamines images embedded in HTMLdocuments shown to the user and hides imagesclassified as advertisements

2 BackgroundThere are a number of mechanisms fordelivering advertisements on the web Forsimplicity AdZap only considers IMG tagswhich are the most common mechanism AnHTML document contains embedded imageswhere each image is represented by an IMGelement An IMG element has a set of HTMLattributes and a set of CSS attributes One ofthe HTML attributes is a URL pointing to theimage file to be displayed in the document Inthe rest of this paper references to ldquoHTMLattributesrdquo will mean all attributes except theURLAdvertisement URLs are usually generated byad-serving software based on the location andcontent of the document in which they appearalong with other factors such as currentadvertisement inventory The same pageviewed at different times will display differentadvertisements although usually in the sameformat and the same location on the pageThere are a large number of different ad-servingsoftware packages available Each packageconstructs URLs in a fairly arbitrary anddistinctive wayThere are also number of ad-blocking software

packages available to detect and blockadvertisements on web pages Currently thesead-blockers use hand-constructed lists ofregular expressions Every URL fetched by theusers browser is compared against the regularexpressions Images that match are hiddenfrom display or replaced with a blank imageOther HTML and CSS attributes of the imagesare ignored

Lists of regular expressions are reasonablyeffective for advertisements already seenHowever they require regular maintenanceFurthermore the great majority of end users arenot capable of or willing to craft regularexpressions to identify advertisements Thusad-blocking programs rely on lists maintainedby small groups of experts and distributed toend users on a regular basis

3 Classifying ImagesAdZap constructs labeled feature vectors basedon clicks from the user as described belowThere are two possible labels ldquoadrdquo and ldquonot-adrdquo These vectors are used to train a NaiumlveBayes Multivariate Bernoulli event model Themodel is retrained each time the user creates anew feature vector When the model changesall currently displayed images are reclassifiedand redisplayed as appropriate

An image has two types of informationassociated with it image content and imagecontext Image content is the actual pixel dataof the image Image context is all the ancillaryinformation required by the browser to locateand display the image correctly AdZap like allother ad-blockers only uses contextinformation The actual content of the image isnot examined

There is a lot of context information availablefor use The IMG element contains HTML andCSS attributes which are key-value pairs withsemantics based on key names The mostimportant is the src attribute whose value isthe URL of the image file to display Thisattribute is treated separately from all others

31 Feature ConstructionAdZap uses a very simple method of creatingfeatures from the URL The URL is brokeninto tokens at non-alphanumeric boundariesSeries of tokens that are separated only by dotsor hyphens are joined together to createadditional larger tokens For example the URLldquohttpspeedpointrollcomPoint-Rollhelliprdquo is translated to the feature setldquourlspeedrdquo ldquourlpointrollrdquoldquourlcomrdquo ldquourlspeedpointrollcomrdquoldquourlPointrdquo ldquourlRollrdquo ldquourlPoint-Rollrdquo hellip The hostname is notdistinguished from the other parts of the URLIt is common for ad URLs to includeredirection steps where the actual address ofthe image file is embedded as a queryparameter inside a URL to a redirector site foruser tracking and other purposes

AdZap uses another simple method of creatingfeatures from the HTML and CSS attributesEach keyvalue pair is concatenated into asingle token For example ltIMG height=rdquo20rdquoborder=rdquo1rdquogt is translated to the feature setldquoattributeheight20rdquoldquoattributeborder1rdquo

Note that attributes with the same key anddifferent values are treated as unrelated by themodel That is ldquoheight20rdquo andldquoheight30rdquo are separate dimensions in thefeature vector with possible values 0 and 1rather than a single dimension ldquoheightrdquo withvalues from 0 to 1000 Also integer valuedattributes are not grouped into a smaller numberof buckets (eg values from ldquoheight25rdquo toldquoheight35rdquo are not normalized toldquoheight30rdquo) This is motivated by the factthat the ad industry has a set of standard ad

sizes specified to the pixel An image with oneof the standard sizes is very likely an ad andconversely a size that is even one pixeldifferent is much less likely to be an ad

Location of the image on the page is alsoinformative banner advertisements arecommonly placed along the top and right sideof web pages and less often on the left side orin the center However location is a difficultattribute to work with since it depends on theshape size and layout of the browser windowand the rest of the document Maximizing thebrowser window adding or removing UIelements to the browser window or evenchanging the font can change the location ofimages AdZap currently ignores attributesspecifying location

4 ImplementationAdZap is a Firefox browser plugin Note thatAdZap is a new piece of software and isunrelated to the ldquoadzapperrdquo package for squidproxies

When a document is first loaded in the browsereach IMG tag is examined and classified asldquoadrdquo or ldquonot-adrdquo by the model Ads are hiddenby rendering them 90 transparent Thismakes ads virtually invisible but still makes theimage visible enough for the user to recoverfrom classification errors

AdZap adds some controls to the browserwindow Clicking on the AdZap button puts thebrowser in ldquozapping moderdquo In zapping modea left-click on an image labels that image as anldquoadrdquo Similarly a right-click labels the imageas ldquonot-adrdquo The image list at the bottom of thewindow contains all the images in the currentdocument and left and right clicks in this listbehave the same as zapping mode clicksLabels created by these clicks are saved in apersistent store and the model is retrained eachtime the user clicks

Figure 1AdZap User Interface

41 Training DataThe first design of AdZap only allowed the userto label ads Every image on a page that wasnot explicitly labeled ldquoadrdquo was implicitlylabeled ldquonot-adrdquo and used to train the model

This implicit labeling works poorlyExperienced web surfers have trainedthemselves to identify and ignore ads at asubconscious level (ad blindness) so it iseasy to overlook ads even when consciouslylooking for them Beyond that pages containdozens sometimes of hundreds of images thatare small unobtrusive or even invisible Forexample CNNrsquos home page has more than 160images A small number are ads or news

photos The majority are company logosinvisible ldquoweb bugsrdquo partner logos icons ofunknown purpose cobranding logos and moreWeb bugs are a particular problem becausetheir URLs are very similar to ad image URLsbut they never get marked as ads because theyare invisible

Finally there are many images that are simplyambiguous For example Forbesrsquo home pagecontains a medium-sized image of a vehiclelinked to ForbesAutoscom On the one handthis is a legitimate navigation link to anothersection of Forbesrsquo web site ForbesAutoscomcontains free auto reviews and other meaningfulcontent On the other hand the imageprominently features a Lincoln Towncar and

clicking the link takes you directly to theLincoln section of ForbesAutoscom completewith ads by Lincoln and links to purchase aLincoln Itrsquos purposely unclear where thedividing line between editorial content andadvertising is

This was a real problem until I realized that forthe great majority of images on the web theirproper classification is ldquodonrsquot know and donrsquotcarerdquo The user cares about the large flashingannoying ads and the large interestingmeaningful photos and other actual contentEverything else is ignorable

The second and current design of AdZapintroduces a category of ldquoignorerdquo All imagessmaller that a certain area are ldquoignorerdquo Thistakes care of web bugs company logos and thelike Images that are very narrow or very shortare also ldquoignorerdquo This takes care of border artand lines Images in ldquoignorerdquo are not used astraining data (unless explicitly relabeled as ldquoadrdquoor ldquonot-adrdquo by the user) and are not classifiedby the model They are displayed normally inthe browser

Furthermore the second design of AdZapallows the user to explicitly label images asldquonot-adrdquo as well as ldquoadrdquo These labeled imagesare used as training data Images that are notlabeled are not used as training data howeverthey are classified by the model and hidden ifclassified as an ldquoadrdquo Since ldquoadrdquo images arerendered mostly transparent instead of totallyhidden this creates a simple feedback loop withthe user When the user notices misclassifiedimages they can explicitly label them and thusimprove the model

5 Experimental ResultsIt was not immediately clear what the best wayto gather training data was One could imagineenumerating all the web pages in existence andvisiting a random sample of them This hassome logistical problems and itrsquos not clear thatthis would actually be representative of a real

user experience anyway

I decided to visit a selection of news sitesThese sites offer a rich selection of newsphotos advertisements and miscellaneousimages My test procedure was to visitnewsgooglecom and for each story on thefront page follow the top three links for thatstory Each story link goes to an article page ona news site like nytimescom On each articlepage I classified every visible image as ldquoadrdquoldquonot-adrdquo or ldquodonrsquot know and donrsquot carerdquo (bynot clicking on that image) I repeated thisprocedure at intervals since the stories onnewsgooglecom change over time until I hadsufficient data points

This training data was used to evaluate themodel using different sets of features as shownon Figure 2 (see next page) Each colorrepresents a model that incorporates only thefeatures listed in the legend above Eachfeature set was evaluated using 20-fold cross-validation on training sets of various sizes

6 ConclusionsTraining a Naiumlve Bayes model on image URLsis very effective at identifying ads The fewerrors that occur have little negative effect onthe userrsquos browsing experience Some error isunavoidable due to the inherent ambiguity ofcertain images

Training a model on HTML and CSS attributesis not very effective at identifying ads Thereare a number of standardized shapes andpositions for banner ads and these types of adsare detected easily However images inunusual places such as the center of the pageor images with unusual shapes are poorlyclassified

Training on both URLs and HTML attributesisnrsquot any better than training on just URLsThis might be due to overtraining

Figure 2Evaluation on various feature sets

7 Future WorkCurrently ad URLs are distinctively differentfrom non-ad URLs However if large numbersof users started using ad-blocking software adcompanies would probably respond by makingtheir URLs indistinguishable from other imageURLs which would be fairly easy If thishappened then ad classification based onHTML attributes like size and location wouldbecome relatively more useful since theseattributes canrsquot be obfuscated like the URL can

It might be possible to improve theperformance of HTML features by morecomplex analysis of web pages For examplemost advertisements are clickable linksmeaning that the IMG element is containedinside an A element The A element has its ownattributes and its own URL describing wherethe user will be sent if they click on theadvertisement

8 ReferencesNielsen Jakob ldquoBanner Blindness Old andNew Findingsrdquo Jacob Nielsonrsquos Alertboxhttpwwwuseitcomalertboxbanner-blindnesshtml (accessed December 15 2007)

Ragget Dave Arnaud Le Hors and Ian Jacobsed ldquoHTML 401 Specificationrdquo W3Chttpwwww3orgTRREC-html40 (accessedDecember 15 2007)ldquoExtensionsrdquo Mozilla FoundationhttpdevelopermozillaorgendocsExtensions(accessed December 15 2007)

ldquoadblockrdquo The Adblock Projecthttpadblockmozdevorg (accessed December15 2007)

Page 2: Automatically Detecting Banner Ads in Web Pages

There is a lot of context information availablefor use The IMG element contains HTML andCSS attributes which are key-value pairs withsemantics based on key names The mostimportant is the src attribute whose value isthe URL of the image file to display Thisattribute is treated separately from all others

31 Feature ConstructionAdZap uses a very simple method of creatingfeatures from the URL The URL is brokeninto tokens at non-alphanumeric boundariesSeries of tokens that are separated only by dotsor hyphens are joined together to createadditional larger tokens For example the URLldquohttpspeedpointrollcomPoint-Rollhelliprdquo is translated to the feature setldquourlspeedrdquo ldquourlpointrollrdquoldquourlcomrdquo ldquourlspeedpointrollcomrdquoldquourlPointrdquo ldquourlRollrdquo ldquourlPoint-Rollrdquo hellip The hostname is notdistinguished from the other parts of the URLIt is common for ad URLs to includeredirection steps where the actual address ofthe image file is embedded as a queryparameter inside a URL to a redirector site foruser tracking and other purposes

AdZap uses another simple method of creatingfeatures from the HTML and CSS attributesEach keyvalue pair is concatenated into asingle token For example ltIMG height=rdquo20rdquoborder=rdquo1rdquogt is translated to the feature setldquoattributeheight20rdquoldquoattributeborder1rdquo

Note that attributes with the same key anddifferent values are treated as unrelated by themodel That is ldquoheight20rdquo andldquoheight30rdquo are separate dimensions in thefeature vector with possible values 0 and 1rather than a single dimension ldquoheightrdquo withvalues from 0 to 1000 Also integer valuedattributes are not grouped into a smaller numberof buckets (eg values from ldquoheight25rdquo toldquoheight35rdquo are not normalized toldquoheight30rdquo) This is motivated by the factthat the ad industry has a set of standard ad

sizes specified to the pixel An image with oneof the standard sizes is very likely an ad andconversely a size that is even one pixeldifferent is much less likely to be an ad

Location of the image on the page is alsoinformative banner advertisements arecommonly placed along the top and right sideof web pages and less often on the left side orin the center However location is a difficultattribute to work with since it depends on theshape size and layout of the browser windowand the rest of the document Maximizing thebrowser window adding or removing UIelements to the browser window or evenchanging the font can change the location ofimages AdZap currently ignores attributesspecifying location

4 ImplementationAdZap is a Firefox browser plugin Note thatAdZap is a new piece of software and isunrelated to the ldquoadzapperrdquo package for squidproxies

When a document is first loaded in the browsereach IMG tag is examined and classified asldquoadrdquo or ldquonot-adrdquo by the model Ads are hiddenby rendering them 90 transparent Thismakes ads virtually invisible but still makes theimage visible enough for the user to recoverfrom classification errors

AdZap adds some controls to the browserwindow Clicking on the AdZap button puts thebrowser in ldquozapping moderdquo In zapping modea left-click on an image labels that image as anldquoadrdquo Similarly a right-click labels the imageas ldquonot-adrdquo The image list at the bottom of thewindow contains all the images in the currentdocument and left and right clicks in this listbehave the same as zapping mode clicksLabels created by these clicks are saved in apersistent store and the model is retrained eachtime the user clicks

Figure 1AdZap User Interface

41 Training DataThe first design of AdZap only allowed the userto label ads Every image on a page that wasnot explicitly labeled ldquoadrdquo was implicitlylabeled ldquonot-adrdquo and used to train the model

This implicit labeling works poorlyExperienced web surfers have trainedthemselves to identify and ignore ads at asubconscious level (ad blindness) so it iseasy to overlook ads even when consciouslylooking for them Beyond that pages containdozens sometimes of hundreds of images thatare small unobtrusive or even invisible Forexample CNNrsquos home page has more than 160images A small number are ads or news

photos The majority are company logosinvisible ldquoweb bugsrdquo partner logos icons ofunknown purpose cobranding logos and moreWeb bugs are a particular problem becausetheir URLs are very similar to ad image URLsbut they never get marked as ads because theyare invisible

Finally there are many images that are simplyambiguous For example Forbesrsquo home pagecontains a medium-sized image of a vehiclelinked to ForbesAutoscom On the one handthis is a legitimate navigation link to anothersection of Forbesrsquo web site ForbesAutoscomcontains free auto reviews and other meaningfulcontent On the other hand the imageprominently features a Lincoln Towncar and

clicking the link takes you directly to theLincoln section of ForbesAutoscom completewith ads by Lincoln and links to purchase aLincoln Itrsquos purposely unclear where thedividing line between editorial content andadvertising is

This was a real problem until I realized that forthe great majority of images on the web theirproper classification is ldquodonrsquot know and donrsquotcarerdquo The user cares about the large flashingannoying ads and the large interestingmeaningful photos and other actual contentEverything else is ignorable

The second and current design of AdZapintroduces a category of ldquoignorerdquo All imagessmaller that a certain area are ldquoignorerdquo Thistakes care of web bugs company logos and thelike Images that are very narrow or very shortare also ldquoignorerdquo This takes care of border artand lines Images in ldquoignorerdquo are not used astraining data (unless explicitly relabeled as ldquoadrdquoor ldquonot-adrdquo by the user) and are not classifiedby the model They are displayed normally inthe browser

Furthermore the second design of AdZapallows the user to explicitly label images asldquonot-adrdquo as well as ldquoadrdquo These labeled imagesare used as training data Images that are notlabeled are not used as training data howeverthey are classified by the model and hidden ifclassified as an ldquoadrdquo Since ldquoadrdquo images arerendered mostly transparent instead of totallyhidden this creates a simple feedback loop withthe user When the user notices misclassifiedimages they can explicitly label them and thusimprove the model

5 Experimental ResultsIt was not immediately clear what the best wayto gather training data was One could imagineenumerating all the web pages in existence andvisiting a random sample of them This hassome logistical problems and itrsquos not clear thatthis would actually be representative of a real

user experience anyway

I decided to visit a selection of news sitesThese sites offer a rich selection of newsphotos advertisements and miscellaneousimages My test procedure was to visitnewsgooglecom and for each story on thefront page follow the top three links for thatstory Each story link goes to an article page ona news site like nytimescom On each articlepage I classified every visible image as ldquoadrdquoldquonot-adrdquo or ldquodonrsquot know and donrsquot carerdquo (bynot clicking on that image) I repeated thisprocedure at intervals since the stories onnewsgooglecom change over time until I hadsufficient data points

This training data was used to evaluate themodel using different sets of features as shownon Figure 2 (see next page) Each colorrepresents a model that incorporates only thefeatures listed in the legend above Eachfeature set was evaluated using 20-fold cross-validation on training sets of various sizes

6 ConclusionsTraining a Naiumlve Bayes model on image URLsis very effective at identifying ads The fewerrors that occur have little negative effect onthe userrsquos browsing experience Some error isunavoidable due to the inherent ambiguity ofcertain images

Training a model on HTML and CSS attributesis not very effective at identifying ads Thereare a number of standardized shapes andpositions for banner ads and these types of adsare detected easily However images inunusual places such as the center of the pageor images with unusual shapes are poorlyclassified

Training on both URLs and HTML attributesisnrsquot any better than training on just URLsThis might be due to overtraining

Figure 2Evaluation on various feature sets

7 Future WorkCurrently ad URLs are distinctively differentfrom non-ad URLs However if large numbersof users started using ad-blocking software adcompanies would probably respond by makingtheir URLs indistinguishable from other imageURLs which would be fairly easy If thishappened then ad classification based onHTML attributes like size and location wouldbecome relatively more useful since theseattributes canrsquot be obfuscated like the URL can

It might be possible to improve theperformance of HTML features by morecomplex analysis of web pages For examplemost advertisements are clickable linksmeaning that the IMG element is containedinside an A element The A element has its ownattributes and its own URL describing wherethe user will be sent if they click on theadvertisement

8 ReferencesNielsen Jakob ldquoBanner Blindness Old andNew Findingsrdquo Jacob Nielsonrsquos Alertboxhttpwwwuseitcomalertboxbanner-blindnesshtml (accessed December 15 2007)

Ragget Dave Arnaud Le Hors and Ian Jacobsed ldquoHTML 401 Specificationrdquo W3Chttpwwww3orgTRREC-html40 (accessedDecember 15 2007)ldquoExtensionsrdquo Mozilla FoundationhttpdevelopermozillaorgendocsExtensions(accessed December 15 2007)

ldquoadblockrdquo The Adblock Projecthttpadblockmozdevorg (accessed December15 2007)

Page 3: Automatically Detecting Banner Ads in Web Pages

Figure 1AdZap User Interface

41 Training DataThe first design of AdZap only allowed the userto label ads Every image on a page that wasnot explicitly labeled ldquoadrdquo was implicitlylabeled ldquonot-adrdquo and used to train the model

This implicit labeling works poorlyExperienced web surfers have trainedthemselves to identify and ignore ads at asubconscious level (ad blindness) so it iseasy to overlook ads even when consciouslylooking for them Beyond that pages containdozens sometimes of hundreds of images thatare small unobtrusive or even invisible Forexample CNNrsquos home page has more than 160images A small number are ads or news

photos The majority are company logosinvisible ldquoweb bugsrdquo partner logos icons ofunknown purpose cobranding logos and moreWeb bugs are a particular problem becausetheir URLs are very similar to ad image URLsbut they never get marked as ads because theyare invisible

Finally there are many images that are simplyambiguous For example Forbesrsquo home pagecontains a medium-sized image of a vehiclelinked to ForbesAutoscom On the one handthis is a legitimate navigation link to anothersection of Forbesrsquo web site ForbesAutoscomcontains free auto reviews and other meaningfulcontent On the other hand the imageprominently features a Lincoln Towncar and

clicking the link takes you directly to theLincoln section of ForbesAutoscom completewith ads by Lincoln and links to purchase aLincoln Itrsquos purposely unclear where thedividing line between editorial content andadvertising is

This was a real problem until I realized that forthe great majority of images on the web theirproper classification is ldquodonrsquot know and donrsquotcarerdquo The user cares about the large flashingannoying ads and the large interestingmeaningful photos and other actual contentEverything else is ignorable

The second and current design of AdZapintroduces a category of ldquoignorerdquo All imagessmaller that a certain area are ldquoignorerdquo Thistakes care of web bugs company logos and thelike Images that are very narrow or very shortare also ldquoignorerdquo This takes care of border artand lines Images in ldquoignorerdquo are not used astraining data (unless explicitly relabeled as ldquoadrdquoor ldquonot-adrdquo by the user) and are not classifiedby the model They are displayed normally inthe browser

Furthermore the second design of AdZapallows the user to explicitly label images asldquonot-adrdquo as well as ldquoadrdquo These labeled imagesare used as training data Images that are notlabeled are not used as training data howeverthey are classified by the model and hidden ifclassified as an ldquoadrdquo Since ldquoadrdquo images arerendered mostly transparent instead of totallyhidden this creates a simple feedback loop withthe user When the user notices misclassifiedimages they can explicitly label them and thusimprove the model

5 Experimental ResultsIt was not immediately clear what the best wayto gather training data was One could imagineenumerating all the web pages in existence andvisiting a random sample of them This hassome logistical problems and itrsquos not clear thatthis would actually be representative of a real

user experience anyway

I decided to visit a selection of news sitesThese sites offer a rich selection of newsphotos advertisements and miscellaneousimages My test procedure was to visitnewsgooglecom and for each story on thefront page follow the top three links for thatstory Each story link goes to an article page ona news site like nytimescom On each articlepage I classified every visible image as ldquoadrdquoldquonot-adrdquo or ldquodonrsquot know and donrsquot carerdquo (bynot clicking on that image) I repeated thisprocedure at intervals since the stories onnewsgooglecom change over time until I hadsufficient data points

This training data was used to evaluate themodel using different sets of features as shownon Figure 2 (see next page) Each colorrepresents a model that incorporates only thefeatures listed in the legend above Eachfeature set was evaluated using 20-fold cross-validation on training sets of various sizes

6 ConclusionsTraining a Naiumlve Bayes model on image URLsis very effective at identifying ads The fewerrors that occur have little negative effect onthe userrsquos browsing experience Some error isunavoidable due to the inherent ambiguity ofcertain images

Training a model on HTML and CSS attributesis not very effective at identifying ads Thereare a number of standardized shapes andpositions for banner ads and these types of adsare detected easily However images inunusual places such as the center of the pageor images with unusual shapes are poorlyclassified

Training on both URLs and HTML attributesisnrsquot any better than training on just URLsThis might be due to overtraining

Figure 2Evaluation on various feature sets

7 Future WorkCurrently ad URLs are distinctively differentfrom non-ad URLs However if large numbersof users started using ad-blocking software adcompanies would probably respond by makingtheir URLs indistinguishable from other imageURLs which would be fairly easy If thishappened then ad classification based onHTML attributes like size and location wouldbecome relatively more useful since theseattributes canrsquot be obfuscated like the URL can

It might be possible to improve theperformance of HTML features by morecomplex analysis of web pages For examplemost advertisements are clickable linksmeaning that the IMG element is containedinside an A element The A element has its ownattributes and its own URL describing wherethe user will be sent if they click on theadvertisement

8 ReferencesNielsen Jakob ldquoBanner Blindness Old andNew Findingsrdquo Jacob Nielsonrsquos Alertboxhttpwwwuseitcomalertboxbanner-blindnesshtml (accessed December 15 2007)

Ragget Dave Arnaud Le Hors and Ian Jacobsed ldquoHTML 401 Specificationrdquo W3Chttpwwww3orgTRREC-html40 (accessedDecember 15 2007)ldquoExtensionsrdquo Mozilla FoundationhttpdevelopermozillaorgendocsExtensions(accessed December 15 2007)

ldquoadblockrdquo The Adblock Projecthttpadblockmozdevorg (accessed December15 2007)

Page 4: Automatically Detecting Banner Ads in Web Pages

clicking the link takes you directly to theLincoln section of ForbesAutoscom completewith ads by Lincoln and links to purchase aLincoln Itrsquos purposely unclear where thedividing line between editorial content andadvertising is

This was a real problem until I realized that forthe great majority of images on the web theirproper classification is ldquodonrsquot know and donrsquotcarerdquo The user cares about the large flashingannoying ads and the large interestingmeaningful photos and other actual contentEverything else is ignorable

The second and current design of AdZapintroduces a category of ldquoignorerdquo All imagessmaller that a certain area are ldquoignorerdquo Thistakes care of web bugs company logos and thelike Images that are very narrow or very shortare also ldquoignorerdquo This takes care of border artand lines Images in ldquoignorerdquo are not used astraining data (unless explicitly relabeled as ldquoadrdquoor ldquonot-adrdquo by the user) and are not classifiedby the model They are displayed normally inthe browser

Furthermore the second design of AdZapallows the user to explicitly label images asldquonot-adrdquo as well as ldquoadrdquo These labeled imagesare used as training data Images that are notlabeled are not used as training data howeverthey are classified by the model and hidden ifclassified as an ldquoadrdquo Since ldquoadrdquo images arerendered mostly transparent instead of totallyhidden this creates a simple feedback loop withthe user When the user notices misclassifiedimages they can explicitly label them and thusimprove the model

5 Experimental ResultsIt was not immediately clear what the best wayto gather training data was One could imagineenumerating all the web pages in existence andvisiting a random sample of them This hassome logistical problems and itrsquos not clear thatthis would actually be representative of a real

user experience anyway

I decided to visit a selection of news sitesThese sites offer a rich selection of newsphotos advertisements and miscellaneousimages My test procedure was to visitnewsgooglecom and for each story on thefront page follow the top three links for thatstory Each story link goes to an article page ona news site like nytimescom On each articlepage I classified every visible image as ldquoadrdquoldquonot-adrdquo or ldquodonrsquot know and donrsquot carerdquo (bynot clicking on that image) I repeated thisprocedure at intervals since the stories onnewsgooglecom change over time until I hadsufficient data points

This training data was used to evaluate themodel using different sets of features as shownon Figure 2 (see next page) Each colorrepresents a model that incorporates only thefeatures listed in the legend above Eachfeature set was evaluated using 20-fold cross-validation on training sets of various sizes

6 ConclusionsTraining a Naiumlve Bayes model on image URLsis very effective at identifying ads The fewerrors that occur have little negative effect onthe userrsquos browsing experience Some error isunavoidable due to the inherent ambiguity ofcertain images

Training a model on HTML and CSS attributesis not very effective at identifying ads Thereare a number of standardized shapes andpositions for banner ads and these types of adsare detected easily However images inunusual places such as the center of the pageor images with unusual shapes are poorlyclassified

Training on both URLs and HTML attributesisnrsquot any better than training on just URLsThis might be due to overtraining

Figure 2Evaluation on various feature sets

7 Future WorkCurrently ad URLs are distinctively differentfrom non-ad URLs However if large numbersof users started using ad-blocking software adcompanies would probably respond by makingtheir URLs indistinguishable from other imageURLs which would be fairly easy If thishappened then ad classification based onHTML attributes like size and location wouldbecome relatively more useful since theseattributes canrsquot be obfuscated like the URL can

It might be possible to improve theperformance of HTML features by morecomplex analysis of web pages For examplemost advertisements are clickable linksmeaning that the IMG element is containedinside an A element The A element has its ownattributes and its own URL describing wherethe user will be sent if they click on theadvertisement

8 ReferencesNielsen Jakob ldquoBanner Blindness Old andNew Findingsrdquo Jacob Nielsonrsquos Alertboxhttpwwwuseitcomalertboxbanner-blindnesshtml (accessed December 15 2007)

Ragget Dave Arnaud Le Hors and Ian Jacobsed ldquoHTML 401 Specificationrdquo W3Chttpwwww3orgTRREC-html40 (accessedDecember 15 2007)ldquoExtensionsrdquo Mozilla FoundationhttpdevelopermozillaorgendocsExtensions(accessed December 15 2007)

ldquoadblockrdquo The Adblock Projecthttpadblockmozdevorg (accessed December15 2007)

Page 5: Automatically Detecting Banner Ads in Web Pages

Figure 2Evaluation on various feature sets

7 Future WorkCurrently ad URLs are distinctively differentfrom non-ad URLs However if large numbersof users started using ad-blocking software adcompanies would probably respond by makingtheir URLs indistinguishable from other imageURLs which would be fairly easy If thishappened then ad classification based onHTML attributes like size and location wouldbecome relatively more useful since theseattributes canrsquot be obfuscated like the URL can

It might be possible to improve theperformance of HTML features by morecomplex analysis of web pages For examplemost advertisements are clickable linksmeaning that the IMG element is containedinside an A element The A element has its ownattributes and its own URL describing wherethe user will be sent if they click on theadvertisement

8 ReferencesNielsen Jakob ldquoBanner Blindness Old andNew Findingsrdquo Jacob Nielsonrsquos Alertboxhttpwwwuseitcomalertboxbanner-blindnesshtml (accessed December 15 2007)

Ragget Dave Arnaud Le Hors and Ian Jacobsed ldquoHTML 401 Specificationrdquo W3Chttpwwww3orgTRREC-html40 (accessedDecember 15 2007)ldquoExtensionsrdquo Mozilla FoundationhttpdevelopermozillaorgendocsExtensions(accessed December 15 2007)

ldquoadblockrdquo The Adblock Projecthttpadblockmozdevorg (accessed December15 2007)