Guide to implementation of ACAP Version 1.1 Communication...

43
ACAP Technical Framework Guide to implementation of ACAP Version 1.1 Issue 1 Page i 2008-10-09 Guide to implementation of ACAP Version 1.1 Communication with Crawlers A component of the ACAP Technical Framework Issue 1, 9 October 2009

Transcript of Guide to implementation of ACAP Version 1.1 Communication...

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page i 2008-10-09

Guide to implementation of ACAP Version 1.1 Communication with Crawlers A component of the ACAP Technical Framework

Issue 1, 9 October 2009

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page ii 2008-10-09

Document history

Version Release date

Version 1.0 Issue 1 2008-03-18

Version 1.1 Issue 1 2009-10-09

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page iii 2008-10-09

Table of contents

Document history .................................................................................................... ii

Table of contents.................................................................................................... iii

Acknowledgements ................................................................................................. v

1 Introduction ...................................................................................................... 1

1.1 Why implement ACAP Version 1.1 ................................................................. 1

1.2 Who need to read this Implementation Guide ................................................. 2

1.3 ACAP and the Robots Exclusion Protocol ....................................................... 2

1.4 How this Guide works ..................................................................................... 3

1.4.1 How to read the examples ....................................................................... 3

1.5 Preparation for implementing ACAP ............................................................... 3

1.6 What will and won't happen when you implement ACAP ................................ 4

2 Step-by-step implementation guide ................................................................ 5

2.1 Step 1: Define the crawlers to which to address your access and usage

policies ........................................................................................................... 5

2.1.1 Option 1A: Address all crawlers .............................................................. 5

2.1.2 Option 1B: Address a group of named crawlers ...................................... 6

2.1.3 Option 1C: Address all crawlers that can interpret ACAP expressions

differently from those that can only interpret conventional REP

expressions ............................................................................................. 8

2.2 Step 2: Define the purposes for which a crawler may use your content .......... 9

2.3 Step 3: Define the content for which usage is either permitted or prohibited . 11

2.4 Step 4: Define policies for specific usages .................................................... 14

2.4.1 Option 4A: follow Express whether or not crawlers may following links . 15

2.4.2 Option 4B: index Express whether or not crawled content may be

indexed ................................................................................................. 16

2.4.3 Option 4C: preserve Express whether or not copies of crawled content

may be preserved ................................................................................. 18

2.4.4 Option 4D: present Express whether or not a representation of a

crawled resource may be delivered for presentation in an end-user's

browser ................................................................................................. 20

2.5 Step 5: Define restrictions upon basic usages .............................................. 23

2.5.1 Option 5A: Time-limit restriction of permissions for basic usages .......... 23

2.5.2 Option 5B: Permit indexing of content, but specify resource to be used

for indexing ........................................................................................... 25

2.5.3 Option 5C: Permit presentation of snippets with maximum snippet

length restriction .................................................................................... 28

2.5.4 Option 5D: Permit presentation of snippets, but specify text to be used 28

2.5.5 Option 5E: Permit presentation of content but prohibit specified

modifications ......................................................................................... 29

2.5.6 Option 5F: Permit presentation of content but require or prohibit

specified presentation „contexts‟ ............................................................ 31

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page iv 2008-10-09

3 Glossary .......................................................................................................... 33

4 References ...................................................................................................... 35

Annex – Sample content from robots.txt and HTML files used in testing ........... 1

A.1 Introduction ..................................................................................................... 1

A.2 De Persgroep samples ................................................................................... 1

A.2.1 Content of the robots.txt file on test website acap.persgroep.be .............. 1

A.2.2 META tags used in HTML test content on website acap.persgroep.be .... 2

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page v 2008-10-09

Acknowledgements

Many of the examples provided in the Implementation Guide have been derived from

the robots.txt files and META tags created and used by publishers to test ACAP

Version 1.0. We are grateful to all members of the ACAP Technology Working Group

for their assistance with reviewing this Implementation Guide prior to publication, and

especially to the following for allowing us to use their test results as source material

in preparing the examples:

Gert François, Manager New Media, De Persgroep (www.persgroep.be)

David Sommer, Commercial Director, MPS Technologies

(www.mpstechnologies.com)

Daniel Neethling, General Manager, Southern Newspapers, Media 24

(www.media24.com)

Edgar Schouten, Reed Business Information – Netherlands

(www.reedbusiness.nl)

Paul Mostert, Elsevier Science (www.elsevier.com)

Sébastien Richard, Director, Web Development, Exalead S.A.

(www.exalead.com).

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page vi 2008-10-09

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 1 2008-10-09

1 Introduction

ACAP enables publishers to communicate their policies for access to and use of their

content in machine-readable forms to search engines and other content aggregators.

ACAP Version 1.0 was published in November 2007. It has been adopted by more

than 1250 online publishers worldwide. ACAP Version 1.0 continues to provide the

basis for a quick and simple implementation of ACAP, where the intention of the

publisher is to indicate support for ACAP without making changes to their existing

policies as expressed in existing robots.txt files and in web pages using Robots

META Tags.

ACAP Version 1.1 clarifies and extends the functionality of ACAP for existing use

cases, as well as addressing the needs of some additional use cases. We

recommend that, for any implementation of ACAP beyond the quick and simple

implementation described on the ACAP website, ACAP Version 1.1 should be

implemented in preference to ACAP Version 1.0.

This Implementation Guide aims to help you understand and implement ACAP

Version 1.1. For guidance and a tool to support a quick and simple way of

implementing ACAP Version 1.0 go to the ACAP website and click on 'Implement

ACAP'.

1.1 Why implement ACAP Version 1.1

ACAP Version 1.1 is designed to meet the needs of a specific set of use cases that

focus on the activity of search engines and other "web harvesters" in relation to

general web content; ACAP is being extended to support a variety of other business

models for the delivery and use of online content. These use cases are not covered

by ACAP Version 1.1 and are therefore not dealt with in this version of the

Implementation Guide. For further information on how ACAP is being extended for

application for business models other than search and web archiving, please visit the

ACAP website.

ACAP Version 1.1 represents a small but significant upgrade of ACAP Version 1.0,

clarifying existing features and including a small number of new features. There is no

absolute need for existing implementations of ACAP Version 1.0 to upgrade – ACAP

Version 1.1 is backwards-compatible with ACAP Version 1.0 – but new implementers

should use the new version for anything other than a "quick and simple

implementation" and existing implementers are advised to review the additional

features in the new version before deciding whether or not to upgrade.

Search engines typically use software systems – often referred to as crawlers (also

known as „robots‟, 'bots' or „spiders‟) – to crawl web content, follow links within it to

other web content, index it, preserve (copy) it and present it in different ways to end-

users. ACAP Version 1.1 enables publishers to express which of these usages are

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 2 2008-10-09

permitted or prohibited for which online content and, in the case of permitted usages,

express some restrictions on those usages. ACAP Version 1.1 also enables

publishers to express that any usage other than the five basic usages (crawl, follow,

index, preserve and present) are prohibited.

If you wish to communicate clear and precise policies as to how crawler-based

aggregation services, including search engines, should use your website content,

you probably need to implement ACAP Version 1.1.

From this point forward, when referring to "ACAP" without any explicit reference to

version, we mean "ACAP Version 1.1".

1.2 Who need to read this Implementation Guide

This Guide is aimed principally at webmasters and website administrators, but is not

a full technical specification. The full technical specifications of ACAP can be

downloaded from the ACAP website. References to these specifications can be

found at the end of this Guide.

1.3 ACAP and the Robots Exclusion Protocol

ACAP makes use of an existing communication protocol that is the most widely used

means of communicating with search engine crawlers: the Robots Exclusion Protocol

(REP). ACAP makes use of both the major components of REP:

the "robots.txt" file that is placed on a web server specifically for

communication with crawlers,. and

the Robots META tags which may be embedded in the headers of individual

HTML web pages.

In both cases, ACAP uses REP to communicate expressions of what uses of the

specified content the publisher wishes to permit. By defining a set of extensions to

REP ACAP enables publishers to express more precisely nuanced permissions than

is possible using the conventional REP formats.

It is important to understand that REP takes a permissive approach to crawler

communication. If a robots.txt file does not explicitly prohibit a usage, a crawler may

well assume it to be permitted. For publishers used to the underlying assumptions of

copyright, this may seem counter-intuitive, but it is the way that REP works in

practice and the proper expression of policies using ACAP will take this into account.

One consequence of this is that ACAP defines a usage verb "other", in addition to the

five basic usage verbs already mentioned, to enable usages, other than those

explicitly permitted, to be prohibited.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 3 2008-10-09

1.4 How this Guide works

This Guide aims to guide you through the process of working out how to use ACAP

to express your content access and use policies. It does so in a series of steps, and

at each step you will be presented with a series of options to choose from, each of

which is illustrated by examples to assist you in selecting the options that most

closely match your own use case. If you find that it is not clear which options to select

or how to adapt the examples to your own case, please contact us for further advice.

If you find that none of the options seems to apply to your use case, it is possible that

ACAP does not yet provide a solution for you, in which case we would be pleased to

hear from you, so that we can see how ACAP might be extended to meet your

needs.

However, it is also possible that ACAP already meets your needs, but we have

simply not made that clear in the way that the options and examples are presented,

in which case we would be delighted to put this right.

At various points you have the option of contacting us to discuss your precise

requirements. Please follow this option, if that seems appropriate.

NOTE – All examples in the main part of this Implementation Guide are intended to

be realistic but fictitious. Any similarity in names of crawler, domains or resources

between the examples in this Guide and real-world examples are unintended and

entirely coincidental. By contrast, the examples in the Annex are real-world examples

provided by publishers who have tested ACAP Version 1.0 in real business use

cases.

1.4.1 How to read the examples

The examples all show how ACAP policies can be expressed either within a

robots.txt file or in META tags embedded in HTML pages. In the robots.txt examples

the hash symbol # indicates that the remainder of the line is a comment and

therefore would be ignored by a crawler.

In the META tag examples the tags are shown in the form specified by the HTML

v4.01 specification. Only META tags and any associated comments are shown.

These must be placed inside the HEAD element of an HTML page, in accordance

with the HTML 4.01 DTD and specification.

1.5 Preparation for implementing ACAP

This Guide assumes that you will be using an existing robots.txt file to express most

of the policies that apply to your online content, reserving the use of META tags for

special cases in which the precise policy varies (at least in certain respects) from one

HTML page to another. Other approaches may be appropriate in specific

circumstances but are not covered by this Guide.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 4 2008-10-09

Is there already a robots.txt file on your website? If not, you will need to study

guides to REP, including webmaster guidance provided by the major search engines,

before creating a robots.txt file that communicates your web content access and use

policies at least so far as is possible with conventional REP. For most practical

purposes you cannot implement ACAP without first having a robots.txt file on your

website.

For the foreseeable future it must be assumed that many crawlers will not be able to

interpret ACAP policies, so we strongly recommend that you add ACAP policies to

your robots.txt file without removing its current contents. Crawlers that understand

your ACAP policies can be instructed to ignore the pre-existing content of your

robots.txt file, as shown in this User Guide. For the same reason we also recommend

that ACAP permissions be embedded in HTML pages by adding new META tags

without replacing existing tags.

If your website content is managed using a web content management system (CMS),

we advise you to talk to your CMS supplier before starting to implement ACAP

Version 1.1. Your CMS may be able to assist you in managing your content access

and usage policies, in which case the advice given in this Guide may not be the best

way of meeting your requirements.

For what kinds of content do you need to express access and usage policies?

ACAP enables expression of policies for general web content such as HTML pages.

ACAP also provides support for expression of policies that depend upon license

metadata embedded in photographic images using the PLUS Coalition's License

Definition Format. ACAP does not yet enable the expression of complex permissions

depending upon properties of audio-visual and other non-textual content.

It's easy to start implementing ACAP now! We strongly recommend that you

convert your existing robots.txt file to use ACAP forms of expression, using the tool

available for this purpose on the ACAP website. Using this tool your robots.txt file will

be extended to include exactly the same information that it already contains, but

expressed in ACAP terms. This will not only serve notice to crawlers that you are

implementing ACAP, but will also reduce the amount of effort that may be involved in

manually editing your robots.txt file to include ACAP expressions of your access and

usage policies. If you wish to modify your robots.txt file to use the richer forms of

expression provided by ACAP, we

1.6 What will and won't happen when you implement ACAP

ACAP is not a technical protection mechanism. For ACAP to function as intended,

crawler operators must voluntarily re-program their crawlers to interpret and act upon

ACAP forms of expression in robots.txt and embedded in web content. How a crawler

interprets ACAP will be dependent upon how they are re-programmed to do so.

Without re-programming, a crawler will simply ignore ACAP expressions in both

robots.txt and embedded in web content. This is fortunate, because it means that

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 5 2008-10-09

implementing ACAP will not disrupt the way in which your website is crawled and

indexed currently by search engines and other aggregators.

2 Step-by-step implementation guide

2.1 Step 1: Define the crawlers to which to address your access and usage policies

First you need to decide to which crawlers you need to address your access and

usage policies. ACAP enables you to define which crawlers may or may not have

access to your content.

Crawlers can be addressed in two ways:

as distinct groups of one or more crawlers addressed by name

as a single group of all crawlers.

Several crawlers to which the same policies apply can be addressed as a group.

Crawlers are expected to check all policies expressed using ACAP to determine

which policies apply to them.

ACAP enables crawlers to be addressed in these ways by replicating the functionality

in conventional REP that is provided by the User-agent field in robots.txt, but using

a syntax that enable crawlers to distinguish conventional REP from the ACAP

extensions to REP.

ACAP extends the functionality of conventional REP to enable a single crawler to be

addressed by name in a META tag within an HTML page. Multiple named crawlers

may be addressed by META tags, but only one named crawler per META tag.

ACAP policies always override conventional REP policies that apply in the same

circumstances, See Option 1C for details of how to communicate specifically and

differently with all crawlers that can interpret ACAP policies.

2.1.1 Option 1A: Address all crawlers

Any policies that are addressed to all crawlers define „default‟ policies that apply to all

crawlers unless those policies are overridden by policies addressed to specific

crawlers by name.

Example 1.1: Prohibit all crawlers from crawling your content

The most basic policy is one that prohibits crawlers in general from crawling your

content. Because of the essentially permissive nature of REP, it is necessary to be

explicit about prohibiting crawlers from crawling your content, but it is not necessary

to explicitly permit crawlers to crawl your content – silence is taken as permission.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 6 2008-10-09

The following example shows how to express that you wish to prohibit all crawlers

from crawling your content. The example uses both conventional REP and ACAP

expressions in the same robots.txt file.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

ACAP-disallow-crawl: /

Example 1.2: Restrict all crawlers to crawl specified content

In the following example, all crawlers are permitted to crawl only a specific page

/index.html. This is achieved by prohibiting access generally, but permitting

access to the one file.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

Allow: /$ # A crawler may give up if not allowed to crawl the root

Allow: /index.html$

# The 'Allow' field is a widely-recognised extension to conventional REP

# ACAP policies here...

ACAP-crawler: *

ACAP-disallow-crawl: /

ACAP-allow-crawl: /$

ACAP-allow-crawl: /index.html$

2.1.2 Option 1B: Address a group of named crawlers

A group of crawlers may be addressed by name. In most cases the name of a

crawler is advertised every time it crawls a website by being included in each request

for a resource from the site.

Restricting which search engine crawlers may have access to content can be done in

one of two ways:

1. Explicitly state which named crawlers are permitted access and prohibit access to

all others. In example 1.3 permission is given to crawlers named 'SearchBot1'

and „SearchBot2‟ to crawl all content, while all others are prohibited.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 7 2008-10-09

2. Explicitly state which crawlers are prohibited access and permit access to all

others, regardless of whether or not they are able to interpret ACAP policies. In

example 1.4 „SearchBot1' and „SearchBot2‟ are prohibited access to all content.

Example 1.3: Permit only named crawlers to crawl content

##ACAP version=1.1

# Conventional policies here..

User-agent: *

Disallow: /

User-agent: SearchBot1

User-agent: SearchBot2

Allow: /

# ACAP policies here...

ACAP-crawler: *

ACAP-disallow-crawl: /

ACAP-crawler: SearchBot1

ACAP-crawler: SearchBot2

ACAP-allow-crawl: /

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 8 2008-10-09

Example 1.4: Prohibit named crawlers from crawling content

##ACAP version=1.1

# Conventional policies here...

User-agent: SearchBot1

User-agent: SearchBot2

Disallow: /

User-agent: *

Allow: /

# ACAP policies here...

ACAP-crawler: SearchBot1

ACAP-crawler: SearchBot2

ACAP-disallow-crawl: /

ACAP-crawler: *

ACAP-allow-crawl: /

2.1.3 Option 1C: Address all crawlers that can interpret ACAP expressions differently from those that can only interpret conventional REP expressions

A crawler that has not been programmed to interpret ACAP expressions will not be

able to act upon your policies in full. If your policies cannot be fully expressed in

conventional REP, you will almost certainly wish to communicate different policies to

those crawlers that can interpret ACAP.

In the first of the following two examples, crawlers that cannot interpret ACAP

policies are prohibited from crawling the site.

Example 1.5: Permit only ACAP-enabled crawlers to crawl all content

In this example all crawlers that can only interpret conventional REP are prohibited

from crawling content, while all crawlers that can interpreted ACAP expressions are

permitted to crawl all content.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 9 2008-10-09

ACAP-allow-crawl: /

...

Example 1.6: Permit ACAP-enabled crawlers to crawl all content and advise

them to ignore conventional REP policies

In this example, the policies expressed in conventional REP are different to those

expressed using ACAP. While ACAP policies should always override conventional

policies, crawlers that can interpret ACAP policies can be expressly asked to ignore

conventional REP policies, as in this example.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

#...other conventional policies here

# ACAP policies here...

ACAP-ignore-conventional-records

ACAP-crawler: *

ACAP-allow-crawl: /

#... other ACAP policies here

2.2 Step 2: Define the purposes for which a crawler may use your content

In most cases crawlers crawl content for the purpose of using it in a specific service,

such as in a general search service or in a news, sports or other themed service.

However, in some cases the content accessed by a single crawler may be used for

more than one distinct purpose: it may be used in more than one service, or a single

service may be delivered in a variety of ways (e.g. through several different web

addresses or portals). You may not wish all your content to be used for all the

purposes for which a crawler is crawling your content.

ACAP makes it possible to communicate to a crawler that it may only crawl content

for certain specified usage purposes only. Rather like a specific crawler name, a

usage purpose is specified using a label that is recognised by the crawler. ACAP

also makes it possible to communicate that a crawler may not crawl content for

certain specified purposes. Permissions and prohibitions that relate to a specified set

of services always override permissions and prohibitions for which no purpose is

specified.

There are no standard labels for search engine services. If the crawler operator does

not specify a particular service label to use, it is suggested that the web address

(URI) for the service in question be used, as this usually is sufficient to identify

uniquely the service in question. It is expected that individual search engine

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 10 2008-10-09

operators will publish names or labels that their crawlers will recognise and associate

with the various services that they deliver.

If there is no concern about which usages permitted resources are used for, then

there is no need to specify an ACAP usage purpose.

Example 2.1: Prohibit specified crawler from crawling content for use in

specified services

The following example denies crawler 'SearchBot1' access to a site for the purpose

of using content in services identified respectively by use of the text patterns

"news.*" and “images.*”. Note that this restriction does not apply to any other

crawler. These text patterns are examples of a simple form of regular expression in

which the asterisk * is a „wild card‟ character representing any sequence of

characters at the end of the pattern, so it is assumed that the crawler 'SearchBot1' is

capable of correctly interpreting such patterns.

##ACAP version=1.1

# Conventional policies here...

# ACAP policies here...

ACAP-crawler: *

ACAP-allow-crawl: /

ACAP-crawler: SearchBot1

ACAP-usage-purpose: news.*

ACAP-usage-purpose: images.*

ACAP-disallow-crawl: /

Example 2.2: Permit crawlers to crawl content only for use in a named service

In this example crawlers are only permitted to crawl content for use in services

identified by the label "sports".

##ACAP version=1.1

# Conventional policies here...

# ACAP policies here...

ACAP-crawler: *

ACAP-disallow-crawl: /

ACAP-usage-purpose: sports

ACAP-allow-crawl: /

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 11 2008-10-09

2.3 Step 3: Define the content for which usage is either permitted or prohibited

ACAP policies in robots.txt either permit or prohibit the crawling and use of specified

content resources. In all of the examples shown in Steps 1 and 2, except example

1.2, the policies being expressed apply to all content on a website.

A strict use of conventional REP only permits access to be prohibited for the whole

website or for named directories or named resources. ACAP enables usages to be

explicitly permitted. This feature is based upon a widely-recognised extension to

conventional REP that also enables a wider prohibition to be overridden for specific

resources, using the 'Allow:' construct.

The resources to which a permission or prohibition policy applies are specified by

what is known as a 'resource path pattern'. This is a string against which a crawler

will match the resource path part of each content item's web address, i.e. the part of

the web address that follows the host name (and occasionally the port number),

beginning with the slash character '/'. A slash on its own represents all content on

the website. A slash followed by any other characters represents some subset of

resources on the website whose resource paths match the pattern of characters.

Such patterns of characters are a simple form of regular expression in which

asterisks and dollar signs have special meaning. For more information on resource

path patterns see the full technical specification of ACAP Version 1.1.

The longer the resource path pattern, the more specific the set of resources that

match the pattern will generally be. If one pattern represents a subset of the

resources represented by another pattern, any policy using the former (narrower)

pattern will override a contradictory policy using the latter (broader) pattern.

For example:

ACAP-disallow-crawl: /

is overridden by

ACAP-allow-crawl: /public/

for content items that match the resource path pattern '/public/' on that web

server. This is interpreted to mean that a crawler is permitted to crawl a resource

located in the directory /public, or in a sub-directory below this directory, but is

prohibited to crawl resources that are located elsewhere.

The above example shows how a permission policy can override a prohibition policy.

A prohibition policy can override a permission policy in the same way.

NOTE – It is advised that directory names in resource path patterns should be

terminated by a slash, to avoid inconsistencies in interpretation by crawlers.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 12 2008-10-09

Example 3.1: Permit ACAP-enabled crawlers to crawl all content except if

located in a specified directory

In this example the robots.txt file has been set to express that only crawlers that can

interpret ACAP policies are permitted to crawl the website, while the ACAP policy

states that any crawler is explicitly permitted to crawl and index the site as a whole,

with the exception of the directory /nocrawl.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# explicitly allow access to the whole site...

ACAP-allow-crawl: /

# ...except for the content of directory /nocrawl

ACAP-disallow-crawl: /nocrawl/

Example 3.2: Permit all crawlers only to crawl content in specified directories

An alternative approach is to specify exactly which directories a crawler may crawl. In

this example any crawler may crawl the website home page /index.html and

specified directories, but otherwise is not permitted to crawl the site. This is

expressed using both conventional REP extended by 'Allow:', as well as using the

equivalent ACAP forms of expression.

##ACAP version=1.1

# conventional policies here...

User-agent: * # addressed to all crawlers

# prohibit access to all directories...

Disallow: / # but...

# permit access to the home page

Allow: /$ # A crawler may give up if not allowed to crawl the root

Allow: /index.html$

# permit access to the public directory and sub-directories

Allow: /public/

# permit access to the promotion directory and sub-directories

Allow: /promotion/

# permit access to the news directory and sub-directories

Allow: /news/

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 13 2008-10-09

# ACAP policies here...

ACAP-crawler: * # addressed to all crawlers

# prohibit access to all directories

ACAP-disallow-crawl: / # but...

# permit access to the index page

ACAP-allow-crawl: /$

ACAP-allow-crawl: /index.html$

# permit access to the public directory and sub-directories

ACAP-allow-crawl: /public/

# permit access to the promotion directory and sub-directories

ACAP-allow-crawl: /promotion/

# permit access to the news directory and sub-directories

ACAP-allow-crawl: /news/

The following is a re-statement of example 3.2 but the ACAP expression shows how

a 'resource set' can be defined in order to express a policy more concisely. The

resource set 'permitted' is defined to include all resources that match any of the

resource path patterns '/index.html', '/public/', '/promotion/' or '/news/'.

For more information on resource sets see the full technical specification of ACAP

Version 1.1.

##ACAP version=1.1

# conventional policies here...

User-agent: * # addressed to all crawlers

# prohibit access to all directories...

Disallow: / # but...

# permit access to the home page

Allow: /$ # A crawler may give up if not allowed to crawl the root

Allow: /index.html$

# permit access to the public directory and sub-directories

Allow: /public/

# permit access to the promotion directory and sub-directories

Allow: /promotion/

# permit access to the news directory and sub-directories

Allow: /news/

# ACAP policies here...

ACAP-resource-set: permitted /$ /index.html /public/ /promotion/ /news/

ACAP-crawler: * # addressed to all crawlers

# prohibit access to all directories

ACAP-disallow-crawl: / # but...

# permit access to any resources in resource set 'permitted'

ACAP-allow-crawl: the-acap:resource-set:permitted

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 14 2008-10-09

Example 3.3: Permit crawlers to crawl specific files and file types

Many search engine crawlers recognise an extension of the capability of

conventional REP to use wildcard characters to create 'globs' or 'patterns' against

which many resources sharing some common pattern of characters in their path or

filename can be matched. The ACAP extensions to REP also enable such pattern

matching. In this example the use of such resource path patterns is shown. The

asterisk * represents zero or more characters and the dollar sign $ indicates that the

preceding character must be the final character in the path.

##ACAP version=1.1

# Conventional policies here...

User-agent: * # addressed to any crawler

# prohibit access to content generally...

Disallow: /

# but allow access to the home page /index.html

Allow: /$ # A crawler may give up if not allowed to crawl the root

Allow: /index.html$

# and to HTML files in the directory /public

Allow: /public/*.htm$

Allow: /public/*.html$

# ACAP policies here...

ACAP-crawler: * # addressed to any crawler

# prohibit access to content generally...

ACAP-disallow-crawl: /

# but allow access to the home page /index.html

ACAP-allow-crawl: /$

ACAP-allow-crawl: /index.html$

# and to HTML files in the directory /public

ACAP-allow-crawl: /public/*.htm$

ACAP-allow-crawl: /public/*.html$

2.4 Step 4: Define policies for specific usages

Apart from the basic usage 'crawl', ACAP can be used to express policies that

involve a number of other usages that are involved in the way that most search

engines (and similar aggregators) deliver their services to end-users. This enables

publishers to define more precisely how their content may be used. Permitting a

crawler to crawl a resource enables a crawler to have access to a resource for

subsequent use of that resource. If crawlers are permitted to crawl, but the policies

are silent about other usages, crawlers will interpret silence as permission.

ACAP supports policies that express permission or prohibition of the following

usages:

crawl a content resource

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 15 2008-10-09

follow links from a resource to other resources

index a resource

preserve a persistent copy of a resource

present a resource in some form to an end-user

other usages not explicitly permitted or prohibited

The preceding steps have dealt with the crawl usage. All other usages depend upon

being permitted to crawl, so it is not necessary to prohibit other usages if crawling is

prohibited.

2.4.1 Option 4A: follow Express whether or not crawlers may following links

Permission to follow a resource means that web links that it contains may be

followed by the crawler in order to find other resources to crawl. However, it may not

be necessary, desirable or perhaps even safe for a crawler to follow all the links on a

website (some links are not safe for crawlers to follow, such as links that are queries

on very large databases containing mapping or similar data). The ACAP follow usage

provides a means for specifying permission and prohibition policies for the following

of links.

Conventional REP supports the communication of permissions or prohibitions to

follow links using META tags embedded in HTML pages, but does not support

communication of such policies in robots.txt. ACAP supports the communication of

such policies in both robots.txt and in META tags.

Example 4.1: Prohibit crawlers from following links found in specified content

In this example, crawlers are permitted to crawl the content of the directory /public,

but are prohibited from following links in content within /public/catalog.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

# prohibit non-ACAP-enabled crawlers from crawling anything

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

# allow access to /public/ via the root /

ACAP-allow-crawl: /$

ACAP-allow-crawl: /public/

ACAP-disallow-follow: /public/catalog/

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 16 2008-10-09

Example 4.2: Use an embedded META tag to prohibit crawlers from following

links found in a resource

In this example, crawlers are not permitted to follow links in the HTML page in which

these META tags are embedded.

<META NAME="robots" CONTENT="nofollow"> <!-- conventional policy -->

<META NAME="robots" CONTENT="ACAP disallow-follow"> <!-- ACAP policy -->

2.4.2 Option 4B: index Express whether or not crawled content may be indexed

Permission to index enables the derivation and storage of index entries from the

crawled content (according to whatever indexing methods are employed by the

crawler operator).

At this Step we show how policies for indexing entire resources may be expressed.

Step 5 will show how indexing of a resource may be limited to fragments of text

within a resource.

Example 4.3: Permit indexing of specified resources

In the simplest case, a publisher may wish to restrict indexing (and possibly other

defined usages) to specific resources such as the pages created for public

consumption. Assuming that these are to be found in the directory /public (and its

sub-directories) then Example 4.3 shows how ACAP permissions are used to permit

indexing of resources in this directory only. Although in this case the policy is silent

about indexing resources outside this directory, the prohibition to crawl other

resources means that only resources in /public can be indexed. Other usages are

explicitly prohibited in this case.

##ACAP version=1.1

# Conventional policies

User-agent: *

Disallow: /

# ACAP policies

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

# allow access to /public/ via the root /

ACAP-allow-crawl: /$

ACAP-allow-crawl: /public/

ACAP-allow-index: /public/

ACAP-disallow-other: /public/

Example 4.4: Prohibit indexing of image files

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 17 2008-10-09

In this example, the crawler SearchBot1 is permitted to index content in the „/TEXTS‟

directory but is prohibited from indexing any image files, which are identified by the

patterns of characters used in image filename extensions.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

# Policies for SearchBot1

ACAP-crawler: SearchBot1

ACAP-allow-crawl: /$ # always allow crawlers access to the root /

ACAP-allow-crawl: /TEXTS/

ACAP-allow-index: /TEXTS/

ACAP-disallow-index: /TEXTS/*.gif$

ACAP-disallow-index: /TEXTS/*.png$

ACAP-disallow-index: /TEXTS/*.jpg$

The following is a restatement of example 4.4 using a defined resource set.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

ACAP-resource-set: images /TEXTS/*.gif$ /TEXTS/*.png$ /TEXTS/*.jpg$

# Policies for SearchBot1

ACAP-crawler: SearchBot1

ACAP-allow-crawl: /$ # always allow crawlers access to the root /

ACAP-allow-crawl: /TEXTS/

ACAP-allow-index: /TEXTS/

ACAP-disallow-index: the-acap:resource-set:images

Example 4.5: Permit indexing of HTML content only

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 18 2008-10-09

In this example, the crawler SearchBot1 is only permitted to index HTML pages in the

„/TEXTS‟ directory but is prohibited from indexing any other content.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

# Policies for SearchBot1

ACAP-crawler: SearchBot1

ACAP-allow-crawl: /$

ACAP-allow-crawl: /TEXTS/

ACAP-disallow-index: /TEXTS/

ACAP-allow-index: /TEXTS/*.html

Example 4.6: Use an embedded META tag to prohibit crawlers from indexing a

resource

In this example, crawlers are not permitted to index the HTML page in which these

META tags are embedded.

<META NAME="robots" CONTENT="noindex"> <!-- conventional policy -->

<META NAME="robots" CONTENT="ACAP disallow-index"> <!-- ACAP policy -->

2.4.3 Option 4C: preserve Express whether or not copies of crawled content may be preserved

Permission to preserve enables the persistent storage of crawled content. Terms

such as "store" and "cache" have been avoided in ACAP terminology, due to the

risks that these terms would be misinterpreted.

The term "preserve" should not be understood to imply „permanent preservation‟ in

an archival sense, but does imply storage beyond the immediate requirements to

process the crawled content for indexing or other immediate purposes. The meaning

of the search engine term "cache", implying „storage until re-crawled‟ is

encompassed by the meaning of the ACAP term "preserve".

When a search engine stores crawled HTML content in a „cache‟, it may be

necessary to make changes to the way that the content is encoded in HTML to

ensure that, when the cached copy is presented in an end-user's browser, it

resembles as faithfully as possible the page that was originally crawled. The term

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 19 2008-10-09

"preserve" therefore also conveys the sense that a copy of the crawled content is

stored in a form that, when presented, closely resembles the original on the

publisher's website, implying that some modification of the crawled content prior to

storage may have been involved.

ACAP policies can express whether or not a search engine is permitted to preserve a

copy of a crawled resource in its „cache‟. Step 5 will show how permission to

preserve may be qualified to limit the time for which a copy of a crawled resource

may be stored.

It should be noted that search engines usually generate snippets from cached copies

of resources, so if they are prohibited from preserving a resource, they are usually

prevented from presenting a snippet for that resource.

Example 4.7: Permit crawlers to index but not to preserve copies of resources

In this example, the crawler SearchBot1 is permitted to crawl and index web content

in directory „/public‟, but not to preserve (i.e. cache) copies of the resources found

there.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

# Permissions for SearchBot1

ACAP-crawler: SearchBot1

ACAP-allow-crawl: /$

ACAP-allow-crawl: /public/

ACAP-allow-index: /public/

ACAP-disallow-preserve: /public/

Example 4.8: Permit crawlers to preserve copies of specified resources

In this example, the crawler SearchBot1 is permitted to crawl and index web content

in directory „/public‟, and may cache HTML pages, but may not cache other

content from that directory (e.g. images).

##ACAP version=1.1

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 20 2008-10-09

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

# Permissions for SearchBot1

ACAP-crawler: SearchBot1

ACAP-allow-crawl: /$

ACAP-allow-crawl: /public/

ACAP-allow-index: /public/

ACAP-disallow-preserve: /public/

ACAP-allow-preserve: /public/*.htm$

ACAP-allow-preserve: /public/*.html$

Example 4.9: Use an embedded META tag to prohibit crawlers from preserving

a resource

In this example, crawlers are not permitted to preserve the HTML page in which

these META tags are embedded.

<!-- There is no conventional REP equivalent -->

<META NAME="robots" CONTENT="ACAP disallow-preserve"> <!-- ACAP policy -->

2.4.4 Option 4D: present Express whether or not a representation of a crawled resource may be delivered for presentation in an end-user's browser

Permission to present enables a search engine or other aggregator to deliver

representations of a crawled resource to an end-user's browser.

Resources can be represented in a variety of ways. ACAP supports the expression of

policies for the presentation of the following representations of crawled resources:

links to the original resource on the content owner's website

snippets, usually generated by the search engine

thumbnail images, also usually generated by the search engine

preserved copies from the search engine cache, if necessary distinguishing

between copies of the most recently-crawled version of a resource and an older

version of the same resource

the original resource retrieved by the search engine in real time from the content

owner's website.

Example 4.10: Permit presentation of some but not all representations of

specified resources

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 21 2008-10-09

In this example, a publisher permits the specified crawler SearchBot1 to create and

present snippets for HTML pages in the „/news‟ directory, and may also present

thumbnails and links, but is not permitted to present other representations (e.g.

cached or original copies) of the same content.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

# Policies for SearchBot1

ACAP-crawler: SearchBot1

ACAP-allow-crawl: /$

ACAP-allow-crawl: /news/

ACAP-allow-index: /news/

ACAP-allow-preserve: /news/

ACAP-disallow-present: /news/

ACAP-allow-present-snippet: /news/*.htm$

ACAP-allow-present-thumbnail: /news/*.htm$

ACAP-allow-present-links: /news/*.htm$

ACAP-allow-present-snippet: /news/*.html$

ACAP-allow-present-thumbnail: /news/*.html$

ACAP-allow-present-links: /news/*.html$

Example 4.11 Prohibit thumbnails images from being presented

In this example all forms of presentation are permitted except for thumbnails.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally

ACAP-disallow-crawl: /

# Policies for SearchBot1

ACAP-crawler: SearchBot1

ACAP-allow-crawl: /$

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 22 2008-10-09

ACAP-allow-crawl: /news/

ACAP-allow-index: /news/

ACAP-allow-preserve: /news/

ACAP-allow-present: /news/

ACAP-disallow-present-thumbnail: /news/

Example 4.12: Use an embedded META tag to prohibit crawlers to present a

resource snippet

In this example, crawlers are not permitted to present a snippet for the HTML page in

which these META tags are embedded.

<!-- There is no conventional REP equivalent -->

<!-- ACAP policy -->

<META NAME="robots" CONTENT="ACAP disallow-present-snippet">

Example 4.13: Prohibit presentation of cached copies of a resource, but permit

presentation of the original resource (retrieved in real time from the publisher's

website), provided it is an HTML page

In this example, ACAP is used to instruct all crawlers to present the original version

only of all resources in the directory „/news‟.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# prohibit access to content generally...

ACAP-disallow-crawl: /

# ...but allow the content of /news to be crawled

ACAP-allow-crawl: /$

ACAP-allow-crawl: /news/

ACAP-allow-index: /news/

ACAP-allow-preserve: /news/

ACAP-allow-present: /news/

ACAP-disallow-present-currentcopy: /news/

ACAP-disallow-present-oldcopy: /news/

ACAP-disallow-present-original: /news/

ACAP-allow-present-original: /news/*.htm$

ACAP-allow-present-original: /news/*.html$

Example 4.14 Permit a preserved copy to be displayed, provided it is the most

recently-crawled version

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 23 2008-10-09

In this example, ACAP is used to prohibit access to the directory „/archive‟ for all

crawlers with the exception of crawler called SearchBot1, which is given permission

to preserve copies of the resources in „/archive‟ and to present these preserved

copies.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Allow: / # implied, but included for clarity

Disallow: /archive/

# ACAP policies here...

ACAP-crawler: SearchBot1

# permit access all content, including /archive

# ACAP-allow-crawl: / # implied, but shown for clarity

ACAP-allow-crawl: /archive/ # overrides conventional policy

# ACAP-allow-index: /archive/ # implied, but shown for clarity

# ACAP-allow-preserve: /archive/ # implied, but shown for clarity

ACAP-allow-present: /archive/

ACAP-disallow-present-oldcopy: /archive/

# ACAP-allow-present-currentcopy: /archive/ #implied

2.5 Step 5: Define restrictions upon basic usages

ACAP policies may be refined to place restrictions upon basic usages. ACAP

supports the current types of restrictions:

time limit for indexing, preserving or presentation

which resource to use for indexing

maximum length for presentation of a snippet

modification prior to presentation

presentation context.

2.5.1 Option 5A: Time-limit restriction of permissions for basic usages

There are three ways of expressing a time limit in an ACAP policy, all of which may

be applied to the basic usages index, preserve and present. These are:

1. until the resource is re-crawled

2. until a specified date after which the permitted usage ceases

3. for a specified number of days after the resource is first indexed, after which the

permitted usage ceases.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 24 2008-10-09

Example 5.1 Permit content to be indexed, preserved and presented until

recrawled

Although this is often the default behaviour of a search engine, it may be beneficial to

make explicit the restriction that a resource may only be indexed and preserved until

it is next re-crawled, implying that older versions of the resource may not remain in

the index or be preserved. This example uses ACAP to permit the crawler

SearchBot1 to preserve the index page „index.html‟ until it is re-crawled.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: SearchBot1

ACAP-disallow-crawl: /

# allow to crawl /index.html

ACAP-allow-crawl: /$

ACAP-allow-crawl: /index.html$

ACAP-allow-index: /index.html time-limit=until-recrawled

ACAP-allow-preserve: /index.html time-limit=until-recrawled

Example 5.2: Permit a copy of a resource to be indexed and preserved until a

specified date

In this example, ACAP is used to tell all crawlers that content in the „/serial‟

directory may only be indexed and preserved until the specified date (after which it is

to be moved into an archive).

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# allow /serial to be crawled

ACAP-allow-crawl: /$

ACAP-allow-crawl: /serial/

ACAP-allow-index: /serial/ time-limit=until-2008-03-01

ACAP-allow-preserve: /serial/ time-limit=until-2008-03-01

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 25 2008-10-09

Example 5.3: Use an embedded META tag to permit crawlers to index and

preserve a resource until a specified date

If varying policies are to be applied to different resources, it may be easier to present

this in the Robots META tags format, the date being set as appropriate for each

case.

In this example the crawler SearchBot1 is permitted to index and preserve the HTML

page in which these META tags are embedded until the specified date.

<!-- There is no conventional REP equivalent -->

<!-- ACAP policy -->

<META NAME="SearchBot1"

CONTENT="ACAP allow-preserve time-limit=until-2008-03-01">

Example 5.4: Permit a resource to be indexed and preserved for a specified

number of days

This example shows a time limit of 14 days for indexing and preserving content in the

directory „/current‟ for crawler SpecialBot1. Such a limit might be set by prior

arrangement with a specialist search engine, which only crawls infrequently and

accepts a variety of limits on how long it may hold content in its index, but this is

unlikely to be appropriate when communicating with search engines in general.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: SpecialBot1

ACAP-disallow-crawl: /

# allow to crawl /current

ACAP-allow-crawl: /$

ACAP-allow-crawl: /current/

ACAP-allow-index: /current/ time-limit=14-days

ACAP-allow-preserve: /current/ time-limit=14-days

2.5.2 Option 5B: Permit indexing of content, but specify resource to be used for indexing

ACAP can be used to specify a particular resource to be used for indexing. This may

be specified in a number of ways.

Warning! Some of the ways in which this feature may be used could be treated by

search engines as „cloaking‟ – an attempt to deceive a search engine as to the true

nature of the content of a page – which could cause a website to be blacklisted, so

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 26 2008-10-09

this feature should be used with caution. In some cases, a special arrangement may

need to be made with search engines to have these features accepted by their

crawlers.

If the crawled resource contains text, and is in a format that the indexing processes

can handle (e.g. plain text, HTML, XML or PDF), the whole of the crawled content will

normally be used for indexing purposes. For the publisher this may be inappropriate.

If, on the other hand, the crawled resource is an object that does not contain text, but

is an image or some other non-text object, or is text in an unreadable format, the

search engine will not be able to index it directly. If the object is embedded in an

HTML page, or the crawler has followed a link to it from an HTML page, the indexing

process will use text in the referring page, close to the embedded or linked object, to

index it. From the publisher's point of view this, too, may be inappropriate.

In the case of an HTML page, the indexing of that page can be specified to be any of

the following:

an extract from the page comprising a single HTML element within the body of the

page, specified by the value of its 'class' attribute

an extract from the page comprising a single HTML element within the body of the

page, specified by the value of its 'id' attribute

the value of the 'content' attribute on a META tag within the page header,

specified by its 'name' attribute.

a separate resource specified by its URI

In the case of any other type of content, the indexing of that resource can be

specified to be a separate resource specified by its URI.

It is inadvisable to specify a separate resource to be used to index a crawled

resource without checking first that this is acceptable to the search engines

concerned, because this is viewed by many search engines as a form of cloaking.

A future version of ACAP is expected to support the embedding of policies in non-text

content, in which case it will become possible to specify within the non-text content

how it should be indexed.

Example 5.5: Permit HTML content to be indexed using only text specified by

'class' attribute value

In the following example the crawler SearchBot1 may crawl and index pages in the

directory /articles, but only using an element (assumed to exist in each page)

whose 'class' attribute has the value 'abstract'.

##ACAP version=1.1

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 27 2008-10-09

# Conventional policies

User-agent: *

Disallow: /

# ACAP policies

ACAP-crawler: SearchBot1

# prohibit access to content generally

ACAP-disallow-crawl: /

# allow access to /articles/

ACAP-allow-crawl: /$

ACAP-allow-crawl: /articles/

ACAP-allow-index: /articles/ must-use-resource=the-acap:extract:class:abstract

Example 5.6: Use an embedded META tag to permit crawlers to index a

resource using only text specified by 'id' attribute value

The use of 'id' attribute values for specifying text to be used for indexing purposes is

most likely to be practical when the policy is embedded in a META tag.

<!-- There is no conventional REP equivalent -->

<!-- ACAP policy -->

<META NAME="SearchBot1"

CONTENT="ACAP allow-index must-use-resource=the-acap:extract:id:i012345">

Example 5.7: Permit HTML content to be indexed using only text in the

'content' value of a specified META tag

In the following example the crawler SearchBot1 may crawl and index pages in the

directory /articles, but only using the 'content' value of a META tag with a

specified value of its 'name' attribute (assumed to exist in each page header).

##ACAP version=1.1

# Conventional policies

User-agent: *

Disallow: /

# ACAP policies

ACAP-crawler: SearchBot1

# prohibit access to content generally

ACAP-disallow-crawl: /

# allow access to /articles/

ACAP-allow-crawl: /$

ACAP-allow-crawl: /articles/

ACAP-allow-index: /articles/ must-use-resource=the-acap:extract:meta:index-text

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 28 2008-10-09

2.5.3 Option 5C: Permit presentation of snippets with maximum snippet length restriction

ACAP allows the expression of permission to present snippets to be qualified through

the specification of a maximum snippet length. The length can be expressed either

as a number of characters or words.

Example 5.8 Permit presentation of snippets with a restriction on maximum

length, expressed in robots.txt

In this example, all crawlers are permitted to index and present snippets for crawled

content in the directory „/news‟ but snippets are restricted to 20 words in length.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# permit to crawl /news

ACAP-allow-crawl: /$

ACAP-allow-crawl: /news/

# ACAP-allow-index: /news/ # implied, but included for clarity

# ACAP-allow-preserve: /news/ # implied, but included for clarity

# ACAP-allow-present: /news/ # implied, but included for clarity

ACAP-allow-present-snippet /news/ max-length=20-words

Example 5.9: Use an embedded META tag to permit crawlers to present a

resource snippet with restricted maximum length

If varying policies are to be applied to different resources, it may be easier to present

this in the Robots META tags format, the maximum length of snippet being set as

appropriate for each case.

<!-- There is no conventional REP equivalent -->

<!-- ACAP policy -->

<META name="robots"

content="ACAP allow-present-snippet max-length=20-words">

2.5.4 Option 5D: Permit presentation of snippets, but specify text to be used

ACAP can be used to specify a particular resource to be used as a snippet in search

results. This may be specified in a number of ways, but the simplest way of doing so

is shown here.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 29 2008-10-09

Warning! Some of the ways in which this feature may be used could be treated by

search engines as „cloaking‟ – an attempt to deceive a search engine as to the true

nature of the content of a page – which could cause a website to be blacklisted, so

this feature should be used with caution. In some cases, a special arrangement may

need to be made with search engines to have these features accepted by their

crawlers.

Example 5.10: Use an embedded META tag to permit crawlers to present a

resource snippet containing specified text

<!-- There is no conventional REP equivalent -->

<!-- ACAP policy -->

<META name=”snippet” content=”This is the snippet text to be presented”>

<META name="robots"

content="ACAP allow-present-snippet

must-use-resource=the-acap:extract:meta:snippet">

2.5.5 Option 5E: Permit presentation of content but prohibit specified modifications

There can be many reasons why publishers do not want the form, layout or context of

their content changed or manipulated. ACAP enables expression of policies that

permit presentation of content but without modification to:

data format

typographic style / layout

text translation

text annotation, for example with end-user ratings or ‟tags‟

any modification, including all the above.

Example 5.11 Permit presentation but prohibit changes to format (PDF to

HTML)

ACAP enables publishers to specify that no modification is permitted to the data

format of the content to be presented to the end-user. A common example of this is

where PDF content is converted to HTML, disrupting layout and in the worst case,

rendering the transformed content meaningless. Another example would be to

prohibit image format transformations.

In the example, permission is given for the preserved copy of resources to be

presented but no changes to the format are permitted.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 30 2008-10-09

# ACAP policies here...

ACAP-crawler: *

# permit to crawl content generally

ACAP-allow-crawl: /

# ACAP-allow-index: / # implied, but included for clarity

# ACAP-allow-preserve: / # implied, but included for clarity

# ACAP-allow-present: / # implied, but included for clarity

ACAP-allow-present-currentcopy: / prohibited-modification=format

Example 5.11 Use an embedded META tag to permit crawlers to present a

resource but prohibit transformation of the format (e.g. to PDF or plain text)

In this example the prohibition of format change is embedded in the HTML page.

<!—ACAP policies -->

<META name=”robots”

content=”ACAP allow-present-currentcopy prohibited-modification=format”>

Example 5.12: Permit presentation but prohibit change of layout

Layout and presentation of content is generally integral to a publisher‟s brand. ACAP

enables publishers to prohibit any changes to typographic layout when content is

presented to the end-user.

Publishers should bear in mind that it is unreasonable to expect a search engine or

other aggregator to maintain the layout of content if it is difficult or impossible for this

to be done using a copy of the content stored in the aggregator's cache. Publishers

are advised to make it as easy as possible for a requested restriction on layout

changes to be complied with.

In this example, all search engines have permission to index and to present HTML

content on the website (by filename pattern) but if presenting a cached copy, this

should be presented without any modification to typographic style or layout.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

ACAP-disallow-crawl: /

# permit to crawl HTML content only

ACAP-allow-crawl: /$

ACAP-allow-crawl: *.htm$

ACAP-allow-crawl: *.html$

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 31 2008-10-09

ACAP-allow-present-currentcopy: *.htm$ prohibited-modification=style

ACAP-allow-present-currentcopy: *.html$ prohibited-modification=style

Example 5.13: Permit presentation but prohibit translation of resources

In this example, presentation of translated versions of content are not permitted.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# permit to crawl content generally

ACAP-allow-crawl: /

# ACAP-allow-index: / # implied, but included for clarity

# ACAP-allow-preserve: / # implied, but included for clarity

ACAP-allow-present: / prohibited-modification=translation

Example 5.14: Permit presentation but only without any end-user annotations

In certain situations, publishers may wish to prevent their content from being

presented by an aggregator with end-user ratings or tags, for example peer-reviewed

scientific, technical or medical content. In the example, aggregators are prohibited

from presenting content with any associated end user ratings or tags.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# permit to crawl content generally

ACAP-allow-crawl: /

# ACAP-allow-index: / # implied, but included for clarity

# ACAP-allow-preserve: / # implied, but included for clarity

ACAP-allow-present: / prohibited-modification=annotation

2.5.6 Option 5F: Permit presentation of content but require or prohibit specified presentation ‘contexts’

The context in which content is presented is important: what surrounds the content or

is beside it when viewed by the end-user in their browser. Search engines define the

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 32 2008-10-09

context in which search results are presented and in many cases they are either

unable or unwilling to allow the publisher to restrict how they do this.

There are some limited circumstances in which a publisher might reasonably request

a restriction on the context of presentation, and these all relate to the presentation of

the content in full.

ACAP enables a publisher to express a policy that restricts the context of

presentation in the following limited cases:

when presenting a preserved copy of the crawled content, it may not be presented

in a frame added by the aggregator

when presenting the original crawled content, it must be presented in the frame as

intended by the content owner.

Example 5.15: Permit presentation of cached copies but prohibit their

presentation in a frame

In this example cached copies of any web pages may be presented, but must not be

presented in a frame added by the aggregator.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

# ACAP policies here...

ACAP-crawler: *

# permit to crawl content generally

ACAP-allow-crawl: /

# ACAP-allow-index: / # implied, but included for clarity

# ACAP-allow-preserve: / # implied, but included for clarity

# ACAP-allow-present: / # implied, but included for clarity

ACAP-allow-present-currentcopy: / prohibited-context=within-user-frame

Example 5.16: Permit presentation of the original content but require

presentation in the original frame

In this example the original web pages may be presented, but must be presented in a

frame in which they would be presented by the publisher.

##ACAP version=1.1

# Conventional policies here...

User-agent: *

Disallow: /

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 33 2008-10-09

# ACAP policies here...

ACAP-crawler: *

# permit to crawl content generally

ACAP-allow-crawl: /

# ACAP-allow-index: / # implied, but included for clarity

# ACAP-allow-preserve: / # implied, but included for clarity

# ACAP-allow-present: / # implied, but included for clarity

ACAP-allow-present-original: / required-context=within-original-frame

3 Glossary

The following glossary is based upon the ACAP Dictionary.

Allow An indicator in ACAP expressions that a particular

usage is permitted. The conventional Robots Exclusion

Protocol does not include this term, but it is a widely-

recognised term in established extensions to REP and

its use is extended further by ACAP.

Content item A piece of content on a publisher‟s website, at any level

of granularity which it is possible to reference in an

ACAP policy statement. In robots.txt only whole

resources may be associated with a policy, but in

META tags fragments of resources may be associated

with a policy.

Content item index

entries

The index entries (see below) required to make a

content item searchable in accordance with a service‟s

indexing practices.

Content item link A link to a content item on the publisher‟s website.

Content item retrieved

copy

A copy of a content item that has been retrieved by a

crawler for processing. NB a retrieved copy may or

may not be stored after it has been processed.

Content item snippet A short indication of the content of a content item and,

where applicable, its relevance to an end user search,

generated by a service from content item index entries

or from a content item preserved copy using proprietary

algorithms.

Content item extract An extract taken verbatim from the preserved copy of a

content item. The order of content in an extract may not

be modified.

Content item thumbnail A thumbnail image of a content item webpage, either

generated by a service from content item preserved

copy or supplied by the publisher.

Content property A property of a content item, encoded as part of the

item or in associated metadata.

Crawl (usage) The action of a crawler, to retrieve content items in a

systematic way. A basic ACAP usage verb.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 34 2008-10-09

Crawler Software controlled by a service operator that

systematically retrieves content items from the web in

order to take copies of some or all the resources on that

server for indexing and other purposes.

Disallow An indicator in ACAP expressions that a particular

usage is not permitted. The conventional Robots

Exclusion Protocol uses this term, but its use is

extended by ACAP.

Follow (usage) The action of a crawler, which uses links contained

within a content item that it has crawled to request

further resources from the same server or from other

servers. A basic ACAP usage verb.

Index (usage) The action of a crawler, or software that uses content

retrieved by a crawler, to create index entries for a

content item, for use in a search service or for other

purposes. A basic ACAP usage verb.

Index entry An entry in an index that comprises (a) a look-up string

derived from a content item retrieved copy, (b) a link to

the content item, and (c) optionally, positional

information to be used only for the purpose of

increasing the effectiveness of searching and relevance

assessment, eg by enabling the proximity of words in

text to be taken into account.

Other (usage) The action of software controlled by a search engine, or

other aggregator, to use a content item in ways other

than those covered by explicitly permitted usages.

Present (usage) The action of software controlled by a search engine, or

other aggregator, that uses content retrieved by a

crawler to deliver a content item, however it might be

represented (e.g. by snippet, thumbnail, link, cached

copy), for presentation by a web browser. A basic

ACAP usage verb.

Preserve (usage) The action of a crawler, or software that uses content

retrieved by a crawler, to store a persistent copy of a

content item, preserving as far as possible to

arrangement and style of presentation of the original. A

persistent copy is a copy that is stored for purposes

other than those (essentially transient purposes) that

are associated with crawling and indexing the content

item, and is referred to as a 'preserved copy'. A 'cached'

copy is a preserved copy, even if it is overwritten or

removed when the same content item is crawled on a

subsequent occasion. A basic ACAP usage verb.

Process Any process that is operated by or for the operator of a

crawler.

Publisher A person or organisation that places content on the

web.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 35 2008-10-09

Publisher content item

indexing resource

A resource from which content item index entries must

be generated, either based upon a content item extract

or supplied by its publisher.

Publisher content item

summary

A descriptive summary of a content item supplied by its

publisher.

Publisher snippet A resource specified by the publisher that must be used

for presentation instead of a snippet that would be

generated by a process controlled by the crawler

operator.

Publisher thumbnail A resource specified by the publisher that must be used

for presentation instead of a thumbnail that would be

generated by a process controlled by the crawler

operator.

Service A service, such as a search service, that provides

organised access to content that has been harvested

from the web. A service is normally delivered from one

or more specific web addresses, or portals.

4 References

4. ACAP Technical Framework – Communicating access and usage policies to

crawlers using extensions to the Robots Exclusion Protocol – Part 1: Extension of

the Robots META Tag format and other techniques for embedding permissions in

HTML content. Current version available at http://www.the-

acap.org/download.php?ACAP-TF-CrawlerCommunication-Part1-V1.1.pdf.

5. ACAP Technical Framework – Communicating access and usage policies to

crawlers using extensions to the Robots Exclusion Protocol – Part 2: Extension of

the Robots META Tag format and other techniques for embedding permissions in

HTML content. Current version available at http://www.the-

acap.org/download.php?ACAP-TF-CrawlerCommunication-Part2-V1.1.pdf.

6. ACAP Dictionary of Access and Usage Terminology. Current version available at

http://www.the-acap.org/download.php?ACAP-TF-Dictionary-V1.1.pdf.

7. Robots Exclusion Protocol: An informal specification, based upon an original

June 1994 „consensus‟ of robot authors and others, can be found on the web at

http://www.robotstxt.org/, including guidance on both the robots.txt format and the

use of Robots META Tags. A number of extensions have been proposed by

major search engine operators and others, and some of these extensions are in

widespread use.

8. HTML 4.01 Specification – W3C Recommendation 24 December 1999. Current

version available at http://www.w3.org/TR/html401/.

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 1 2008-10-09

Annex – Sample content from robots.txt and HTML files used in testing

A.1 Introduction

The following are examples of robots.txt file content and HTML META tags created by one of the publishers that participated in test

implementations of ACAP Version 1.0 during 2007. They are shown here to provide a more realistic impression of the way that ACAP might

be used in practice.

A.2 De Persgroep samples

De Persgroep (www.persgroep.be) are a major Flemish news group. They tested ACAP for communicating access and usage policies to

crawlers visiting their public news website.

A.2.1 Content of the robots.txt file on test website acap.persgroep.be

##ACAP version=0.2 #pilot version

# for http://acap.persgroep.be

# author: gert francois - de persgroep

# with minor modifications for use in the ACAP Implementation Guide

# current REP instructions

User-agent: *

Disallow: /

ACAP-ignore-conventional-records

# ACAP instructions for all crawlers

ACAP Technical Framework

Guide to implementation of ACAP Version 1.1 Issue 1

Page 2 2008-10-09

ACAP-crawler: *

ACAP-allow-crawl: /

ACAP-allow-index: /

ACAP-disallow-crawl: /site/ # not allowed to crawl the non-content directories

ACAP-allow preserve: / time-limit=100-days # not allowed to store/preserve content longer than 100 days

ACAP-disallow-present-currentcopy: / # not allowed to display preserved (cached) copies

# ACAP instructions for ExaBotAcap12 - the test crawler implemented by Exalead.com

ACAP-crawler: ExaBotAcap12

ACAP-usage-purpose: http://acap2.exalead.com # only allowed to use content in service http://acap2.exalead.com

ACAP-disallow-index: /nieuws/ # not allowed to index the news section

ACAP-allow-index: /sport/ # allowed to index the sports section

ACAP-allow present-snippet: /sport/ max-length=150-chars # limit the snippet of sport to 150 chars

A.2.2 META tags used in HTML test content on website acap.persgroep.be

The following sample shows only those META tags that contain ACAP policy information. They are extracted from the header of an HTML

page on De Persgroep‟s test website, and have been modified slightly for inclusion in this Implementation Guide.

<meta name="snippet-exalead-acap" content="Op de vandaag gepubliceerde nieuwe ATP-rankings vallen er slechts hier en daar

veranderingen te bespeuren" />

<meta name="ExaBotAcap12" content="ACAP allow-present-snippet must-use-resource=the-acap:extract:meta:snippet-exalead-

acap"/>