Crawl Operators’ Workshop

12
Crawl Operators’ Workshop Roger G. Coram

description

Crawl Operators’ Workshop. Roger G. Coram. Topics. ExternalGeoLocationDecideRule Sheets IpAddressSetDecideRule. ExternalGeoLocationDecideRule. Legal Deposit legislation passed in April 2013. The Legal Deposit Libraries (Non-Print Works) Regulations 2013: - PowerPoint PPT Presentation

Transcript of Crawl Operators’ Workshop

Page 1: Crawl Operators’ Workshop

Crawl Operators’ Workshop

Roger G. Coram

Page 2: Crawl Operators’ Workshop

www.bl.uk 2

Topics

• ExternalGeoLocationDecideRule

• Sheets– IpAddressSetDecideRule

Page 3: Crawl Operators’ Workshop

www.bl.uk 3

ExternalGeoLocationDecideRule

• Legal Deposit legislation passed in April 2013.

• The Legal Deposit Libraries (Non-Print Works) Regulations 2013:

– 18 (1) “…a work published on line shall be treated as published in the United Kingdom if:

• “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”

Page 4: Crawl Operators’ Workshop

www.bl.uk 4

Geolocation

• ExternalGeoLocationDecideRule requires:

– A list of ISO 3166-1 country-codes to be included in the crawl

• GB, FR, DE, etc.

– An Implementation of ExternalGeoLookupInterface.

Page 5: Crawl Operators’ Workshop

www.bl.uk 5

ExternalGeoLookupInterface

• Our implementation is based on MaxMind’s GeoLite2 database.

• Freely available under ‘Creative Commons Attribution-ShareAlike 3.0 Unported License’.

• Only ~30MB; can be held in memory.

Page 6: Crawl Operators’ Workshop

www.bl.uk 6

crawler-beans.cxml

<!-- GEO-LOOKUP: specifying location of external database. --> <bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup"> <property name="database" value="/dev/shm/geoip-city.mmdb"/> </bean>

<!-- ... ACCEPT those in the UK... --> <bean id="externalGeoLookupRule" class="org.archive.crawler.deciderules.ExternalGeoLocationDecideRule"> <property name="lookup"> <ref bean="externalGeoLookup"/> </property> <property name="countryCodes"> <list> <value>GB</value> </list> </property> </bean>

Configuration example:

Page 7: Crawl Operators’ Workshop

www.bl.uk 7

Results

• Short test crawl (1,000,000 seeds) produced:– 89,500,755 URLs in total.

– 26,072 non-UK URLs which would not otherwise been in scope.

• 137 distinct hosts.

Page 8: Crawl Operators’ Workshop

www.bl.uk 8

IP-based Sheets

“Hi,

“I'm a senior system administrator for Webfusion / 123-reg.

“We're currently experiencing lots of requests from crawler1.bl.uk to sites hosted on 81.21.76.62 , this is part of our Parking platform, which links into Yahoo to allow customers to park domains and earn money.”

• Large number of hosts on a single machine.

• Need a way to reduce the load on a specific IP address.

Page 9: Crawl Operators’ Workshop

www.bl.uk 9

Sheets

• “Sheets provide the ability to replace default settings on a per domain basis.”

– Allow you to change any value on any named bean for a specific set of URLs.

• Actually quite flexible:– SurtPrefixesSheetAssociation

• Applied by matching SURT prefixes.

– DecideRuledSheetAssociation:

• Applied a series of DecideRules.

– IpAddressSetDecideRule

Page 10: Crawl Operators’ Workshop

www.bl.uk 10

1. crawler-beans.cxml

<bean id="extraPolite" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="disposition.delayFactor" value="8.0"/> <entry key="disposition.minDelayMs" value="10000"/> <entry key="disposition.maxDelayMs" value="60000"/> <entry key="disposition.respectCrawlDelayUpToSeconds" value="60"/> </map> </property> </bean>

<bean id="crawlLimited" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="quotaEnforcer.serverMaxFetchResponses" value="25"/> </map> </property> </bean>

Configuration example:

Page 11: Crawl Operators’ Workshop

www.bl.uk 11

2. crawler-beans.cxml

<bean class="org.archive.crawler.spring.DecideRuledSheetAssociation"> <property name="rules"> <bean class="org.archive.modules.deciderules.IpAddressSetDecideRule"> <property name="ipAddresses"> <set> <value>81.21.76.62</value> </set> </property> <property name="decision" value="ACCEPT"/> </bean> </property> <property name="targetSheetNames"> <list> <value>extraPolite</value> <value>crawlLimited</value> </list> </property> </bean>

Configuration example:

Page 12: Crawl Operators’ Workshop

www.bl.uk 12

Thank you

GitHub: https://github.com/ukwa/bl-heritrix-modulesMaxMind: http://dev.maxmind.com/geoip/geoip2/geolite2/