Wikipedia Cloud Search Webinar

15
1 Searching Wikipedia with Amazon CloudSearch

description

View this webinar presented by Search Technologies' Chief Architect Paul Nelson on cloud search and a Wikipedia use case. Webinar given in conjunction with Amazon Cloud Search. Search Technologies provides implementation and consulting services for Amazon CloudSearch. For further information, see http://www.searchtechnologies.com/amazon-cloudsearch-services.html http://www.searchtechnologies.com/

Transcript of Wikipedia Cloud Search Webinar

Page 1: Wikipedia Cloud Search Webinar

1

Searching Wikipedia with Amazon CloudSearch

Page 2: Wikipedia Cloud Search Webinar

2Agenda

• Project Background• High-level Architecture• Summary & Observations

Page 3: Wikipedia Cloud Search Webinar

3Project Background

• Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch

• Decision to use Wikipedia as a convenient data set for testing purposes

3

Page 4: Wikipedia Cloud Search Webinar

4High-level Architecture

4

Page 5: Wikipedia Cloud Search Webinar

5Indexing

• Wikipedia provides content in a series of large xml files• Amazon CloudSearch ingests xml in a specified form• Various content processing tasks to perform

• Splitting into individual documents• Date normalization• Metadata extraction & mapping• Cleanup, etc.

• We used Aspire for these tasks

5

Page 6: Wikipedia Cloud Search Webinar

6Aspire in Brief

• Based on Apache Felix / OSGi• Thread-safe, multi-threaded, distributable• Any number of pipelines, conditional branching• Plug-in components individually testable & upgradable• In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA.• Tested with Elasticsearch and SP 2013

6

Page 7: Wikipedia Cloud Search Webinar

7XML Input

7

Page 8: Wikipedia Cloud Search Webinar

8Indexing

• Streaming Wikipedia Dump Files directly into CloudSearch

• 500 docs/second achieved without much effort• Using 4 x XL instances of CloudSearch• 1 x XL EC2 instance for Aspire

8

Page 9: Wikipedia Cloud Search Webinar

9Searching

• Amazon CloudSearch provides a RESTful/XML interface for search purposes

• For the Wikipedia project, we needed a UI• Chose to use Twigkit• Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at

http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html

9

Page 10: Wikipedia Cloud Search Webinar

10Searching

• Supports navigators and relevancy customization• E.g. a “PageRank” style link

analysis was performed

• Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds• Very useful for analysis

applications

• So, what does it look like?

10

Page 12: Wikipedia Cloud Search Webinar

12wikipedia.searchtechnologies.com 12

Page 13: Wikipedia Cloud Search Webinar

13Summary & Observations

• A capable and scalable “raw” engine• xml in, RESTful/xml out• Easy to set up – much the same as an EC2

instance• Elastic scalability

13

Page 14: Wikipedia Cloud Search Webinar

14Summary & Observations

• Cost effective• From $75 per month, including management /

maintenance

• Extremely convenient• Switch on / off at leisure• Promotes experimentation & agility

14

Page 15: Wikipedia Cloud Search Webinar

15