Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search...

26
Web Search by the people, for the people Michael Christen, [email protected], http://yacy.net RMLL 2011 Rencontres Mondiales du Logiciel Libre http://2011.rmll.info Topics What is a decentralized search engine? and why would you use that Architecture details about the YaCy technology Integration of YaCy in your web pages and services 1

Transcript of Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search...

Page 1: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Web Search by the people, for the peopleMichael Christen, [email protected], http://yacy.net

RMLL 2011Rencontres Mondiales du Logiciel Librehttp://2011.rmll.info

Topics

What is a decentralized search engine?and why would you use that

Architecturedetails about the YaCy technology

Integrationof YaCy in your web pages and services

1

Page 2: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

We do not want centralization(of search engines)

we want:freedom of information

anonymity when doing web search

2

Page 3: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Retrieval, Indexing, Storage and Search Components

A Search Engine CoreSe

arch

Inte

rfac

eD

atabaseIndexingCr

awle

r

Text Analysis

words

Double LinkCheck

Stop wordsCheck

ReverseWord Index

@

URL Crawl Stack

links

URL ReferencesWordYaCy has an

integrated NoSQL Database. The

database stores a Reverse Word

Index, Metadata and the source

documents.

Depth = 0 Start-URL

Depth = 1

Depth = 2

ranking,verification,visualization

filtering,parsing

Peer-to-Peer Network API

3

Page 4: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

horizontal scaling: more documentsvert

ical

sca

ling:

mor

e qu

erie

s pe

r se

cond Search Engine Cluster

Architecture of Large-Scale Search

4

Page 5: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

Large Search Engine in a Data Center

Construction of a Large-Scale Search Engine

5

Page 6: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

The Large-Scale Search Engine in your Home!

6

Page 7: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Peer Peer Peer Peer Peer

Peer Peer Peer Peer Peer

Peer Peer Peer Peer Peer

YaCy connects search peers with a peer-to-peer protocol

7

Page 8: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

DHT-Store DHT-Read

Peer

Peer

PeerPeer

Peer

Peer

Peer

Peer

Peer

Peer

Peer

PeerPeer

Peer

Peer

Peer

The YaCy Search Network: Fully Decentralized!

YaCy peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are

not available. The redundancy also helps to increase search performance.

Crawling, Indexing&

DistributionSearching in the DHTDHT

8

Page 9: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

PeerAppliance

Peer

Peer

Peer

Peer

PeerPeer

Peer

Peer

Decentralized Search non-Cloud Search(keep your secrets)

Community & Personal Use of Search Engines

9

Page 10: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

BugtrackerCodeDiscussions Wiki

Search Engine

Your Project

Appliance

Productivity #1/5: Project Search Engine

10

Page 11: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Search Engine

Productivity #2/5: Keep Secrets!

Enterprise Environment

The Internet

BugtrackerCodeDiscussions Wiki

Appliance

11

Page 12: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Productivity #3/5: Personal Relevance

that‘s what lucene has

similar to G**gle PR

in YaCy, you can combine many

weighted attributes

12

Page 13: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Productivity #4/5: Download Helper

13

Page 14: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Productivity #5/5: Become Independent

Data Search User

Free Software

Data under Creative Commons License

Open Access Repositories

as it is today: PROPRIETARY & CENTRALISED:

it traces you & data can be censored, blocked,

removed, spammed

User needs proprietary & centralised software to discover free content

is this what we want?

14

Page 15: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Impact of running your own search engine:

1. connect tools and people in projectsfree software projects need free search

2. keep secretssearch tracks can reveal industrial research targets

3. your personal relevancecreate a ranking method for your personal needs

4. do more with searchfor example file sharing and downloading

5. support freedomfree information cannot be free without free search

Productivity: Summary

15

Page 16: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

http://sciencenet.fzk.de

300 million documents

,Sciencenet‘: Search Engine for scientific content in the Karlsruhe Institute of Technology:

34 computers running YaCy in it‘s own network

Examples #1/2: Search Cluster in a Data Center

16

Page 17: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Search Engine @Home

> 1 Billion Documents

Examples #2/2: Decentralised Search for Everyone

People run they own YaCy search peer at home and create independent search for everyone

17

Page 18: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

The ,freeworld‘ YaCy Search Engine Network

DHT-StoreDHT-Read

Juniorbehind firewall or router

Seniorhas open server port

Principalpublishes seed-lists

Peer Types:

Architecture #1/4: The Search Engine Network

18

Page 19: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

SRU

FacetsFile Types, Protocols,

Domains, Authors

every link is verifiedbefore it is displayed: the content is loaded,

parsed and used for a search snippet generation

Opensearch (search results with RSS), JSON, AJAX toolsAPIssearch widget, ready-to-use code snippets to embed search everywhereTools

Standards

Architecture #2/4: Snippets & Link Verification

19

Page 20: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Crawlerwith target host balancing

target hosts(domain name)

round-robin access

robots.txt, latency and minimum access time 0.5s

loader

Architecture #3/4: Data Aquisition

OAI-PMH Loaderload opac records from libraries

Import FilesDublin Core Files

Wikimedia Dump

Scan Sourcesin a specific network

Scan IP Range

Discover Services

Availablility Mngt.

SMBFTP

Indexer

ParserHTML, XHTML, RSS, RDF, XHTML+RDFa, FOAF, vCard, Flash, PDF, PS, Word, Excel, Visio, Powerpoint,

OpenOffice, RTF, csv, gzip, zip, tar, rar, bzip2, 7zip, images(EXIF), Dublin Core XML, torrent files

many file formats Dublin Core

RSS Feeds

20

Page 21: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Data Visualization

Architecture #4/4: Production / MonitoringNetwork Animation

Connections, Queues, Database

Scheduler

21

Page 22: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

<iframe name="target2" src="http://141.52.175.43:8080/yacysearch.html?display=2&resource=local" width="100%" height="180" frameborder="0" scrolling="auto" id="target2"</iframe>

<form method="get" accept-charset="UTF-8" action="http://141.52.175.43:8080/yacysearch.html"> <div> <div>MySearch</div> <input type="text" name="query" value="" maxlength="80" /> <input type="hidden" name="verify" value="true" /> <input type="hidden" name="maximumRecords" value="10" /> <input type="hidden" name="meanCount" value="5" /> <input type="hidden" name="resource" value="local" /> <input type="hidden" name="urlmaskfilter" value=".*" /> <input type="hidden" name="prefermaskfilter" value="" /> <input type="hidden" name="display" value="2" /> <input type="hidden" name="nav" value="all" /> <input type="submit" name="Enter" value="Search" /> </div></form>

How to integrate a YaCy Search Portal:Just copy-paste the code snippet to your web page source code.

Code Snippet Example #1: a search window in an iframe

Code Snippet Example #2: a search box (points to new page)Code Snippet #2 looks like:

The YaCy administration interface offers more code snippets. An example from/ConfigSearchBox.htmllooks like:

your YaCy peer provides help pages with code snippets for an easy integration!

Integration #1/3: Search Interface Integration

22

Page 23: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

> curl http://localhost:8080/yacysearch.rss?query=foaf&maximumRecords=10<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type='text/xsl' href='/yacysearch.xsl' version='1.0'?><rss version="2.0" xmlns:yacy="http://www.yacy.net/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"<!-- very short example --><item> <title>Friend of a Friend (FOAF) project</title> <link>http://www.foaf-project.org/</link> <pubDate>Fri, 23 May 2008 02:00:00 +0200</pubDate></item><item> <title>FOAF - Wikipedia</title> <link>http://de.wikipedia.org/wiki/FOAF</link> <pubDate>Tue, 08 Jan 2008 01:00:00 +0100</pubDate></item><item> <link>http://microformats.org/wiki/xfn-to-foaf</link> <pubDate>Fri, 09 May 2008 02:00:00 +0200</pubDate></item></rss>

Standards:The YaCy-internal Dublin Core Metadata Format fits into the RSS format for search result data in Opensearch standard very well.

If wanted, also JSON can be used as export format.

How to get Opensearch/JSON Search Results:• do a normal web search in YaCy• replace the ‘html‘ extension of

the result page URL with ‘rss‘• for json, replace the ‘html‘

extension with ‘json‘

SRU Standard for Queries: http://www.loc.gov/standards/sru/specs/search-retrieve.htmlOpensearch Standard: http://www.opensearch.org

Integration #2/3: External Index Retrieval

23

Page 24: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

<?xml version="1.0" encoding="utf-8"?><!-- YaCy surrogate using dublin core notion --><surrogates xmlns:dc="http://purl.org/dc/elements/1.1/">

<record> <dc:title><![CDATA[Alan Smithee]]></dc:title> <dc:identifier>http://de.wikipedia.org/wiki/Alan_Smithee</dc:identifier> <dc:description> <![CDATA['''Alan Smithee''' ist ein Anagramm von „The Alias Men“.]]> </dc:description> <dc:language>de</dc:language> <dc:date>2009-04-14T00:00:00Z</dc:date> <!-- date is in ISO 8601 --> </record> </surrogates>

Standards:YaCy can import standard Dublin Core Metadata XML files as input for indexing

How to import Dublin Core Files:just place the xml files into a hand-over directory at DATA/SURROGATES/in/

The Dublin Core XML File Standard:http://dublincore.org/documents/dc-xml-guidelines/

Integration #3/3: External Index Feeding

24

Page 25: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Michael Christenhttp://yacy.net

YaCy, Web Search by the people, for the people@ RMLL - Rencontres Mondiales du Logiciel Libre - http://2011.rmll.info

Usage granted by

Where is a (demo) Search Portal?There is no one-for-all demo portal for YaCy!

YaCy is about decentralized search and offering a central point for everyone would ruin the idea!

Decentralized Searchin your browser:

http://peer-search.net

- JavaScript Code is loaded into your browser- your browser loads a list of YaCy peers- when you search, your browser contacts some

of the YaCy peers and combines the search results from these peers; like a meta-search.

Peer Roulette,search on a random peer:

http://www.yacyweb.de/peers.htm

- yacyweb generates a list of YaCy peers

- when you click on a link you get the web interface of the peer directly

- when you search on that peer the content may be restricted to the rules of the peer owner

The best demo: run your own peer!

25

Page 26: Web Search by the people, for the people · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components A Search Engine Core ce se Crawler ng Text Analysis words Double Link

Thank You!

Download- download YaCy from http://yacy.net

Please Help!- the french interface translation and wiki pages- run a peer- become a developer

French Support Forum- we don‘t have that (yet). Please start a french forum!

26