Marc Nause, [email protected] Michael Christen,...

26
YaCy Workshop Marc Nause, [email protected] Michael Christen, [email protected], http://yacy.net Hackerspace Ghent 27.08.2011 @ http://0x20.be/YaCy_Workshop Topics What is a decentralized search engine? Architecture of search engines, scaling, decentralization Live Demo set-up, run a web crawl, decentralized stuff, using the API Hands-On make your own search index; crawl the internet, the intranet and file shares +Demo!

Transcript of Marc Nause, [email protected] Michael Christen,...

Page 1: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

YaCy WorkshopMarc Nause, [email protected] Christen, [email protected], http://yacy.net

Hackerspace Ghent 27.08.2011 @http://0x20.be/YaCy_Workshop

Topics

What is a decentralized search engine?Architecture of search engines, scaling, decentralization

Live Demoset-up, run a web crawl, decentralized stuff, using the API

Hands-Onmake your own search index;crawl the internet, the intranet and file shares

+Demo!

Page 2: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

We do not want centralization(of search engines)

we want:freedom of information

anonymity when doing web search

Page 3: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Retrieval, Indexing, Storage and Search Components

A Search Engine Core

Sear

ch In

terf

ace

Database

IndexingCraw

ler

Text Analysis

words

Double LinkCheck

Stop wordsCheck

ReverseWord Index

@

URL Crawl Stack

links

URL ReferencesWordYaCy has an

integrated NoSQL Database. The

database stores a Reverse Word

Index, Metadata and the source

documents.

Depth = 0 Start-URL

Depth = 1

Depth = 2

ranking,verification,visualization

filtering,parsing

Peer-to-Peer Network API

Page 4: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

horizontal scaling: more documentsvert

ical

sca

ling:

mor

e qu

erie

s pe

r se

cond Search Engine Cluster

Architecture of Large-Scale Search

Page 5: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

Large Search Engine in a Data Center

Construction of a Large-Scale Search Engine

Page 6: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

The Large-Scale Search Engine in your Home!

Page 7: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Peer Peer Peer Peer Peer

Peer Peer Peer Peer Peer

Peer Peer Peer Peer Peer

YaCy connects search peers with a peer-to-peer protocol

Page 8: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

DHT-Store DHT-Read

Peer

Peer

PeerPeer

Peer

Peer

Peer

Peer

Peer

Peer

Peer

PeerPeer

Peer

Peer

Peer

The YaCy Search Network: Fully Decentralized!

YaCy peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are

not available. The redundancy also helps to increase search performance.

Crawling, Indexing&

DistributionSearching in the DHTDHT

Page 9: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

PeerAppliance

Peer

Peer

Peer

Peer

PeerPeer

Peer

Peer

Decentralized Search non-Cloud Search(keep your secrets)

Community & Personal Use of Search Engines

Page 10: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

BugtrackerCodeDiscussions Wiki

IntegratedSearch

IntegratedSearch

IntegratedSearch

IntegratedSearch

Search Engine

Your Project

Appliance

Productivity #1/5: Project Search Engine

Page 11: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Search Engine

Productivity #2/5: Keep Secrets!

Enterprise Environment

The Internet

BugtrackerCodeDiscussions Wiki

Appliance

Page 12: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Productivity #3/5: Personal Relevance

that‘s what lucene has

similar to G**gle PR

in YaCy, you can combine many

weighted attributes

Page 13: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Productivity #4/5: Download Helper

Page 14: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Productivity #5/5: Become Independent

Data Search User

Free Software

Data under Creative Commons License

Open Access Repositories

as it is today: PROPRIETARY & CENTRALISED:

it traces you & data can be censored, blocked,

removed, spammed

User needs proprietary & centralised software to discover free content

is this what we want?

Page 15: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Impact of running your own search engine:

1. connect tools and people in projectsfree software projects need free search

2. keep secretssearch tracks can reveal industrial research targets

3. your personal relevancecreate a ranking method for your personal needs

4. do more with searchfor example file sharing and downloading

5. support freedomfree information cannot be free without free search

Productivity: Summary

Page 16: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

http://sciencenet.fzk.de

300 million documents

,Sciencenet‘: Search Engine for scientific content in the Karlsruhe Institute of Technology:

34 computers running YaCy in it‘s own network

Examples #1/2: Search Cluster in a Data Center

Page 17: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Search Engine @Home

> 1 Billion Documents

Examples #2/2: Decentralised Search for Everyone

People run they own YaCy search peer at home and create independent search for everyone

Page 18: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

The ,freeworld‘ YaCy Search Engine Network

DHT-StoreDHT-Read

Juniorbehind firewall or router

Seniorhas open server port

Principalpublishes seed-lists

Peer Types:

Architecture #1/4: The Search Engine Network

Page 19: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

SRU

FacetsFile Types, Protocols,

Domains, Authors

every link is verifiedbefore it is displayed: the content is loaded,

parsed and used for a search snippet generation

Opensearch (search results with RSS), JSON, AJAX toolsAPIssearch widget, ready-to-use code snippets to embed search everywhereTools

Standards

Architecture #2/4: Snippets & Link Verification

Page 20: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Crawlerwith target host balancing

target hosts(domain name)

round-robin access

robots.txt, latency and minimum access time 0.5s

loader

Architecture #3/4: Data Aquisition

OAI-PMH Loaderload opac records from libraries

Import FilesDublin Core Files

Wikimedia Dump

Scan Sourcesin a specific network

Scan IP Range

Discover Services

Availablility Mngt.

SMBFTP

Indexer

ParserHTML, XHTML, RSS, RDF, XHTML+RDFa, FOAF, vCard, Flash, PDF, PS, Word, Excel, Visio, Powerpoint,

OpenOffice, RTF, csv, gzip, zip, tar, rar, bzip2, 7zip, images(EXIF), Dublin Core XML, torrent files

many file formats Dublin Core

RSS Feeds

Page 21: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Data Visualization

Architecture #4/4: Production / MonitoringNetwork Animation

Connections, Queues, Database

Scheduler

Page 22: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

<iframe name="target2" src="http://141.52.175.43:8080/yacysearch.html?display=2&resource=local" width="100%" height="180" frameborder="0" scrolling="auto" id="target2"</iframe>

<form method="get" accept-charset="UTF-8" action="http://141.52.175.43:8080/yacysearch.html"> <div> <div>MySearch</div> <input type="text" name="query" value="" maxlength="80" /> <input type="hidden" name="verify" value="true" /> <input type="hidden" name="maximumRecords" value="10" /> <input type="hidden" name="meanCount" value="5" /> <input type="hidden" name="resource" value="local" /> <input type="hidden" name="urlmaskfilter" value=".*" /> <input type="hidden" name="prefermaskfilter" value="" /> <input type="hidden" name="display" value="2" /> <input type="hidden" name="nav" value="all" /> <input type="submit" name="Enter" value="Search" /> </div></form>

How to integrate a YaCy Search Portal:Just copy-paste the code snippet to your web page source code.

Code Snippet Example #1: a search window in an iframe

Code Snippet Example #2: a search box (points to new page)Code Snippet #2 looks like:

The YaCy administration interface offers more code snippets. An example from/ConfigSearchBox.htmllooks like:

your YaCy peer provides help pages with code snippets for an easy integration!

Integration #1/4: Search Interface Integration

Page 23: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

> curl http://localhost:8080/yacysearch.rss?query=foaf&maximumRecords=10<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type='text/xsl' href='/yacysearch.xsl' version='1.0'?><rss version="2.0" xmlns:yacy="http://www.yacy.net/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"<!-- very short example --><item> <title>Friend of a Friend (FOAF) project</title> <link>http://www.foaf-project.org/</link> <pubDate>Fri, 23 May 2008 02:00:00 +0200</pubDate></item><item> <title>FOAF - Wikipedia</title> <link>http://de.wikipedia.org/wiki/FOAF</link> <pubDate>Tue, 08 Jan 2008 01:00:00 +0100</pubDate></item><item> <link>http://microformats.org/wiki/xfn-to-foaf</link> <pubDate>Fri, 09 May 2008 02:00:00 +0200</pubDate></item></rss>

Standards:The YaCy-internal Dublin Core Metadata Format fits into the RSS format for search result data in Opensearch standard very well.

If wanted, also JSON can be used as export format.

How to get Opensearch/JSON Search Results:• do a normal web search in YaCy• replace the ‘html‘ extension of

the result page URL with ‘rss‘• for json, replace the ‘html‘

extension with ‘json‘

SRU Standard for Queries: http://www.loc.gov/standards/sru/specs/search-retrieve.htmlOpensearch Standard: http://www.opensearch.org

Integration #2/4: External Index Retrieval

Page 24: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

Integration #3/4: Federated Index + feed to Solr

Solr Charding

Page 25: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Marc Nause, Michael Christenhttp://yacy.net

YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop

Usage granted by

<?xml version="1.0" encoding="utf-8"?><!-- YaCy surrogate using dublin core notion --><surrogates xmlns:dc="http://purl.org/dc/elements/1.1/">

<record> <dc:title><![CDATA[Alan Smithee]]></dc:title> <dc:identifier>http://de.wikipedia.org/wiki/Alan_Smithee</dc:identifier> <dc:description> <![CDATA['''Alan Smithee''' ist ein Anagramm von „The Alias Men“.]]> </dc:description> <dc:language>de</dc:language> <dc:date>2009-04-14T00:00:00Z</dc:date> <!-- date is in ISO 8601 --> </record> </surrogates>

Standards:YaCy can import standard Dublin Core Metadata XML files as input for indexing

How to import Dublin Core Files:just place the xml files into a hand-over directory at DATA/SURROGATES/in/

The Dublin Core XML File Standard:http://dublincore.org/documents/dc-xml-guidelines/

Integration #4/4: External Index Feeding

Page 26: Marc Nause, marc.nause@audioattack.de Michael Christen, …yacy.net/material/YaCy_Hackerspace_Ghent_Whitespace... · 2014. 4. 8. · Retrieval, Indexing, Storage and Search Components

Join In!

Download- download YaCy from http://yacy.net

Help!- interface translation and wiki pages- run a peer- become a developer- talk to us at http://forum.yacy.de