Marc Nause, [email protected] Michael Christen,...
Transcript of Marc Nause, [email protected] Michael Christen,...
YaCy WorkshopMarc Nause, [email protected] Christen, [email protected], http://yacy.net
Hackerspace Ghent 27.08.2011 @http://0x20.be/YaCy_Workshop
Topics
What is a decentralized search engine?Architecture of search engines, scaling, decentralization
Live Demoset-up, run a web crawl, decentralized stuff, using the API
Hands-Onmake your own search index;crawl the internet, the intranet and file shares
+Demo!
We do not want centralization(of search engines)
we want:freedom of information
anonymity when doing web search
Retrieval, Indexing, Storage and Search Components
A Search Engine Core
Sear
ch In
terf
ace
Database
IndexingCraw
ler
Text Analysis
words
Double LinkCheck
Stop wordsCheck
ReverseWord Index
@
URL Crawl Stack
links
URL ReferencesWordYaCy has an
integrated NoSQL Database. The
database stores a Reverse Word
Index, Metadata and the source
documents.
Depth = 0 Start-URL
Depth = 1
Depth = 2
ranking,verification,visualization
filtering,parsing
Peer-to-Peer Network API
SearchEngine
SearchEngine
Search Engine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
horizontal scaling: more documentsvert
ical
sca
ling:
mor
e qu
erie
s pe
r se
cond Search Engine Cluster
Architecture of Large-Scale Search
SearchEngine
SearchEngine
Search Engine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
Large Search Engine in a Data Center
Construction of a Large-Scale Search Engine
SearchEngine
SearchEngine
Search Engine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
SearchEngine
The Large-Scale Search Engine in your Home!
Peer Peer Peer Peer Peer
Peer Peer Peer Peer Peer
Peer Peer Peer Peer Peer
YaCy connects search peers with a peer-to-peer protocol
DHT-Store DHT-Read
Peer
Peer
PeerPeer
Peer
Peer
Peer
Peer
Peer
Peer
Peer
PeerPeer
Peer
Peer
Peer
The YaCy Search Network: Fully Decentralized!
YaCy peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are
not available. The redundancy also helps to increase search performance.
Crawling, Indexing&
DistributionSearching in the DHTDHT
PeerAppliance
Peer
Peer
Peer
Peer
PeerPeer
Peer
Peer
Decentralized Search non-Cloud Search(keep your secrets)
Community & Personal Use of Search Engines
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
BugtrackerCodeDiscussions Wiki
IntegratedSearch
IntegratedSearch
IntegratedSearch
IntegratedSearch
Search Engine
Your Project
Appliance
Productivity #1/5: Project Search Engine
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Search Engine
Productivity #2/5: Keep Secrets!
Enterprise Environment
The Internet
BugtrackerCodeDiscussions Wiki
Appliance
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Productivity #3/5: Personal Relevance
that‘s what lucene has
similar to G**gle PR
in YaCy, you can combine many
weighted attributes
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Productivity #4/5: Download Helper
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Productivity #5/5: Become Independent
Data Search User
Free Software
Data under Creative Commons License
Open Access Repositories
as it is today: PROPRIETARY & CENTRALISED:
it traces you & data can be censored, blocked,
removed, spammed
User needs proprietary & centralised software to discover free content
is this what we want?
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Impact of running your own search engine:
1. connect tools and people in projectsfree software projects need free search
2. keep secretssearch tracks can reveal industrial research targets
3. your personal relevancecreate a ranking method for your personal needs
4. do more with searchfor example file sharing and downloading
5. support freedomfree information cannot be free without free search
Productivity: Summary
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
http://sciencenet.fzk.de
300 million documents
,Sciencenet‘: Search Engine for scientific content in the Karlsruhe Institute of Technology:
34 computers running YaCy in it‘s own network
Examples #1/2: Search Cluster in a Data Center
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Search Engine @Home
> 1 Billion Documents
Examples #2/2: Decentralised Search for Everyone
People run they own YaCy search peer at home and create independent search for everyone
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
The ,freeworld‘ YaCy Search Engine Network
DHT-StoreDHT-Read
Juniorbehind firewall or router
Seniorhas open server port
Principalpublishes seed-lists
Peer Types:
Architecture #1/4: The Search Engine Network
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
SRU
FacetsFile Types, Protocols,
Domains, Authors
every link is verifiedbefore it is displayed: the content is loaded,
parsed and used for a search snippet generation
Opensearch (search results with RSS), JSON, AJAX toolsAPIssearch widget, ready-to-use code snippets to embed search everywhereTools
Standards
Architecture #2/4: Snippets & Link Verification
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Crawlerwith target host balancing
target hosts(domain name)
round-robin access
robots.txt, latency and minimum access time 0.5s
loader
Architecture #3/4: Data Aquisition
OAI-PMH Loaderload opac records from libraries
Import FilesDublin Core Files
Wikimedia Dump
Scan Sourcesin a specific network
Scan IP Range
Discover Services
Availablility Mngt.
SMBFTP
Indexer
ParserHTML, XHTML, RSS, RDF, XHTML+RDFa, FOAF, vCard, Flash, PDF, PS, Word, Excel, Visio, Powerpoint,
OpenOffice, RTF, csv, gzip, zip, tar, rar, bzip2, 7zip, images(EXIF), Dublin Core XML, torrent files
many file formats Dublin Core
RSS Feeds
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Data Visualization
Architecture #4/4: Production / MonitoringNetwork Animation
Connections, Queues, Database
Scheduler
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
<iframe name="target2" src="http://141.52.175.43:8080/yacysearch.html?display=2&resource=local" width="100%" height="180" frameborder="0" scrolling="auto" id="target2"</iframe>
<form method="get" accept-charset="UTF-8" action="http://141.52.175.43:8080/yacysearch.html"> <div> <div>MySearch</div> <input type="text" name="query" value="" maxlength="80" /> <input type="hidden" name="verify" value="true" /> <input type="hidden" name="maximumRecords" value="10" /> <input type="hidden" name="meanCount" value="5" /> <input type="hidden" name="resource" value="local" /> <input type="hidden" name="urlmaskfilter" value=".*" /> <input type="hidden" name="prefermaskfilter" value="" /> <input type="hidden" name="display" value="2" /> <input type="hidden" name="nav" value="all" /> <input type="submit" name="Enter" value="Search" /> </div></form>
How to integrate a YaCy Search Portal:Just copy-paste the code snippet to your web page source code.
Code Snippet Example #1: a search window in an iframe
Code Snippet Example #2: a search box (points to new page)Code Snippet #2 looks like:
The YaCy administration interface offers more code snippets. An example from/ConfigSearchBox.htmllooks like:
your YaCy peer provides help pages with code snippets for an easy integration!
Integration #1/4: Search Interface Integration
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
> curl http://localhost:8080/yacysearch.rss?query=foaf&maximumRecords=10<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type='text/xsl' href='/yacysearch.xsl' version='1.0'?><rss version="2.0" xmlns:yacy="http://www.yacy.net/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"<!-- very short example --><item> <title>Friend of a Friend (FOAF) project</title> <link>http://www.foaf-project.org/</link> <pubDate>Fri, 23 May 2008 02:00:00 +0200</pubDate></item><item> <title>FOAF - Wikipedia</title> <link>http://de.wikipedia.org/wiki/FOAF</link> <pubDate>Tue, 08 Jan 2008 01:00:00 +0100</pubDate></item><item> <link>http://microformats.org/wiki/xfn-to-foaf</link> <pubDate>Fri, 09 May 2008 02:00:00 +0200</pubDate></item></rss>
Standards:The YaCy-internal Dublin Core Metadata Format fits into the RSS format for search result data in Opensearch standard very well.
If wanted, also JSON can be used as export format.
How to get Opensearch/JSON Search Results:• do a normal web search in YaCy• replace the ‘html‘ extension of
the result page URL with ‘rss‘• for json, replace the ‘html‘
extension with ‘json‘
SRU Standard for Queries: http://www.loc.gov/standards/sru/specs/search-retrieve.htmlOpensearch Standard: http://www.opensearch.org
Integration #2/4: External Index Retrieval
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
Integration #3/4: Federated Index + feed to Solr
Solr Charding
Marc Nause, Michael Christenhttp://yacy.net
YaCy Workshop@ Hackerspace Ghent #whitespace - http://0x20.be/YaCy_Workshop
Usage granted by
<?xml version="1.0" encoding="utf-8"?><!-- YaCy surrogate using dublin core notion --><surrogates xmlns:dc="http://purl.org/dc/elements/1.1/">
<record> <dc:title><![CDATA[Alan Smithee]]></dc:title> <dc:identifier>http://de.wikipedia.org/wiki/Alan_Smithee</dc:identifier> <dc:description> <![CDATA['''Alan Smithee''' ist ein Anagramm von „The Alias Men“.]]> </dc:description> <dc:language>de</dc:language> <dc:date>2009-04-14T00:00:00Z</dc:date> <!-- date is in ISO 8601 --> </record> </surrogates>
Standards:YaCy can import standard Dublin Core Metadata XML files as input for indexing
How to import Dublin Core Files:just place the xml files into a hand-over directory at DATA/SURROGATES/in/
The Dublin Core XML File Standard:http://dublincore.org/documents/dc-xml-guidelines/
Integration #4/4: External Index Feeding
Join In!
Download- download YaCy from http://yacy.net
Help!- interface translation and wiki pages- run a peer- become a developer- talk to us at http://forum.yacy.de