Inside Searching

Searching: Needles and Haystacks

Searching for stuff Why it's important How it's done Technical difficulties Social difficulties

Search and Hugh

Involved in this since about 1982, mainly EEC projects especially Celex

I currently work for Vienna U on a multimedia database for stored manuscripts etc.

Computing industry since about 1974

[email protected]

As a Human Activity

Looking for keys Remembering names and birthdays Looking up in a book And [the subject of this] making tools for the

intertubes Getting a clue, from the above...

How it's Done: 1

Health warning: this explanation is simplified! Let's take Google How does it find one zillion documents/images

with 'lolcat' in them, within a few seconds?

How it's Done:2

It did it already A key concept: IndexingAnother key concept: Inverted index [see

wikipedia]: https://en.wikipedia.org/wiki/Inverted_index

- lolcat in document x at position y- highlighted cat in document x at position y [?]

Why do this at all?

Since Google, Bing, Yahoo already did it... Lots of interesting technical pieces Self education Fun and profit, do it 'better' [?] Internal search engines, intranet search

engines Domain specific engines School or research projects

Parts of the Search Engine

- Spider or Harvester [?]- Parser/Indexer- Index Storage- Retrieval [the bit of Google that we see!]I'm going to go through these in order...

Spider or Harvester

Go and get a load of stuff from the webThink of it as a programmatic super-surfer, use

curl: http://curl.haxx.se/ to try, for example Actually there are tons of ready-made ones:http://search.cpan.org/~johnd/WWW-Crawler-

Lite-0.005/lib/WWW/Crawler/Lite.pm Be polite, user-agent name, robots.txt,

throttling etc. [?] Can you think of some of the problems?

http://curl.haxx.se/

Spidering/Harvesting/Darkweb

Spidering: start at top, follow linksDirectory based: index a load of things in a given

directoryHarvesting: academic harvester interfaces,

domain specific, I do this at presentRobots.txt: courtesy, the interactive web and the

darkwebProblems

Parser/Indexer

Now the fun begins! Breaking down the stuff you get into tokens <b>lolcat</b> is indexed as 'lolcat', for

example Some tagging is preserved as meta-

information, <h1>Cat</h1> for example Parsing document types text, html, pdf etc Swish-e: http://www.swish-e.org/ is a small

scale harvester/parser, for example Some problems/opportunities

http://www.swish-e.org/

Index

What does it look like?T[0] = "it is what it is"

T[1] = "what is it"

T[2] = "it is a banana"

we have the following inverted file index (where the integers in the set notation brackets refer to the indexes (or keys) of the text symbols, T[0], T[1] etc.):

"a": {2}

"banana": {2}

"is": {0, 1, 2}

"it": {0, 1, 2}

"what": {0, 1}

Problems: stop words !!!

Storage

This used to be easy, now lots of options Sparse data, some entries 'lolcats' have

millions of entries, pyx [look it up] won't have many

It's a 'lot' of data, google came about by misspelling googolplex:

Relationals are fairly unsuitable Nosql and ready-mades:

http://solr-vs-elasticsearch.com/ for example

http://solr-vs-elasticsearch.com/

Retrieval

Here you get results of all this work Simple, one field, one work Booleans and implied booleans [lots of works

anded together] Relevant results, this is the main thing and

links back to the storage and parsing Some problems, multilingual, non-latin,

synonyms

Technical Difficulties

Finding all the documents Documents that change, appear or disappear Making the index and looking at it [it's big!] Non-latin/accented scripts for latin speakers:

appauvrissement Can you think of others?

Technical Difficulties:2

Looking for André Looking for 中国 [what's that incidentally?] Looking for cat [furry] and cat [computer

command] Speed of index refresh [days] Storage and computation Semantic search

Social Difficulties

Right to be forgotten Security services and data mining Privacy and doxing, see visual tagging too Linking the unlinked Automatic visual tagging [facebook] Automatic geolocation [most smartphones] Any more?

Conclusions

It's a central human activity It's a vital activity for the web Very simple central idea, but lots of evolution

possible There's a societal debate to go with the

technical evolution

Thanks!

Thanks for listening and questions!

Inside Searching

Internet

Transcript of Inside Searching