Inside Searching

18
Searching: Needles and Haystacks Searching for stuff Why it's important How it's done Technical difficulties Social difficulties

Transcript of Inside Searching

Page 1: Inside Searching

Searching: Needles and Haystacks

Searching for stuff Why it's important How it's done Technical difficulties Social difficulties

Page 2: Inside Searching

Search and Hugh

Involved in this since about 1982, mainly EEC projects especially Celex

I currently work for Vienna U on a multimedia database for stored manuscripts etc.

Computing industry since about 1974

[email protected]

Page 3: Inside Searching

As a Human Activity

Looking for keys Remembering names and birthdays Looking up in a book And [the subject of this] making tools for the

intertubes Getting a clue, from the above...

Page 4: Inside Searching

How it's Done: 1

Health warning: this explanation is simplified! Let's take Google How does it find one zillion documents/images

with 'lolcat' in them, within a few seconds?

Page 5: Inside Searching

How it's Done:2

It did it already A key concept: IndexingAnother key concept: Inverted index [see

wikipedia]: https://en.wikipedia.org/wiki/Inverted_index

- lolcat in document x at position y- highlighted cat in document x at position y [?]

Page 6: Inside Searching

Why do this at all?

Since Google, Bing, Yahoo already did it... Lots of interesting technical pieces Self education Fun and profit, do it 'better' [?] Internal search engines, intranet search

engines Domain specific engines School or research projects

Page 7: Inside Searching

Parts of the Search Engine

- Spider or Harvester [?]- Parser/Indexer- Index Storage- Retrieval [the bit of Google that we see!]I'm going to go through these in order...

Page 8: Inside Searching

Spider or Harvester

Go and get a load of stuff from the webThink of it as a programmatic super-surfer, use

curl: http://curl.haxx.se/ to try, for example Actually there are tons of ready-made ones:http://search.cpan.org/~johnd/WWW-Crawler-

Lite-0.005/lib/WWW/Crawler/Lite.pm Be polite, user-agent name, robots.txt,

throttling etc. [?] Can you think of some of the problems?

Page 9: Inside Searching

Spidering/Harvesting/Darkweb

Spidering: start at top, follow linksDirectory based: index a load of things in a given

directoryHarvesting: academic harvester interfaces,

domain specific, I do this at presentRobots.txt: courtesy, the interactive web and the

darkwebProblems

Page 10: Inside Searching

Parser/Indexer

Now the fun begins! Breaking down the stuff you get into tokens <b>lolcat</b> is indexed as 'lolcat', for

example Some tagging is preserved as meta-

information, <h1>Cat</h1> for example Parsing document types text, html, pdf etc Swish-e: http://www.swish-e.org/ is a small

scale harvester/parser, for example Some problems/opportunities

Page 11: Inside Searching

Index

What does it look like?T[0] = "it is what it is"

T[1] = "what is it"

T[2] = "it is a banana"

we have the following inverted file index (where the integers in the set notation brackets refer to the indexes (or keys) of the text symbols, T[0], T[1] etc.):

"a": {2}

"banana": {2}

"is": {0, 1, 2}

"it": {0, 1, 2}

"what": {0, 1}

Problems: stop words !!!

Page 12: Inside Searching

Storage

This used to be easy, now lots of options Sparse data, some entries 'lolcats' have

millions of entries, pyx [look it up] won't have many

It's a 'lot' of data, google came about by misspelling googolplex:

Relationals are fairly unsuitable Nosql and ready-mades:

http://solr-vs-elasticsearch.com/ for example

Page 13: Inside Searching

Retrieval

Here you get results of all this work Simple, one field, one work Booleans and implied booleans [lots of works

anded together] Relevant results, this is the main thing and

links back to the storage and parsing Some problems, multilingual, non-latin,

synonyms

Page 14: Inside Searching

Technical Difficulties

Finding all the documents Documents that change, appear or disappear Making the index and looking at it [it's big!] Non-latin/accented scripts for latin speakers:

appauvrissement Can you think of others?

Page 15: Inside Searching

Technical Difficulties:2

Looking for André Looking for 中国 [what's that incidentally?] Looking for cat [furry] and cat [computer

command] Speed of index refresh [days] Storage and computation Semantic search

Page 16: Inside Searching

Social Difficulties

Right to be forgotten Security services and data mining Privacy and doxing, see visual tagging too Linking the unlinked Automatic visual tagging [facebook] Automatic geolocation [most smartphones] Any more?

Page 17: Inside Searching

Conclusions

It's a central human activity It's a vital activity for the web Very simple central idea, but lots of evolution

possible There's a societal debate to go with the

technical evolution

Page 18: Inside Searching

Thanks!

Thanks for listening and questions!