Inside Searching
-
Upload
hugh-barnard -
Category
Internet
-
view
204 -
download
0
Transcript of Inside Searching
![Page 1: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/1.jpg)
Searching: Needles and Haystacks
Searching for stuff Why it's important How it's done Technical difficulties Social difficulties
![Page 2: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/2.jpg)
Search and Hugh
Involved in this since about 1982, mainly EEC projects especially Celex
I currently work for Vienna U on a multimedia database for stored manuscripts etc.
Computing industry since about 1974
![Page 3: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/3.jpg)
As a Human Activity
Looking for keys Remembering names and birthdays Looking up in a book And [the subject of this] making tools for the
intertubes Getting a clue, from the above...
![Page 4: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/4.jpg)
How it's Done: 1
Health warning: this explanation is simplified! Let's take Google How does it find one zillion documents/images
with 'lolcat' in them, within a few seconds?
![Page 5: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/5.jpg)
How it's Done:2
It did it already A key concept: IndexingAnother key concept: Inverted index [see
wikipedia]: https://en.wikipedia.org/wiki/Inverted_index
- lolcat in document x at position y- highlighted cat in document x at position y [?]
![Page 6: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/6.jpg)
Why do this at all?
Since Google, Bing, Yahoo already did it... Lots of interesting technical pieces Self education Fun and profit, do it 'better' [?] Internal search engines, intranet search
engines Domain specific engines School or research projects
![Page 7: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/7.jpg)
Parts of the Search Engine
- Spider or Harvester [?]- Parser/Indexer- Index Storage- Retrieval [the bit of Google that we see!]I'm going to go through these in order...
![Page 8: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/8.jpg)
Spider or Harvester
Go and get a load of stuff from the webThink of it as a programmatic super-surfer, use
curl: http://curl.haxx.se/ to try, for example Actually there are tons of ready-made ones:http://search.cpan.org/~johnd/WWW-Crawler-
Lite-0.005/lib/WWW/Crawler/Lite.pm Be polite, user-agent name, robots.txt,
throttling etc. [?] Can you think of some of the problems?
![Page 9: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/9.jpg)
Spidering/Harvesting/Darkweb
Spidering: start at top, follow linksDirectory based: index a load of things in a given
directoryHarvesting: academic harvester interfaces,
domain specific, I do this at presentRobots.txt: courtesy, the interactive web and the
darkwebProblems
![Page 10: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/10.jpg)
Parser/Indexer
Now the fun begins! Breaking down the stuff you get into tokens <b>lolcat</b> is indexed as 'lolcat', for
example Some tagging is preserved as meta-
information, <h1>Cat</h1> for example Parsing document types text, html, pdf etc Swish-e: http://www.swish-e.org/ is a small
scale harvester/parser, for example Some problems/opportunities
![Page 11: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/11.jpg)
Index
What does it look like?T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
we have the following inverted file index (where the integers in the set notation brackets refer to the indexes (or keys) of the text symbols, T[0], T[1] etc.):
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Problems: stop words !!!
![Page 12: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/12.jpg)
Storage
This used to be easy, now lots of options Sparse data, some entries 'lolcats' have
millions of entries, pyx [look it up] won't have many
It's a 'lot' of data, google came about by misspelling googolplex:
Relationals are fairly unsuitable Nosql and ready-mades:
http://solr-vs-elasticsearch.com/ for example
![Page 13: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/13.jpg)
Retrieval
Here you get results of all this work Simple, one field, one work Booleans and implied booleans [lots of works
anded together] Relevant results, this is the main thing and
links back to the storage and parsing Some problems, multilingual, non-latin,
synonyms
![Page 14: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/14.jpg)
Technical Difficulties
Finding all the documents Documents that change, appear or disappear Making the index and looking at it [it's big!] Non-latin/accented scripts for latin speakers:
appauvrissement Can you think of others?
![Page 15: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/15.jpg)
Technical Difficulties:2
Looking for André Looking for 中国 [what's that incidentally?] Looking for cat [furry] and cat [computer
command] Speed of index refresh [days] Storage and computation Semantic search
![Page 16: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/16.jpg)
Social Difficulties
Right to be forgotten Security services and data mining Privacy and doxing, see visual tagging too Linking the unlinked Automatic visual tagging [facebook] Automatic geolocation [most smartphones] Any more?
![Page 17: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/17.jpg)
Conclusions
It's a central human activity It's a vital activity for the web Very simple central idea, but lots of evolution
possible There's a societal debate to go with the
technical evolution
![Page 18: Inside Searching](https://reader036.fdocuments.net/reader036/viewer/2022083022/58ab02591a28abd35e8b5fbf/html5/thumbnails/18.jpg)
Thanks!
Thanks for listening and questions!