Informacijos paieškaTrumpas įvadas į Apache Lucene
Mindaugas ŽakšauskasVilnius Kaunas Java User Group
assert this != null
Kalbėsiu apie● Probleminę sritį● Informacijos paieškos teoriją● Lucene● Solr (jei liks laiko)● ?
Who are we?
Sukaposiu gabalais (tokenization)
Dokumentas #1: "The quick brown fox jumps over the lazy dog.”
Word the quick brown fox
Documents 1 1 1 1
Offsets 1[0] 1[4] 1[10] 1[16]
Sukaposiu gabalais (tokenization) #2Dokumentas #2: "I saw a brown fox yesterday. It ran away quickly.”
Word the quick brown fox
Documents 1 1 1, 2 1, 2
Offsets 1[0] 1[4] 1[10], 2[8] 1[16], 2[13]
Stopwords
Anglų:- artikeliai a, the
Lietuvių:- prielinksniai: į, nuo- ištiktukai: oi! - ?
Quick| ⇒ quick
quick|ly ⇒ quick
Stemming, lowercasing
Apache Lucene● ACID (+2 phase commit)● NoSQL (rimtai!)● Concurrency● Java (.NET, Python, Ruby)● Bendruomenė ● Plačiai naudojama
Ne caro laikų Lucene!
Indeksavimo greitis, Lucene v4
Sinonimai● Vanduo - H2O● Reikia spec. žodyno (SynonymMap)
quick brown fox ⇒ "quick", "fast", "brown", "fox"
Fonetinis kodavimashttp://en.wikipedia.org/wiki/Metaphone
0BFHJKLMNPRSTWXY
Stephen Smith ⇒STFN SM0
Boosting (indexing, query)
Užklausos● field: foo bar● field: +foo -bar● field: “foo bar”● field: +“foo bar” AND blah● field: f?o bar*● field: foo~ bar~0.8● date_field: [2000 TO 2001]● field: (foo AND bar) OR bob
http://searchhub.org/dev/2011/12/28/why-not-and-or-and-not/
Rezultatų formulė
org.apache.lucene.search.similarities.TFIDFSimilarity
Finite state transducer
mop, moth, pop, star, stop, top
10 million Wikipedia index - 69Mb
Top Related