An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
-
Upload
koji-sekiguchi -
Category
Technology
-
view
1.503 -
download
0
Transcript of An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L
Natural Language Processing tool for Apache Lucene
Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT
Agenda• What’s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
2
Agenda• What’s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
3
What’s NLP4L?
4
What’s NLP4L?• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
5
What’s NLP4L?• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
6
What’s NLP4L?• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
7
What’s Lucene?
alice 1an 1, 2, 3
apple 1, 3ate 1is 3
likes 2mike 2
orange 2red 3
Alice ate an apple.
Mike likes an orange.
An apple is red.
1: 2: 3:
indexing
“apple”
searching
(inverted) index
Lucene is a high-performance, full-featured text search engine library written entirely in Java.
8
Agenda• What’s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
9
Evaluation Measures
10
Evaluation Measures
target
11
Evaluation Measures
targetresult
12
Evaluation Measures
targetresult
tpfp fn
tn
13
Evaluation Measures
targetresult
positive
14
Evaluation Measuresnegative
15
result
Evaluation Measures
16
true positive
true negative
Evaluation Measures
targetresult
17
false positive
false negative
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
18
Recall ,Precision
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
19
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
Recall ,Precision
20
Solutionn-gram, synonym dictionary, etc.
facet (filter query) Ranking Tuning
recall precision
recall , precision
21
Solutionn-gram, synonym dictionary, etc.
facet (filter query) Ranking Tuning
recall precision
recall , precision
22
Solutionn-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
recall precision
recall , precision
Ranking Tuning
23
Solutionn-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
e.g. Named Entity Extraction
recall precision
recall , precision
Ranking Tuning
24
gradual precision improvement
q=watch
25
targetresult
filter by “Gender=Men’s”
26
targetresult
gradual precision improvement
27
targetresult
filter by “Gender=Men’s”
filter by “Price=100-150”
gradual precision improvement
Structured Documents
ID product price gender
1 CURREN New Men’s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men’s
2 Suiksilver The Gamer Watch 87.99 Men’s
28
Unstructured Documents
ID article
1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.
2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.
29
Make them Structured
ID article person org loc
1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.
David Cameron EU Bruss
els
2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.
EUUK Britain
NEE[1] extracts interesting words.
[1] Named Entity Extraction
30
Manual Tagging using brat
31
Agenda• What’s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
32
A small Corpus
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
33
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
34
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
35
text data
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
36
Lucene index directory
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
37
schema definition
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
38
create Lucene document
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
39
open a writer
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
40
write documents
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
41
close writer
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
val index = "/tmp/index-simple"
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
42
As for code snippets used in my talk, please look at: https://github.com/NLP4L/meetups/tree/master/20150818
Getting word countsalice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
43
Getting word countsalice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text") // -> 12
reader.field("text").get.terms.size // -> 9
reader.totalTermFreq("text", "an") // -> 3
reader.close
getting_word_counts.scala
44
Getting word countsalice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text") // -> 12
reader.field("text").get.terms.size // -> 9
reader.totalTermFreq("text", "an") // -> 3
reader.close
getting_word_counts.scala
45
Getting word countsalice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text") // -> 12
reader.field("text").get.terms.size // -> 9
reader.totalTermFreq("text", "an") // -> 3
reader.close
getting_word_counts.scala
46
Getting top termsalice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.topTermsByDocFreq("text") reader.topTermsByTotalTermFreq("text") // -> // (term, docFreq, totalTermFreq) // (an,3,3) // (apple,2,2) // (likes,1,1) // (is,1,1) // (orange,1,1) // (mike,1,1) // (ate,1,1) // (red,1,1) // (alice,1,1)
reader.close
getting_word_counts.scala
47
What’s ShingleFilter?• ShingleFilter = Word n-gram TokenFilter
WhitespaceTokenizer
ShingleFilter (N=2)
“Lucene is a popular software”
Lucene/is/a/popular/software
Lucene is/is a/a popular/popular software
48
Language Model• LM represents the fluency of language
• N-gram model is the LM which is most widely used
• Calculation example for 2-gram
49
val index = "/tmp/index-lm"
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
// create a language model index val writer = IWriter(index, schema())
def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }
CORPUS.foreach(addDocument(_))
writer.close()
language_model.scala
1/2
50
val index = "/tmp/index-lm"
val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )
def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }
// create a language model index val writer = IWriter(index, schema())
def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }
CORPUS.foreach(addDocument(_))
writer.close()
language_model.scala
1/2
51
schema definition
val reader = RawReader(index)
// P(apple|an) = C(an apple) / C(an) val count_an_apple = reader.totalTermFreq("word2g", "an apple") val count_an = reader.totalTermFreq("word", "an") val prob_apple_an = count_an_apple.toFloat / count_an.toFloat
// P(orange|an) = C(an orange) / C(an) val count_an_orange = reader.totalTermFreq("word2g", "an orange") val prob_orange_an = count_an_orange.toFloat / count_an.toFloat
reader.close
language_model.scala
2/2
52
Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./.
NNP Proper noun, singularVB VerbAT ArticleJJ Adjective. period
Part-of-Speech Tagging
53
Our Corpus for training
Hidden Markov Model
54
Hidden Markov Model
55
Series of Words
Hidden Markov Model
56
Series of Part-of-Speech
Hidden Markov Model
57
Hidden Markov Model
58
HMM state diagramNNP 0.667
VB 0.0
. 0.0
JJ 0.0
AT 0.333
1.0
1.0
0.4 0.6
0.6670.333
59
alice 0.2 apple 0.4 mike 0.2 orange 0.2
ate 0.333 is 0.333 likes 0.333
an 1.0
red 1.0
. 1.0
val index = "/tmp/index-hmm"
val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }
indexer.close()
// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
60
val index = "/tmp/index-hmm"
val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }
indexer.close()
// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
61
text data (they are tagged!)
val index = "/tmp/index-hmm"
val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }
indexer.close()
// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
62
write-open Lucene index
val index = "/tmp/index-hmm"
val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }
indexer.close()
// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
63
tagged texts are indexed here
val index = "/tmp/index-hmm"
val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }
indexer.close()
// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
64
make an HmmModel from Lucene index
val index = "/tmp/index-hmm"
val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }
indexer.close()
// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
65
get HmmTagger from HmmModel
val index = "/tmp/index-hmm"
val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }
indexer.close()
// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
66
use HmmTagger to annotate unknown sentence
val index = "/tmp/index-hmm"
val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }
indexer.close()
// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
NLP4L has hmm_postagger.scala in examples directory. It uses brown corpus for HMM training.
67
Agenda• What’s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
68
TransliterationTransliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers.
computer コンピューター
server サーバー
internet インターネット
mouse マウス
information インフォメーション
examples of transliteration from English to Japanese
69
It helps improve recallyou search English “mouse”
70
It helps improve recall
but you got “マウス” (=mouse) highlighted in Japanese
71
Training data in NLP4Lアaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy
train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt
72
academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー
Demo: Transliteration
Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
ダイニング dining (OK)
コミッター committer (OK)
エントリー entree entry
nlp4l> :load examples/trans_katakana_alpha.scala
73
Gathering loan words
① crawl
gathering Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
“アルゴリズム”
“algorism”
calculate edit distance
synonyms.txt
74
store pair of strings if edit distance is small enough
②
③
④⑤
⑥
Gathering loan words
① crawl
gathering Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
“アルゴリズム”
“algorism”
calculate edit distance
synonyms.txt
75
store pair of strings if edit distance is small enough
②
③
④⑤
⑥
Got 1,800+ records of synonym knowledge from jawiki
Agenda• What’s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
76
NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and indexes.
• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
77
NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and indexes.
• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
78
NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and indexes.
• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
79
NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and indexes.
• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
80
Solr
ES
Mahout Spark
Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log
Dictionaries
・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment
maintenance
Model files Tagged Corpus
Document Vectors
・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection
・Learning to Rank ・Personalized Search
81
Keyword Attachment• “Keyword attachment” is a general format that enables the
following functions.
• Learning to Rank
• Personalized Search
• Named Entity Extraction
• Document Classification
Lucene doc
Lucene doc keyword
↑ Increase boost
82
Learning to Rank• Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, …
https://en.wikipedia.org/wiki/Learning_to_rank
83
Personalized Search• Program learns, from access log and other sources, that
the score of document d for a query q by user u should be larger than the normal score(q,d)
• Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d).
• Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous.
Lucene doc d1 q1u1, q2u2
Lucene doc d2 q2u1, q1u2
84
Join and Code with Us!
Contact us at
koji at apache dot org
for the details.
85
Demo or Q & A
Thank you!
86