An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko...

76
OCTOBER 13-16, 2016 AUSTIN, TX

Transcript of An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko...

Page 1: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Page 2: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

An Introduction to NLP4L: Natural Language Processing Tool for Apache Lucene

Tomoko Uchida Consultant, Rondhuit Co. Ltd.

Page 3: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

3

Who am I

• Tomoko Uchida (@moco_beta)

• Luke (Lucene Toolbox) collaborator (2015 ~)

• https://github.com/DmitryKey/luke

• The best-known tool for debugging and learning Lucene/Solr, Elasticsearch index :-)

Page 4: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

4

Agenda

• What’s NLP4L?

• How NLP improves search experience

• Calculate probabilities using ShingleFilter

• Transliteration (Application for HMM)

• NLP4L Framework (coming soon)

Page 5: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

5

Agenda

• What’s NLP4L?

• How NLP improves search experience

• Calculate probabilities using ShingleFilter

• Transliteration (Application for HMM)

• NLP4L Framework (coming soon)

Page 6: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

6

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Apprications (e.g. Transliteration)

Page 7: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

7

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Apprications (e.g. Transliteration)

Page 8: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

8

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

Page 9: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

9

Agenda

• What’s NLP4L?

• How NLP improves search experience

• Calculate probabilities using ShingleFilter

• Transliteration (Application for HMM)

• NLP4L Framework (coming soon)

Page 10: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

10

Evaluation Measures

Page 11: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

11

Evaluation Measures

Page 12: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

12

Evaluation Measures

Page 13: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

13

Evaluation Measures

Page 14: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

Recall ,Precision

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

14

Page 15: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

Recall ,Precision

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

15

Page 16: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

n-gram, synonym dictionary, etc.

facet (filter query) Ranking Tuning

recall precision

recall , precision

16

Solution

Page 17: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

n-gram, synonym dictionary, etc.

facet (filter query) Ranking Tuning

recall precision

recall , precision

17

Solution

Page 18: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

n-gram, synonym dictionary, etc.

e.g. Transliteration

facet (filter query)

recall precision

recall , precision

Ranking Tuning

18

Solution

Page 19: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

n-gram, synonym dictionary, etc.

e.g. Transliteration

facet (filter query)

e.g. Named Entity Extraction

recall precision

recall , precision

Ranking Tuning

19

Solution

Page 20: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

q=watch

20

targetresult

gradual precision improvement

Page 21: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

filter by “Gender=Men’s”

21

targetresult

gradual precision improvement

Page 22: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

22

targetresult

filter by “Gender=Men’s”

filter by “Price=100-150”

gradual precision improvement

Page 23: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

ID product price gender

1 CURREN New Men’s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men’s

2 Suiksilver The Gamer Watch 87.99 Men’s

23

Structured Documents

Page 24: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

ID article

1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

24

Unstructured Documents

Page 25: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

ID article person org loc

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

David Cameron EU Brussels

2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

EU UK Britain

NEE[1] extracts interesting words.

[1] Named Entity Extraction

25

Make them Structured

Page 26: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

26

Agenda

• What’s NLP4L?

• How NLP improves search experience

• Calculate probabilities using ShingleFilter

• Transliteration (Application for HMM)

• NLP4L Framework (coming soon)

Page 27: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

27

Language Model

• LM represents the fluency of language

Page 28: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

28

Language Model• LM represents the fluency of language

Page 29: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

• LM represents the fluency of language

• N-gram model is the LM which is most widely used

29

Language Model

Page 30: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

• LM represents the fluency of language

• N-gram model is the LM which is most widely used

• Calculation example for 2-gram

30

totalTermFreq(”word2g”,”an apple”)

totalTermFreq(”word”,”an”)

Language Model

Page 31: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./.

NNP Proper noun, singular

VB Verb

AT Article

JJ Adjective

. period

31

Our Corpus for training

Part-of-Speech Tagging

Page 32: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

32

Hidden Markov Model

Page 33: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

33

Series of Words

Hidden Markov Model

Page 34: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

34

Series of Part-of-Speech

Hidden Markov Model

Page 35: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

35

Hidden Markov Model

Page 36: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

36

Hidden Markov Model

Page 37: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

NNP 0.667

VB 0.0

. 0.0

JJ 0.0

AT 0.333

1.0

1.0

0.4 0.6

0.6670.333

37

alice 0.2 apple 0.4 mike 0.2 orange 0.2

ate 0.333 is 0.333 likes 0.333

an 1.0

red 1.0

. 1.0

HMM state diagram

Page 38: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

38

Agenda

• What’s NLP4L?

• How NLP improves search experience

• Calculate probabilities using ShingleFilter

• Transliteration (Application for HMM)

• NLP4L Framework (coming soon)

Page 39: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

39

Transliteration

Transliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers.

Page 40: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

computer コンピューター

server サーバー

internet インターネット

mouse マウス

information インフォメーション

examples of transliteration from English to Japanese

40

Transliteration

Page 41: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

you search English “mouse”

41

It helps improve recall

Page 42: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

but you got “マウス” (=mouse) highlighted in Japanese

42

It helps improve recall

Page 43: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー

train_data/alpha_katakana.txt

43

Training data in NLP4L

Page 44: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

アaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy

train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt

44

academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー

Training data in NLP4L

Page 45: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

nlp4l> :load examples/trans_katakana_alpha.scala

45

Demo: Transliterationval indexer = new HmmModelIndexer(index)val file = Source.fromFile("train_data/alpha_katakana_aligned.txt", "UTF-8")val pattern: Regex = """([\u30A0-\u30FF]+)([a-zA-Z]+)(.*)""".rdef align(result: List[(String, String)], str: String): List[(String, String)] = { str match { case pattern(a, b, c) => align(result :+ (a, b), c) case _ => result }}// create hmm model indexfile.getLines.foreach{ line: String => val doc = align(List.empty[(String, String)], line) indexer.addDocument(doc)}

Page 46: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

Input Prediction Right Answer

アルゴリズム algorism algorithm

プログラム program (OK)

ケミカル chaemmical chemical

ダイニング dining (OK)

コミッター committer (OK)

エントリー entree entry

46

Demo: Transliteration

Page 47: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

47

store pair of strings if edit distance is small enough

④⑤

Gathering loan words

Page 48: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

48

Agenda

• What’s NLP4L?

• How NLP improves search experience

• Calculate probabilities using ShingleFilter

• Transliteration (Application for HMM)

• NLP4L Framework (coming soon)

Page 49: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

49

NLP4L Framework• A framework that improves search experience (for mainly Lucene-

based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

Page 50: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

50

NLP4L Framework• A framework that improves search experience (for mainly Lucene-

based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

Page 51: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

51

NLP4L Framework• A framework that improves search experience (for mainly Lucene-

based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

Page 52: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

52

NLP4L Framework• A framework that improves search experience (for mainly Lucene-

based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

Page 53: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

53

Solr

ES

Mahout Spark

Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log

Dictionaries

・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment

maintenance

Model files Tagged Corpus

Document Vectors

・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection

・Learning to Rank ・Personalized Search

Page 54: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

54

Solr

ES

Mahout Spark

Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log

Dictionaries

・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment

maintenance

Model files Tagged Corpus

Document Vectors

・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection

・Learning to Rank ・Personalized Search

Page 55: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

55

Solr

ES

Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log

Dictionaries

・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment

maintenance

・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection

・Learning to Rank ・Personalized Search

Mahout Spark

Model filesDocument Vectors

Tagged Corpus

Page 56: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

56

Mahout Spark

Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log

Dictionaries

・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment

maintenance

Model files Tagged Corpus

Document Vectors

・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection

・Learning to Rank ・Personalized Search

Solr

ES

Page 57: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

57

example: Keyword Attachment

Information about associated Solr collection (core)

NLP/ML task (processor) chain

described by HOCON (Human-Optimized Config Object Notation)

UI prototype for NLP4L Framework (Lucia)https://github.com/NLP4L/lucia

Page 58: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

58

example: Keyword Attachment

Extracted keywords from whole documentsex.) Named Entities by OpenNLP

Page 59: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

59

example: Keyword Attachment

Information about associated Solr document (unique key, etc.)

Extracted keywords from this document

Solr field name for each keyword

Page 60: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

60

example: Keyword Attachment

Check the keywords and removewrong / inappropriate entries

Page 61: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

61

example: Keyword Attachment

Synch (attach) all keywords to Solr documents (by partial update command)

Page 62: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

62

example: Keyword Attachment

Solr document (befere keywords are attached)

Page 63: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

63

example: Keyword Attachment

Solr document (after keywords are attached)

Page 64: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

64

example: Keyword Attachment

If you delete keyword(s) already have been attached to solr documents,

the keyword(s) also will be removed from solr index when next “synch” action executed.

Page 65: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

65

Lucene doc

Lucene doc keyword

↑ Increase boost

Keyword Attachment Application

• “Keyword attachment” is a general format that enables the following functions.

• Learning to Rank

• Personalized Search

• Named Entity Extraction

• Document Classification

Page 66: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

66

targetresult

1 2 3 …

50 100 500 …

Before Learning to Rank

Page 67: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

67

targetresult

1 2 3 …

50 100 500 …

After Learning to Rank

Page 68: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

• Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d)

68

Lucene doc d

q, q, …

https://en.wikipedia.org/wiki/Learning_to_rank

Learning to Rank

Page 69: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

69

targetresult

1 2 3 …

50 100 500 …

q=apple

computer …

Personalized Search

Page 70: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

70

target

result

50 100 500 …

1 2 3 …

q=applefruit …Personalized Search

Page 71: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

71

Lucene doc d1 q1u1, q2u2

Lucene doc d2 q2u1, q1u2

Personalized Search• Program learns, from access log and other sources, that the score of

document d for a query q by user u should be larger than the normal score(q,d)

• Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d).

• Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous.

Page 72: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

72

example: Generating Synonyms (loanwords)

Execute job that generate pairs of Katakana and corresponding English words from corpus

Page 73: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

73

example: Generating Synonyms (loanwords)

Make adjustments in auto generated pairs (candidate synonyms) via web UI

Page 74: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

74

example: Generating Synonyms (loanwords)

acacia,アカシアacademy,アカデミーacatenango,アカテナンゴaccess,アクセスaccident,アクシデントaction,アクションactive,アクティブactivision,アクティビジョンacton,アクトンactor,アクター……

Exported pairs can be used in SynonymFilter

synonyms_loadwords_ja.txt

Page 75: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

75

Contact us at

koji at apache dot org

for the details.

Join and Code with Us!

https://github.com/NLP4L

Page 76: An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

76

Thank you!