An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

An Introduction to NLP4L

Natural Language Processing tool for Apache Lucene

Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

2





• Future Plans

3

What’s NLP4L?

4

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

5



• FEATURES



• NLP4L provides



6



• FEATURES



• NLP4L provides



7

What’s Lucene?

alice 1an 1, 2, 3

apple 1, 3ate 1is 3

likes 2mike 2

orange 2red 3

Alice ate an apple.

Mike likes an orange.

An apple is red.

1: 2: 3:

indexing

“apple”

searching

(inverted) index

Lucene is a high-performance, full-featured text search engine library written entirely in Java.

8





• Future Plans

9

Evaluation Measures

10

Evaluation Measures

target

11

Evaluation Measures

targetresult

12

Evaluation Measures

targetresult

tpfp fn

tn

13

Evaluation Measures

targetresult

positive

14

Evaluation Measuresnegative

15

result

Evaluation Measures

16

true positive

true negative

Evaluation Measures

targetresult

17

false positive

false negative

Evaluation Measures

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

18

Recall ,Precision

tpfp fn

tn



19

targetresult

tpfp fn

tn



Recall ,Precision

20

Solutionn-gram, synonym dictionary, etc.

facet (filter query) Ranking Tuning

recall precision

recall , precision

21


facet (filter query) Ranking Tuning

recall precision

recall , precision

22


e.g. Transliteration

facet (filter query)

recall precision

recall , precision

Ranking Tuning

23


e.g. Transliteration

facet (filter query)

e.g. Named Entity Extraction

recall precision

recall , precision

Ranking Tuning

24

gradual precision improvement

q=watch

25

targetresult

filter by “Gender=Men’s”

26

targetresult


27

targetresult

filter by “Gender=Men’s”

filter by “Price=100-150”


Structured Documents

ID product price gender

1 CURREN New Men’s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men’s

2 Suiksilver The Gamer Watch 87.99 Men’s

28

Unstructured Documents

ID article

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

29

Make them Structured

ID article person org loc

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

David Cameron EU Bruss

els

2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

EUUK Britain

NEE[1] extracts interesting words.

[1] Named Entity Extraction

30

Manual Tagging using brat

31





• Future Plans

32

A small Corpus

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

33


val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

34







writer.close

index_simple.scala

35

text data







writer.close

index_simple.scala

36

Lucene index directory







writer.close

index_simple.scala

37

schema definition







writer.close

index_simple.scala

38

create Lucene document







writer.close

index_simple.scala

39

open a writer







writer.close

index_simple.scala

40

write documents







writer.close

index_simple.scala

41

close writer







writer.close

index_simple.scala

42

As for code snippets used in my talk, please look at: https://github.com/NLP4L/meetups/tree/master/20150818

https://github.com/NLP4L/meetups/tree/master/20150818

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

43


an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.sumTotalTermFreq("text") // -> 12

reader.field("text").get.terms.size // -> 9

reader.totalTermFreq("text", "an") // -> 3

reader.close

getting_word_counts.scala

44


an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3





reader.close


45


an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3





reader.close


46

Getting top termsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3


reader.topTermsByDocFreq("text") reader.topTermsByTotalTermFreq("text") // -> // (term, docFreq, totalTermFreq) // (an,3,3) // (apple,2,2) // (likes,1,1) // (is,1,1) // (orange,1,1) // (mike,1,1) // (ate,1,1) // (red,1,1) // (alice,1,1)

reader.close


47

What’s ShingleFilter?• ShingleFilter = Word n-gram TokenFilter

WhitespaceTokenizer

ShingleFilter (N=2)

“Lucene is a popular software”

Lucene/is/a/popular/software

Lucene is/is a/a popular/popular software

48

Language Model• LM represents the fluency of language

• N-gram model is the LM which is most widely used

• Calculation example for 2-gram

49

val index = "/tmp/index-lm"


def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

// create a language model index val writer = IWriter(index, schema())

def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }

CORPUS.foreach(addDocument(_))

writer.close()

language_model.scala

1/2

50

val index = "/tmp/index-lm"


def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

// create a language model index val writer = IWriter(index, schema())

def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }

CORPUS.foreach(addDocument(_))

writer.close()


1/2

51

schema definition


// P(apple|an) = C(an apple) / C(an) val count_an_apple = reader.totalTermFreq("word2g", "an apple") val count_an = reader.totalTermFreq("word", "an") val prob_apple_an = count_an_apple.toFloat / count_an.toFloat

// P(orange|an) = C(an orange) / C(an) val count_an_orange = reader.totalTermFreq("word2g", "an orange") val prob_orange_an = count_an_orange.toFloat / count_an.toFloat

reader.close


2/2

52

Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./.

NNP Proper noun, singularVB VerbAT ArticleJJ Adjective. period

Part-of-Speech Tagging

53

Our Corpus for training

Hidden Markov Model

54

Hidden Markov Model

55

Series of Words

Hidden Markov Model

56

Series of Part-of-Speech

Hidden Markov Model

57

Hidden Markov Model

58

HMM state diagramNNP 0.667

VB 0.0

. 0.0

JJ 0.0

AT 0.333

1.0

1.0

0.4 0.6

0.6670.333

59

alice 0.2 apple 0.4 mike 0.2 orange 0.2

ate 0.333 is 0.333 likes 0.333

an 1.0

red 1.0

. 1.0

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

60





indexer.close()



hmm.scala

61

text data (they are tagged!)





indexer.close()



hmm.scala

62

write-open Lucene index





indexer.close()



hmm.scala

63

tagged texts are indexed here





indexer.close()



hmm.scala

64

make an HmmModel from Lucene index





indexer.close()



hmm.scala

65

get HmmTagger from HmmModel





indexer.close()



hmm.scala

66

use HmmTagger to annotate unknown sentence





indexer.close()



hmm.scala

NLP4L has hmm_postagger.scala in examples directory. It uses brown corpus for HMM training.

67





• Future Plans

68

TransliterationTransliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers.

computer コンピューター

server サーバー

internet インターネット

mouse マウス

information インフォメーション

examples of transliteration from English to Japanese

69

It helps improve recallyou search English “mouse”

70

It helps improve recall

but you got “マウス” (=mouse) highlighted in Japanese

71

Training data in NLP4Lアaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy

train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt

72

academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー

Demo: Transliteration

Input Prediction Right Answer

アルゴリズム algorism algorithm

プログラム program (OK)

ケミカル chaemmical chemical

ダイニング dining (OK)

コミッター committer (OK)

エントリー entree entry

nlp4l> :load examples/trans_katakana_alpha.scala

73

Gathering loan words

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

74

store pair of strings if edit distance is small enough

②

③

④⑤

⑥

Gathering loan words

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

75

store pair of strings if edit distance is small enough

②

③

④⑤

⑥

Got 1,800+ records of synonym knowledge from jawiki





• Future Plans

76

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

77





78





79





80

Solr

ES

Mahout Spark

Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log

Dictionaries

・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment

maintenance

Model files Tagged Corpus

Document Vectors

・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection

・Learning to Rank ・Personalized Search

81

Keyword Attachment• “Keyword attachment” is a general format that enables the

following functions.

• Learning to Rank

• Personalized Search

• Named Entity Extraction

• Document Classification

Lucene doc

Lucene doc keyword

↑ Increase boost

82

Learning to Rank• Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, …

https://en.wikipedia.org/wiki/Learning_to_rank

83

https://en.wikipedia.org/wiki/Learning_to_rank

Personalized Search• Program learns, from access log and other sources, that

the score of document d for a query q by user u should be larger than the normal score(q,d)

• Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d).

• Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous.

Lucene doc d1 q1u1, q2u2

Lucene doc d2 q2u1, q1u2

84

Join and Code with Us!

Contact us at

koji at apache dot org

for the details.

85

Demo or Q & A

Thank you!

86

An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Technology

Transcript of An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)