An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

86
An Introduction to NLP4L Natural Language Processing tool for Apache Lucene Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT

Transcript of An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Page 1: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

An Introduction to NLP4L

Natural Language Processing tool for Apache Lucene

Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT

Page 2: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

2

Page 3: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

3

Page 4: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

What’s NLP4L?

4

Page 5: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

5

Page 6: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

6

Page 7: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

What’s NLP4L?• GOAL

• Improve Lucene users’ search experience

• FEATURES

• Use of Lucene index as a Corpus Database

• Lucene API Front-end written in Scala

• NLP4L provides

• Preprocessors for existing ML tools

• Provision of ML algorithms and Applications (e.g. Transliteration)

7

Page 8: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

What’s Lucene?

alice 1an 1, 2, 3

apple 1, 3ate 1is 3

likes 2mike 2

orange 2red 3

Alice ate an apple.

Mike likes an orange.

An apple is red.

1: 2: 3:

indexing

“apple”

searching

(inverted) index

Lucene is a high-performance, full-featured text search engine library written entirely in Java.

8

Page 9: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

9

Page 10: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measures

10

Page 11: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measures

target

11

Page 12: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measures

targetresult

12

Page 13: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measures

targetresult

tpfp fn

tn

13

Page 14: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measures

targetresult

positive

14

Page 15: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measuresnegative

15

result

Page 16: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measures

16

true positive

true negative

Page 17: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measures

targetresult

17

false positive

false negative

Page 18: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Evaluation Measures

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

18

Page 19: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Recall ,Precision

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

19

Page 20: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

targetresult

tpfp fn

tn

precision = tp / (tp + fp)

recall = tp / (tp + fn)

Recall ,Precision

20

Page 21: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Solutionn-gram, synonym dictionary, etc.

facet (filter query) Ranking Tuning

recall precision

recall , precision

21

Page 22: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Solutionn-gram, synonym dictionary, etc.

facet (filter query) Ranking Tuning

recall precision

recall , precision

22

Page 23: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Solutionn-gram, synonym dictionary, etc.

e.g. Transliteration

facet (filter query)

recall precision

recall , precision

Ranking Tuning

23

Page 24: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Solutionn-gram, synonym dictionary, etc.

e.g. Transliteration

facet (filter query)

e.g. Named Entity Extraction

recall precision

recall , precision

Ranking Tuning

24

Page 25: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

gradual precision improvement

q=watch

25

targetresult

Page 26: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

filter by “Gender=Men’s”

26

targetresult

gradual precision improvement

Page 27: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

27

targetresult

filter by “Gender=Men’s”

filter by “Price=100-150”

gradual precision improvement

Page 28: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Structured Documents

ID product price gender

1 CURREN New Men’s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men’s

2 Suiksilver The Gamer Watch 87.99 Men’s

28

Page 29: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Unstructured Documents

ID article

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

29

Page 30: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Make them Structured

ID article person org loc

1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.

David Cameron EU Bruss

els

2He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.

EUUK Britain

NEE[1] extracts interesting words.

[1] Named Entity Extraction

30

Page 31: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Manual Tagging using brat

31

Page 32: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

32

Page 33: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

A small Corpus

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

33

Page 34: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

34

Page 35: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

35

text data

Page 36: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

36

Lucene index directory

Page 37: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

37

schema definition

Page 38: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

38

create Lucene document

Page 39: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

39

open a writer

Page 40: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

40

write documents

Page 41: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

41

close writer

Page 42: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

val index = "/tmp/index-simple"

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

def doc(text: String): Document = { Document(Set( Field("text", text) ) ) }

val writer = IWriter(index, schema)

CORPUS.foreach(text => writer.write(doc(text)))

writer.close

index_simple.scala

42

As for code snippets used in my talk, please look at: https://github.com/NLP4L/meetups/tree/master/20150818

Page 43: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

43

Page 44: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.sumTotalTermFreq("text") // -> 12

reader.field("text").get.terms.size // -> 9

reader.totalTermFreq("text", "an") // -> 3

reader.close

getting_word_counts.scala

44

Page 45: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.sumTotalTermFreq("text") // -> 12

reader.field("text").get.terms.size // -> 9

reader.totalTermFreq("text", "an") // -> 3

reader.close

getting_word_counts.scala

45

Page 46: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Getting word countsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.sumTotalTermFreq("text") // -> 12

reader.field("text").get.terms.size // -> 9

reader.totalTermFreq("text", "an") // -> 3

reader.close

getting_word_counts.scala

46

Page 47: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Getting top termsalice 1

an 1, 2, 3

apple 1, 3

ate 1

is 3

likes 2

mike 2

orange 2

red 3

val reader = RawReader(index)

reader.topTermsByDocFreq("text") reader.topTermsByTotalTermFreq("text") // -> // (term, docFreq, totalTermFreq) // (an,3,3) // (apple,2,2) // (likes,1,1) // (is,1,1) // (orange,1,1) // (mike,1,1) // (ate,1,1) // (red,1,1) // (alice,1,1)

reader.close

getting_word_counts.scala

47

Page 48: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

What’s ShingleFilter?• ShingleFilter = Word n-gram TokenFilter

WhitespaceTokenizer

ShingleFilter (N=2)

“Lucene is a popular software”

Lucene/is/a/popular/software

Lucene is/is a/a popular/popular software

48

Page 49: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Language Model• LM represents the fluency of language

• N-gram model is the LM which is most widely used

• Calculation example for 2-gram

49

Page 50: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-lm"

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

// create a language model index val writer = IWriter(index, schema())

def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }

CORPUS.foreach(addDocument(_))

writer.close()

language_model.scala

1/2

50

Page 51: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-lm"

val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." )

def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) }

// create a language model index val writer = IWriter(index, schema())

def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) }

CORPUS.foreach(addDocument(_))

writer.close()

language_model.scala

1/2

51

schema definition

Page 52: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val reader = RawReader(index)

// P(apple|an) = C(an apple) / C(an) val count_an_apple = reader.totalTermFreq("word2g", "an apple") val count_an = reader.totalTermFreq("word", "an") val prob_apple_an = count_an_apple.toFloat / count_an.toFloat

// P(orange|an) = C(an orange) / C(an) val count_an_orange = reader.totalTermFreq("word2g", "an orange") val prob_orange_an = count_an_orange.toFloat / count_an.toFloat

reader.close

language_model.scala

2/2

52

Page 53: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./.

NNP Proper noun, singularVB VerbAT ArticleJJ Adjective. period

Part-of-Speech Tagging

53

Our Corpus for training

Page 54: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Hidden Markov Model

54

Page 55: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Hidden Markov Model

55

Series of Words

Page 56: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Hidden Markov Model

56

Series of Part-of-Speech

Page 57: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Hidden Markov Model

57

Page 58: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Hidden Markov Model

58

Page 59: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

HMM state diagramNNP 0.667

VB 0.0

. 0.0

JJ 0.0

AT 0.333

1.0

1.0

0.4 0.6

0.6670.333

59

alice 0.2 apple 0.4 mike 0.2 orange 0.2

ate 0.333 is 0.333 likes 0.333

an 1.0

red 1.0

. 1.0

Page 60: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

60

Page 61: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

61

text data (they are tagged!)

Page 62: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

62

write-open Lucene index

Page 63: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

63

tagged texts are indexed here

Page 64: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

64

make an HmmModel from Lucene index

Page 65: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

65

get HmmTagger from HmmModel

Page 66: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

66

use HmmTagger to annotate unknown sentence

Page 67: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

val index = "/tmp/index-hmm"

val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." )

val indexer = HmmModelIndexer(index)

CORPUS.foreach{ text => val pairs = text.split("\\s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) }

indexer.close()

// execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model)

tagger.tokens("alice likes an apple .")

hmm.scala

NLP4L has hmm_postagger.scala in examples directory. It uses brown corpus for HMM training.

67

Page 68: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

68

Page 69: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

TransliterationTransliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers.

computer コンピューター

server サーバー

internet インターネット

mouse マウス

information インフォメーション

examples of transliteration from English to Japanese

69

Page 70: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

It helps improve recallyou search English “mouse”

70

Page 71: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

It helps improve recall

but you got “マウス” (=mouse) highlighted in Japanese

71

Page 72: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Training data in NLP4Lアaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy

train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt

72

academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー

Page 73: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Demo: Transliteration

Input Prediction Right Answer

アルゴリズム algorism algorithm

プログラム program (OK)

ケミカル chaemmical chemical

ダイニング dining (OK)

コミッター committer (OK)

エントリー entree entry

nlp4l> :load examples/trans_katakana_alpha.scala

73

Page 74: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Gathering loan words

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

74

store pair of strings if edit distance is small enough

④⑤

Page 75: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Gathering loan words

① crawl

gathering Katakana-Alphabet

string pairs

アルゴリズム, algorithm

Transliteration

“アルゴリズム”

“algorism”

calculate edit distance

synonyms.txt

75

store pair of strings if edit distance is small enough

④⑤

Got 1,800+ records of synonym knowledge from jawiki

Page 76: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Agenda• What’s NLP4L?

• How NLP improves search experience

• Count number of words in Lucene index

• Application: Transliteration

• Future Plans

76

Page 77: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

77

Page 78: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

78

Page 79: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

79

Page 80: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

NLP4L Framework• A framework that improves search experience (for mainly Lucene-based search system). Pluggable.

• Reference implementation of plug-ins and corpora provided.

• Uses NLP/ML technologies to output models, dictionaries and indexes.

• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.

80

Page 81: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Solr

ES

Mahout Spark

Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log

Dictionaries

・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment

maintenance

Model files Tagged Corpus

Document Vectors

・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection

・Learning to Rank ・Personalized Search

81

Page 82: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Keyword Attachment• “Keyword attachment” is a general format that enables the

following functions.

• Learning to Rank

• Personalized Search

• Named Entity Extraction

• Document Classification

Lucene doc

Lucene doc keyword

↑ Increase boost

82

Page 83: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Learning to Rank• Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, …

https://en.wikipedia.org/wiki/Learning_to_rank

83

Page 84: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Personalized Search• Program learns, from access log and other sources, that

the score of document d for a query q by user u should be larger than the normal score(q,d)

• Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d).

• Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous.

Lucene doc d1 q1u1, q2u2

Lucene doc d2 q2u1, q1u2

84

Page 85: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Join and Code with Us!

Contact us at

koji at apache dot org

for the details.

85

Page 86: An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Demo or Q & A

Thank you!

86