Japanese Linguistics in Lucene and Solr

Japanese linguisticsin Apache Lucene™ and Apache Solr™

Christian Moenchristian@atilika.com

May 9th, 2012

About me• MSc. in computer science, University of Oslo, Norway• Worked with search at FAST (now Microsoft) for 10 years

• 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway• 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan

• Founded アティリカ株式会社 in 2009• We help companies innovate using search technologies and good ideas• We know information retrieval, natural language processing and big data• We are based in Tokyo, but we have clients everywhere

• Newbie Lucene & Solr Committer• Mostly been working on Japanese language support (Kuromoji) so far

• Please write me on christian@atilika.com or cm@apache.org

Today’s topics

• Japanese 101 - ordering beer and toasting

• Japanese language processing

• Japanese features in Lucene/Solr

Today’s topics

Japanese 101

ビールくださいbi-ru kudasai

A beer, please

ありがとうございます！arigatō gozaimasu!

ありがとうございます！

Thank you very much!

arigatō gozaimasu!

乾杯！kanpai!

Cheers!

乾杯！kanpai!

ＪＲ新宿駅の近くにビールを飲みに行こうか？JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

ＪＲ新宿駅の近くにビールを飲みに行こうか？

Shall we go for a beer near JR Shinjuku station?

JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

Romaji - ローマ字・Latin characters (26+)・Used for proper nouns, etc.

Katakana - カタカナ・Phonetic script (~50)・Typically used for loan words

Kanji - 漢字・Chinese characters (50,000+)・Used for stems & proper nouns

Hiragana - ひらがな・Phonetic script (~50)・Used for inflections & particles

Katakana - カタカナ・Phonetic script・Typically used for loan words

Kanji - 漢字・Chinese characters (50,000+)・Used for stems & proper nouns

Hiragana - ひらがな・Phonetic script (~50)・Used for inflections & particles

Romaji - ローマ字・Latin characters (26+)・Used for proper nouns, etc.

Katakana - カタカナ・Phonetic script (~50)・Typically used for loan words

What are the words in this sentence??

What are the words in this sentence?Words are implicit in Japanese - there is no white space that separates them

How do we index this for search, then??

How do we index this for search, then?We need to segment text into tokens first

1. n-gramming2. morphological analysis

(statistical approach)

Two major approaches for segmentation!

n-gramming (n=2)ＪＲ新宿駅の近くにビールを飲みに行こうか？

ＪＲ

Ｒ新

ＪＲＲ新

ＪＲ

Ｒ新

新宿

ＪＲＲ新新宿

ＪＲ

Ｒ新

新宿

宿駅

ＪＲＲ新新宿宿駅

ＪＲ

Ｒ新

新宿

宿駅

ＪＲＲ新新宿宿駅駅の

駅の

ＪＲ

Ｒ新

新宿

宿駅

ＪＲＲ新新宿宿駅駅のの近

駅の

の近

ＪＲ

Ｒ新

新宿

宿駅

ＪＲＲ新新宿宿駅駅のの近近く

駅の

の近

近く

Problems with n-gramming

Problems with n-grammingＪＲＲ新新宿宿駅駅のの近近く ...

Problems with n-grammingＪＲＲ新新宿宿駅駅のの近近く ...●

×● ●

× ×● ●change ofsemantics!

means ‘post town’, ‘relay station’ or ‘stage’

× × ×● ●change ofsemantics!

× × × ×● ●change ofsemantics!

× × × ×● ● ●change ofsemantics!

• Does not preserve meaning well and often changes semantics• Impacts on ranking - search precision (many false positives)

Generates many terms per document or queryImpacts on index size and search performance

Sometimes appropriate for certain search applicationsCompliance, e-commerce with non product names, ...

• Also generates many terms per document or query• Impacts on index size and performance

Sometimes appropriate for certain search applicationsCompliance, e-commerce with non product names, ...

• Also generates many terms per document or query• Impacts on index size and performance

• Still sometimes appropriate for certain search applications• Compliance, e-commerce with special product names, ...

Morphological analysisＪＲ新宿駅の近くにビールを飲みに行こうか？

●●● ● ● ● ● ● ● ● ● ● ● ●ＪＲ新宿駅の近くにビールを飲みに行こうか？

• Tokens reflect what a Japanese speaker consider as words• Machine-learned statistical approach

• CRFs decoded using Viterbi• Also does part-of-speech tagging, readings for kanji, etc.

• Several statistical models available with high accuracy (F > 0.97)• Models/dictionaries are available as IPADIC, UniDic, ...

●●● ● ● ● ● ● ● ● ● ● ● ●

• Conditional Random Fields (CRFs) decoded using Viterbi• Also does part-of-speech tagging, extract readings for kanji, etc.

●●● ● ● ● ● ● ● ● ● ● ● ●

• Conditional Random Fields (CRFs) decoded using Viterbi• Also does part-of-speech tagging, readings for kanji, etc.

●●● ● ● ● ● ● ● ● ● ● ● ●

How does this actually work?

Japanese support in Lucene and Solr

Japanese in Lucene/Solr

New feature in Lucene/Solr 3.6!

Available out-of-the-box!

Easy to use with reasonable defaults!

Provides sophisticated Japanese linguistics!

Customisable!

Provides sophisticated Japanese linguistics!

How do we use it?

Use JapaneseAnalyzer!

How do we use it?

Use JapaneseAnalyzer!

Use field type “text_ja” in example schema.xml

How do we use it?

Feature summary / text_ja analyzer chain

JapaneseTokenizer

Segments Japanese text into tokens with very high accuracy• Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms• Segmentation is customisable using user dictionaries

JapaneseTokenizer

JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

JapaneseTokenizer

JapanesePartOfSpeechStopFilterStop-words removal based on part-of-speech tagsSee example/solr/conf/lang/stoptags_ja.txt

JapaneseTokenizer

CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

JapaneseTokenizer

StopFilterStop-words removal

See example/solr/conf/lang/stopwords_ja.txt

JapaneseTokenizer

JapaneseKatakanaStemFilter Normalises common katakana spelling variations

JapaneseTokenizer

JapaneseKatakanaStemFilter Normalises common katakana spelling variations

LowerCaseFilter Lowercases

Feature details

Compound nounsHow do we deal with compound nouns??

Japanese English関西国際空港 Kansai International Airport

シニアソフトウェアエンジニア Senior Software Engineer

Compound nounsHow do we deal with compound nouns??

These are one word in Japanese, so searching for 空港 (airport) doesn’t match

Compound nounsHow do we deal with compound nouns?

We need to segment the compounds, too

These are one word in Japanese, so searching for 空港 (airport) doesn’t match

Compound segmentation

関西国際空港Kansai International Airport

シニアソフトウェアエンジニナ

Senior Software Engineer

We are using a heuristic to implement this!

関西Kansai

Senior Software Engineerシニア

Senior

関西Kansai

国際International

Seniorソフトウェア

Software

関西Kansai

国際International

空港Airport

Seniorソフトウェア

Softwareエンジニナ

Engineer

Compound synonym tokensPosition 1 Position 2 Position 3関西国際空港

関西国際空港

• Segment the compounds into its part• Good for recall - we can also search and match 空港 (airport)

• We keep the compound itself as a synonym• Good for precision with an exact hit because of IDF

• Approach benefits both precision and recall for overall good ranking• JapaneseTokenizer actually returns a graph of tokens

関西国際空港

• Segment the compounds into its parts• Good for recall - we can also search and match 空港 (airport)

関西国際空港

Character width normalisationHow do we deal with character widths??

Half-width・半角 Full-width・全角Lucene Ｌｕｃｅｎｅｶﾀｶﾅカタカナ123 １２３

Character width normalisation

Half-width・半角 Full-width・全角Lucene Ｌｕｃｅｎｅｶﾀｶﾅカタカナ123 １２３

Input text Ｌｕｃｅｎｅｶﾀｶﾅ１２３

CJKWidthFilter Lucene カタカナ 1 2 3

half-width full-width half-width

Use CJKWidthFilter to normalise them(Unicode NFKC subset)

How do we deal with character widths??

Katakana end-vowel stemming

English Japanese spelling variations Japanese spelling variations Japanese spelling variationsmanager マネージャーマネージャマネジャー

A common spelling variation in katakana is a end long-vowel sound

Katakana end-vowel stemming

English Japanese spelling variations Japanese spelling variations Japanese spelling variationsmanager マネージャーマネージャマネジャー

Input text コピーマネージャーマネージャマネジャーJapaneseKatakanaStemFilter コピーマネージャマネージャマネジャ

copy manager manager “manager”

We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms

A common spelling variation in katakana is a end long-vowel sound

LemmatisationJapanese adjectives and verbs are highly inflected, how do we deal with that?

kauto buy

買うDictionary form

kauto buy

買いなさい買いなさるな買いましたら買いましたり買いまして買いましょう買います買いますまい買いませば買いません買いませんで買いませんでした

買える買おう買った買ったら買ったり買って買わせない買わせます買わせません買わせられない買わせられます買わせられません

Inflected forms (not exhaustive)買いませんでしたら買いませんでしたり買いませんなら買うだろう買うでしょう買うな買うまい買え買えない買えば買えます買えません

買わせられる買わせる買わない買わないだろう買わないで買わないでしょう買わなかった買わなかったら買わなかったり買わなければ買われない買われます

kauto buy

買いなさい買いなさるな買いましたら買いましたり買いまして買いましょう買います買いますまい買いませば買いません買いませんで買いませんでした

買える買おう買った買ったら買ったり買って買わせない買わせます買わせません買わせられない買わせられます買わせられません

Inflected forms (not exhaustive)買いませんでしたら買いませんでしたり買いませんなら買うだろう買うでしょう買うな買うまい買え買えない買えば買えます買えません

買わせられる買わせる買わない買わないだろう買わないで買わないでしょう買わなかった買わなかったら買わなかったり買わなければ買われない買われます

Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form(lemmatisation by reduction)

User dictionaries• Own dictionaries can be used for ad hoc segmentation, i.e. to override default model

• File format is simple and there’s no need to assign weights, etc. before using them

• Example custom dictionary:# Custom segmentation and POS entry for long entries関西国際空港,関西国際空港,カンサイコクサイクウコウ,カスタム名詞

# Custom reading and POS former sumo wrestler Asashoryu朝青龍,朝青龍,アサショウリュウ,カスタム人名

Japanese focus in 4.0• Improvements in JapaneseTokenizer

• Improved search mode for katakana compounds• Improved unknown word segmentation• Some performance improvements

• CharFilters for various character normalisations• Dates and numbers• Repetition marks (odoriji)

• Japanese spell-checker• Robert and Koji almost got this into 3.6, but it got

postponed because of API changes being necessary

AcknowledgementsRobert MuirThanks for the heavy lifting integrating Kuromoji into Lucene and always reviewing my patches quickly and friendly helpMichael McCandlessThanks for streaming Viterbi and synonym compounds!Uwe SchindlerThanks for performance improvements + being the policemanSimon WillnauerThanks for doing the Kuromoji code donation process so wellGaute Lambertsen & Gerry HocksThanks for presentation feedback and being great colleagues

ありがとうございました！

Thank you very much!

arigatō gozaimashita!

Japanese Linguistics in Lucene and Solr

Technology

Transcript of Japanese Linguistics in Lucene and Solr

Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Solr & Lucene @ Etsy by Gregg Donovan

Dagens Næringslivs overgang til Lucene/Solr søk

Randomized Continuous Testing: Solr & Lucene Use Case

Solr, Lucene and Hadoop @ Etsy

Geneva jug Lucene Solr

Introduction to Lucene and Solr - 1

Faceting with Lucene Block Join Query - Lucene/Solr Revolution 2014

Solr JDBC - Lucene/Solr Revolution 2016

Using LWE/Solr/Lucene for eCom

Solr @ Etsy - Apache Lucene Eurocon

Solr and Lucene search revolution

Apache Solr/Lucene: Looking Ahead

Apache Solr/Lucene Internals by Anatoliy Sokolenko

Semantic & Multilingual Strategies in Lucene/Solr

Lucene for Solr Developers

Solr, Lucene, Apache, and You!

NYC Lucene/Solr Meetup: Spark / Solr

Oslo Lucene/Solr Meetup

Introduction to Apache Lucene/Solr