Open-source Hebrew search

Open-Source Hebrew Search

Itamar Syn-HershkoSIGTRS Meetup22/7/2010, Jerusalem

Introduction• The requirement to control masses of information• Manual tagging / categorization is no longer an

option• Scanning text?• Using an inverted index: faster, flexible, relevance• Measuring TR engine: relevance, precision, recall• The perfect search engine is language dependant• The perfect Hebrew search engine• Introducing: HebMorph

Open-Source Hebrew Search: Introduction

How do search engines work?

• Inverted index

• Normalizations: Porter stemmer, s-stemmer, Soundex etc.

• Stemming, so (looking, looked, looker) equal “look”, and book will return “books”.

Open-Source Hebrew Search: Introduction

The Challenge


Tokens Ambiguity

• With Niqqud, Hebrew is no different than any other non-Semitic language

• Niqqud-less spelling yields more than one possible meaning to almost any given word

English: Look, Luke; Wine, Whine; Stack, Stuck.

Hebrew: י ִנ� י, ָש� ִנ� י, ָש� ִנ� י, ָש י, ָשִנ� ִנ� ָש

Niqqud-less spelling: ָשִני, ָשִני, ָשִני, ָשִני, ָשִני …

Open-Source Hebrew Search: The Challenge

Particles Separation• Hebrew word uses particles for context• Without removing suffixes, relevant

words might be skipped (for example: (חבלה

• Without removing prefixes, relevant words will not be looked up at all

• Ambiguity makes affixes removal impossible in many cases

בית -< הבית, בבית, שבבית, לבית, והבית...רכבתרותי פספסה את ההרכבת -<

המוצר מסובכת להפליאהרכבת? כלבי -<

שבתו –< ?


Spelling Rules?• There is no common agreement over

rules for Niqqud-less spelling, like the one exists for diacriticized Hebrew

• Even spelling in common agreement isn’t always being widely used

• Did you know the correct spelling for “mother” is “אימא“?

• The same word can be spelled differently by different writers, or even by the same writer

שירות / שרות / שיירותדוגמא / דוגמה


!(Spelling Rules)

• Most debates are over spelling of nouns and loanwords, which have the greatest value in IR

• An extra layer of ambiguity, where each author or user can choose the spelling he likes

אחשורוש או אחשוורוש?שבדיה או שוודיה?

טורקיה או תורכיה?פריס או פריז? או אולי פאריז?


Noise Reduction

• Stop words ambiguity

אשר, כדי, אף...• Stop words as collations

על ידי, אי פעם, אף על פי, שום דבר...

• Collations where a meaning of a single word is changed

פי התהום


Tokenization Challenges

• Hebrew acronyms use double-quotes character, which is usually considered as punctuation character by most tokenizers

• Same with Geresh, which is used for abbrevations

• Geresh is also used for חצ"ץ ג"ז• … and ambiguity again: אינצ'


Common Texts

• Various dialects may present OOV cases, or change a meaning ( חמרא, חמר ), hence require different handling

• Each corpus might hold more than one dialect

• Even partial Niqqud can help disambiguation

• Niqqud-less spelling is the most common nowadays


Ways of Resolution


What to Index?

• Deciding on an “indexing unit” is the cornerstone of any good performing search engine

• For Hebrew we have:– The original term (and possibly using

wildcards?)– Hebrew triliteral root– Lemma ( דלתותינו← דלת )– Psuedo-lemma, Stem

• Considerations

Open-Source Hebrew Search: Ways of Resolution

Hebrew NLP Methods

• To analyze a Hebrew word, NLP tools have to be used:– Dictionary-based approach– Algorithmic approach

• Comparison criteria include:– Morphological precision (handling of 4-5

roots, broken plurals, assimilation, etc.)– Handling of loanwords, names and slang– Toleration of spelling differences– Disambiguation (error rate, POS, ranking)


Dictionary vs Algorithm

• Dictionaries are easier to build and maintain, but they need much more on-going attention and coverage tests

• Easier to support non-exact matches with an algorithm

• Prerequisites and dependencies• Hand-crafted dictionaries with

morphological information, and corpora generated dictionaries with statistical data


Lemma Disambiguation

• In order to index a correct lemma, a good disambiguation process needs to be used

• POS tools, grammatical or statistical, is the only reliable way to correctly eliminate false positives

• Even with such tools, ambiguity may exist:

"המראה של מטוסים ריקים ]...[""ראש הממשלה בבון"


NLP-based Hebrew Text Retrieval

• Filter lemmas based on their rank, morphological characteristics or statistical data

• OOV cases can be saved as-is, have affixes removed from them, or compared to a list of known words (i.e. names and addresses)

• Removal of stop and noise words• Term expansion (soundex, synonyms)• Save lemma to index (multiple lemmas

at the same position)


Other Text Retrieval Methods

• Is morphological analysis necessary?• Available methods:

– Light-stemming– Word truncation– N-grams– Skipgrams– (Sub-types)

• Require no extra overhead• Favorable, even when not superior• Disadvantages: larger index size, slower

searches (for some)


… Applied on Semitic Languages

• Researches have shown 4-grams and light stemmers (“light-10”) to work better than morphological lemmatizers for Arabic

• Apparently, good relevance can be achieved without ‘knowing’ the language

• Computers vs Humans• Lemmatization and disambiguation

processes do make mistakes• Contextual processing can fail for short

queries, producing incorrect searches


The Best Retrieval Method for Hebrew Texts

• Arabic and Hebrew share many morphologic phenomenas

• … but they do differ• Without trying, we can never know• Where HebMorph comes in


HebMorph’s Approach


HebMorph… is a free, open-source effort for making Hebrew properly

searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals.

• 2 goals• Development is done with Lucene (why?)• MorphAnalyzer, Hebrew.SimpleAnalyzer (+

duality)• OpenRelevance

Open-Source Hebrew Search: HebMorph’s Approach

Indexing Flow ChartOpen-Source Hebrew Search: HebMorph’s Approach

Searching Wikipedia with BzReader and HebMorph


Source available fromhttp://github.com/synhershko/BzReader

The Road Ahead

• A better tokenizer• MorphAnalyzer:

– Hspell improvements (coverage, lemma probabilities, prefixes probabilities)

– Toleration guidelines– Smarter OOV handling– Better stop words handling

• Hebrew judgments for OpenRelevance with Orev

• Comparing various approaches to Hebrew IR• Wide availability (Java port underway!)• Other uses (NLP, OCR, you name it)


Join Us!

• The more people join, the more feedback we get, and the better we become.

• Our mailing list:https://lists.sourceforge.net/lists/listinfo/hebmorph-

thinktank

• Code repository (Released under GPLv2):http://github.com/synhershko/HebMorph

• Activity updates:http://www.code972.com/blog/hebmorph/

#HebMorph on Twitter


Thank you!


Open-source Hebrew search

Documents

Transcript of Open-source Hebrew search