Open-source Hebrew search
Transcript of Open-source Hebrew search
Open-Source Hebrew Search
Itamar Syn-HershkoSIGTRS Meetup22/7/2010, Jerusalem
Introduction• The requirement to control masses of information• Manual tagging / categorization is no longer an
option• Scanning text?• Using an inverted index: faster, flexible, relevance• Measuring TR engine: relevance, precision, recall• The perfect search engine is language dependant• The perfect Hebrew search engine• Introducing: HebMorph
Open-Source Hebrew Search: Introduction
How do search engines work?
• Inverted index
• Normalizations: Porter stemmer, s-stemmer, Soundex etc.
• Stemming, so (looking, looked, looker) equal “look”, and book will return “books”.
Open-Source Hebrew Search: Introduction
The Challenge
Open-Source Hebrew Search
Tokens Ambiguity
• With Niqqud, Hebrew is no different than any other non-Semitic language
• Niqqud-less spelling yields more than one possible meaning to almost any given word
English: Look, Luke; Wine, Whine; Stack, Stuck.
Hebrew: י ִנ� י, ָש� ִנ� י, ָש� ִנ� י, ָש י, ָשִנ� ִנ� ָש
Niqqud-less spelling: ָשִני, ָשִני, ָשִני, ָשִני, ָשִני …
Open-Source Hebrew Search: The Challenge
Particles Separation• Hebrew word uses particles for context• Without removing suffixes, relevant
words might be skipped (for example: (חבלה
• Without removing prefixes, relevant words will not be looked up at all
• Ambiguity makes affixes removal impossible in many cases
בית -< הבית, בבית, שבבית, לבית, והבית...רכבתרותי פספסה את ההרכבת -<
המוצר מסובכת להפליאהרכבת? כלבי -<
שבתו –< ?
Open-Source Hebrew Search: The Challenge
Spelling Rules?• There is no common agreement over
rules for Niqqud-less spelling, like the one exists for diacriticized Hebrew
• Even spelling in common agreement isn’t always being widely used
• Did you know the correct spelling for “mother” is “אימא“?
• The same word can be spelled differently by different writers, or even by the same writer
שירות / שרות / שיירותדוגמא / דוגמה
Open-Source Hebrew Search: The Challenge
!(Spelling Rules)
• Most debates are over spelling of nouns and loanwords, which have the greatest value in IR
• An extra layer of ambiguity, where each author or user can choose the spelling he likes
אחשורוש או אחשוורוש?שבדיה או שוודיה?
טורקיה או תורכיה?פריס או פריז? או אולי פאריז?
Open-Source Hebrew Search: The Challenge
Noise Reduction
• Stop words ambiguity
אשר, כדי, אף...• Stop words as collations
על ידי, אי פעם, אף על פי, שום דבר...
• Collations where a meaning of a single word is changed
פי התהום
Open-Source Hebrew Search: The Challenge
Tokenization Challenges
• Hebrew acronyms use double-quotes character, which is usually considered as punctuation character by most tokenizers
• Same with Geresh, which is used for abbrevations
• Geresh is also used for חצ"ץ ג"ז• … and ambiguity again: אינצ'
Open-Source Hebrew Search: The Challenge
Common Texts
• Various dialects may present OOV cases, or change a meaning ( חמרא, חמר ), hence require different handling
• Each corpus might hold more than one dialect
• Even partial Niqqud can help disambiguation
• Niqqud-less spelling is the most common nowadays
Open-Source Hebrew Search: The Challenge
Ways of Resolution
Open-Source Hebrew Search
What to Index?
• Deciding on an “indexing unit” is the cornerstone of any good performing search engine
• For Hebrew we have:– The original term (and possibly using
wildcards?)– Hebrew triliteral root– Lemma ( דלתותינו← דלת )– Psuedo-lemma, Stem
• Considerations
Open-Source Hebrew Search: Ways of Resolution
Hebrew NLP Methods
• To analyze a Hebrew word, NLP tools have to be used:– Dictionary-based approach– Algorithmic approach
• Comparison criteria include:– Morphological precision (handling of 4-5
roots, broken plurals, assimilation, etc.)– Handling of loanwords, names and slang– Toleration of spelling differences– Disambiguation (error rate, POS, ranking)
Open-Source Hebrew Search: Ways of Resolution
Dictionary vs Algorithm
• Dictionaries are easier to build and maintain, but they need much more on-going attention and coverage tests
• Easier to support non-exact matches with an algorithm
• Prerequisites and dependencies• Hand-crafted dictionaries with
morphological information, and corpora generated dictionaries with statistical data
Open-Source Hebrew Search: Ways of Resolution
Lemma Disambiguation
• In order to index a correct lemma, a good disambiguation process needs to be used
• POS tools, grammatical or statistical, is the only reliable way to correctly eliminate false positives
• Even with such tools, ambiguity may exist:
"המראה של מטוסים ריקים ]...[""ראש הממשלה בבון"
Open-Source Hebrew Search: Ways of Resolution
NLP-based Hebrew Text Retrieval
• Filter lemmas based on their rank, morphological characteristics or statistical data
• OOV cases can be saved as-is, have affixes removed from them, or compared to a list of known words (i.e. names and addresses)
• Removal of stop and noise words• Term expansion (soundex, synonyms)• Save lemma to index (multiple lemmas
at the same position)
Open-Source Hebrew Search: Ways of Resolution
Other Text Retrieval Methods
• Is morphological analysis necessary?• Available methods:
– Light-stemming– Word truncation– N-grams– Skipgrams– (Sub-types)
• Require no extra overhead• Favorable, even when not superior• Disadvantages: larger index size, slower
searches (for some)
Open-Source Hebrew Search: Ways of Resolution
… Applied on Semitic Languages
• Researches have shown 4-grams and light stemmers (“light-10”) to work better than morphological lemmatizers for Arabic
• Apparently, good relevance can be achieved without ‘knowing’ the language
• Computers vs Humans• Lemmatization and disambiguation
processes do make mistakes• Contextual processing can fail for short
queries, producing incorrect searches
Open-Source Hebrew Search: Ways of Resolution
The Best Retrieval Method for Hebrew Texts
• Arabic and Hebrew share many morphologic phenomenas
• … but they do differ• Without trying, we can never know• Where HebMorph comes in
Open-Source Hebrew Search: Ways of Resolution
HebMorph’s Approach
Open-Source Hebrew Search
HebMorph… is a free, open-source effort for making Hebrew properly
searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals.
• 2 goals• Development is done with Lucene (why?)• MorphAnalyzer, Hebrew.SimpleAnalyzer (+
duality)• OpenRelevance
Open-Source Hebrew Search: HebMorph’s Approach
Indexing Flow ChartOpen-Source Hebrew Search: HebMorph’s Approach
Searching Wikipedia with BzReader and HebMorph
Open-Source Hebrew Search: HebMorph’s Approach
Source available fromhttp://github.com/synhershko/BzReader
The Road Ahead
• A better tokenizer• MorphAnalyzer:
– Hspell improvements (coverage, lemma probabilities, prefixes probabilities)
– Toleration guidelines– Smarter OOV handling– Better stop words handling
• Hebrew judgments for OpenRelevance with Orev
• Comparing various approaches to Hebrew IR• Wide availability (Java port underway!)• Other uses (NLP, OCR, you name it)
Open-Source Hebrew Search: HebMorph’s Approach
Join Us!
• The more people join, the more feedback we get, and the better we become.
• Our mailing list:https://lists.sourceforge.net/lists/listinfo/hebmorph-
thinktank
• Code repository (Released under GPLv2):http://github.com/synhershko/HebMorph
• Activity updates:http://www.code972.com/blog/hebmorph/
#HebMorph on Twitter
Open-Source Hebrew Search: HebMorph’s Approach
Thank you!
Open-Source Hebrew Search