Post on 27-Apr-2015
Myanmar Search Engine
Nyi Lynn SeckEC (MCPA)
Search Engine Evolution
● 1st generation (use only “on page” data)– text data, Word frequency, language
● 2nd generation (use off-page, web-specific data)– Link (or connectivity) analysis– Click-through data (What people click)– Anchor-text (How people refer to this page)
● 3rd generation (answer “the need behind the query”)– Semantic analysis - what is this about?– Focus on user need, rather than on query– Context determination
Text Mining Research Area
● Information Retrieval (IR)– Search Engines– Classification– Recommendation
● Information Extraction (IE)– Screen scraping– Product Information (e.g. price) scraping
● Information Understanding– Natural Language Processing (NLP)– Question Answering– Concept Extraction from Newsgroup– Visualization– Summarization
● Cross-Lingual Text Mining● Trend Detection
– Outlier Detection
Classical Indexing
Indexing
– Keyword Indexing
– Subject Indexing (Classification)
– Collocate subjects– Define & Assign code (Call Number) to document
Tokenization
Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information without compromising its security
Assign unique ID to each word & keep in a lexicon
Remove Stop/Noise words before/after tokenization
Stemming, Lemmatization
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form.
Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or head word form you would find in a dictionary. The combination of the lemma form with its word class (noun, verb. etc.) is called the lexeme.
ကစ
ကစကြကင� ကစစရ အကစကစပြ
ကစ
ကစေ�နသည� ကစလ�မမ���ည�ကစခ�သည�
Inverted Index
Inverted Index
Formula & Algorithm?
The weight of a term that occurs in all documents
Stop Wordsaableaboutaboveabroadaccordingaccordinglyacrossactuallyadjafterafterwardsagainagainstagoaheadain'tallallowallowsalmostalone
Engl
ish
What stop words will be use in Myanmar Search Engine?
NGram သ သသသသသ သ သသသသသသသသသသသသ သသသသ သသသသသေ�မမတယဉမ��သတ�ထ ေေ�မမင�န�င�န �လ��အ ညည�
ေ�မမတေ�တယဉ �ယဉမ��မမ�သသတ�တ�ထထမမင�ေ�မမင�န�င�န�င�န �ရနလန%�ည&လ��အ�အ ညည�
|ေ�မမ||ေ�တ||ယဉ �||မမ�||သ||တ�||ထ||ေ�မမင�||န�င�||ရန �||လ��||အ�||သည�|
ေ�မမတယဉ �ေ�တယဉမ��ယဉမ��သမမ�သတ�သတ�ထတ�ထမမင�ထမမင�န�င�ေ�မမင�န�င�န �န�င�နလန%�ည&ရနလန%�ည&အ�လ��အ ညည�
ေ�မမတယဉမ��ေ�တယဉမ��သယဉမ��သတ�မမ�သတ�ထသတ�ထမမင�တ�ထမမင�န�င�ထမမင�န�င�န �ေ�မမင�န�င�နလန%�ည&န�င�နလန%�ည&အ�ရနလန%�ည&အ ညည�
2 Gram |ေ�မမတ||ယဉမ��||သတ�||ေ�မမင�န�င�||ရနလန%�ည&||လ��အ�||အ ညည�|3 Gram |ေ�မမတယဉ �||သတ�ထ||ေ�မမင�န�င�န �||လ��အ ညည�|4 Gram |ေ�မမတယဉမ��|
ေ�မမတယဉမ��သေ�တယဉမ��သတ�ယဉမ��သတ�ထမမ�သတ�ထမမင�သတ�ထမမင�န�င�တ�ထမမင�န�င�န �ထမမင�န�င�နလန%�ည&ေ�မမင�န�င�နလန%�ည&အ�န�င�နလန%�ည&အ ညည�
MyanmarWord Segmentation using Syllable level Longest Matching : Hla Hla Htay
Simple Myanmar Syllable Structure
Consonant
Medial
Vowel
Killer
Diacritic
Diacritic
Killer
Diacriti
c
Diacritic
Vowel
Killer
Diacritic
Diacritic
Killer
Diacritic
CC+MC+M+VC+M+V+KC+M+ V+ K+ DC+M+V+DC+M+KC+M+K+DC+M+DC+VC+V+KC+V+K+DC+V+DC+KC+K+D
Corpus/Lexicon
WWWWWW
Ranking engine
Query engineParser Indexer
Language specific crawler
Pagerepository
queryresults
Crawler
Language Identification
Language Specific Search EngineBasic Architecture
Pann Yu Mon, Management and Information System Engineering Department, Nagaoka University of Technology, Japan
Crawling Coverage
Crawling Parameters
Seed URLs 35Level of depth 6 Crawling time 2 weeksCPU 2.40 GHzMemory 1 GBConnection: 100 Mbit per second
Domains The Number of Pages Collected
.mm 3,555 [ 1.1%]
.com 276,554 [ 83.2%]
Other gTLDs 52,245 [ 15.7%]
Total 332,354 [100.0%]
10th July 2008