The Use of yig-cha and chos-kyi-rnam-grangs in Computing...
Transcript of The Use of yig-cha and chos-kyi-rnam-grangs in Computing...
![Page 1: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/1.jpg)
The Use of yig-cha andchos-kyi-rnam-grangs in
Computing Lexical Cohesion for Tibetan Topic Boundary
DetectionPaul G. Hackett
Columbia University
August 18, 2010Tibetan IT Panel — IATS-12, Vancouver, BC
![Page 2: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/2.jpg)
Introduction
Simple Tibetan IR system requires segmentation (n-gram, POS-tagging, dictionary substring matching, etc.)
For finer grain indexing, large-scale structure and (sub-)topic detection is needed
![Page 3: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/3.jpg)
Previous Research
In a previous paper (Hackett, 2000), we reported on automatic techniques developed for Tibetan for:
Word Segmentation,
Part-Of-Speech tagging, and
Sentence boundary detection
![Page 4: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/4.jpg)
Large-scale Structures:
Explicit & Reoccurring Text Titles
Chapter-boundaries
Topical Outlines (sa bcad)
Exploiting Existing Features
![Page 5: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/5.jpg)
Chapter Boundary Detection
Case Example:
‘Gro-lung-pa’s Bstan-rim-chen-mo.
Full title from Title Page:
bde bar gshegs pa'i bstan pa rin po che la 'jug pa'i lam gyi rim pa rnam par bshad pa bzhugs so
![Page 6: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/6.jpg)
Chapter Boundary Detection
Case Example:
‘Gro-lung-pa’s Bstan-rim-chen-mo.
Full title from Title Page:
bde bar gshegs pa'i bstan pa rin po che la 'jug pa'i lam gyi rim pa rnam par bshad pa bzhugs so
![Page 7: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/7.jpg)
Chapter Boundary Detection
Combine Title “Key” Syllables:
bde | gshegs | bstan | rin | po | che | 'jug | lam | rim | rnam | bshad
![Page 8: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/8.jpg)
Chapter Boundary Detection
With Chapter Colophon “Flags”:
de | ste | te | le’u | las
![Page 9: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/9.jpg)
Chapter Boundary Detection
... and Ordinal Numbers:
((nyer|nyi shu|((sum|bzhi|lnga|drug|bdun|brgyad|dgu|brgya) (bcu|cu )?))?
(((rtsa|so|zhe|nga|re|don|gya|go) )?))?
(dang po|(((gcig|gnyis|gsum|bzhi|lnga|drug|bdun|brgyad|dgu|bcu|tham) )+(pa)?))
![Page 10: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/10.jpg)
Chapter Boundary Detection
Yields Automatic Colophon Identification:
TITLE + FLAG + ORDINAL
bstan pa la 'jug pa'i rim pa rnam par bshad pa las dge ba'i bshes gnyen bsten pa la 'jug pa ste le'u dang po'o
![Page 11: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/11.jpg)
Automatic Tagging of Large-scale Structures...
</SENTENCE>
</CHAPTER>
<CHAPTER_COLOPHON>
<SENTENCE Struct="S(NP(S,N),C247,VP(V5,SEC6),NP(S,N),NP(S,S,S,N),C5,NP(S,NEC6,S,N),NP(S,N),C247,VP(V5,N),RSP,N,NP(S,NUM)ETP)">
<PHRASE>bstan pa la 'jug pa'i rim pa rnam par bshad pa las</PHRASE>
<PHRASE>dge ba'i bshes gnyen bsten pa la 'jug pa ste le'u dang po'o</PHRASE>
</SENTENCE>
</CHAPTER_COLOPHON>
<CHAPTER n="2">
<SENTENCE Struct="...
![Page 12: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/12.jpg)
Approaches to Topic Boundary Detection
Previous research explored three approaches:
Statistical Methods
Conceptual Hierarchies
Exploiting lexical resources
![Page 13: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/13.jpg)
Lexical Cohesion MethodKozima (1993) put forth a method for calculating the Lexical Cohesion Profile (LCP) of English-language texts by:
Building a weighted co-occurrence database of words from the Longman Dictionary of Contemporary English
Performing a co-occurrence analysis over the text using a sliding Hanning window
![Page 14: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/14.jpg)
LCP Method for Tibetan
No resource comparable to Longman Dict. (Tshig-mdzod-chen-mo too uneven)
Have two highly specialized genres of lit.:
Chos-kyi-rnam-grangs (“Enumerations of Phenomena”)
Yig-cha (“Monastic Textbooks”)
![Page 15: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/15.jpg)
Chos-kyi-rnam-grangs
Sample Entry:
‘dus byas kyi mtshan nyid bzhi:
skye ba’i mtshan nyid rga ba’i mtshan nyid gnas pa’i mtshan nyid mi rtag pa’i mtshan nyid do
![Page 16: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/16.jpg)
Chos-kyi-rnam-grangs (stemmed & segmented)
Sample Entry:
‘dus_byas kyi mtshan_nyid bzhi:
skye_ba mtshan_nyid rga_ba mtshan_nyid gnas_pa mtshan_nyid mi_rtag_pa mtshan_nyid do
![Page 17: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/17.jpg)
Yig-cha
Sample Entry:
yid dpyod:
rang gi ‘jug yul gyi gtso bor gyur pa’i chos la ‘tha‘ gcig tu zhen kyang bcad don ma thob pa’i rig pa
![Page 18: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/18.jpg)
Yig-cha (stemmed & segmented)
Sample Entry:
yid_dpyod:
rang_gi_‘jug_yul gyi gtso_bo gyur_pa chos la ‘tha‘_gcig tu zhen kyang bcad_don ma thob_pa rig_pa
![Page 19: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/19.jpg)
Calculate TFIDF
Term Frequency (TF) per entry, times Inverse Document Frequency (IDF) over the entire lexicon:
tfidf = log(1 + log(tf)) * log(N / df)
![Page 20: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/20.jpg)
Weighted & Normalized TFIDF
For example:
yid_dpyod (0.222390700)rang_gi_‘jug_yul (0.166793025)‘gyur_pa (0.008339651)chos (0.011119535)‘tha‘_gcig (0.166793025)etc ...
![Page 21: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/21.jpg)
Hanning WeightsRectangular Window Hanning Window
For window width, Nfor 0 < n≤N, w(n)=1
else, w(n)=0
For window width, N
w(n)=1-cos(2πn/N-1)
![Page 22: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/22.jpg)
Evaluation Metric: Known Item Retrieval
Identify e-texts that have varied and rich vocabulary with known topic boundaries. Two test candidates — one canonical, one non-canonical:
• Śāntideva’s Bodhicaryāvatāra(10 chapters; 26,887 syll.; 18,129 words)
• Tsong-kha-pa’s Legs-bshad-snying-po(no chap. boundaries; 69,176 syll.; 42,956 words)
![Page 23: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/23.jpg)
LCP for Śāntideva:Chos-kyi-rnam-grangs
![Page 24: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/24.jpg)
LCP for Śāntideva:Yig-cha definitions
![Page 25: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/25.jpg)
LCP for Prajñākaramati:Yig-cha definitions
![Page 26: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/26.jpg)
LCP for Tsong-kha-pa:Chos-kyi-rnam-grangs
![Page 27: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/27.jpg)
LCP for Tsong-kha-pa:Yig-cha definitions
![Page 28: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/28.jpg)
LCP for Tsong-kha-pa:Yig-cha definitions
![Page 29: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/29.jpg)
LCP for Tsong-kha-pa:Yig-cha definitions
![Page 30: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/30.jpg)
Analysis
Immediate Observations:
1. Topic boundaries detection is feasible
2. Chapter boundaries are best / easily captured by non-CL methods
3. Chos-kyi-rnam-grangs fail, likely due to being “un-natural” lists
![Page 31: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/31.jpg)
Applications
1. Fine grain indexing of texts based on individual sub-topics
2. Topic identification can be deployed for translation equivalent disambiguation
3. Content analysis and automatic topic outline generation easily done
![Page 32: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/32.jpg)
Future Work
Expand lexical Cohesion database with additional / alternate definitions
Add domain tags to lexical pairs
Incorporate domain tags in XML tagged documents for gisting/translation
![Page 33: The Use of yig-cha and chos-kyi-rnam-grangs in Computing …ph2046/iats/it/IATS-XII_Hackett_slides.pdf · The Use of yig-cha and chos-kyi-rnam-grangs in Computing Lexical Cohesion](https://reader031.fdocuments.net/reader031/viewer/2022011907/5f49dca218654429275c8151/html5/thumbnails/33.jpg)
fin.