Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches
description
Transcript of Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches
![Page 1: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/1.jpg)
Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches
黃瀚萱2008
![Page 2: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/2.jpg)
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
2/38
![Page 3: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/3.jpg)
3/38
![Page 4: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/4.jpg)
4/38
![Page 5: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/5.jpg)
CCSS: Classical Chinese Sentence Segmentation Almost all pre-20th century Chinese is written
without any punctuation marks. Nothing to separate words from words, phrases
from phrases, and sentences from sentences. Explicit boundaries of sentences and clauses
are lacking. readers have to manually identify these
boundaries during reading.
5/38
![Page 6: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/6.jpg)
Example of CCSS from Zhangzi
北冥有魚其名為鯤鯤之大不知其幾千里也化而為鳥其名為鵬鵬之背不知其幾千里也怒而飛其翼若垂天之雲是鳥也海運則將徙於南冥南冥者天池也
北冥有魚.其名為鯤.鯤之大.不知其幾千里也.化而為鳥.其名為鵬.鵬之背.不知其幾千里也.怒而飛.其翼若垂天之雲.是鳥也.海運則將徙於南冥.南冥者.天池也.6/38
![Page 7: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/7.jpg)
Challenges CCSS is not a trivial problem, inherently
ambiguous. 道 / 可道 / 非常道 / 名 / 可名 / 非常名. 道可道 / 非常道 / 名可名 / 非常名.
Difficult to construct a set of rules or practical procedures to do CCSS. Readers perform CCSS in instinctive ways. They
rely on their experience and sense of the language rather than on a systematic procedure.
7/38
![Page 8: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/8.jpg)
Automated CCSS Innumerable documents in Classical Chinese
from the centuries of Chinese history remain to be segmented.
To aid in processing these documents, a automated CCSS system is proposed.
Enable completion of segmentation tasks quickly and accurately.
8/38
![Page 9: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/9.jpg)
Research Goals Evaluation Metrics Datasets
Training data Benchmarking
Statistical segmenters.
9/38
![Page 10: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/10.jpg)
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
10/38
![Page 11: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/11.jpg)
Related Areas
CCSS
Classifiers
Linguistics
Chunking
NLP Machine Learning
Tagging
SBD
11/38
CWS
![Page 12: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/12.jpg)
Useful Chinese Features Chinese Character
則 and 而 usually appear in the head of sentences. 也 and 矣 usually appear in the tail of sentences.
Phonology 反切 , 平仄 , 擬音
POS Verbs, nouns, adjectives, adverbs, etc.
Antithesis and couplet 道 / 可道 / 非常道 / 名 / 可名 / 非常名
12/38
![Page 13: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/13.jpg)
Sentence Boundary Detection Distinguish the periods used as the end-of-
sentence indicator from other usages. Parts of abbreviations, i.e. Dr. Wang. Decimal points, i.e. 1.618. Ellipsis, i.e. “I don’t know…”
Metrics for segmentation F-measure and NIST-SU error rate
SBD in speech.
13/38
![Page 14: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/14.jpg)
Chinese Word Segmentation Identify the boundaries of the words in a given
text. Non-trivial problem, ambiguity
日 文章 魚 怎麼 說 日文 章魚 怎麼 說
Words can be handled with a dictionary. Segmentation by character tagging [Xue,
2003]日 文 章 魚 怎 麼 說LL RR LL RR LL RR LR
14/38
![Page 15: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/15.jpg)
Part-of-Speech Tagging Tagging the words of a sentence with word
class. The[AT] representative[NN] put[VBD] chairs[NNS]
on[IN] the[AT] table[NN]. Tagging another information rather than word
class. Position-of-character tagging.
Classical Chinese POS [Huang et al., 2002] Focused on sentence segmented text.
15/38
![Page 16: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/16.jpg)
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
16/38
![Page 17: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/17.jpg)
CCSS Framework
training data
dataset
test data segmentation model
measure metrics
performance measurement
training
testing
testing outcome
17/38
![Page 18: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/18.jpg)
Sequential Data Transform the sentence segmenting task to a
character labeling task. Tagging with four position-of-character tags
Left boundary (LL) Middle character (MM) Right boundary (RR) single character clause (LR)
北冥有魚 / 其名為鯤 / 鯤之大 / 不知其幾千里也18/38
![Page 19: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/19.jpg)
Markov Chain for CCSS
LL MM
RRLR
Start
Finish
北冥、有
魚
19/38
![Page 20: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/20.jpg)
Sequence Labeling Models Hidden Markov Models Maximum Entropy Conditional Random Fields [Lafferty, 2001]
With Averaged Perceptron [Collins, 2002] Large margin methods
Support Vector Machine AdaBoost
20/38
![Page 21: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/21.jpg)
Conditional Random Fields The model
P( “LL, MM, MM, RR” | “ 北冥有魚” ) Tagging the x with Viterbi algorithm
λ is estimated by the averaged perceptron algorithm.
txyyf
ZxyP ttk
T
t k
kx
,,,exp1| 1
1
xyPy
|maxarg*y
21/38
![Page 22: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/22.jpg)
Datasets Focused on the corpora of the Pre-Qin and
Han Dynasties ( 先秦兩漢 ) Fundamental of later Chinese. Simpler syntax. Shorter sentences. The words are largely composed of a single
character. Qing Palace Memorials ( 奏摺 )
22/38
![Page 23: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/23.jpg)
Dataset StatisticsDataset Paragraphs Characters Distinct
characters Clauses
論語 500 15982 1368 4015
孟子 260 35392 1916 7351
莊子 1128 65165 2936 12574
春秋左傳 3381 195983 3238 47281
春秋公羊傳 1804 44352 1638 11151
春秋穀梁傳 1801 40711 1585 10946
史記 4778 503890 4788 99792
上古漢語混合 1250 97476 3489 20573
清代奏摺 1000 111739 3147 15521
23/38
![Page 24: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/24.jpg)
Evaluation Metrics Specificity (1 – fallout)
Probability of the true negative cases. F-measure (F1)
Harmonic means of precision and recall. NIST-SU error rate
Ratio of the wrongly segmented boundaries to the reference boundaries.
More than 100% if the mis-segmented too serious.
24/38
![Page 25: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/25.jpg)
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
25/38
![Page 26: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/26.jpg)
Experiment Design Experiment 1
Evaluate the performance of HMMs and CRFs 10-fold cross-validation
Experiment 2 Find the best training data from ancient Chinese. Train the system on one dataset, and test it on others.
Experiment 3 Cross-era evaluation. Train the system on the data from the Qing, and test it on the
data from the pre-Qing and Han Dynasties, and vice versa. 以古鑑近?以近鑑古?
26/38
![Page 27: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/27.jpg)
Result: Experiment 1Dataset HMMs CRFs
Specificity F1 NIST-SU Specificity F1 NIST-SU
論語 93.01% 73.84% 49.84% 93.23% 78.52% 42.63%
孟子 94.49% 70.86% 54.35% 91.95% 75.08% 52.90%
莊子 93.72% 70.48% 57.25% 94.13% 76.37% 48.20%
春秋左傳 94.57% 83.55% 33.70% 95.16% 88.25% 25.60%
春秋公羊傳 95.78% 88.52% 24.22% 97.83% 93.60% 13.53%
春秋穀梁傳 95.37% 86.92% 27.10% 97.23% 92.12% 16.19%
史記 91.68% 60.87% 75.60% 92.02% 72.69% 58.02%
清代奏摺 96.70% 73.19% 50.68% 98.68% 78.54% 35.24%
上古漢語混合 93.00% 69.30% 59.05% 90.56% 73.44% 58.93%
Overall 94.26% 75.28% 47.98% 94.53% 80.96% 39.03%
Evaluation in 10-Fold Cross-Validation.5 generations of CRFs averaged perceptron with 100K feature functions. 27/38
![Page 28: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/28.jpg)
Result: Experiment 2
Training Data HMMs CRFsSpecificity F1 NIST-SU Specificity F1 NIST-SU
論語 90.36% 57.61% 80.96% 86.80% 58.93% 88.16%
孟子 92.57% 59.10% 72.99% 89.12% 63.90% 75.12%
莊子 92.77% 60.68% 70.50% 91.00% 65.08% 68.97%
春秋左傳 90.32% 63.57% 73.77% 87.37% 67.00% 75.93%
史記 93.56% 68.86% 57.97% 93.23% 74.11% 51.19%
Overall 91.92% 61.96% 71.24% 89.50% 65.80% 71.87%
5 generations of CRFs averaged perceptron with 100K feature functions. 28/38
![Page 29: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/29.jpg)
Experiment 3a: 以古鑑近
HMMs/CRFs
孟子 莊子論語左傳 史記
Validation
Training Data
Test Data
清代奏摺
29/38
![Page 30: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/30.jpg)
Result: Experiment 3aTraining Data HMMs CRFs
Specificity F1 NIST-SU Specificity F1 NIST-SU
論語 90.40% 41.05% 123.42% 79.17% 37.26% 195.26%
孟子 91.20% 42.00% 119.78% 79.03% 37.22% 194.35%
莊子 90.70% 42.50% 120.34% 83.13% 40.75% 166.21%
春秋左傳 84.31% 34.36% 168.87% 76.16% 37.66% 209.51%
史記 88.00% 40.35% 138.59% 83.24% 41.82% 165.20%
上古漢語混合 87.06% 38.85% 146.95% 80.88% 38.77% 182.38%
Overall 88.61% 39.85% 136.33% 80.27% 38.91% 185.49%
30/38
![Page 31: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/31.jpg)
Experiment 3b: 以近鑑古
HMMs/CRFs
清代奏摺
孟子莊子
論語
左傳史記
Validation
Training Data
Test Data
31/38
![Page 32: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/32.jpg)
Result: Experiment 3bTraining Data HMMs CRFs
Specificity F1 NIST-SU Specificity F1 NIST-SU
論語 91.32% 47.15% 87.23% 94.67% 43.12% 83.79%
孟子 91.55% 46.95% 90.74% 95.01% 42.13% 86.50%
莊子 91.31% 48.35% 91.24% 94.63% 46.52% 83.46%
春秋左傳 95.16% 50.41% 73.65% 97.73% 49.01% 68.96%
史記 93.18% 37.73% 97.01% 96.49% 34.35% 88.89%
上古漢語混合 92.57% 46.59% 87.16% 95.76% 42.72% 82.50%
Overall 92.52% 46.20% 87.84% 95.72% 42.98% 82.35%
32/38
![Page 33: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/33.jpg)
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
33/38
![Page 34: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/34.jpg)
Overall Build up an automated CCSS system. Complete 3 tasks during the system
developing. A set of evaluation metrics. A set of datasets. Two segmentation models.
HMMs CRFs
34/38
![Page 35: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/35.jpg)
Datasets Evaluated some classics from the 5th century
BCE to the 19th century, including 論語 , 孟子 , 莊子 , 左傳 , and 史記 .
My system maintains its performance on a test data differing from the training data, but the difference in written eras between the test data and training data cannot be too great.
史記 is the best dataset for training.
35/38
![Page 36: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/36.jpg)
Segmentation Models Overall performance
Model comparison
Model Correctness Training time Run time Sensitive to Training Data
HMMs Average Fast Fast Insignificant
CRFs Better Slow Fast Sensitive
Model Specificity F-measure NIST-SU error rate
HMMs 94.26% 75.28% 47.98%CRFs 94.53% 80.96% 39.03%
36/38
![Page 37: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/37.jpg)
Future Work Apply more Chinese features
Phonology, POS, and Antithesis. Integrate pre-defined rules
Names, places, dates, numbers. Mix several datasets to obtain a more general,
robust dataset.
37/38
![Page 38: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/38.jpg)
THE END
38/38
![Page 39: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/39.jpg)
Conditional Random Fields The model
P( “LL, MM, MM, RR” | “ 北冥有魚” ) Feature functions: f(yt-1, yt, x, t)
f(MM, RR, ‘ 曰’ ) = 1 ’ 曰’出現在句末
f(RR, ‘ 孟’ ) = 0 ‘ 孟’從未出現在句末
由 λk 決定 fk 的重要性, λk 由資料中學習而來
txyyf
ZxyP ttk
T
t k
kx
,,,exp1| 1
1
39/38
![Page 40: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/40.jpg)
Conditional Random Fields Cont. 將 x 加上標籤
以 Viterbi algorithm 實作 參數評估(決定 λ 值)
保證收斂,但沒有分析式的解法,須透過迭代法逼近。 複雜度高、速度慢、不易實作 GIS, IIS, L-BFGS, etc.
以 averaged perceptron 代替傳統的數值方法。
xyPy
|maxarg*y
40/38
![Page 41: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/41.jpg)
Feature Templates
41/38
Template Example
yi-1, yi LL, MM
yi, wi 北冥有魚其名為鯤wi-2, yi 北冥有魚其名為鯤wi-1, yi 北冥有魚其名為鯤wi+1, yi 北冥有魚其名為鯤wi+2, yi 北冥有魚其名為鯤wi-2, wi-1, yi 北冥有魚其名為鯤wi-1, wi, yi 北冥有魚其名為鯤wi, wi+1, yi 北冥有魚其名為鯤wi+1, wi+2, yi 北冥有魚其名為鯤
![Page 42: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/42.jpg)
Raw Results (Better Case)
42/38
齊宣王問曰.文王之囿.方七十里.有諸.孟子對曰.於傳有之.曰.若是其大乎.曰.民猶以為小也.曰.寡人之囿.方四十里.民猶以為大.何也.曰.文王之 囿.方七十里.芻蕘者往焉.雉兔者往焉.與民同之.民以為小.不亦宜乎.臣始至於境.問國之大禁.然後敢入.臣聞郊關之內.有囿方四十里.殺其麋鹿者.如 殺人之罪.則是方四十里為阱於國中.民以為大.不亦宜乎.
齊宣王問曰.文王之囿方七十里.有諸孟子對曰.於傳有之.曰.若是其大乎.曰.民猶以為小也.曰.寡人之囿.方四十里.民猶以為大.何也.曰.文王之囿方 七十里.芻蕘者往焉.雉兔者往焉.與民同之.民以為小.不亦宜乎.臣始至於境.問國之大禁.然後敢入.臣聞郊關之內.有囿方四十里.殺其麋鹿者.如殺人之 罪.則是方四十里.為阱於國中.民以為大.不亦宜乎.Training Data: 史記, Test Data: 孟子
![Page 43: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/43.jpg)
Raw Results (Worse Case)
43/38
孟子曰.三代之得天下也.以仁.其失天下也以不仁.國之所以廢興存亡者亦然.天子不仁.不保四海.諸侯不仁.不保社稷.卿大夫不仁.不保宗廟.士庶人不仁.不保四體.今惡死亡而樂不仁.是由惡醉而強酒.
孟子曰.三代之.得天下也.以仁其失天下也.以不仁國之所以廢興.存亡者.亦然.天子不仁.不保四海.諸侯不仁.不保社.稷卿大夫.不仁不保.宗廟士.庶人不仁.不保四體.今惡死亡.而樂不仁.是由惡醉.而強酒.
Training Data: 史記, Test Data: 孟子
![Page 44: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/44.jpg)
44/38
![Page 45: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/45.jpg)
45/38
![Page 46: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/46.jpg)
46/38
![Page 47: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/47.jpg)
Evaluation Measures
tp fnfp
tn
47/38
![Page 48: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/48.jpg)
Statistical Machine Learning
Learner (Model)
Training Data
input output
48/38
![Page 49: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/49.jpg)
結論:評估指標 主要參考指標
Specificity F-measure (F1)
Recall 和 precision 的調和平均 NIST-SU error rate
ROC Curves 同時評估 recall 和 specificity 比較多筆斷句結果
49/38
![Page 50: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/50.jpg)
統計式斷句系統設計 Statistical approach 斷句系統可以視為一個 machine learner 。 不透過人力預先定義規則,而從大量
training data 中調整學習 Training data 的需求
一般性 足夠的數量
50/38
![Page 51: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/51.jpg)
51/38
![Page 52: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches](https://reader035.fdocuments.net/reader035/viewer/2022081507/56815b37550346895dc90a12/html5/thumbnails/52.jpg)
古漢語的文法與斷句研究 訓詁學:虛字的研究
《爾雅.釋詁.釋言.釋訓》 清末馬建忠《馬氏文通》
仿效西洋文法,建立古漢語的文法。 民初楊樹達《詞詮》
為虛字分門別類。 楊樹達《古書句讀釋例》
探討誤讀的因素。 目前尚無數位化文獻。
52/38