Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer...

28
eoName: a system for back-transliterat pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New York email: [email protected] email:[email protected]

Transcript of Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer...

Page 1: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName: a system for back-transliteratingpinyin place names

Kui-Lam Kwok & Qiang Deng

Computer Science Dept., Queens CollegeCity University of New York

email: [email protected]:[email protected]

Page 2: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Or:

issues involving cross language referencing

of a Chinese place by name

Page 3: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Content:

1. Back-transliteration problem

2. GeoName system - a proposed approach

3. Evaluation

4. Observation/conclusion

Page 4: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Transliteration:

• ‘alphabet mismatch’ when expressingChinese (place) names in English Texts

• names represented by PRC Pinyin code:

e.g. Beijing, Shenzhen

Page 5: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Back-Transliteration:

given the Pinyin code,

what are the original Chinese characters?

Page 6: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Back-Transliteration:

Why Chinese Characters are needed?

• remove ambiguity of referenced Pinyin place

• reconcile names in English & Chinese texts

• may assist alignment in E/C parallel texts

• necessary for E-C Cross Language IR (when translating English queries containing

Pinyin place, person, organization names)

Page 7: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

4 Possible Ambiguities in

English–Chinese

cross language place name references

Page 8: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Ambiguity #3: Back-transliteration--> which character string is correct?

e.g.•China’s capital in Chinese - 北京•PRC Pinyin (1 char, 1 syllable) -

北 --> bei 京 --> jing

•map back from Pinyin to characters –bei --> { 北 , 贝 , 被 , 背 , 碑 , 杯 , 备 , 鐾 , …} (total 23)jing--> { 京 , 景 , 井 , 静 , 敬 , 竞 , 精 , 荆 , …} (total 20)

•ambiguous candidates: 北井 , 贝京 , 贝荆 , …北京which one?

Page 9: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Ambiguity #4: Name Reference--> same name, different places

Suppose result of back-transliteration is:

beijing --> 贝荆 , then which 贝荆 ? (longitude, latitude)

Page 10: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Ambiguity #1: E/C Pinyin Systems--> which Pinyin system was used ?

e.g. ‘Hong Kong’ in characters - 香港

PRC Pinyin: 香 -> xiang, 港 -> gangWade-Giles: 香 -> hsiang, 港 -> kangHong Kong: 香 -> hong, 港 -> kong …

‘hong kong’ back-transliterate using PRC Pinyin:

hong -> { 红洪鸿宏虹弘泓闳烘项黉哄 … } (19)kong -> { 孔空恐崆控箜倥 } (7)

Original ‘ 香港’ is NOT one of these 7x19 combinations !

Page 11: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Ambiguity #2: Syllable Segmentationwhich segmentation is correct?

e.g. 秦皇岛 - possible pinyin writing styles:

• Qin Huang Dao• QinHuangDao• Qinhuangdao <-- most common, used in NYT

--> how many syllables?Qin huang dao 3 charQin huang da o 4 charQin hu ang dao 4 charQin hu ang da o 5 char

Page 12: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Summarize: given a Pinyin geographic name

1. Pinyin system -- which?

2. segmentation -- how many syllables?

3. back-transliterate -- which candidate character string?

4. resolve same name, different places.

Page 13: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName:

a system for back-transliteratingPinyin place names

Page 14: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName: E-C cross language place reference

1. which Pinyin system?-- user chooses; or allow both PY & WG

2. how many segmented syllables?-- fewest syllables ranked first

3. back-transliterate: which candidate ?-- a) bi-list; b) confirm by web/Chinese place list; c) rank candidates by frequency

4. resolve same name different places -- not considered

Page 15: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName –

Given English Pinyin place E =e1e2.. en (n syllables),many possible Chinese character string candidates:

C* = c1c2.. cn = argmaxC P(C|E)

= argmaxC P(E|C)*P(C)

~ argmaxC P(C), by assuming

P(E|C) ~ Πi P(ei|C) i.e. ei, ek

independent ~ Πi P(ei| ci) i.e. ei, ck

independent ~ 1 i.e. all ci map to unique

ei

Page 16: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName –

P(C) = language model of Chinese place names<obtain training data by processing TREC, NTCIR Chinese collections using BBN IdentiFinder: ~80K approximate unique place names>

Use P(C) to sort candidates; fewest syllables rankedearlier<bigram model P(c2|c1)P(c3|c2).. not too good>

Page 17: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName –

A heuristic weighting formula based on whole string, bigram and character frequencies:

g(C) = a1*log [f(C)+a1] + a2*log [f(cicj)+a2]

+ a3*log [f(ci)+a3],

- factor ignored if f(.) = 0; a1>a2>a3

- a1*log [f(C)+a1] => a string seen before

is probably correct

Page 18: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Evaluation

Use frequency formula only on 162

Pinyin city names from bilingual map

(no bilingual pair list were employed)

Page 19: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName Evaluation - Frequency Formula(back-transliterating 162 Pinyin geographic names)

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 >10

Rank

Cu

mm

ula

tiv

e #

Co

rre

ct

at

Ra

nk

48%

70%

74%

82%

Page 20: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Examples of Correct Names ranked #1Daqiu ( 大丘 ), Wanbi ( 湾碧 ), Gongzhuling, ..

( 公主岭 )Examples of Failed Names• Non-Pinyin:

Qarqi, Yengisar, Jorra, Dongkar, .. ( 察尔齐 ) ( 阳霞 ) ( 觉拉 ) ( 洞嘎 )

• mainly longer names:Tuolu, Fenglingguan, Qingguandu,( 驮芦 ) ( 枫岭关 ) ( 清官渡 )Dating, Shasonggang, Denglonghe, ..( 大亭 ) ( 杉松岗 ) ( 灯笼河 )

Page 21: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName – further improvement

Hypothesis: prefer candidate strings that have been seen before as location

names

confirm candidates on:

1. a bilingual list (~4K) – tag: 100ftp://ftpserver.ciesin.columbia.edu/pub/data/China /CITAS/gb_code/

2. Chinese monolingual place name list (~80K+4K) – tag:

010

3. web data via Google search – tag: 001

Page 22: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

1. Pinyin place nameinput; user indicatesPRC or WG system.

3. Bilingual table(4k) lookup. tag 100

2. Pinyin segmentation; map to all possible GB character strings.tag 000

4. Merge GB candidates

6. WWWconfirmation.tag 101, 001

5. Monolingual name list (84k) confirmation.tag 110, 010

7. Evaluate weight g(C);rank according to:(1) tag, (2) name character length, (3) g(C).

tag 111, 011

GeoName –flowchart

Page 23: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName – Evaluation

Evaluate system result using:

tag=000, rank by g(C)tag=001, web confirmation + g(C)tag=010, mono-list confirmation + g(C)tag=111, bi-list + all above

Page 24: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

GeoName Evaluation - Various Methods(back-transliterating 162 Pinyin geographic names)

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 >10

Rank

Cu

mm

ula

tiv

e #

Co

rre

ct

at

Ra

nk

freq+mono.list (010)

all (111)

freq only (000)

freq+web (001)

48%

70%

74%

82%

72%

83%

86%

79%

Page 25: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Example of back-transliteration: web & no-web

Tag = 111 (with web confirmation)

Chagugang 001 1.38629436 汊沽港 000 15.68423107 查古港 000 9.24647942 诧古港 000 9.24647942 岔古港 000 8.55333224 锸古港 000 8.55333224 槎古港 000 8.55333224 楂古港 000 8.55333224 汊古港 000 8.55333224 嚓古港 000 8.55333224 刹古港

Tag = 110 (without web confirmation)

Chagugang 000 15.68423107 查古港 000 9.24647942 诧古港 000 9.24647942 岔古港 000 8.55333224 锸古港 000 8.55333224 槎古港 000 8.55333224 楂古港 000 8.55333224 汊古港 000 8.55333224 嚓古港 000 8.55333224 刹古港 000 8.55333224 差古港

Page 26: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Examples:

Luliangqu 010 40.02587171 吕梁区 000 9.24647942 吕梁瞿 000 9.24647942 吕梁衢 000 9.24647942 吕梁渠 000 9.24647942 吕梁曲 000 9.24647942 陆良瞿 000 9.24647942 陆良衢 000 9.24647942 陆良渠 000 9.24647942 陆良曲 000 9.24647942 陆良区 district/region

Xiaoyishi 110 40.18588115 孝义市 000 9.24647942 孝尾市 000 9.24647942 萧尾市 000 8.55333224 箫尾市 000 8.55333224 筱尾市 000 8.55333224 骁尾市 000 8.55333224 潇尾市 000 8.55333224 崤尾市 000 8.55333224 哓尾市 000 8.55333224 效尾市 city

Yimaxiang 000 15.68423107 义马乡 000 9.24647942 义马缃 000 9.24647942 义马巷 000 9.24647942 义马祥 000 9.24647942 义马湘 000 9.24647942 义马襄 000 9.24647942 义马香 000 9.24647942 伊玛缃 000 9.24647942 伊玛巷 000 9.24647942 伊玛祥 village

Mengnanzhuang 000 14.95494484 蒙南庄 000 8.51719319 懵南庄 000 8.51719319 孟南庄 000 8.51719319 盟南庄 000 8.51719319 萌南庄 000 7.82404601 虻南庄 000 7.82404601 勐南庄 000 7.82404601 梦南庄 000 7.82404601 猛南庄 000 7.82404601 锰南庄 place

Page 27: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

Conclusion:

• reasonable back-transliteration results for map cities

• longer names (>2 char), more error • non-pinyin names, does not work

Future Work:

• increase training data• improve ranking function• direct translation (not just confirmation)

using web• better/more realistic evaluation

Page 28: Geo N ame: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New.

If interested:

can demonstrate GeoName (needs Linux re-boot)

Try GeoName at:

http://post.cs.qc.edu/spell2gb/(needs Chinese character display)

feedback appreciated

Thank You!