1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授...

50
1 資資資資資資資資資資資資資 資資資資資資資資資資資資資 Ch9. Thesaurus Constru Ch9. Thesaurus Constru ction ction 指指指指 指指指指指 指指指 指指指指 指指指指 指指指 :、、 19. Nov. 2002
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    252
  • download

    4

Transcript of 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授...

Page 1: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

1

資訊擷取與推薦技術期中報告資訊擷取與推薦技術期中報告

Ch9. Thesaurus Construction Ch9. Thesaurus Construction

指導教授:黃三益教授組 別:第三組組 員:周桂穗、孫繡紋、莊士民日 期: 19. Nov. 2002

Page 2: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

2

outlineoutline1. Introduction

1) 索引典 (Thesaurus) 定義2) 索引典結構3) INSPEC thesaurus4) 索引典參照款目說明

2. Features of thesauri1) Coordination level2) Term relationships3) Number of entries for each term4) Specificity of vocabulary5) Control on term frequency of class members6) Normalization of vocabulary

3. Thesaurus Construction1) Manual Thesaurus Construction2) Automatic Thesaurus Construction

3.2.1 Thesaurus Construction from Text Automatic Thesaurus Construction From a Collection of Document Items By Merging Existing Thesauri User Generated Thesaurus Construction of Vocabulary

3.2.2 Merging existing thesauri

4. Conclusion

Page 3: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

3

1.1 1.1 索引典索引典 (Thesaurus)(Thesaurus) 定義 定義 就資訊儲存與檢索的範疇而言,索引典

乃收集足以表示知識概念的字或詞,並將之以特定的結構加以排列,這些字彙控制了同義字,區別了同形異義字,並顯現各相關詞彙間階層及語意互屬上的各種關係,以作為索引者在分析處理資料及讀者在檢索資料時能選用一致的、經過控制的詞彙。即提供資訊儲存與檢索標準化的用語。

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 4: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

4

1.2 1.2 索引典結構索引典結構

1. 索引典的詞彙分為標目 (heading) 及參照款目 (cross reference entries) 兩種。

2. 標目被認可為可使用的詞彙,稱之為敘述語或述語 (descriptors); 參照款目則為不可使用的詞彙,稱為非敘述語 (non-descriptors) 或被替代語 (use references) ,亦即圖書館書目資料處理時採用的參見(see) 作法。

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 5: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

5

1.3 INSPEC thesaurus1.3 INSPEC thesaurus

1. This thesaurus is designed for the INSPEC domain, which covers physics, electrical engineering, electronics, as well as computers and control.

2. The thesaurus is logically organized as a set of hierarchies.

3. Includes an alphabetical listing of thesaural terms.4. Each hierarchy is built from a root term represent

ing a high-level concept in the domain.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 6: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

6

A short extract from the 1979 INSPEC thesaurusA short extract from the 1979 INSPEC thesaurus

Cesium USE caesium Computer-aided instruction see also education UF teaching machines BT educational computing TT computer applications RT education teaching CC C7810C FC c7810cf

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 7: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

7

1.4 1.4 索引典參照款目說明索引典參照款目說明USE ( 被替代 )

The “see also( 參見 )” link leads to cross-referenced thesaural terms.

NT (narrower terms 較狹義字 )suggest a more specific thesaural term.

BT (broader terms 較廣義字 ) provides a more general thesaural term.

TT (Top term, 最 BT 的 term)

RT signifies a related term( 相關字 ) .

UF ( 替代 )is utilized to indicate the chosen form from a set of alternatives.

CC Classification Codes 類別代碼

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 8: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

8

2. Features of thesauri2. Features of thesauri

1) Coordination level

2) Term relationships

3) Number of entries for each term

4) Specificity of vocabulary

5) Control on term frequency of class members

6) Normalization of vocabulary

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 9: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

9

2.1 2.1 CoordinationCoordination level level

1) The construction of phrases from individual terms.

2) Two coordination options : pre-coordination and post-coordination.

3) A precoordinated thesaurus can contain phrases. The advantage is that the vocabulary is very precise.

4) The disadvantage is that the searcher has to be aware of the phrase construction rules employed.

5) Precoordination is more common in manually constructed thesauri.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 10: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

10

7) A postcoordinated thesaurus does not allow phrases. Instead, phrases are constructed while searching.

8) The advantage is that the user need not worry about the exact ordering of the words in phrase.

9) The disadvantage is that search precision may fall.

10) Automatic thesaurus construction usually implies postcoordination.

Page 11: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

11

Coordination level 組合層次 前組合 將此述語概念視為

一個複合名詞,例如: diesel   locomotive( 柴油引擎火車頭 )

後組合 以現存的二個或二

個以上的詞彙代替而組合,例如: diesel engines AND locomotive

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 12: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

12

2.2 Term Relationships2.2 Term Relationships 詞彙間的關係詞彙間的關係

Three categories of term relationships:

(a) Equivalence relationships( 同義關係 )

(b) Hierarchical relationships( 層屬關係 )

(c) Nonhierarchical relationships( 非層屬關係 )

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 13: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

13

2.2a) Equivalence relationships2.2a) Equivalence relationships

Equivalence relations include both synonymy( 同義字 ) and quasi-synonymy( 半同義字 ).

For example:genetics( 遺傳 ) and heredity; harshness and tenderness

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 14: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

14

同義關係同義關係 同義字:同一概念可以用一種

以上的詞彙表示時,索引典多選用較廣為使用或新穎的一種為述語,其他

則作為參照款目,例如: storage batteries UF secondary batteries

secondary batteries USE storage batteries

UF(Used For) 替代 USE 被替代

半同義字:有時兩個反義字卻可代表一個概念一體之兩面,則擇其一為述語,另一為參照款目,例如:stability UF instability ,相對應之參照款目為 instability USE stability  

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 15: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

15

2.2b) Hierarchical relationships2.2b) Hierarchical relationships層級關係層級關係

A typical example of a hierarchical relation is genus( 屬 )-species( 種 ),such as ”dog” and “german shepherd( 牧羊犬近親 ).”

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 16: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

16

2.2b) 2.2b) 層屬關係層屬關係索引典對於詞彙間具有層屬關係的詞彙,

通常以 BT 及 NT 兩種參照符號來表示。BT 乃指示某詞彙的上層較廣義的詞彙,

例如: oak tree BT treeNT 乃指示某詞彙的下層較狹義的詞彙,

例如: tree NT oak treeBT 與 NT 是兩個用來相互對應的參照符

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 17: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

17

2.2c) Nonhierarchical relationships2.2c) Nonhierarchical relationships非層級關係非層級關係

Nonhierarchical relationships also identify conceptually related terms. There are many examples including :thing—part such as “bus” and “seat”;thing—attribute such as “rose” and “fragrance( 香味 )”.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 18: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

18

相關關係相關關係

是指認可述語間的關連,一般採用 RT 的參照符號來連結,例如:

表示事或物的全部與部份的關係, windows RT houses 表示事或物與其處理作業的關係, skates RT skating 表示事或物與其應用的關係, railway construction RT r

ailway 表示事或物與其特性的關係, seawater RT corrosion

( 侵蝕)

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 19: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

19

Wang, Vandendorpe, and Evens (1985) provide an alternative classification of term relationships consisting of:

(1)parts—wholes( 整部關係 )

(2)collocation relations( 排列關係 )

(3)paradigmatic relations( 範例關係 )

(4)taxonomy and synonymy( 分類及同義字 )

(5)antonymy relations( 反義字 )

Page 20: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

20

(1)Parts-wholes(1)Parts-wholes

Parts and wholes include examples such as set( 集合 )—element( 元素 );count—mass.

Page 21: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

21

(2)Collocation relations(2)Collocation relations編排關係編排關係

Collection relates words that frequently co-occur in the same phrase or sentence.

Page 22: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

22

(3)Paradigmatic relations(3)Paradigmatic relations

Paradigmatic relations relate words that have the same semantic core like “moon” and “lunar” and are somewhat similar to Aitchison and Gilchrist’s quasi-synonymy relationship.

Page 23: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

23

(4)Taxonomy and synonymy(4)Taxonomy and synonymy

Taxonomy and synonymy are self-explanatory and refer to the classical relations between terms.

Page 24: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

24

(5) antonymy relations(5) antonymy relations

Page 25: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

25

2.3 Number of entries for each term2.3 Number of entries for each term

1. It is in general preferable to have a single entry for each thesaurus term.However ,this is seldom achieved due to the presence of homographs—words with multiple meanings.

2. In a manually constructed thesaurus such as INSPEC, this problem is resolved by the use of parenthetical qualifiers( 括弧限定語 ), as in the pair of homographs, bonds 化學鍵 (chemical) and bonds 粘合劑 (adhesive 膠帶 ).

3. However, this is hard to achieve automatically.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 26: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

26

同形異義關係同形異義關係 (homographs)(homographs)

當許多詞彙的拼法完全相同,但所代表的意義卻不同時,則以小括號加修飾語以區別之,例如: Mercury水銀 (metal金屬 ) 、 Mercury水星 (planet行星 ) ,小括號內的修飾語亦為述語的一部分。

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 27: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

27

2.4 Specificity of vocabulary 2.4 Specificity of vocabulary 詞彙的明確性詞彙的明確性1. a function of the precision associated with the component terms.2. A highly specific vocabulary is able to express the subject in

great depth and detail.This promotes precision in retrieval.3. The disadvantage is that the size of the vocabulary grows. Also,

specific terms tend to change more rapidly than general terms.4. There, such vocabularies tend to require more regular

maintenance.5. High specificity implies a high coordination level and user has

to be more concerned with the rules for phrase construction.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 28: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

28

2.5 Control on term frequency of class members2.5 Control on term frequency of class members1. Salton and McGill have stated that in order to maintain a good

match between documents and queries, it is necessary to ensure that terms included in the same thesaurus class have roughly equal frequencies.

2. The total frequency in each class should also be roughly similar.

3. These constraints are imposed to ensure that the probability of a match between a query and a document is the same across classes.

4. Terms within the same class should be equally specific, and the specificity across classed should also be the same.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 29: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

29

2.6 Normalization of vocabulary2.6 Normalization of vocabulary 詞彙的標準化詞彙的標準化

1. 最好以名詞方式表示2. 名詞片語應避免採用頭字語,除非大家都知道3. 可採用形容詞4. There are other rules to direct issues such as the singularity of terms

(單數 ), the ordering of terms within phrases( 在片語中的順序 ), spelling( 拼法 ), capitalization(大寫 ), transliteration( 字譯 ), abbreviations(縮寫 ), initials( 字首 ), acronyms( 字首縮寫 ), and punctuation( 標點符號 ).

5. The advantage is that variant forms are mapped into base expressions, thereby bringing consistency to the vocabulary.

6. The disadvantage is that, in order to be used effectively, the user has to be well aware of the normalization rules used.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 30: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

30

3.1 Manual Thesaurus Construction3.1 Manual Thesaurus Construction Define the boundaries of the subject area

– Identify central subject areas and peripheral ones

– Partition the domain into divisions or subareas

Identify desired characteristics

Collect terms for each subarea– Sources from index, encyclopedia, handbook, textbook,

journal, abstract, catalog, existing thesaurus or vocabulary systems

– Including: subject expert and potential user

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 31: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

31

3.1 Manual Thesaurus Construction3.1 Manual Thesaurus Construction(continued)(continued)

Analyze each term for its related vocabulary– Including synonyms, broader and narrower term,

definition and scope note

Organize term and relationship into hierarchical structure

Review or refine for consistency

Invert the structured thesaurus to produce an alphabetical arrangement of entries

Test the thesaurus

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 32: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

32

3.1 Manual Thesaurus Construction3.1 Manual Thesaurus Construction(continued)(continued)

Conclusion:

– Involve a group of individuals and a variety of resources

– Need to be maintained to ensure viability and effectiveness

– Reflect any changes in the terminology of the area

=>An art and a science

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 33: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

33

3.2 Automatic Thesaurus Construction3.2 Automatic Thesaurus ConstructionFrom document collections

– Use a collection of documents as the source for thesaurus construction

– Apply statistical procedures to identify important terms as well as relationships

– Use computationally simpler methods to identify the more important semantic knowledge

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 34: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

34

3.2 Automatic Thesaurus Construction 3.2 Automatic Thesaurus Construction (continued)(continued)Merge existing thesaurus

– Merge two or more thesauri into a single unit

– Merger should not violate the integrity of any component thesaurus

– e.g. augment MeSH from SNOMED

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 35: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

35

3.2 Automatic Thesaurus Construction 3.2 Automatic Thesaurus Construction (continued)(continued)User generated thesaurus

– Uses of term relationship in search strategies

– Capture knowledge from user’s search

– e.g. TEGEN (Thesaurus Generating system) The types of Boolean operators between terms

The type of query modification

User feedback included

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 36: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

36

3.2.1 Thesaurus Construction from Texts3.2.1 Thesaurus Construction from Texts Process

– 1 Construction of vocabulary Normalization

Selection of terms

Phrase construction

Identify the statistical associations between terms

– 2 Similarity computations

– 3 Organization of vocabulary Organize the selected vocabulary into hierarchy

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Page 37: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

37

1 Construction of Vocabulary1 Construction of VocabularyObjective: Identify the most informative

terms (words, phrases)– Identify an appropriate document collection

which should be sizable and representative of the subject area

– Determine the required specificity

– Vocabulary for normalization Eliminate trivial words and construct a stoplist Stem the vocabulary

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 38: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

38

Selection by Frequency of OccurrenceSelection by Frequency of OccurrenceSelection by frequency of occurrence

– Each term placed in one frequency category: high, medium, low

– Medium: best for indexing and abstracting

– Low: minimal impact on retrieval

– High: too general and negatively impact search precision

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 39: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

39

Selection by Discrimination Value (DV)Selection by Discrimination Value (DV)

Selection by discrimination value (DV)

– DV measures the degree to which a term is able to discriminate or distinguish between the documents

– The more discriminating a term, the higher its value as an index term

– Using some similarity functions to compute the average inter-document similarity in the collection

– DV(k) = (Average similarity without k) - (Average similarity with k)

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 40: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

40

Selection by Discrimination Value Selection by Discrimination Value (DV)(DV)

(continued)(continued) Selection by discrimination value (DV)

– Good discriminators are those that decrease the average similarity by their presence DV is positive

– Poor discriminators have negative DV

– Neutral discriminators have no effect on average similarity

– Terms that are positive discriminators can be included in the vocabulary and the rest rejected

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 41: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

41

Selection by Poisson MethodSelection by Poisson Method

Selection by Poisson Method

– Poisson distribution is a discrete random distribution that can be used to model a variety of random phenomena

– Trivial words have a single Poisson distribution

– Distribution of nontrivial words deviates significantly from a Poisson distribution

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 42: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

42

Phrase ConstructionPhrase Construction

Phrase constructiondecrease the frequency of high-frequency terms and increase their value for retrieval

Salton and McGill Procedure: a statistical alternative to syntactic and/or semantic methods for identifying and constructing phrases– The component words of a phrase should occur frequen

tly in a common context– The component words should represent broad concepts,

and their frequency of occurrence should be sufficiently high

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 43: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

43

Phrase ConstructionPhrase Construction (continued)(continued)

– Criteria

1. Compute pair wise co-occurrence for high-frequency words

2. If this co-occurrence is lower than a threshold, then do not consider the pair any further

3. For pairs that qualify, compute the cohesion valuecohesion (ti, tj) =co-occurrence-frequency / sqrt ( frequency (ti) * (frequency (tj) )cohesion (ti, tj) =size-factor * (co-occurrence-frequency / (total-frequency (ti) * ( total-frequency (tj) ) )

4. If cohesion is above a second threshold, retain the phrase as a valid vocabulary phrase

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 44: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

44

Choueka Procedure Choueka Procedure (continued)(continued)

Choueka Procedure Identifying collocational expressions by the phrases whose meaning cannot be derived in a simple way from that of the component words (e.g. artificial intelligence)

1. Select the range of length allowed for each collocational expression

2. Build a list of all potential expressions from the collection with prescribed length that have a minimum frequency

3. Delete sequences the begin or end with a trivial word

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 45: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

45

Choueka Procedure Choueka Procedure (continued)(continued)

4. Delete expressions that contain high-frequency nontrivial words

5. Given an expression such a b c d evaluate any potential subexpressions for relevance. Discard any that are not sufficiently relevant

6. Try to merge small expressions into large and more meaningful ones

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 46: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

46

4.2 Similarity computations between terms4.2 Similarity computations between terms

To determine the statistical similarity between pairs of terms.

Dice: , if l1=0 or l2 =0 return 0

Cosine: , if l1=0 or l2 =0 return 0

l1: # of terms associated with document 1

l2: # of terms associated with document 2

common: # of terms in common between them

21 ll

common

21 ll

common

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 47: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

47

4.3 Organization of vocabulary4.3 Organization of vocabulary

Two assumption High-frequency words have broad meaning If the density functions of the two terms, p and q (of

varying frequencies) have the same shape, then the two words have similar meaning.

As two assumptions, if p is the term with the higher frequency, then q becomes a child of p.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 48: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

48

Identify a set of frequency ranges.Group the vocabulary terms into different

classes based on their frequencies.The highest frequency class is assigned

level 0, the next, level 1 and so on.Parent-child links are determined.Create “dummy” term.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

Process1 Construction of vocabulary2 Similarity computations3 Organization of vocabulary

•From document collections•Merge existing thesaurus

Page 49: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

49

Merging existing thesauriMerging existing thesauri

Simple-merge– Two terms in different hierarchies are merged if

they are identical.

Complex-merge– Any two terms in different hierarchies are

merged if they have ‘similar’ parents and children.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion

•From document collections•Merge existing thesaurus

Page 50: 1 資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19.

50

conclusionconclusion

This chapter began with an introduction to thesauri.

Two major automatic thesaurus construction methods have been detailed.

A few related issues to thesauri have not been considered here:– Evaluation of thesauri.– Maintenance of thesauri.– How to automate the usage of thesauri.

1. Introduction2. Features of thesauri3. Thesaurus Construction4. Conclusion