Mining the Web for Information Organization J. H. Wang Academia Sinica.

84
Mining the Web for Information Organization J. H. Wang Academia Sinica

Transcript of Mining the Web for Information Organization J. H. Wang Academia Sinica.

Page 1: Mining the Web for Information Organization J. H. Wang Academia Sinica.

Mining the Web for Information Organization

J. H. WangAcademia Sinica

Page 2: Mining the Web for Information Organization J. H. Wang Academia Sinica.

2

Outline

• Introduction• Web Mining• Cross-Language Web Search• Other Applications

Page 3: Mining the Web for Information Organization J. H. Wang Academia Sinica.

3

Introduction

• Huge amount of Web data– Rich and dynamic resources of human

knowledge– Multimedia – Scalability

How to organize Web data into useful information?

Page 4: Mining the Web for Information Organization J. H. Wang Academia Sinica.

4

Number of Web Pages The world’s

largest search engine ?

Billions Of Textual Documents Indexed

December 1995-September 2003

KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.

Source: Search Engine Watch (Nov. 2004)

SearchEngine

ReportedSize

PageDepth

Google 8.1 billion 101K

MSN 5.0 billion 150K

Yahoo4.2 billion(estimate)

500K

AskJeeves

2.5 billion 101K+

Page 5: Mining the Web for Information Organization J. H. Wang Academia Sinica.

5

Web Users and Pages (7 years ago)

Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99

Challenge of Scalability !

Total Users: 800MChinese Users: 110M

Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), and others.

Source: Global Reach, 2004

Page 6: Mining the Web for Information Organization J. H. Wang Academia Sinica.

6

Web Mining

• Data Mining• Text Mining• Web Mining Technologies

Page 7: Mining the Web for Information Organization J. H. Wang Academia Sinica.

7

Data Mining

• Data Mining (Knowledge Discovery in Databases) is a process of nontrivial extraction of implicit, previously unknown, and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases [G. Piatetsky-Shapiro and W. J. Frawley]

Page 8: Mining the Web for Information Organization J. H. Wang Academia Sinica.

8

Text Mining

• Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources [Marti Hearst]

Page 9: Mining the Web for Information Organization J. H. Wang Academia Sinica.

9

Web Mining

• Web Mining is the use of data mining techniques to automatically discover and extract information from Web documents and services [O. Etzioni]

Page 10: Mining the Web for Information Organization J. H. Wang Academia Sinica.

10

Comparison

• Data mining tries to find interesting (non-trivial, implicit, previously unknown, potentially useful) patterns from large databases

• In text mining, the patterns are extracted from natural language texts rather than from structured databases of facts

• Web mining discovers and extracts information from Web documents and services

Page 11: Mining the Web for Information Organization J. H. Wang Academia Sinica.

11

Web Mining Technologies

• Web content mining• Web structure mining • Web usage mining

Page 12: Mining the Web for Information Organization J. H. Wang Academia Sinica.

12

Web Content Mining

• Unstructured documents– Free texts such as news articles

• Semi-structured documents– HTML structures and hyperlink information– Intra-document structure

• Applications: text categorization, text clustering, information extraction, computational linguistics, …

Page 13: Mining the Web for Information Organization J. H. Wang Academia Sinica.

13

Web Structure Mining

• The structure of the hyperlinks within the Web– Inter-document structure– HITS, PageRank

• Social network and citation analysis• Applications: to calculate the quality

rank or relevancy of each Web page, Web page categorization, …

Page 14: Mining the Web for Information Organization J. H. Wang Academia Sinica.

14

Web Usage Mining

• Techniques that could predict user behavior while the user interacts with the Web– To map the usage data of the Web server

into relational tables– To use the log data directly

• Applications: learning a user profile (personalization) vs. learning user navigation patterns

Page 15: Mining the Web for Information Organization J. H. Wang Academia Sinica.

15

Related Fields of Research

• IR (Information Retrieval)• IE (Information Extraction)• ML (Machine Learning)

Page 16: Mining the Web for Information Organization J. H. Wang Academia Sinica.

16

LiveTrans: Cross-language Web Search

• LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html

Page 17: Mining the Web for Information Organization J. H. Wang Academia Sinica.

17

Examples

Page 18: Mining the Web for Information Organization J. H. Wang Academia Sinica.

18

More Examples

Page 19: Mining the Web for Information Organization J. H. Wang Academia Sinica.

19

Cross Language Information Retrieval (CLIR)

• A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language

Page 20: Mining the Web for Information Organization J. H. Wang Academia Sinica.

20

Cross Language Web Search

• A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language

Page 21: Mining the Web for Information Organization J. H. Wang Academia Sinica.

21

Why “Cross-Language”?

• Source: Global Reach (global-reach.biz/globstats)

Page 22: Mining the Web for Information Organization J. H. Wang Academia Sinica.

22

Top Ten Languages Used in the Web

Source: Internet World Stats (Sep. 20, 2006)

TOP TEN LANGUAGESIN THE INTERNET

% of allInternet Users

Internet Usersby Language

InternetPenetration

by Language

Internet Growthfor Language( 2000 - 2006 )

World Population2006 Estimate

for the Language

English 29.7 % 322,600,837 28.7 % 135.2 % 1,125,664,397

Chinese 13.3 % 144,301,513 10,8 % 346.7 % 1,340,767,863

Japanese 7.9 % 86,300,000 67.2 % 83.3 % 128,389,000

Spanish 7.5 % 81,729,671 18.7 % 231.1 % 437,502,257

German 5.4 % 58,854,682 61.3 % 113.2 % 95,982,043

French 4.6 % 49,660,498 13.0 % 307.1 % 381,193,149

Portuguese 3.1 % 34,064,760 14.8 % 349.6 % 230,846,275

Korean 3.1 % 32,372,000 45.8 % 78.0 % 73,945,860

Italian 2.7 % 28,870,000 48.8 % 118.7 % 59,115,261

Russian 2.2 % 23,700,000 16.5 % 664.5 % 143,682,757

TOP TEN LANGUAGES 79.5 % 863,981,961 21.5 % 166.7 % 4,017,088,863

Rest of World Languages 20.5 % 222,268,942 9.0 % 500.0 % 2,482,608,197

WORLD TOTAL 100.0 % 1,086,250,903 16.7 % 200.9 % 6,499,697,060

Top Ten Languages Used in the Web( Number of Internet Users by Language )

More and more non-English users!

Page 23: Mining the Web for Information Organization J. H. Wang Academia Sinica.

23

Web Content by Language

Source: http://www.netz-tipp.de/languages.html (2002)

Chart of Web Content, 2002

0

200

400

600

800

1000

1200

English German French Japanese Spanish Chinese Italian Dutch Russian Korean Portuguese

Language

Milli

ons o

f Web

page

s More and more non-English pages

Page 24: Mining the Web for Information Organization J. H. Wang Academia Sinica.

24

866,000,000 pages

Scalability Problem !

Number of Chinese Web Pages

Page 25: Mining the Web for Information Organization J. H. Wang Academia Sinica.

25

Challenge of Cross-Language Web Search

• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

• 81% of the search terms could not be obtained from common English-Chinese translation dictionaries

中 央 處 理 器 (CPU), 電 子 商 務 (E-commerce),

個人數位助理 (PDA), 雅虎 (Yahoo), 太空總署 (NASA), 星際大戰 (Star War),非典型肺炎 (SARS), …

Page 26: Mining the Web for Information Organization J. H. Wang Academia Sinica.

26

Challenge

• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

• 81% of the search requests could not be obtained from common English-Chinese translation dictionaries

• How to find effective translations automatically for query terms not included in a dictionary ?

Page 27: Mining the Web for Information Organization J. H. Wang Academia Sinica.

27

CLIR

• Conventional approach to query translation – Parallel documents as the corpus– Assume long queries

• Problems of CLIR in Web search– No corpus for cross-lingual training– Short queries

“Out-of-dictionary” terms– Ex: proper nouns, new

terminologies, …

English Terminologies

Chinese Translation

mechanical strain 機械應變viscous damping 黏滯阻尼Richard Feynman 費曼Hyoplastic Left Heart Syndrome

左心發育不全症候群

NII Japan 國立情報學研究所

SARS 嚴重急性呼吸道症候群

Extracorporeal Shock Wave Lithotripsy

震波碎石

Davinci 達文西

Page 28: Mining the Web for Information Organization J. H. Wang Academia Sinica.

28

Translation Lexicon Construction for CLIR

• To use the Web as the corpus for query translation– Web mining techniques

• Anchor-text-based [ACM TOIS ‘04, ACM TALIP ‘02]• Search-result-based [JCDL ‘04]

• To extract terms from real document collections as possible queries– Term extraction method [SIGIR ‘97]

Page 29: Mining the Web for Information Organization J. H. Wang Academia Sinica.

29

Web Mining Approach to Term Translation Extraction

LiveTrans Engine

LiveTrans Engine

Academia SinicaAnchor textsAnchor texts

Search resultsSearch results

The Web

中央研究院 / 中研院

Source query

Target translations

Page 30: Mining the Web for Information Organization J. H. Wang Academia Sinica.

30

Search-Result Page – National Palace Museum vs. 故宮博

物院

• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?

Noises

Page 31: Mining the Web for Information Organization J. H. Wang Academia Sinica.

31

Coverage Rate of Top-Ranked Search-Result Pages

• 95% of popular Web queries’ translations can be found in top 30-40 result pages

• About 70% of random queries were covered

• Many relevant translations can also be found

Page 32: Mining the Web for Information Organization J. H. Wang Academia Sinica.

32

Anchor-Text Set -- Yahoo vs. 雅虎

• Anchor text (link text)– The descriptive text of a

link on a Web page

• Anchor-text set– A set of anchor texts

pointing to the same page (URL)

– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ

• Anchor-text-set corpus– A collection of anchor-

text sets

Yahoo Search Engine

美国雅虎 雅虎搜尋引擎

Yahoo! America

Taiwan

China

Japan

Korea

야후 -USA

アメリカの Yahoo! http://www.yahoo.com

Page 33: Mining the Web for Information Organization J. H. Wang Academia Sinica.

33

Problems

• How to extract translation candidates with correct lexical boundary?– Term extraction

• From the search-result pages• From the document collections

– Bilingual lexicon construction

• How to choose relevant candidates?– Term translation

Page 34: Mining the Web for Information Organization J. H. Wang Academia Sinica.

34

Term Translation Extraction from Different Resources

Term

Extraction

Term

Extraction

Source Query

TargetTranslation

Search-ResultPages

SearchEngineSearchEngine

SimilarityEstimationSimilarityEstimation

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

Anchor-Text

Corpus

WebSpiderWeb

Spider

Page 35: Mining the Web for Information Organization J. H. Wang Academia Sinica.

35

Term Extraction

• Problem

DL

Doc.

國立故宮博物院

故宮博物院

故宮

立故宮

故宮博

宮博物

宮博

Correctly segmented

Incorrect text segments

Page 36: Mining the Web for Information Organization J. H. Wang Academia Sinica.

36

Three Approaches to Term Extraction

• NLP (Linguistic) approach– Named entity recognition

• Extraction pattern/template approach– Wrapper generation/induction

• Statistical approach– Class-based language model– PAT-tree-based

Page 37: Mining the Web for Information Organization J. H. Wang Academia Sinica.

37

PAT-tree-based Term Extraction

• SCP (Symmetric Conditional Probability)– Cohesion holding the words together– Low frequency n-grams tend to be

discarded

• CD (Context Dependency)– Dependence on the left- or right- adjacent

word/character– Low frequency n-grams can be extracted

• SCPCD: a combination of the two

Page 38: Mining the Web for Information Organization J. H. Wang Academia Sinica.

38

Association Measure

1

1 11

21

1

1 11

21

1

)...()...(1

1)...(

)()(1

1)(

)(

n

i nii

n

n

i nii

nn

wwfreqwwfreqn

wwfreq

wwpwwpn

wwpwwSCP

21

111

)(

)()()(

n

nnn

wwfreq

wwRCwwLCwwCD

1

1 11

11

111

)()(1

1)()(

)()()(

n

i nii

nn

nnn

wwfreqwwfreqn

wwRCwwLCwwCDwwSCPwwSCPCD

Page 39: Mining the Web for Information Organization J. H. Wang Academia Sinica.

39

Term Extraction Performance

Association Measure

Precision Recall Avg. R-P

CD 68.1 % 5.9 % 37.0 %

SCP 62.6 % 63.3 % 63.0 %

SCPCD 79.3 % 78.2 % 78.7 %

•Table 1. The obtained extraction accuracy including precision, recall, and average recall-precision of auto-extracted translation candidates using different methods.

Page 40: Mining the Web for Information Organization J. H. Wang Academia Sinica.

40

Speed PerformanceTable 2. The obtained average speed performance of different term extraction methods.

Term Extraction MethodTime for

PreprocessingTime for Extraction

LocalMaxs (Web Queries) 0.87 s 0.99 s

PATtree+LocalMaxs (Web Queries)

2.30 s 0.61 s

LocalMaxs (1,367 docs) 63.47 s 4,851.67 s

PATtree+LocalMaxs (1,367 docs)

840.90 s 71.24 s

LocalMaxs (5,357 docs) 47,247.55 s 350,495.65 s

PATtree+LocalMaxs (5,357 docs)

11,086.67 s 759.32 s

Page 41: Mining the Web for Information Organization J. H. Wang Academia Sinica.

41

Term Translation

• Problem 故宮博物院

故宮

繪畫

書法

陶瓷

玉器

瓷器

刺繡

porcelain

Source query

Translation candidates

Similarity

Relevant terms

Page 42: Mining the Web for Information Organization J. H. Wang Academia Sinica.

42

Similarity Estimation

How to decide the ranking?

1) S, Ti: frequently co-occur in the same pages– Not necessarily true

for synonyms and antonyms

2) S, Ti: have similar co-occurring context terms in the search-result pages

QueryS

QueryS .

.

.

T1

T2

Tn

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

Page 43: Mining the Web for Information Organization J. H. Wang Academia Sinica.

43

Chi-Square Test

• Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91]

(3) . )()()()(

)(),(

2

2dcdbcaba

cbdaNtsSx

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

Available from search engine

Page 44: Mining the Web for Information Organization J. H. Wang Academia Sinica.

44

Example Boolean Query for Chi-Square Test

Page 45: Mining the Web for Information Organization J. H. Wang Academia Sinica.

45

Context Vector Analysis

• Context Vector Analysis– Co-occurring context terms as feature vectors– TF-IDF weighting

• Similarity measure– Cosine measure

(4) , )n

log(),(max

),(

N

dtf

dtfw

jj

iti

(5) . )()(

),(

1

2

1

2

1

m

it

m

is

tsm

icv

ii

ii

ww

wwtsS

Page 46: Mining the Web for Information Organization J. H. Wang Academia Sinica.

46

The Combined Method• To take advantage of both methods

– Chi-Square Test: co-occurrence– Context Vector Analysis: similar context

(6) ,),(

),( m m

m

tsRtsSall

Rm(s,t) : Ranking of score in different methods

Rm(s,t) : Ranking of score in different methods

Page 47: Mining the Web for Information Organization J. H. Wang Academia Sinica.

47

Experiments

• Web Search– Chinese search engine logs

• Dreamer Log: 228,566 unique terms, during a period of 3 months in 1998

• GAIS Log: 114,182 unique queries during a period of two weeks in 1999

• Digital Library– STICNET Database

• 33,797 scientific documents in 86 categories during 1983 and 1997

Page 48: Mining the Web for Information Organization J. H. Wang Academia Sinica.

48

Experiments on Web Search

• Test sets– Popular-query set: 430 popular query terms

• Type Dic: 36%• Type OOV: 64%

– Random-query set• Randomly selected 200 Chinese queries from the

top 20,000 queries in Dreamer Log– Proper names & technical terms

• 50 scientists’ names & 50 disease names– Common terms

• Randomly selected 100 common nouns and 100 common verbs

Page 49: Mining the Web for Information Organization J. H. Wang Academia Sinica.

49

Popular Chinese Query Set

Table 3. Coverage and inclusion rates for popular Chinese queries using different methods.

Method Query Type Top-1 Top-3 Top-5 Coverage

CV

Dic 56.4% 70.5% 74.4% 80.1%

OOV 56.2% 66.1% 69.3% 85.0%

All 56.3% 67.7% 71.2% 83.3%

χ2

Dic 40.4% 61.5% 67.9% 80.1%

OOV 54.7% 65.0% 68.2% 85.0%

All 49.5% 63.7% 68.1% 83.3%

Combined

Dic 57.7% 71.2% 75.0% 80.1%

OOV 56.6% 67.9% 70.9% 85.0%

All 57.2% 68.6% 72.8% 83.3%

Page 50: Mining the Web for Information Organization J. H. Wang Academia Sinica.

50

Popular English Query Set

Table 4. Coverage and top-n inclusion rates for popular English query set using different methods.

Method Top-1 Top-3 Top-5 Coverage

CV 50.9% 60.1% 60.8% 80.9%

X2 44.6% 56.1% 59.2% 80.9%

Combined 51.8% 60.7% 62.2% 80.9%

Page 51: Mining the Web for Information Organization J. H. Wang Academia Sinica.

51

Random Query Set

Table 5. Coverage and top-n inclusion rates for the random-query set using different methods.

Method Top-1 Top-3 Top-5 Coverage

CV 25.5% 45.5% 50.5% 60.5%

X2 26.0% 44.5% 50.5% 60.5%

Combined 29.5% 49.5% 56.5% 60.5%

Page 52: Mining the Web for Information Organization J. H. Wang Academia Sinica.

52

Proper Names and Technical Terms

Table 6. Top-n inclusion rates for proper names and technical terms using the combined method.

Query Type Top-1 Top-3 Top-5

Scientist Name

40.0% 52.0% 60.0%

Disease Name

44.0% 60.0% 70.0%

Page 53: Mining the Web for Information Organization J. H. Wang Academia Sinica.

53

Common Nouns and Verbs

Table 8. Top-n inclusion rates for common nouns and verbs using the combined approach.

Query Type Top-1 Top-3 Top-5

100 Common Nouns 23.0% 33.0% 43.0%

100 Common Verbs 6.0% 8.0% 10.0%

• Our methods are less reliable to common terms

Page 54: Mining the Web for Information Organization J. H. Wang Academia Sinica.

54

Summary of Different Methods

• χ2 test– Fast– More suitable for high-frequency terms

• CV– Slow (for feature extraction)– Applicable to low-frequency terms

• Combined– Slow– Both high-frequency & low-frequency terms

Page 55: Mining the Web for Information Organization J. H. Wang Academia Sinica.

55

Experiments on Digital Libraries

• Cross-language search for STICNET Database– 33,797 scientific documents in 86

categories, during 1983 and 1997– 410,557 English-Chinese bilingual key terms

• Challenges: – Various categories in specific domains– Hard to find translations on the Web

Page 56: Mining the Web for Information Organization J. H. Wang Academia Sinica.

56

Example Cross-Lingual Queries in STICNET Database

Page 57: Mining the Web for Information Organization J. H. Wang Academia Sinica.

57

STICNET Database Search Result

Page 58: Mining the Web for Information Organization J. H. Wang Academia Sinica.

58

Translation of Auto-Extracted Unknown Terms

Table 9. The top-n inclusion rates of translations for auto-extracted useful unknown terms.

Query Type Top-1 Top-3 Top-5

Auto-extracted useful terms in Information

Engineering33.3% 37.5% 50.0%

Auto-extracted useful terms in Medicine

34.6% 46.2% 50.0%

• The feasibility of auto-extracted unknown terms has been shown

Page 59: Mining the Web for Information Organization J. H. Wang Academia Sinica.

59

Some Examples of the Auto-Extracted Translations

English Terminologies Chinese Translation

mechanical strain 機械應變

viscous damping 黏滯阻尼

Extracorporeal Shock Wave Lithotripsy 震波碎石

Galilei, Galileo 伽利略 / 伽里略 / 加利略

Legionnaires' Disease 退伍軍人症

Page 60: Mining the Web for Information Organization J. H. Wang Academia Sinica.

60

Other Applications

• Text Classification• Query Clustering• Search Result Clustering• Concept Search

Page 61: Mining the Web for Information Organization J. H. Wang Academia Sinica.

61

LiveClassifier

A system that creates classifiers through Web mining

[WWW 2004]

Page 62: Mining the Web for Information Organization J. H. Wang Academia Sinica.

62

LiveClassifier

Users create topic hierarchies and define classes/keywords

Page 63: Mining the Web for Information Organization J. H. Wang Academia Sinica.

63

LiveClassifier

Web

Auto-extracted training data; No manually-labeled data provided

Exploiting the structure information inherent for training

Page 64: Mining the Web for Information Organization J. H. Wang Academia Sinica.

64

LiveClassifier

People

Place

Subjects

Sub-subjects

Page 65: Mining the Web for Information Organization J. H. Wang Academia Sinica.

65

LiveClassifier

Classifying documents

Into classes

Page 66: Mining the Web for Information Organization J. H. Wang Academia Sinica.

66

LiveClassifier

Classifying short texts

Into classes

Page 67: Mining the Web for Information Organization J. H. Wang Academia Sinica.

67

LiveClassifier

Page 68: Mining the Web for Information Organization J. H. Wang Academia Sinica.

68

LiveClassifier

Page 69: Mining the Web for Information Organization J. H. Wang Academia Sinica.

69

Page 70: Mining the Web for Information Organization J. H. Wang Academia Sinica.

70

Term Clustering

Page 71: Mining the Web for Information Organization J. H. Wang Academia Sinica.

71

HAC-based Binary Clustering

Page 72: Mining the Web for Information Organization J. H. Wang Academia Sinica.

72

Min-Max Partition

Page 73: Mining the Web for Information Organization J. H. Wang Academia Sinica.

73

Query Clustering

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104

人力銀

行人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

Page 74: Mining the Web for Information Organization J. H. Wang Academia Sinica.

74

Thesaurus Construction from Query Log

• Query logs provide a representative terms for DL usage• Taxonomy generation from query logs

– Query clustering– Query categorization– Document categorization

Taxonomy Generation

(Query Clustering)

Query TermCategorization

Document TermCategorization

QueryLogs High-freq

terms

Low-freqterms Relevant

documents

Page 75: Mining the Web for Information Organization J. H. Wang Academia Sinica.

75

Search Result Clustering

• Why search result clustering?• Why is SRC different from

document clustering?– In assessment of algorithm’s quality– Precision, recall vs. user-oriented,

subjective assessment

Page 76: Mining the Web for Information Organization J. H. Wang Academia Sinica.

76

Example of Search Result Clustering

National Taiwan University NTU Hospital

Nanyang Technological University, Singapore

NTU?

Page 77: Mining the Web for Information Organization J. H. Wang Academia Sinica.

77

Example Clustering Search Engines

• Vivisimo.com– Clusty.com

• KillerInfo.com• InfoNetWare.com• SnakeT (Snippet Aggregation for

Knowledge ExTraction): http://roquefort.unipi.it/ – A hierarchical clustering engine for snippets

• Mooter.com• …

Page 78: Mining the Web for Information Organization J. H. Wang Academia Sinica.

78

Example on Vivisimo

Page 79: Mining the Web for Information Organization J. H. Wang Academia Sinica.

79

Vivisimo (cont.)

Page 80: Mining the Web for Information Organization J. H. Wang Academia Sinica.

80

Clusty.com

Page 81: Mining the Web for Information Organization J. H. Wang Academia Sinica.

81

InfoNetWare.com

Page 82: Mining the Web for Information Organization J. H. Wang Academia Sinica.

82

Concept Search

• Conventional search

• Concept-level search

doc Keyword search for “researcher” and “AI” and “Taiwan”

docresearcher AI

“professor”

“NTU”

“neuralnetwork”

researcherAI

Interesting document

Taiwan

Page 83: Mining the Web for Information Organization J. H. Wang Academia Sinica.

83

Further Reading• Jenq-Haur Wang, Jei-Wen Teng, Wen-Hsiang Lu, and Lee-Feng Chien, "Exploiting

the Web as the Multilingual Corpus for Unknown Query Translation," Journal of the American Society for Information Science and Technology (JASIST), Vol. 57, No. 5, pp. 660-670, Special Issue on Multilingual Information Systems, Mar. 2006. (SCI, SSCI)

• Jenq-Haur Wang, Jei-Wen Teng, Pu-Jen Cheng, Wen-Hsiang Lu, and Lee-Feng Chien, "Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach," Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2004), pp. 108-116.

• Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee, Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems, Vol. 22, No. 2, pp. 242-269, 2004. (SCI)

• Chien-Chung Huang, Shui-Lung Chuang, Lee-Feng Chien, Liveclassifier: Creating Hierarchical Text Classifiers through Web Corpora, Proceedings of WWW 2004, pp. 184-192.

• Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee, Translation of Web Queries Using Anchor Text Mining. ACM Transactions on Asian Language Information Processing, pp. 159-172, 2002.

• Lee-Feng Chien, T.-I. Huang, M-C. Chien, Pat-tree-based Keyword Extraction for Chinese Information Retrieval, Proceedings of SIGIR 1997, pp. 50-58.

Page 84: Mining the Web for Information Organization J. H. Wang Academia Sinica.

84

Thanks for Your Attention!

• Any question or comments?– [email protected]