TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah...

30
TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1

Transcript of TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah...

Page 1: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

TERM IMPACT- BASED WEB PAGE RAKING

School of Electrical Engineering and Computer Science

Falah Al-akashi and Diana Inkpen

20141

Page 2: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• Background• Challenges• Goal• Contributions• Our Approach• Query Processing• Evaluation• Conclusion

2

Page 3: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• TREC 2009-2012 Web track.• One Billion Pages (10 Languages).• 25 TB Uncompressed.• 5 TB compressed.• 500,000,000 Pages (A).• 50,000,000 Pages (B).• Submitted systems: Microsoft Research, Yahoo Group , Google team, University of

Glasgow, University of Waterloo, University of Ottawa, University of Delware, University of California, University of Maryland, University of Twente, Carnegie Mellon University, University Melbourne, University of Amsterdam, York University, University of Otago, University of Massachusetts , Queensland University group, Chinese Academic group, Hungarian Academic, Centrum Wiskunde, University Dublin, University London, SIFT Project, etc.

3

Page 4: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• The huge growth of Internet from 1995 till now - millions.• Lack of clear topic boundaries in most Web documents.• Lack of clear topic boundaries in most user queries.• Many of the relevant topics are available as subtopics or

semantically similar with other topics in the same documents. • Search results cannot satisfy all users‘ points of views. • Spam documents have impact in web search engines.• Home page and entity finding queries require extra efforts and

different algorithms than regular search algorithms.

4

Page 5: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• IR is different from WIR because the environment is dynamic and highly diverse, information is often added, updated, or it becomes unavailable.

• The Web keeps growing and it becomes more complex; similarly, the queries become more complex, too.

• Some sites do not have any credibility in their contents. There are few popular sites that provide connectivity and engagement between popular sites in a social manner, e.g. Wikipedia.

• Wikipedia seeks to create a summary of all human knowledge in the form of an online encyclopedia.

• Wikipedia intends only to convey knowledge that is already established, recognized, and rarely changed.

• Content in Wikipedia is subject to the copyright laws of the United States.• Wikipedia is the sixth-most-popular website worldwide according to “Alexa

Internet” receiving more than 2.7 billion U.S. page views every month.

5

Page 6: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• Improving retrieval effectiveness from Web data.• Exploiting the query structure.• Adapting index structure which is capable of retrieving results for

different types of queries. • We proposed a novel kind of index structure (centralized) that

exploits human knowledge accumulated and integrated in Wikipedia for indexing Web content.

• We proved the importance of term impact for documents weighting over other documents measures (e.g., tf, tf/idf, etc).

• We proposed alternative ways of query normalization and expansion by using Wikipedia.

6

Page 7: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• We proposed a collection of phrasal indexing algorithms that are suitable for any length and any type of queries.

• We showed the correlation between the topics available in different articles in Wikipedia.

• We proposed a novel search model that adapted the global server locally in one computer.

• We proposed a search model that able to index and answer the query fast.

7

Page 8: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

Our Index Structure

8

Page 9: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• Using Home Pages (from Wikipedia external links).• Using other relevant pages (from Wikipedia external

references).• Using the connectivity between documents for

making query expansion.• Finding related topics for queries that are difficult for

indexing, e.g., “to be or not to be that is the question”.

9

Page 10: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• 10% of English repository (B) (~5 million documents)• ~50% of documents are sharing the same contents but titled

differently.• ~50% of documents are article types; while others are as short

definitions.• Our indexing has removed the short articles (by using threshold) as

well as grouped similar and long articles by: 1- Using CRC16. 2- Using common tags. 3- Using terms impact => for retrieving initial results• Titles, terms, external links, and other related texts; such as anchor,

are indexed in Wikipedia index class.

10

Page 11: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

1- Using Domain Name:Indexing all terms that available in the main domains.1- single word, e.g. diana main domain : www.diana.com , www.diana.gov, www.diana.org, www.diana.edu, www.diana.net, www.diana.(country code) subdomain: diana.???.com, diana.???.gov, diana.???.org, diana.???.edu, diana.???.(country code).3- Two words or more, e.g. princess diana www.princessdiana.? ?, www.princess-diana, princess.diana.??, diana.princess.??> All terms in the titles that referred to the domains above have been indexed.

2- Using Wikipedia External Links.@"((http://.*?)\[\]home page)@"((.*?)\[\]www.*?")@"((http://.*?)\[\].*?official.*?web.*?site)@"((http://.*?)\[\].*?official.*?site)@"((http://.*?)\[\].*?" + query terms+ @".*?\sweb.*?site)@"((.*?)\[\].*?website)@"((http://www.*?)\[\]" + query terms+ @".*?\sweb.*?site)@"((http://.*?)\[\]link)

11

Page 12: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

1- The abbreviation or combined terms used in the urls could be recognized from the keywords in the titles.

2- Segmenting the titles of documents into phrases: "or" , "and" , "at" , "in" , "on" , "by" , "with" , "from" , or "for"; or a

punctuation characters, i.e., ":" , "|" , "(" , ")" , "-" , "," , or "&“.3- Measuring the impact of phrases in their document’s contents. 4- Phrases with high impact score have been used for building and

naming the index nodes; otherwise they were discarded (threshold).5- The impact of each phrase is computed by using the cosine

similarity between two vector, the first vector is the extracted phrases; where the second vector is the document content.

12

Page 13: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

1- Terms in the urls and titles referred to the document’s keywords.

2- At least one term from each user’s query is shared with the keywords above; whereas other terms are available in the document’s contents.

• site, [Impact,t1,f1;t2,f2;……tn,fn]• uottawa, [Impact,t1,f1;t2,f2;…tn,fn]• diana, [Impact,t1,f1;t2,f2;……tn,fn]• inkpen, [Impact,t1,f1;t2,f2;……tn,fn]

13

Page 14: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

14

1- Not all documents hold important keywords in the urls and/or titles.

2- Some documents hold keywords only in the content (subtopics and sometimes each topic is different from others.

3- Some documents hold primitive phrases (available once in the content).

4- This class of index uses collection of strings: queries from one-million query track and titles from Wikipedia.

5- The system scanned through the content of our Web collection looking for list of strings above.

6- The captured strings from each document were ranked according to their impact in each document’s content.

7- Topical phrases validation and weighting (based on impact in each document’s content and idf in documents classified in the same topic),

Page 15: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

15

Page 16: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

16

Page 17: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

1- QE is important to make results more diverse.2- QE is necessary if the first result list is short.3- QE expansion is used only with the diversity topics/queries.4- The terms that used for expanding the original query terms are extracted from Wikipedia articles (connectivity).5- Best QE if query matches Wikipedia topic literally, and long article.

Using Shared-Links. Using Titling Variation Aspect. 1- Lipomatosis

2- Fatty -Tumor3- Lipomatous-Neoplasm

17

Page 18: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

1. Home pages (".com", ".gov", ".org", ".edu", ".net", .., etc.).2. Wikipedia results whose titles matched the query literally.3. Site Preferences ("about.com“, "answers.com“,..etc) .4. Top ten results that ranked high, regardless of the type of sites.5. Other Wikipedia results that ranked high based on their contents.6. Other results.

Exam

ple

http://www.phoenix.edu/http://phoenix.edu/http://axia.phoenix.edu/http://en.wikipedia.org/wiki/en:University_of_Phoenixhttp://en.wikipedia.org/wiki/University_of_Phoenixhttp://phoenix.about.com/library/blseatingcardinals.htmhttp://wiki.answers.com/q/what_are_some_......_the_university_of_phoenixhttp://business.phoenix.edu/http://technology.phoenix.edu/http://military.phoenix.edu/http://artsandsciences.phoenix.edu/http://education.phoenix.edu/

Home Pages (for adhoc task)

User Preferences (for diversity task)

Other Pages (for adhoc and diversity tasks)

18

Page 19: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• Relevant judgment file builds by professional assessors and includes relevant results for each query.

• Best results were selected from the best data set (A and B).

• If results are available in subset B and not available in the relevant judgment, means the corresponding results in set A are more relevant.

• The relevancy degree of each result is based on users’ point of view.

19

Page 20: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

20

Page 21: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

21

Page 22: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

22

Page 23: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

23

Page 24: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

24

Page 25: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• The index classes are working cooperatively.• Eliminating one class from the index does not necessarily

affect the final precision because the same results may retrieve from other classes.

• Eliminating one class from the index may increase the overall precision for set of queries but for a specific query may not (that is why we used all classes).

• Wikipedia has more impact than other classes.• The impact of each class is based on the type of query.

25

Page 26: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

26

Page 27: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

27

Page 28: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

Fast indexing and retrieving method. Efficient method for all types of queries.Centralized index (one server system).Wikipedia is a typical content for home page finding, web indexing,

and query expansion. Each query must pass through all index classes during the query

search; then the type of query must be determined. The ordering (distributing) documents in the final list is not related

to document weightings only, but also to the type of query (navigational, informational, transactional).

Dataset subset B (50 million) is enough for training and testing Web search engine for retrieving the relevant documents.

28

Page 29: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

• Testing our system by using more queries.• Displaying the results in an efficient way since our system

is centralized.• Using other resources rather than only Wikipedia and

ALEXA.• Indexing real-time data from social resources such as

Twitter and Facebook.• Using GUI for displaying our results instead of plain and

simple text.

29

Page 30: TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen 2014 1.

30

Questions? Email to [email protected] or [email protected]

Demo: http://site.uottawa.ca/~falak081