Web Mining Research: A Survey
description
Transcript of Web Mining Research: A Survey
![Page 1: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/1.jpg)
Web Mining Research: A Survey
Authors: Raymond Kosala & Hendrik BlockeelPresenter: Ryan Patterson
April 23rd 2014 CS332 Data Mining
pg 01
![Page 2: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/2.jpg)
outline• Introduction
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
• Review
• Exam Questions
pg 02
![Page 3: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/3.jpg)
outline• Introduction• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
• Review
• Exam Questions
pg 03
![Page 4: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/4.jpg)
Introduction“The Web is huge, diverse, and dynamic . . . we are currently drowning in information and facing
information overload.”Web users encounter problems:• Finding relevant information• Creating new knowledge out of the information
available on the Web• Personalization of the information• Learning about consumers or individual users
pg 04
![Page 5: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/5.jpg)
outline• Introduction
• Web Mining• Web Content Mining
• Web Structure Mining
• Web Usage Mining
• Review
• Exam Questions
pg 05
![Page 6: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/6.jpg)
Web Mining“Web mining is the use of data mining
techniques to automatically discover and extract information from Web documents and
services.”Web mining subtasks:
1. Resource finding2. Information selection and pre-processing3. Generalization4. Analysis
pg 06
![Page 7: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/7.jpg)
Web MiningInformation Retrieval & Information Extraction
• Information Retrieval (IR)o the automatic retrieval of all relevant documents
while at the same time retrieving as few of the non-relevant as possible
• Information Extraction (IE)o transforming a collection of documents into
information that is more readily digested and analyzed
pg 07
![Page 8: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/8.jpg)
Live demo
pg 08
![Page 9: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/9.jpg)
outline• Introduction
• Web Mining
• Web Content Mining• Web Structure Mining
• Web Usage Mining
• Review
• Exam Questions
pg 09
![Page 10: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/10.jpg)
Web Content MiningInformation Retrieval View
Unstructured Documents• Most utilizes “bag of words” representation to generate documents features
o ignores the sequence in which the words occur
• Document features can be reduced with selection algorithmso ie. information gain
• Possible alternative document feature representations:o word positions in the documento phrases/terms (ie. “annual interest rate”)
Semi-Structured Documents• Utilize additional structural information gleaned from the document
o HTML markup (intra-document structure)o HTML links (inter-document structure)
pg 10
![Page 11: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/11.jpg)
Web content mining, IR unstructured documents
pg 11
![Page 12: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/12.jpg)
Web content mining, IR semi structured documents
pg 12
![Page 13: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/13.jpg)
Web Content MiningDatabase View
“the Database view tries . . . to transform a Web site to become a database so that . . . querying
on the Web become[s] possible.”• Uses Object Exchange Model (OEM)
o represents semi-structured data by a labeled graph
• Database view algorithms typically start from manually selected Web siteso site-specific parsers
• Database view algorithms produce:o extract document level schema or DataGuides
structural summary of semi-structured datao extract frequent substructures (sub-schema)o multi-layered database
each layer is obtained by generalizations on lower layers
pg 13
![Page 14: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/14.jpg)
Web content mining, Database view
pg 14
![Page 15: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/15.jpg)
outline• Introduction
• Web Mining
• Web Content Mining
• Web Structure Mining• Web Usage Mining
• Review
• Exam Questions
pg 15
![Page 16: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/16.jpg)
Web Structure Mining“. . . we are interested in the structure of the
hyperlinks within the Web itself”• Inspired by the study of social networks and citation analysis
o based on incoming & outgoing links we could discover specific types of pages (such as hubs, authorities, etc)
• Some algorithms calculate the quality/relevancy of each Web page
o ie. Page Rank
• Others measure the completeness of a Web site
o measuring frequency of local links on the same server
o interpreting the nature of hierarchy of hyperlinks on one domain
pg 16
![Page 17: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/17.jpg)
outline• Introduction
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining• Review
• Exam Questions
pg 17
![Page 18: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/18.jpg)
Web Usage Mining“. . . focuses on techniques that could predict
user behavior while the user interacts with the Web.”
• Web usage is mined by parsing Web server logs
o mapped into relational tables → data mining techniques applied
o log data utilized directly
• Users connecting through proxy servers and/or users or ISP’s utilizing caching of Web data results in decreased server log accuracy
• Two applications:
o personalized - user profile or user modeling in adaptive interfaces
o impersonalized - learning user navigation patterns
pg 18
![Page 19: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/19.jpg)
outline• Introduction
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
• Review• Exam Questions
pg 19
![Page 20: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/20.jpg)
Review• Web mining
o 4 subtaskso IR & IE
• Web content miningo primarily intra-page analysiso IR view vs DB view
• Web structure miningo primarily inter-page analysis
• Web usage miningo primarily analysis of server activity logs
pg 20
![Page 21: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/21.jpg)
Web mining categories
Web Mining
Web Content MiningWeb Structure Mining Web Usage Mining
IR View DB View
View of Data - Unstructured- Semi structured
- Semi structured- Web site as DB
- Links structure - Interactivity
Main Data - Text documents- Hypertext documents
- Hypertext documents - Links structure - Server logs- Browser logs
Representation - Bag of word, n-grams- Terms, phrases- Concepts of ontology- Relational
- Edge-labeled graph (OEM)- Relational
- Graph - Relational table- Graphs
Method - TFIDF and variants- Machine learning- Statistical (incl. NLP)
- Proprietary algorithms- ILP- (modified) association rules
- Proprietary algorithms - Machine Learning- Statistical- (modified) association rules
ApplicationCategories
- Categorization- Clustering- Finding extraction rules- Finding patterns in text- User modeling
- Finding frequent sub-structures- Web site schema discovery
- Categorization- Clustering
- Site construction, adaptation, and management- Marketing- User modeling
pg 21
![Page 22: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/22.jpg)
outline• Introduction
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
• Review
• Exam Questions
pg 22
![Page 23: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/23.jpg)
Exam Question 1Q: Of the following Web mining paradigms:
• Information Retrieval
• Information ExtractionWhich does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.
pg 23
![Page 24: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/24.jpg)
Exam Question 1Q: Of the following Web mining paradigms:
• Information Retrieval
• Information ExtractionWhich does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.
A: Information Retrieval, the search engine attempts provides a list of documents ranked by their relevancy to the search query.
pg 24
![Page 25: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/25.jpg)
Exam Question 2Q: State one common problem hampering accurate Web usage mining? Briefly support your answer.
pg 25
![Page 26: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/26.jpg)
Exam Question 2Q: State one common problem hampering accurate Web usage mining? Briefly support your answer.
A:
• Users connecting to a Web site though a proxy server,
• Users (or their ISP’s) utilizing Web data caching,will result in decreased server log accuracy. Accurate server logs are required for accurate Web usage mining.
pg 26
![Page 27: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/27.jpg)
Exam Question 3Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?
pg 27
![Page 28: Web Mining Research: A Survey](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681637c550346895dd459e0/html5/thumbnails/28.jpg)
Exam Question 3Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?
A: “Bag of words” representation.
pg 28