A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey...
-
Upload
margery-robinson -
Category
Documents
-
view
214 -
download
0
description
Transcript of A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey...
![Page 1: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/1.jpg)
A RESEARCH SUPPORT SYSTEA RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA M FRAMEWORK FOR WEB DATA MININGMININGJin Xu, Yingping Huang, Gregory Madey
Department of Computer Science and EngineeringUniversity of Notre Dame
Notre Dame, IN 46556
WSS’03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support WSS’03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support SystemsSystems
October 13, 2003, HalifaxOctober 13, 2003, Halifax
This research was partially supported by NSF, CISE/IIS-Digital Society and Technology
![Page 2: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/2.jpg)
OUTLINEOUTLINE INTRODUCTIONINTRODUCTION FRAMEWORK OVERVIEWFRAMEWORK OVERVIEW INFORMATION RETRIEVALINFORMATION RETRIEVAL DATA MINING TECHNIQUES DATA MINING TECHNIQUES CASECASE CONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORK
![Page 3: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/3.jpg)
INTRODUCTIONINTRODUCTION World Wide WebWorld Wide Web
Abundant informationAbundant information Important resource for researchImportant resource for research
Web Data FeaturesWeb Data Features Semi-structuredSemi-structured HeterogeneousHeterogeneous DynamicDynamic
A Research Support System for Web A Research Support System for Web Data MiningData Mining
![Page 4: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/4.jpg)
FRAMEWORKFRAMEWORK
Web
SourceIdentification
ContentSelection
InformationRetrieval
DataMining
Discovery
![Page 5: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/5.jpg)
INFORMATION RETRIEVALINFORMATION RETRIEVAL Searching ToolsSearching Tools
DirectoryDirectory Search engineSearch engine
Web CrawlerWeb Crawler URL access methodURL access method Web page parserWeb page parser
Table extractorTable extractor Link extractor – absolute links/relative linksLink extractor – absolute links/relative links Word extractorWord extractor
![Page 6: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/6.jpg)
DATA MINING FUNCTIONSDATA MINING FUNCTIONS Association RulesAssociation Rules
Find interesting association or correlation Find interesting association or correlation relationship among data itemsrelationship among data items
ClassificationClassification Predict classesPredict classes Two steps – build model, apply modelTwo steps – build model, apply model
ClusteringClustering Find natural groups of dataFind natural groups of data
![Page 7: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/7.jpg)
OPEN SOURCE SOFTWAREOPEN SOURCE SOFTWARE Open Source Software (OSS)Open Source Software (OSS)
Apache, Perl, LinuxApache, Perl, Linux Developed by part time contributorsDeveloped by part time contributors
SourceForge Developer SiteSourceForge Developer Site Sponsored by VA SoftwareSponsored by VA Software Largest OSS development siteLargest OSS development site
70,000 projects70,000 projects 90,000 developers90,000 developers 700,000 users700,000 users
![Page 8: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/8.jpg)
DATA COLLECTONDATA COLLECTON Data sourcesData sources
Statistics, forumsStatistics, forums Project statisticsProject statistics
9 fields – project ID, lifespan, rank, page 9 fields – project ID, lifespan, rank, page views, downloads, bugs, support, views, downloads, bugs, support, patches and CVSpatches and CVS
Developer statisticsDeveloper statistics Project ID and developer IDProject ID and developer ID
![Page 9: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/9.jpg)
DATA COLLECTON (Cont.)DATA COLLECTON (Cont.) Web CrawlerWeb Crawler
Perl and CPANPerl and CPAN LWP – fetch pagesLWP – fetch pages HTML parser – parse pagesHTML parser – parse pages HTML::TableExtract – extract informationHTML::TableExtract – extract information Link extractor – extract linksLink extractor – extract links
![Page 10: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/10.jpg)
DATA MININGDATA MINING Association RulesAssociation Rules
““all tracks”, “downloads” and “CVS” are associaall tracks”, “downloads” and “CVS” are associatedted ClassificationClassification
Predict “downloads”Predict “downloads” Naïve Bayes – Build Time 30 sec, accuracy Naïve Bayes – Build Time 30 sec, accuracy 9%9% Adaptive Bayes Network - Build Time 20 min, accuracy Adaptive Bayes Network - Build Time 20 min, accuracy 63%63%
ClusteringClustering K-means: User specified number of clustersK-means: User specified number of clusters O-cluster: Automatically detect the number of clustersO-cluster: Automatically detect the number of clusters
![Page 11: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.](https://reader036.fdocuments.net/reader036/viewer/2022080204/5a4d1b617f8b9ab0599ad5d1/html5/thumbnails/11.jpg)
CONCLUSIONSCONCLUSIONS ConclusionsConclusions
Build a framework Build a framework Describe proceduresDescribe procedures Discuss techniquesDiscuss techniques Provide a case studyProvide a case study
Future WorkFuture Work Exploratory studyExploratory study Implement all stages Implement all stages