About onlineextrems concept
-
Upload
onlineextrems -
Category
Documents
-
view
309 -
download
0
description
Transcript of About onlineextrems concept
Onlineextrems
Overview
Onlineextrems.com
Platform overviewA single unified platform for all content types
(consolidate to reduce development and maintenance costs)
Flexible system which can support any new content type
High automation (cut configuration costs)Real time coverage or as close as possible for
each content typeImproved data quality using validation rulesWas implemented this year
April 10, 2023 2Onlineextrems.com
Supporting all the content typesMessage boardsBlogs and micro blogs (Myspace, Blogger, Live
Journal...)Blog commentsSocial networks – Facebook, Linkedin, XingAuthor profilesProduct reviews Usenet – mailing lists, groupsTraditional media – CNN, Reuters
April 10, 2023 3Onlineextrems.com
Consolidating the content systemsData mining systems
Message boards Blogs Social Networking sites Author profiles system Usenet + Newsgroups system
April 10, 2023 Onlineextrems.com 4
Some of our challenges Dynamic nature of the web Supporting many different types of content Automatically “understanding” millions of sites with different
structures Over 8000 message boards Over 95 million blogs
Supporting data in different languages Data quality
April 10, 2023 5Onlineextrems.com
Data mining processWhat are the important aspects of the data mining?Managing the order in which we crawl pages
Efficiency (e.g. not entering posts where the number of comments hasn’t changed)
Next page (we need to follow it to get more comments)Extracting relevant data out of everything on the page.Separating the data into posts (or comments)Transforming specific data into the desired format
Handling dates in differing formats
April 10, 2023 6Onlineextrems.com
Data mining technologiesJelly –Simple XML workflow engineHttpClient - FetcherRome –Feed parserVelocity–Output template engineJMX + JConsole – Managing the system
April 10, 2023 7Onlineextrems.com
FlowsBuilt from steps which are the blocksAllows adding support for new content types
without writing codeThe implementation is based on Apache Jelly
which allows executing XML files
April 10, 2023 8Onlineextrems.com
XML parserParses the data from simple XML files into
the common in memory “items” structureFor now only supports elements and not
attributesUsed for Twitter
April 10, 2023 9Onlineextrems.com
HTML parserApplies XSLT transformations to HTML pagesExtracts the data into the common in memory
“items” structureUses “Tag Soup” library to read HTML as if it
were XMLFaster and more robust than the current XML
conversion methodUsed for Author Profiles
April 10, 2023 10Onlineextrems.com
XML OutputOutput in XML filesConfigurable output format using
template file
April 10, 2023 11Onlineextrems.com
Sample Work
April 10, 2023 12
Sample Work
April 10, 2023 13
Thank YouConnect and share with us…www.onlineextrems.com
April 10, 2023 14Onlineextrems.com