About onlineextrems concept

14
Onlineextrems Overview Onlineextrems.com

description

About Onlineextrems.com

Transcript of About onlineextrems concept

Page 1: About onlineextrems concept

Onlineextrems

Overview

Onlineextrems.com

Page 2: About onlineextrems concept

Platform overviewA single unified platform for all content types

(consolidate to reduce development and maintenance costs)

Flexible system which can support any new content type

High automation (cut configuration costs)Real time coverage or as close as possible for

each content typeImproved data quality using validation rulesWas implemented this year

April 10, 2023 2Onlineextrems.com

Page 3: About onlineextrems concept

Supporting all the content typesMessage boardsBlogs and micro blogs (Myspace, Blogger, Live

Journal...)Blog commentsSocial networks – Facebook, Linkedin, XingAuthor profilesProduct reviews Usenet – mailing lists, groupsTraditional media – CNN, Reuters

April 10, 2023 3Onlineextrems.com

Page 4: About onlineextrems concept

Consolidating the content systemsData mining systems

Message boards Blogs Social Networking sites Author profiles system Usenet + Newsgroups system

April 10, 2023 Onlineextrems.com 4

Page 5: About onlineextrems concept

Some of our challenges Dynamic nature of the web Supporting many different types of content Automatically “understanding” millions of sites with different

structures Over 8000 message boards Over 95 million blogs

Supporting data in different languages Data quality

April 10, 2023 5Onlineextrems.com

Page 6: About onlineextrems concept

Data mining processWhat are the important aspects of the data mining?Managing the order in which we crawl pages

Efficiency (e.g. not entering posts where the number of comments hasn’t changed)

Next page (we need to follow it to get more comments)Extracting relevant data out of everything on the page.Separating the data into posts (or comments)Transforming specific data into the desired format

Handling dates in differing formats

April 10, 2023 6Onlineextrems.com

Page 7: About onlineextrems concept

Data mining technologiesJelly –Simple XML workflow engineHttpClient - FetcherRome –Feed parserVelocity–Output template engineJMX + JConsole – Managing the system

April 10, 2023 7Onlineextrems.com

Page 8: About onlineextrems concept

FlowsBuilt from steps which are the blocksAllows adding support for new content types

without writing codeThe implementation is based on Apache Jelly

which allows executing XML files

April 10, 2023 8Onlineextrems.com

Page 9: About onlineextrems concept

XML parserParses the data from simple XML files into

the common in memory “items” structureFor now only supports elements and not

attributesUsed for Twitter

April 10, 2023 9Onlineextrems.com

Page 10: About onlineextrems concept

HTML parserApplies XSLT transformations to HTML pagesExtracts the data into the common in memory

“items” structureUses “Tag Soup” library to read HTML as if it

were XMLFaster and more robust than the current XML

conversion methodUsed for Author Profiles

April 10, 2023 10Onlineextrems.com

Page 11: About onlineextrems concept

XML OutputOutput in XML filesConfigurable output format using

template file

April 10, 2023 11Onlineextrems.com

Page 12: About onlineextrems concept

Sample Work

April 10, 2023 12

Page 13: About onlineextrems concept

Sample Work

April 10, 2023 13

Page 14: About onlineextrems concept

Thank YouConnect and share with us…www.onlineextrems.com

April 10, 2023 14Onlineextrems.com