Web Crawling, Analysis and Archiving. PhD Presentation

63
Web Crawling, Analysis and Archiving PHD DEFENSE VANGELIS BANOS DEPARTMENT OF INFORMATICS, ARISTOTLE UNIVERSITY OF THESSALONIKI OCTOBER 2015 COMMITTEE MEMBERS Yannis Manolopoulos, Apostolos Papadopoulos, Dimitrios Katsaros, Athena Vakali, Anastasios Gounaris, Georgios Evangelidis, Sarantos Kapidakis.

Transcript of Web Crawling, Analysis and Archiving. PhD Presentation

Page 1: Web Crawling, Analysis and Archiving. PhD Presentation

Web Crawling, Analysis and ArchivingPHD DEFENSE VANGELIS BANOS

DEPARTMENT OF INFORMATICS, ARISTOTLE UNIVERSITY OF THESSALONIKIOCTOBER 2015

COMMITTEE MEMBERSYannis Manolopoulos, Apostolos Papadopoulos, Dimitrios Katsaros,

Athena Vakali, Anastasios Gounaris, Georgios Evangelidis, Sarantos Kapidakis.

Page 2: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 2 of 67

Problem definition: The web is disappearing

Page 3: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 3 of 67

Web Archiving• Web archiving is the process of collecting portions of

the Web to ensure the information is preserved in an archive for researchers, historians, and the public.

• Many important organisations work on web archiving since 1996.

Page 4: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 4 of 67

Our ContributionsWe focus on Web Crawling, Analysis and Archiving.1. New metrics and systems to appreciate the possibilities of

archiving websites,2. New algorithms and systems to improve web crawling efficiency

and performance,3. New approaches and systems to archive weblogs,4. New algorithms focused on weblog data extraction.

◦ Publications:• 4 scientific journals (1 still under review),• 7 international conference proceedings,• 1 book chapter.

Page 5: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 5 of 67

Presentation Structure1. An Innovative Method to Evaluate Website

Archivability,2. Near-duplicate and Cycle Detection in Webgraphs

towards Optimised Web Crawling,3. The BlogForever Platform: An Integrated Approach

to Preserve Weblogs,4. A Scalable Approach to Harvest Modern Weblogs,5. Conclusions and Future Work.

Page 6: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE

1. An Innovative Method to Evaluate Website ArchivabilityProblem description• Not all websites can be archived correctly. • Web bots face difficulties in harvesting websites (Technical problems, low performance,

invalid code, blocking web crawlers).• After web harvesting, archive administrators review manually the content.• Web crawing is automated while Quality Assurance (QA) is manual.

Our contributions1. The Credible Live Evaluation of Archive Readiness Plus (CLEAR+) Method to evaluate

Website Archivability.2. The ArchiveReady.com system which is the reference implementation of the method.3. Evaluation and observation regarding 12 prominent Web Content Management

Systems’ (CMS) Archivability.

6

Page 7: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 7 of 67

CLEAR+: A Credible Live Method to Evaluate Website Archivability• Website Archivability (WA) captures the core aspects of a website

crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy.o Not to be confused with website reliability, availability, security, etc.

• CLEAR+: A method to produce a credible on-the-fly measurement of Website Archivability by:o Imitating web bots to crawl a website.o Evaluating captured information such as file encoding and errors.o Evaluating compliance with standards, formats and metadata.o Calculating a WA Score (0 – 100%).

Page 8: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 8 of 67

CLEAR+ Archivability Facets and Website Attributes

FAAccessibility

FcCohesion

FMMetada

ta

FSTStandard

sCompliance

Page 9: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 9 of 67

CLEAR+ Method Summary1. Perform specific evaluations on Website Attributes2. Each evaluation has the following attributes:

1. Belongs to one or more WA Facets.2. Has low, medium, or high Significance (different weight).3. Has a score range from 0 – 100%.

3. The score of each Facet is the weighted average of all evaluations’ scores.

4. The final Website Archivability is the average of all Facets’ scores.

Page 10: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 10 of 67

Accessibility FacetFacet Evaluation Rating Significance Total

FA

Accessibility

No sitemap.xml 0% High

63%

21 valid and 1 invalid link 95% High

2 inline JavaScript files 0% High

HTTP Caching Headers 100% Medium

Average response time 30ms, very fast

100% High

Not using proprietary formats (e.g. Flash or QuickTime)

100% High

ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015

Page 11: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 11 of 67

Cohesion Facet

• If files constituting a single website are dispersed across different web locations, the acquisition and ingest is likely to suffer if one or more web locations fail.

• 3rd party resources increase website volatility.

Facet Evaluation Rating Significance Total

FC

Cohesion

6 local and no external scripts 100% Medium 100%9 local and no external images 100% Medium

2 local and no external CSS 100% Medium

ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015

Page 12: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 12 of 67

Metadata Facet

• Adequate metadata are a big concern for digital curation.• The lack of metadata impairs the archive’s ability to manage,

organise, retrieve and interact with content effectively.

Facet Evaluation Rating Significance Total

FM

Metadata

HTTP Content type 100% Medium 100%

HTTP Caching headers 100% Medium

ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015

Page 13: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 13 of 67

Standards Compliance FacetFacet Evaluation Rating Significance Total

FST

Standards Compliance

2 Invalid CSS files 0% Medium

74%

Invalid HTML file 0% Medium

No HTTP Content transfer encoding 50% Medium

HTTP Content type found 100% Medium

HTTP Caching headers found 100% Medium

9 images found and validated with JHOVE 100% Medium

Not using proprietary formats (e.g. Flash or QuickTime)

100% High

ADBIS 2015 Website Accessibility Evaluation 1st Sept 2015

Page 14: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 14 of 67

ADBIS’2015 Website Archivability Evaluation

• Web application implementing CLEAR+• Web interface and REST API• Developed using Python, MySQL, Redis,

PhantomJS, Nginx, Linux.

Page 15: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 15 of 67

Experimentation with Assorted Datasets• D1: National libraries, D2: Top 200 universities,• D3: Government organizations, D4: Random spam websites from Alexa.

Page 16: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 16 of 67

Evaluation by experts• Experts evaluate how well a website is archived in the Internet

Archive and assign a score.• We evaluate the WA Score using ArchiveReady.com.• Pearson’s Correlation Coefficient for WA, WA Facets and experts’

score.• Correlation: 0.516

Page 17: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 17 of 67

WA Variance in the Same Website

Page 18: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 18 of 67

Web Content Management Systems Archivability• Aim: Identify strengths and weaknesses in different web

CMS regarding their WA.• Corpus: 5.821 random WCMS Samples from the Alexa

top 1m websites. Systems:o Blogger, DataLife Engine, DotNetNuke, Drupal,

Joomla, Mediawiki, MovableType, Plone, PrestaShop, Typo3, vBulletin, Wordpress.

• Evaluation using the ArchiveReady.com API• Results saved in MySQL and analysed.

Page 19: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 19 of 67

WCMS Accessibility Variations

Page 20: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 20 of 67

WCMS Standards Compliance Variations

Page 21: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 21 of 67

WCMS Metadata Results

Page 22: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 22 of 67

WCMS Archivability Results Summary

Page 23: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE

Website Archivability Impact• Deutches Literatur Archiv, Marbach, is using the ArchiveReady API in its

web archiving workflow since early 2014.• Stanford University Libraries Web Archiving Resources recommends using

the CLEAR method and ArchiveReady. • The University of South Australia is using ArchiveReady in their Digital

Preservation Course (INFS 5082).• Invited to present at the Library of Congress, National Digital Information

Infrastructure & Preservation, Web Archiving, 2015, and the Internet Archive Web Archiving meeting (University of Innsbruck, 2013).

• Many contacts and users from: University of Newcastle, University of Manchester, Columbia University, Stanford University, University of Michigan Bentley Historical Library, Old Dominion University.

• 120 unique daily visitors, 80.000+ evaluations at http://archiveready.com/.

23

Page 24: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 24 of 67

Presentation Structure1. An Innovative Method to Evaluate Website Archivability,2. Near-duplicate and Cycle Detection in Webgraphs

towards Optimised Web Crawling,3. The BlogForever Platform: An Integrated Approach to

Preserve Weblogs,4. A Scalable Approach to Harvest Modern Weblogs,5. Conclusions and Future Work.

Page 25: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 25 of 67

2. Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling

Problem description• Web bots capture a lot of duplicate and near-duplicate data.

o There are methods to detect and remove duplicate data after crawling.o There are few methods to remove near-duplicate data in web archives.

• Web bots fall into web spider traps, webpages that cause infinite loops. No automated solution to detect them.

Our Contributions

1. a set of methods to detect duplicate and near-duplicate webpages in real time during web crawling.

2. a set of methods to detect web spider traps using webgraphs in real time during web crawling.

3. The WebGraph-It.com system, a web platform which implements the proposed methods.

Page 26: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 26 of 67

Key Concepts

• Unique Webpage Identifier?• Webpage similarity metric?• Web crawling modeled as a graph?

Page 27: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 27 of 67

Key Concepts: Unique Webpage Identifier• URI is not always optimal as a unique webpage identifier.

o http://edition.cnn.com/videos - http://edition.cnn.com/videos#some-point

o http://edition.cnn.com/videos?v1=1&v2=2 o http://edition.cnn.com/videos?v2=2&v1=1

• Sort-friendly URI Reordering Transform (SURT) URI Conversion.o URI: scheme://[email protected]:port/path?query#fragment o SURT: scheme://(tld,domain,:port@user)/path?query

o URI: http://edition.cnn.com/tech -> SURT: com,cnn,edition/tech• SURT encoding is lossy. SURT is not always reversible to URI.

Page 28: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 28 of 67

Key Concepts: Unique Webpage Identifier Similarity• Dear duplicate URIs/SURTs may have duplicate content.

o http://vbanos.gr/page?show-greater=10 - http://vbanos.gr/page?show-greater=11

o http://vbanos.gr/blog/tag/cakephp/ - http://vbanos.gr/blog/tag/php/

• We use the Sorensen-Dice coefficient similarity to search for near-duplicate webpage identifiers with a 95% similarity threshold.o Low sensitivity to word ordering,o Low sensitivity to length variations,o Runs in linear time.

Page 29: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 29 of 67

Key Concepts: Unique Webpage Identifier Similarity

Page 30: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 30 of 67

Key Concepts: Webpage content similarity• Content similarity:• Exact duplicate webpages• Near-duplicate webpages (ads, dates, counters may change)• We use the simhash algorithm (Charikar) to calculate bit

signatures from each webpage.• 96 bit webpage signature.• Near duplicate webpages have very few different bits.• Fast to compare the similarity of two webpages.• Efficient storage (save only the signature, keep it in memory).

Page 31: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 31 of 67

Key Concepts: Webpage content similarity

Page 32: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE

Key concepts: Webgraph cycle detection Step 1 Step 2 Step 3

New Node F Get Nearby Nodes (dist=3) and Cycle Detection using DFS (dist=3)

check for duplicate / near duplicate

32 of 67

Page 33: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 33 of 67

Web Crawling Algorithms

Page 34: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 34 of 67

WebGraph-It.com System• Web application implementing all presented algorithms. API Available.• Built using Python, PhantomJS, Redis, MariaDB, Linux.• Easy to expand and create new web crawling algorithms as plugins.

Page 35: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 35 of 67

Evaluation1. Dataset: 100 random websites from Alexa top 1M.2. Crawl with all 8 algorithms (C1-C8) using the WebGraph-it system.3. Record metrics for each web crawl.4. Analyse the results and compare with the base web crawl.

Page 36: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 36 of 67

Indicative results for a single website

Page 37: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE

Results

37

Page 38: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE

Evaluation conclusions• Best method is D8: Cycle detection with content similarity• 17.1% faster than the base crawl.• 60% of base crawl webpages captured.• 98.3% results completeness.• Always use SURT instead of URL as a unique webpage

identifier.• Use URL/SURT similarity AND content similarity together.• Using URL/SURL similarity alone results in incomplete results.

38

Page 39: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 39 of 67

Presentation Structure1. An Innovative Method to Evaluate Website Archivability,2. Near-duplicate and Cycle Detection in Webgraphs

towards Optimised Web Crawling,3. The BlogForever Platform: An Integrated Approach to

Preserve Weblogs,4. A Scalable Approach to Harvest Modern Weblogs,5. Conclusions and Future Work.

Page 40: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 40 of 67

3. The BlogForever Platform: An Integrated Approach to Preserve WeblogsProblem descriptionCurrent web archiving tools have issues with weblog archiving.• Scheduling (timely intervals vs archive when new content is available.• Content selection (archive everything instead of archiving the updated content only),• Ignoring weblog features (rich set of information entities, structured content, RSS, tags,

etc.)Our contributions

1. A survey of the technical characteristics of weblogs.2. Methods to improve weblog harvesting, archiving and management.3. Methods to integrate weblog archives with existing archive technologies.4. The BlogForever platform: A system to support harvesting, ingestion, management and

reuse of weblogs.

Page 41: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 41 of 67

Technical survey of the blogosphere• Dataset: 259.930 blogs• Evaluate the use of:

o Blog platforms,o Web standards (HTTP Headers, HTML markup etc),o XML feeds,o Image formats,o JavaScript frameworks,o Semantic markup (Microformats, XFN, OpenGraph, etc)

Page 42: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 42 of 67

Indicative survey results: Blog platforms

Page 43: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 43 of 67

Indicative survey results: Image and feed types

Page 44: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 44 of 67

standard_descrcontent

date

Blog has Entry

is a

PostPage

has

Comment

Content

has

Authorhas

has

Categorised ContentCategorised Content

CommunityCommunity

Web FeedWeb Feed

External WidgetsExternal Widgets

Network and Linked DataNetwork and Linked DataBlog ContextBlog Context

SemanticsSemantics

BlogForever: Conceptual Data Model

Version 0.6

Spam DetectionSpam Detection

embeds

WidgetType

crawlerAouth

Widget

Feed

idformat

last_updated

generatorlast_build_date

related_feedLayout

themecss

images

SnapshotView

dateformat

src

hashas

Expression_ Meta

descriptiondef_keywords

Spam

dateflag

contains

SpamCategory

Keyword SentimentContent_Similarity

scoreflag

scoresrc

contains

contains

usernameURIUserProfile

ExternalProfile ProfileType

URI

Association Triple

subjectpredicate

object

Association Type

Multimedia

Text

Link

Tag

srcalt

caption/descrGEO

srcdescription

type

valueformat

tags

copyrightembedding

thumbnail

language

Ranking, Category and SimilarityRanking, Category and Similarity

valuedate

Ranking given

Similarity

Crawling InfoCrawling Info

Crawl captured

Category

similarity_scorealgorithm

AffiliationTypeAffiliation

Eventdate locationname URL

Topic

avatar

creator

service_uri

hasFeed_Type

value

Structured_ Meta

nameproperty

has

Standard and Ontology MappingStandard and Ontology Mapping

OntologyMapping

OntClass

OntProperty

SpamAlgorithm

ImageAudio

VideoDocument

LinkType

is a

BlogEntity

Page 45: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 45 of 67

The BlogForever platform

45

Blog crawlers

Real-time monitoring Html data extraction engine Spam filtering Web services extraction engine

Unstructured information

Web servicesBlog APIs

Original data andXML metadata

Blog digital repository

Digital preservation and QA Collections curation Public access APIs Web interface to browse, search, export Personalised services

Harvesting

PreservingManaging and reusing

Web servicesWeb interface

Page 46: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 46 of 67

Page 47: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 47 of 67

The BlogForever platform

Page 48: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 48 of 67

Evaluation using external testers

Page 49: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 49 of 67

Presentation Structure1. An Innovative Method to Evaluate Website Archivability,2. Near-duplicate and Cycle Detection in Webgraphs

towards Optimised Web Crawling,3. The BlogForever Platform: An Integrated Approach to

Preserve Weblogs,4. A Scalable Approach to Harvest Modern Weblogs,5. Conclusions and Future Work.

Page 50: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 50 of 67

4. A scalable approach to harvest modern weblogsProblem description• Inefficient weblog harvesting with generic solutions.• Unpredictable publishing rate of weblogs.Our contributions

1. A new algorithm to build extraction rules from blog web feeds with linear time complexity,

2. Applications of the algorithm to extract authors, publication dates and comments,

3. A new web crawler architecture and system capable of extracting blog articles, authors, publication dates and comments.

Page 51: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 51 of 67

Motivation & Method Overview• Extracting metadata and content from HTML is hard because web

stardards usage is low. 95% of websites do not pass HTML validation. • Focusing on blogs, we observed that:

1. Blogs provide XML feeds: standardized views of their latest ~10 posts.2. We have to access more posts than the ones referenced in web feeds.3. Posts of the same blog share a similar HTML structure.

• Content Extraction Method Overview1. Use blog XML feeds and referenced HTML pages as training data to build

extraction rules.2. For each XML element (Title, Author, Description, Publication date, etc)

create the relevant HTML extraction rule.3. Use the defined extraction rules to process all blog pages.

Page 52: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 52 of 67

Locate in HTML page all RSS referenced elements

Page 53: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 53 of 67

Generic procedure to build extraction rules

Page 54: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 54 of 67

• Rules are XPath queries.• For each rule, we compute the score based on string similarity.• The choice of ScoreFunction greatly influences the running time and

precision of the extraction process.

• Why we chose Sorensen–Dice coefficient similarity:1. Low sensitivity to word orderingand length variations2. Runs in linear time

Extraction rules and string similarity

Page 55: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 55 of 67

Example: blog post title best extraction rule• Find RSS blog post title: “volumelaser.eim.gr” in html page:

http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/• The Best Extraction Rule for the blog post title is: /body/div[@id=“page”]/header/h1

XPath HTML Element Value Similarity Score

/body/div[@id=“page”]/header/h1 volumelaser.eim.gr 100%

/body/div[@id=“page”]/div[@class=“entry-code”]/p/a

http://volumelaser.eim.gr/ 80%

/head/title volumelaser.eim.gr | Βαγγέλης Μπάνος

66%

... ... ...

Page 56: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 56 of 67

Variations for authors, dates, comments• Authors, dates and comments are special cases as they appear

many times throughout a post.• To resolve this issue, we implement an extra component in the

Score function:o For authors: an HTML tree distance between the evaluated node and

the post content node.o For dates: we check the alternative formats of each date in addition

to the HTML tree distance between the evaluated node and the post content node.

o Example: “1970-01-01” == “January 1 1970”o For comments: we use the special comment RSS feed.

Page 57: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 57 of 67

System Pipeline of operations:

1. Render HTML and JavaScript,2. Extract content,3. Extract comments,4. Download multimedia files,5. Propagate resulting records to

the back-end. Interesting areas:

◦ Blog post page identification,◦ Handle blogs with a large number of pages,◦ JavaScript rendering,◦ Scalability.

Page 58: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 58 of 67

Evaluation• Extract articles and titles from web pages and compare

extraction success rate and running time• Comparison against three open-source projects:

o Readability (Javascript), Boilerpipe (Java), Goose (Scala).

• Dataset: 2300 blog posts from 230 blogs from Spinn3r.

Page 59: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 59 of 67

5. Conclusions• We proposed tangible ways to improve web crawling, web

archiving and blog archiving with new algorithms and systems.

• The Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to evaluate Website Archivability.

• Methods to improve web crawling via detecting duplicates, near-duplicates and web spider traps on the fly.

• A new approach to harvest, manage, preserve and reuse weblogs.

• A new scalable algorithm to harvest modern weblogs.

Page 60: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 60 of 67

PublicationsPublications in scientific journals:

1. Banos V., Manolopoulos Y.: “Near-duplicate and Cycle Detection in Webgraphs towards Optimised Web Crawling”, ACM Transactions on the Web Journal, submitted, 2015.

2. Banos V., Manolopoulos Y.: “A Quantitative Approach to Evaluate Website Archivability Using the CLEAR+ Method”, International Journal on Digital Libraries, 2015.

3. Banos V., Blanvillain O., Kasioumis N., Manolopoulos Y.: “A Scalable Approach to Harvest Modern Weblogs”, International Journal of AI Tools, Vol.24, No.2, 2015.

4. Kasioumis N., Banos V., Kalb H.: “Towards Building a Blog Preservation Platform”, World Wide Web Journal, Special Issue on Social Media Preservation and Applications, Springer, 2013.

Publications in international conference proceedings:

5. Banos V., Manolopoulos Y.: “Web Content Management Systems Archivability”, Proceedings 19th East-European Conference on Advances in Databases & Information Systems (ADBIS), Springer Verlag, LNCS Vol.9282, Poitiers, France, 2015.

6. Blanvillain O., Banos V., Kasioumis N.: BlogForever Crawler: “Techniques and Algorithms to Harvest Modern Weblogs”, Proceedings 4th International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Thessaloniki, Greece, 2014.

Page 61: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 61 of 67

Publications3. Banos V., Kim Y., Ross S., Manolopoulos Y.: “CLEAR: a Credible Method to Evaluate Website

Archivability”, Proceedings 10th International Conference on Preservation of Digital Objects (iPRES), Lisbon, Portugal, 2013.

4. Kalb H., Lazaridou P., Banos V., Kasioumis N., Trier M.: “BlogForever: From Web Archiving to Blog Archiving”, Proceedings ‘Informatik Angepast an Mensch, Organisation und Umwelt‘ (INFORMATIK), Koblenz, Germany, 2013.

5. Stepanyan K., Gkotsis G., Banos V., Cristea A., Joy M.: “A Hybrid Approach for Spotting, Disambiguating and Annotating Places in User-Generated Text”, Proceedings 22nd International Conference on World Wide Web (WWW), Rio de Janeiro, Brazil, 2013.

6. Banos V., Baltas N., Manolopoulos Y.: “Trends in Blog Preservation”, Proceedings 14th International Conference on Enterprise Information Systems (ICEIS), Vol.1, pp.13-22, Wroclaw, Poland, 2012.

7. Banos V., Stepanyan K., Manolopoulos Y., Joy M., Cristea A.: “Technological Foundations of the Current Blogosphere”, Proceedings 2nd International Conference on Web Intelligence, Mining & Semantics (WIMS), ACM Press, Craiova, Romania, 2012.

Book chapters:

8. Banos V., Baltas N., Manolopoulos Y.: “Blog Preservation: Current Challenges and a New Paradigm”, chapter 3 in book Enterprise Information Systems XIII, by Cordeiro J., Maciaszek L. and Filipe J. (eds.), Springer LNBIP Vol.141, pp.29–51, 2013.

Page 62: Web Crawling, Analysis and Archiving. PhD Presentation

of 63WEB CRAWLING, ANALYSIS AND ARCHIVING - PHD DEFENSE 62 of 67

Future Work1. Website Archivability

1. Augment the CLEAR+ method with new metrics.2. Disseminate to wider audiences (e.g. web developers)3. Integrate with web archiving systems.4. Improve http://archiveready.com/

2. Web crawling duplicate and near-duplicate detection1. Develop new algorithm variants.2. Integrate into open source web crawlers.3. Provide support services to web crawling operations.4. Improve http://webgraph-it.com/

3. BlogForever platform1. Automate content curation processes.2. Improve entity detection in archived content.3. Support more types of weblogs.4. http://webternity.eu/

Page 63: Web Crawling, Analysis and Archiving. PhD Presentation

Web Crawling, Analysis and ArchivingPHD DEFENSE VANGELIS BANOS

DEPARTMENT OF INFORMATICS, ARISTOTLE UNIVERSITY OF THESSALONIKIOCTOBER 2015

THANK YOU!