Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in...

39
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit

Transcript of Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in...

Page 1: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Analysis of DOM Structuresfor Site-Level Template Extraction

(PSI 2015)(PSI 2015)

Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit

Page 2: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

2

Motivation

• Content Extraction and Block

Detection

• Template Extraction

A Technique for Template Extraction

• State of the art

• The DOM tree

• Template extraction based on DOM

Experiments

• Firefox plugin online DEMO

Conclusions and Future Work

Contents

Page 3: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

3

Information Retrieval

Web Mining

Template Detection

Content Extraction

Block Detection

Motivation

Page 4: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

menus, status bars, advertisements, sponsored information, etc.

4

Motivation

¿What is content extraction?

Discipline that tries to isolate every information block in a webpage.

¿What is block detection?

Page 5: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

5

Motivation

Page 6: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

6

Motivation

Page 7: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

7

Motivation

The date is differentThe title is different

Page 8: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Component reuse. Web developers can automatically extract components from a webpage.

Enhancing indexers and text analyzers to increase their performance by only processing relevant information.

It has been measured that almost 40-50% of the components of a webpage represent the template.

Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone

Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.

8

Motivation

¿Why is template extraction useful?

Page 9: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

9

Motivation

• Content Extraction and Block

Detection

• Template Extraction

A Technique for Template Extraction

• State of the art

• The DOM tree

• Template extraction based on DOM

Experiments

• Firefox plugin online DEMO

Conclusions and Future Work

Contents

Page 10: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

10

The Technique

What is a webpage?

Page 11: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Three main different ways to solve the problem:

Using the textual information of the webpage (i.e., the HTML code)

Using the rendered image of the webpage in the browser

Using the DOM tree of the webpage

11

The Technique

State of the Art

Densitometric features: counting characters and tags

Statistics on terms:Some terms are common in templates

Page 12: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

12

The Technique

Page 13: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

13

The Technique

Page 14: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Three main different ways to solve the problem:

Using the textual information of the webpage (i.e., the HTML code)

Using the rendered image of the webpage in the browser

Using the DOM tree of the webpage

14

The Technique

State of the Art

Position of elements: lateral menus, main content centered and visible

Less studied:rendering webpages is computationally expensive

Page 15: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Three main different ways to solve the problem:

Using the textual information of the webpage (i.e., the HTML code)

Using the rendered image of the webpage in the browser

Using the DOM tree of the webpage

15

The Technique

State of the Art

Analysis of the DOM structure: Difficulty in analysing DIV based structures

Comparing several webpages:Search for common structures

Page 16: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).

Some assume that the main content text is continuous.

Some assume that the system knows a priori the format of the webpage.

Some need to (randomly) load many webpages (several dozens) to compare them.

16

The Technique

Limitations of Current approaches

Page 17: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

Some assume that the main content text is continuous [11].

Some assume that the system knows a priori the format of the webpage [10].

Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated.

17

The Technique

Limitations of Current approaches<h2>Directory</h2> <div class="vcard"> <span class="fn">Vicente Ramos</span> <div class="org">Software Development </div> <div class="adr"> <div class="street-address">Atmosphere 118</div> <span class="locality">La Piedad, México</span> <span class="postal-code">59300</span> </div> <div class="tel">+52 352 52 68499</div> <h4>His Company</h4> <a class="url" href="page2.html"> Company Page </a></div>

Page 18: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

The main problem of these approaches is a big loss of generality.

They require to previously know or parse the webpages, or they require the webpage to have a particular structure.

This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design).

Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.

18

The Technique

Limitations of Current approaches

Page 19: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

19

The Technique

Other approaches are able to work:

+ Online (i.e., with any webpage)

+ In real-time (i.e., without the need to preprocess the webpages or know their structure)

Page 20: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

20

Motivation

• Content Extraction and Block

Detection

• Template Extraction

A Technique for Content Extraction

• State of the art

• The DOM tree

• Template extraction based on DOM

Experiments

• Firefox plugin online DEMO

Conclusions and Future Work

Contents

Page 21: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

The Document Object Model (DOM)

API that provides programmers with a standard set of objects for the representation of HTML and XML documents.

Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa.

The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically.

21

The Technique

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Page 22: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

The Document Object Model (DOM)

Nodes in the DOM tree can be of two types: tag nodes, and text nodes:

Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes).

Text nodes are always leaves in the DOM tree because they cannot contain other nodes.

22

The Technique

I want to know more!

http://www.w3.org/DOM/

Table

Table

Div

Body

H1

Table

Image

Text

Text

Text

Page 23: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

23

Motivation

• Content Extraction and Block

Detection

• Template Extraction

A Technique for Content Extraction

• State of the art

• The DOM tree

• Template extraction based on DOM

Experiments

• Firefox plugin online DEMO

Conclusions and Future Work

Contents

Page 24: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Our method for template extraction in a nutsell:

1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.

2.Solve conflicts between those webpages that implement different templates.

1. Establishing a voting system between the webpages.

nThe template is the intersection between the initial webpage and the DOM trees in the subdigraph.

The intersection is computed with an Equal Top-Down Mapping between the DOM trees.

1.The three steps can be done with a linear cost with respect to the size of the DOM trees.

24

The Technique

Page 25: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

1. Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.

25

The Technique

Menu

Submenu

Domain A

Domain B

Domain C

Page 26: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

The Technique

1. Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.

Hyperlink distance

Page 27: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

1. Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.

The Technique

Hyperlink distance DOM distance

Page 28: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

2. Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages.

The Technique

Page 29: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Our method for template extraction in a nutsell:

3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph.

3. The intersection is computed with an Equal Top-Down Mapping between the DOM trees.

29

The Technique

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

P1P2

P3P4

P5

Page 30: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Mapping:

30

The Technique

HTMLHTML

BodyBody

DivDiv TableTable

TableTable PP

HTMLHTML

BodyBody

TableTable TableTable

DivDiv PPPPPP

Page 31: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Top-Down Mapping:

31

The Technique

HTMLHTML

BodyBody

DivDiv TableTable

TableTable PP

HTMLHTML

BodyBody

TableTable TableTable

DivDiv PPPPPP

Page 32: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Equal Top-Down Mapping:

32

The Technique

HTMLHTML

BodyBody

DivDiv TableTable

TableTable PP

HTMLHTML

BodyBody

TableTable TableTable

DivDiv PPPPPP

Page 33: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

33

Motivation

• Content Extraction and Block

Detection

• Template Extraction

A Technique for Template Extraction

• State of the art

• The DOM tree

• Template extraction based on DOM

Experiments

• Firefox plugin online DEMO

Conclusions and Future Work

Contents

Page 34: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Benchmarks: online heterogeneus webpages Domains with different layouts and page structures Company’s websites, news articles, forums, etc.

Final evaluation set randomly selected

We determined the actual template of each webpage by downloading it and manually selecting the template.

The DOM tree of the selected elements was then produced and used for comparison evaluation later.

F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall

34

Experiments

Page 35: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

35

Experiments

Page 36: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

37

Motivation

• Content Extraction and Block

Detection

• Template Extraction

A Technique for Template Extraction

• State of the art

• The DOM tree

• Template extraction based on DOM

Experiments

• Firefox plugin online DEMO

Conclusions and Future Work

Contents

Page 37: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

38

Conclusions and future work

Conclusions:

• New technique proposed for template extraction:

1.It does not make assumptions about the particular structure of webpages.

2.It only needs to process a single webpage (no templates, no other webpages of the same website are needed).

3.No preprocessing stages are needed. The technique can work online.

4.It is fully language independent (it can work with pages written in English, German, etc.).

5.The particular text formatting of the webpage does not influence the performance of the technique.

Page 38: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

39

Conclusions and future work

Future Work:

1.Consider that a website can implement several templates along the webpages:

• Extend the benchmark suite by labelling all templates.

• A new technique to detect all templates of a website.

1.Combine template extraction with content extraction:

1. Firstly, apply template extraction to remove the template, and

2. Secondly, look for the main content on the remaining webpage.

Page 39: Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

40

Thank You