Extracting Structured Data from Web Pages

By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003

Instructor: Prof. Taflan Gündem

Presentation Outline

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

What is next?

• Conclusion

Motivation

• Extracting structured data from the web pages is useful, since it enables us to pose complex queries over the data.

• This paper focuses on the problem of automatically extracting structured data from a collection of pages.

• There are many web sites that contain a large collection of “structured” pages.

What is next?

• Conclusion

Example Pages

• In the real world there are many examples for structured web pages.– amazon web site, e-bay web site etc.

• Two examples from www.amazon.com– My System– An Eternal Golden Braid

Example Pages (My System: 21st Century Edition)

Example Pages (An Eternal Golden Braid)

What is next?

• Conclusion

Underlying Problems• Complex Schema: The “schema” of

the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on.

• Template vs. Data: Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.

How is a page created with template?

x extracted from the database

Basic Type, Tuples and Sets• Basic Type: Basic unit

of text• Tuple: Ordered List of

types, <T1,T2,…,Tn>

• Set: {T1}

< C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

Schema and Instance

< C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

Template Definition

• Own example: • Schema: S = <, {, >

• Template: TS = <A * B {*}E C * D>

• A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’

• Instance of TS: Title: Extracting Structured Data Presented by: Arsun and Özgün Cost: 1hr

Template Encoding (T1,x1)

What is next?

• Conclusion

General Description of EXALG

Multiple Pages

Set of Reviewers

Correct Solution for those pages

Some Terminology (1)• The occurrence-vector of a token t, is

defined as the vector <f1,f2,…fn> where fi is the number of occurrences of t in ith page

• An equivalence class is a maximal set of tokens having the same occurrence-vector.

• A token is said to have unique role, if all the occurrences of the token in the pages, is generated by a single template-token.

Some Terminology (2)

<1,1,1,1>

<1,2,1,0>

No unique role

• For real pages, an equivalence class of large size and support is usually valid, where support of a token is defined as the number of pages in which the token occurs.

• Example for invalid equivalence class:– {Data, Mining, Jeff, 2, Jane, 6} has

occurrence vector <0, 1, 0, 0>

• The equivalence classes with large size and support are called LFEQs (for Large and Frequent EQuivalence class). LFEQs are rarely formed by “chance”.

• Threshold for size and support is set by the user (SizeThres, SupThres).

Some Terminology(5)

• Valid equivalence class properties: Ordering and Nesting

• Back to own example:

• Template: TS = <A * B {*}E C * D>

• A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’

• Ordered: A > B > C > D• Nesting: B > E > C

Important Observations

• In practice, two page-tokens with different occurrence-paths have different roles: html-parser

• Two page-tokens having same occurrence paths, but with different neighbours also have different roles

Explanation of observations

Modules and their operations

M o d u le E C G M

E q u iv a len c e C las s G en er a tio n M o d u le

Fin dEqF in d E q u iv a len c e C las s es

HandIn vHan d le I n v a lid

E q u iv a len c e C las s es

D if fEqD if f er en tia te R o les Us in g

E q C las s

An aly s is M o d u le

D if fFormD if f er en tia teR o les Us in gF o r m at

ExV a lE x tr ac t Valu e

C on s tTem pC o n s tr u c tT em p la te

Tem pla teS ch em aV a lu e s

in pu tpage s

Constructing Template (1)

• The extraction algorithm determines the positions between consecutive tokens of an equivalence class that are non-empty.

• A position between two consecutive tokens is empty if the two tokens always occur contiguously, and non-empty, otherwise.

Constructing Template (2)

• The tokens connected by empty positions belong to the template.

• In the non-empty positions, there are either basic types (strings extracted from database), or a more complex type

• This unknown type can be determined by inspecting input pages

Constructing Template(3)

What is next?

• Conclusion

Experimental Results (1)

• Basically this project is compared with the RoadRunner, however RoadRunner makes simplifying assumptions.

• The first 6 web pages are obtained from RoadRunner site.

• The last three web pages have more complex structure.

Experimental Results(2)

What is next?

• Conclusion

Concluding Remarks• EXALG first discovers the unknown

template that generated the pages and uses the discovered template to extract the data from the input pages.

• Besides getting very good results, EXALG does not completely fail to extract any data even when some of the assumptions made by EXALG are not met by the input collection.

• No human intervention – automatically getting template and data

Future Work

• Automatically locate collections of pages that are structured

• Check, whether it is feasible to generate some large database from these pages

Questions & Answers

Extracting Structured Data from Web Pages

Documents

Transcript of Extracting Structured Data from Web Pages

HEDEA: A Python Tool for Extracting and Analysing Semi ... · Vol. 24 • No. 2 • April 2018 149 Python Tool for Semi-structured Information Obtaining such structured medical data

Beyond Posts & Pages - Structured Content in WordPress

Structured Evaluation of the Top Depression & Anxiety Self ...depts.washington.edu/hcsats/PDF/TF- CBT/pages/Theoretical Perspe… · Structured Evaluation of the Top Depression &

Chapter 1: What is WebAssembly? › downloads › 9781788997379... · 2019-06-11 · WebAssembly structured cloning support. Enable web pages to use WebAssembly structured cloning.

Extracting Tree-Structured Representations of Trained Networks...Extracting Tree-structured Representations of Trained Networks 27 reach the given node. Like the ID2-of-3 algorithm,

Extracting Structured Data from Web Pagesinfolab.stanford.edu/~arvind/papers/extract-sigmod03.pdf · Extracting Structured Data from Web Pages Arvind Arasu Stanford University arvinda@cs.stanford.edu

Quantum Structured Solar Cells - Hosted Pages

Extracting Informative Textual Parts from Web Pages Containing User-Generated Content · 2012-12-07 · Extracting Informative Textual Parts from Web Pages Containing User-Generated

Semantic Web: Extracting and Mining Structured Data … Semantic Web Ontologies Linked Data Information Sources Information Extraction and Text Mining Machine Reading Relation Extraction

Extracting Ontological Relations of Korean Numeral Classifiers from Semi-structured Resources Using NLP techniques Youngim Jung, Soonhee Hwang, Aesun Yoon,

Extracting and Re-rendering Structured Auditory Scenes from … · 2007-01-15 · Extracting and Re-rendering Structured Auditory Scenes from Field Recordings Emmanuel Gallo1,2 and

Extracting and Managing Structured Web Data · Project 1 = Static content •Project 1: the pages are static •Pages only change rarely via a manual process •You manually update

[poster] Structured and Unstructured: Extracting Information from Classics Scholarly Texts

Automatically Extracting Structured Data for Web Search

Finding and Extracting Data Records from Web Pages.dl.ifip.org/db/conf/euc/euc2007/AlvarezPRBC07.pdf · 2014-08-11 · Finding and Extracting Data Records from Web Pages* ... for

DBpedia – Extracting structured data from Wikipediaswib.org › swib09 › vortraege › 20091124_jentzsch.pdf · 11/24/2009 · Anja Jentzsch: DBpedia – Extracting structured

Extracting Structured Data from Web Pages › en-us › research › wp-content › ...Extracting structured data from the web pages is clearly very useful, since it enables us to

Facts and Reasons: Web Information Querying to Support ...mmv/papers/SamadiThesis.pdfconstruct machine-readable knowledge bases by extracting structured knowledge from the Web. Most

Extracting Structured Information from Wikipedia Articles ... · Technische Berichte Nr. 38 des Hasso-Plattner-Instituts für Softwaresystemtechnik an der Universität Potsdam Extracting

A hybrid solution for extracting structured medical information from ...