Automating the Extraction of Domain Specific Information from the Web

Post on 05-Jan-2016

77 views 1 download

Tags:

description

Automating the Extraction of Domain Specific Information from the Web. A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004. Research funded by NSF. Genealogical Information on the Web. Hundreds of thousands of sites - PowerPoint PPT Presentation

Transcript of Automating the Extraction of Domain Specific Information from the Web

11

Automating the Extraction of Automating the Extraction of Domain Specific Information Domain Specific Information

from the Webfrom the WebA Case Study for the Genealogical DomainA Case Study for the Genealogical Domain

Troy WalkerTroy WalkerThesis ProposalThesis ProposalJanuary 2004January 2004

Research funded by NSFResearch funded by NSF

22

Genealogical Information on Genealogical Information on the Webthe Web

Hundreds of thousands of sitesHundreds of thousands of sites Some professional (Ancestry.com, Some professional (Ancestry.com,

Familysearch.org)Familysearch.org) Mostly hobbyist (203,200 indexed by Mostly hobbyist (203,200 indexed by

Cyndislist.com)Cyndislist.com) Search enginesSearch engines

““Walker genealogy” on Google: 199,000 resultsWalker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through1 page/minute = 5 months to go through

Why not enlist the help of a computer?Why not enlist the help of a computer?

33

ProblemsProblems

No standard way of presenting dataNo standard way of presenting data Text formatted with HTML tagsText formatted with HTML tags TablesTables Forms to access informationForms to access information

Sites have differing schemasSites have differing schemas

44

Proposed SolutionProposed Solution

Based on Ontos and other work done by Based on Ontos and other work done by the BYU Data Extraction Group (DEG)the BYU Data Extraction Group (DEG)

Able to extract from:Able to extract from: Single-Record or Multiple Record DocumentsSingle-Record or Multiple Record Documents TablesTables FormsForms

Scalable and robust to changes in pagesScalable and robust to changes in pages Easily adaptable to other domainsEasily adaptable to other domains

55

TextText

66

TablesTables

77

FormsForms

88

FormsForms

99

System OverviewSystem Overview

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

1010

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

User QueryUser Query

Generated from ontologyGenerated from ontology Generated once per application Generated once per application

domaindomain

1111

User QueryUser Query

1212

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URL ListURL Listand URL Selectorand URL Selector

Contains Genealogy URLsContains Genealogy URLs Search each URL—too much timeSearch each URL—too much time Select likely URLsSelect likely URLs Distribute document processing using Distribute document processing using

DOGMADOGMA

1313

URL ListURL Listand Document Retrieverand Document Retriever

URL Filter

http://www.ancestry.com/search/main.htm?lfl=adv

http://userdb.rootsweb.com/deaths/cgi-bin/deaths.cgi

Death Date > 1880

http://www.camcomp.com/users/jwalker/johngene/johngenes.htm

Name: Bates, Boyle, Damon, Eliot, … Walker, Woodsworth

http://www.rootsweb.com/~gaupson/cedarcem.htm

Burial Location:Thomaston, GA

http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Adams.html

Name: Adams

http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Walker.html

Name: Walker

http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Warley.html

Name: Warley

http://homepages.rootsweb.com/~gemmell/walkdesc.htm

Name: Walker

http://www.smartnouveau.com/jbplace/Kemp/f0000425.html

Name: Anderson, Burt, Summers, Walker

1414

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Document Structure Document Structure RecognizerRecognizer

Requests analysis from each Data Requests analysis from each Data Extraction EngineExtraction Engine

Selects appropriate methodSelects appropriate method

1515

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Data Extraction EnginesData Extraction Engines

TextText Improved record-separationImproved record-separation Ability to handle single-record pagesAbility to handle single-record pages

TableTable FormsForms

1616

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Data ConstrainerData Constrainer

Selects attribute/value pairsSelects attribute/value pairs Fits data to ontologyFits data to ontology

1717

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Result FilterResult Filter

Fits data to queryFits data to query Returns to central Result PresenterReturns to central Result Presenter

1818

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Result PresenterResult Presenter

Creates XML Schema from OntologyCreates XML Schema from Ontology Presents results to userPresents results to user

1919

Result PresenterResult Presenter

2020

EvaluationEvaluation

ScalabilityScalability Query on large URL listQuery on large URL list Experiment on number of PCsExperiment on number of PCs

Precision and recallPrecision and recall Recall difficult to determineRecall difficult to determine Query on small URL listQuery on small URL list

AdaptabilityAdaptability Car ontologyCar ontology Small URL listSmall URL list

2121

ConclusionConclusion

Integrates, builds on previous DEG workIntegrates, builds on previous DEG work Extracts from:Extracts from:

Single- or Multiple-Record DocumentsSingle- or Multiple-Record Documents TablesTables FormsForms

ScalableScalable Only searches probable pagesOnly searches probable pages Distributed with DOGMADistributed with DOGMA

Robust to changes in pagesRobust to changes in pages Ontology based—easily adapted to other domainsOntology based—easily adapted to other domains