Automating the Extraction of Domain Specific Information from the Web

21
1 Automating the Automating the Extraction of Domain Extraction of Domain Specific Information Specific Information from the Web from the Web A Case Study for the Genealogical Domain A Case Study for the Genealogical Domain Troy Walker Troy Walker Thesis Proposal Thesis Proposal January 2004 January 2004 Research funded by NSF Research funded by NSF

description

Automating the Extraction of Domain Specific Information from the Web. A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004. Research funded by NSF. Genealogical Information on the Web. Hundreds of thousands of sites - PowerPoint PPT Presentation

Transcript of Automating the Extraction of Domain Specific Information from the Web

Page 1: Automating the Extraction of Domain Specific Information from the Web

11

Automating the Extraction of Automating the Extraction of Domain Specific Information Domain Specific Information

from the Webfrom the WebA Case Study for the Genealogical DomainA Case Study for the Genealogical Domain

Troy WalkerTroy WalkerThesis ProposalThesis ProposalJanuary 2004January 2004

Research funded by NSFResearch funded by NSF

Page 2: Automating the Extraction of Domain Specific Information from the Web

22

Genealogical Information on Genealogical Information on the Webthe Web

Hundreds of thousands of sitesHundreds of thousands of sites Some professional (Ancestry.com, Some professional (Ancestry.com,

Familysearch.org)Familysearch.org) Mostly hobbyist (203,200 indexed by Mostly hobbyist (203,200 indexed by

Cyndislist.com)Cyndislist.com) Search enginesSearch engines

““Walker genealogy” on Google: 199,000 resultsWalker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through1 page/minute = 5 months to go through

Why not enlist the help of a computer?Why not enlist the help of a computer?

Page 3: Automating the Extraction of Domain Specific Information from the Web

33

ProblemsProblems

No standard way of presenting dataNo standard way of presenting data Text formatted with HTML tagsText formatted with HTML tags TablesTables Forms to access informationForms to access information

Sites have differing schemasSites have differing schemas

Page 4: Automating the Extraction of Domain Specific Information from the Web

44

Proposed SolutionProposed Solution

Based on Ontos and other work done by Based on Ontos and other work done by the BYU Data Extraction Group (DEG)the BYU Data Extraction Group (DEG)

Able to extract from:Able to extract from: Single-Record or Multiple Record DocumentsSingle-Record or Multiple Record Documents TablesTables FormsForms

Scalable and robust to changes in pagesScalable and robust to changes in pages Easily adaptable to other domainsEasily adaptable to other domains

Page 5: Automating the Extraction of Domain Specific Information from the Web

55

TextText

Page 6: Automating the Extraction of Domain Specific Information from the Web

66

TablesTables

Page 7: Automating the Extraction of Domain Specific Information from the Web

77

FormsForms

Page 8: Automating the Extraction of Domain Specific Information from the Web

88

FormsForms

Page 9: Automating the Extraction of Domain Specific Information from the Web

99

System OverviewSystem Overview

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Page 10: Automating the Extraction of Domain Specific Information from the Web

1010

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

User QueryUser Query

Generated from ontologyGenerated from ontology Generated once per application Generated once per application

domaindomain

Page 11: Automating the Extraction of Domain Specific Information from the Web

1111

User QueryUser Query

Page 12: Automating the Extraction of Domain Specific Information from the Web

1212

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URL ListURL Listand URL Selectorand URL Selector

Contains Genealogy URLsContains Genealogy URLs Search each URL—too much timeSearch each URL—too much time Select likely URLsSelect likely URLs Distribute document processing using Distribute document processing using

DOGMADOGMA

Page 13: Automating the Extraction of Domain Specific Information from the Web

1313

URL ListURL Listand Document Retrieverand Document Retriever

URL Filter

http://www.ancestry.com/search/main.htm?lfl=adv

http://userdb.rootsweb.com/deaths/cgi-bin/deaths.cgi

Death Date > 1880

http://www.camcomp.com/users/jwalker/johngene/johngenes.htm

Name: Bates, Boyle, Damon, Eliot, … Walker, Woodsworth

http://www.rootsweb.com/~gaupson/cedarcem.htm

Burial Location:Thomaston, GA

http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Adams.html

Name: Adams

http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Walker.html

Name: Walker

http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Warley.html

Name: Warley

http://homepages.rootsweb.com/~gemmell/walkdesc.htm

Name: Walker

http://www.smartnouveau.com/jbplace/Kemp/f0000425.html

Name: Anderson, Burt, Summers, Walker

Page 14: Automating the Extraction of Domain Specific Information from the Web

1414

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Document Structure Document Structure RecognizerRecognizer

Requests analysis from each Data Requests analysis from each Data Extraction EngineExtraction Engine

Selects appropriate methodSelects appropriate method

Page 15: Automating the Extraction of Domain Specific Information from the Web

1515

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Data Extraction EnginesData Extraction Engines

TextText Improved record-separationImproved record-separation Ability to handle single-record pagesAbility to handle single-record pages

TableTable FormsForms

Page 16: Automating the Extraction of Domain Specific Information from the Web

1616

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Data ConstrainerData Constrainer

Selects attribute/value pairsSelects attribute/value pairs Fits data to ontologyFits data to ontology

Page 17: Automating the Extraction of Domain Specific Information from the Web

1717

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Result FilterResult Filter

Fits data to queryFits data to query Returns to central Result PresenterReturns to central Result Presenter

Page 18: Automating the Extraction of Domain Specific Information from the Web

1818

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

URLSelector

FormEngine

TableEngine

Single- orMultiple-Record

Engine

URLList

UserQuery

ResultFilter

DocumentRetriever

andStructure

Recognizer

DataConstrainer

Ontology

ResultPresenter

Result PresenterResult Presenter

Creates XML Schema from OntologyCreates XML Schema from Ontology Presents results to userPresents results to user

Page 19: Automating the Extraction of Domain Specific Information from the Web

1919

Result PresenterResult Presenter

Page 20: Automating the Extraction of Domain Specific Information from the Web

2020

EvaluationEvaluation

ScalabilityScalability Query on large URL listQuery on large URL list Experiment on number of PCsExperiment on number of PCs

Precision and recallPrecision and recall Recall difficult to determineRecall difficult to determine Query on small URL listQuery on small URL list

AdaptabilityAdaptability Car ontologyCar ontology Small URL listSmall URL list

Page 21: Automating the Extraction of Domain Specific Information from the Web

2121

ConclusionConclusion

Integrates, builds on previous DEG workIntegrates, builds on previous DEG work Extracts from:Extracts from:

Single- or Multiple-Record DocumentsSingle- or Multiple-Record Documents TablesTables FormsForms

ScalableScalable Only searches probable pagesOnly searches probable pages Distributed with DOGMADistributed with DOGMA

Robust to changes in pagesRobust to changes in pages Ontology based—easily adapted to other domainsOntology based—easily adapted to other domains