Face Recognition Using Frequency Domain Feature Extraction Methods
Automating the Extraction of Domain Specific Information from the Web
description
Transcript of Automating the Extraction of Domain Specific Information from the Web
11
Automating the Extraction of Automating the Extraction of Domain Specific Information Domain Specific Information
from the Webfrom the WebA Case Study for the Genealogical DomainA Case Study for the Genealogical Domain
Troy WalkerTroy WalkerThesis ProposalThesis ProposalJanuary 2004January 2004
Research funded by NSFResearch funded by NSF
22
Genealogical Information on Genealogical Information on the Webthe Web
Hundreds of thousands of sitesHundreds of thousands of sites Some professional (Ancestry.com, Some professional (Ancestry.com,
Familysearch.org)Familysearch.org) Mostly hobbyist (203,200 indexed by Mostly hobbyist (203,200 indexed by
Cyndislist.com)Cyndislist.com) Search enginesSearch engines
““Walker genealogy” on Google: 199,000 resultsWalker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through1 page/minute = 5 months to go through
Why not enlist the help of a computer?Why not enlist the help of a computer?
33
ProblemsProblems
No standard way of presenting dataNo standard way of presenting data Text formatted with HTML tagsText formatted with HTML tags TablesTables Forms to access informationForms to access information
Sites have differing schemasSites have differing schemas
44
Proposed SolutionProposed Solution
Based on Ontos and other work done by Based on Ontos and other work done by the BYU Data Extraction Group (DEG)the BYU Data Extraction Group (DEG)
Able to extract from:Able to extract from: Single-Record or Multiple Record DocumentsSingle-Record or Multiple Record Documents TablesTables FormsForms
Scalable and robust to changes in pagesScalable and robust to changes in pages Easily adaptable to other domainsEasily adaptable to other domains
55
TextText
66
TablesTables
77
FormsForms
88
FormsForms
99
System OverviewSystem Overview
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
1010
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
User QueryUser Query
Generated from ontologyGenerated from ontology Generated once per application Generated once per application
domaindomain
1111
User QueryUser Query
1212
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URL ListURL Listand URL Selectorand URL Selector
Contains Genealogy URLsContains Genealogy URLs Search each URL—too much timeSearch each URL—too much time Select likely URLsSelect likely URLs Distribute document processing using Distribute document processing using
DOGMADOGMA
1313
URL ListURL Listand Document Retrieverand Document Retriever
URL Filter
http://www.ancestry.com/search/main.htm?lfl=adv
http://userdb.rootsweb.com/deaths/cgi-bin/deaths.cgi
Death Date > 1880
http://www.camcomp.com/users/jwalker/johngene/johngenes.htm
Name: Bates, Boyle, Damon, Eliot, … Walker, Woodsworth
http://www.rootsweb.com/~gaupson/cedarcem.htm
Burial Location:Thomaston, GA
http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Adams.html
Name: Adams
http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Walker.html
Name: Walker
http://www.cs.utk.edu/~dwalker/genealogy/LISTS/Warley.html
Name: Warley
http://homepages.rootsweb.com/~gemmell/walkdesc.htm
Name: Walker
http://www.smartnouveau.com/jbplace/Kemp/f0000425.html
Name: Anderson, Burt, Summers, Walker
1414
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
Document Structure Document Structure RecognizerRecognizer
Requests analysis from each Data Requests analysis from each Data Extraction EngineExtraction Engine
Selects appropriate methodSelects appropriate method
1515
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
Data Extraction EnginesData Extraction Engines
TextText Improved record-separationImproved record-separation Ability to handle single-record pagesAbility to handle single-record pages
TableTable FormsForms
1616
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
Data ConstrainerData Constrainer
Selects attribute/value pairsSelects attribute/value pairs Fits data to ontologyFits data to ontology
1717
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
Result FilterResult Filter
Fits data to queryFits data to query Returns to central Result PresenterReturns to central Result Presenter
1818
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
URLSelector
FormEngine
TableEngine
Single- orMultiple-Record
Engine
URLList
UserQuery
ResultFilter
DocumentRetriever
andStructure
Recognizer
DataConstrainer
Ontology
ResultPresenter
Result PresenterResult Presenter
Creates XML Schema from OntologyCreates XML Schema from Ontology Presents results to userPresents results to user
1919
Result PresenterResult Presenter
2020
EvaluationEvaluation
ScalabilityScalability Query on large URL listQuery on large URL list Experiment on number of PCsExperiment on number of PCs
Precision and recallPrecision and recall Recall difficult to determineRecall difficult to determine Query on small URL listQuery on small URL list
AdaptabilityAdaptability Car ontologyCar ontology Small URL listSmall URL list
2121
ConclusionConclusion
Integrates, builds on previous DEG workIntegrates, builds on previous DEG work Extracts from:Extracts from:
Single- or Multiple-Record DocumentsSingle- or Multiple-Record Documents TablesTables FormsForms
ScalableScalable Only searches probable pagesOnly searches probable pages Distributed with DOGMADistributed with DOGMA
Robust to changes in pagesRobust to changes in pages Ontology based—easily adapted to other domainsOntology based—easily adapted to other domains