A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu...
A Fully Automated Object A Fully Automated Object Extraction System for the Extraction System for the
World Wide WebWorld Wide Web
a paper by David Buttler, Ling Liu a paper by David Buttler, Ling Liu and Calton Pu, Georgia Techand Calton Pu, Georgia Tech
Why’d they do it?Why’d they do it?
Identifying object regions and boundaries has Identifying object regions and boundaries has been done manually and with some automation been done manually and with some automation mostly relying on syntactic knowledge (ie mostly relying on syntactic knowledge (ie HTML).HTML).
Embley, Jiang & Ng (Hmmm… must be some Embley, Jiang & Ng (Hmmm… must be some famous scientists in Germany) developed a famous scientists in Germany) developed a pretty sweet heuristics-based automatic object pretty sweet heuristics-based automatic object extraction system, which we want to copy but extraction system, which we want to copy but throw out the ontology heuristic – and maybe throw out the ontology heuristic – and maybe throw in a few ideas of our own. throw in a few ideas of our own.
Omini Omini (not the book after Jarom)(not the book after Jarom)
Fully-automated extractionFully-automated extraction Parses a page into a tree structureParses a page into a tree structure Locates smallest subtree with all objectsLocates smallest subtree with all objects
Reduces possibilities for next stepReduces possibilities for next step Finds correct object separator tagsFinds correct object separator tags
Contributions to IEContributions to IE A few algorithms for subtree extraction and A few algorithms for subtree extraction and
object extractionobject extraction Most the other stuff is already knownMost the other stuff is already known
Some Terms & DefinitionsSome Terms & Definitions
Well-Formed Web DocumentWell-Formed Web Document No brackets besides tagsNo brackets besides tags ALL tags are paired (even <br>,<hr>, etc.)ALL tags are paired (even <br>,<hr>, etc.) Attribute values in a tag are in quotesAttribute values in a tag are in quotes Nested tags do not overlapNested tags do not overlap
Well-Formed Doc Well-Formed Doc Tag Tree Tag Tree
System ArchitectureSystem Architecture
Phase 2, Part A: Subtree ExtractionPhase 2, Part A: Subtree Extraction3 Heuristics used to find the minimal subtree 3 Heuristics used to find the minimal subtree containing all objects of interestcontaining all objects of interest FanoutFanout Content SizeContent Size Tag CountTag Count
Phase 2, Part B: Object Phase 2, Part B: Object Separator ExtractionSeparator Extraction
Combination of 5 HeuristicsCombination of 5 Heuristics SD (Standard Deviation) & RP (Repeating SD (Standard Deviation) & RP (Repeating
Pattern) are taken from BYU.Pattern) are taken from BYU. SB (Sibling tag), PP (Partial Path) are new.SB (Sibling tag), PP (Partial Path) are new. IPS (Identifiable Path Separator) is an IPS (Identifiable Path Separator) is an
extension of BYU’s IT (Identifiable Tag).extension of BYU’s IT (Identifiable Tag).
Phase 2, Part B Continued: Object Phase 2, Part B Continued: Object Separator HeuristicsSeparator Heuristics
SD – Distance between consecutive SD – Distance between consecutive occurrences of a candidate tag. (Objects occurrences of a candidate tag. (Objects usually the same size.)usually the same size.)
RP – Absolute value of difference between RP – Absolute value of difference between pairs of tags together and alone. (Pattern pairs of tags together and alone. (Pattern of tags usually means just one thing.)of tags usually means just one thing.)
IPS – Ranks tags according to a table of IPS – Ranks tags according to a table of common object separators.common object separators.
Phase 2, Part B Continued: Object Phase 2, Part B Continued: Object Separator HeuristicsSeparator Heuristics
SB – Pairs of tags that are immediate SB – Pairs of tags that are immediate siblings of minimal subtree. (ie <p><a>…siblings of minimal subtree. (ie <p><a>…</a><b>…</b><c>…</c></p> (# object </a><b>…</b><c>…</c></p> (# object separators should = # objects)separators should = # objects)
PP – Counts occurrences of same path of PP – Counts occurrences of same path of tags from a node. (Multiple instances of tags from a node. (Multiple instances of object should have same object structure.)object should have same object structure.)
Phase 2, Part B Continued: Object Phase 2, Part B Continued: Object Separator HeuristicsSeparator Heuristics
Combining HeuristicsCombining Heuristics Probability that tag <tr> is an object separator if 3 Probability that tag <tr> is an object separator if 3
heuristics say 78%, 63% and 85%: 99%heuristics say 78%, 63% and 85%: 99%
78+63+85-78*63-78*85-63*85+78*63*85 = 78+63+85-78*63-78*85-63*85+78*63*85 = 99%99% Combination of all 5 heuristics is best.Combination of all 5 heuristics is best.
Phase 3: Object ExtractionPhase 3: Object Extraction
Candidate Object ConstructionCandidate Object Construction Uses Object Separator Tag from Phase 2Uses Object Separator Tag from Phase 2
Object Extraction RefinementObject Extraction Refinement Removes objects that may not be of the same Removes objects that may not be of the same
structure, too big or too smallstructure, too big or too small
ResultsResults
Ran Omini on 1,500 pages across 25 sitesRan Omini on 1,500 pages across 25 sitesUsing the combination of all 5 heuristics:Using the combination of all 5 heuristics: 94% of Object Separators picked correctly94% of Object Separators picked correctly 100% Precision and 98% Recall100% Precision and 98% Recall
vs BYUvs BYU Omini as good if not better in all tests Omini as good if not better in all tests Over 5 websites in March 2000:Over 5 websites in March 2000:
BYU: 59% success rateBYU: 59% success rateOmini: 93% success rateOmini: 93% success rate
Criticism of BYU SystemCriticism of BYU System
IT (Identifiable Tag) vs IPS (Identifiable Path IT (Identifiable Tag) vs IPS (Identifiable Path Separator):Separator): IPS changes tag table based on the node at which the IPS changes tag table based on the node at which the
minimal subtree is anchored.minimal subtree is anchored.
PP (Partial Path) vs HC (Highest Count):PP (Partial Path) vs HC (Highest Count): By itself, HC not very successfulBy itself, HC not very successful In combination with other heuristics, HC can actually In combination with other heuristics, HC can actually
make the total accuracy worse!make the total accuracy worse! PP just like HC on some websitesPP just like HC on some websites
Ontology approach uses human intervention – Ontology approach uses human intervention – if goal is fully automated, this won’t do.if goal is fully automated, this won’t do.