Machine Learning in DIADEM (Andrey Kravchenko)

Machine Learning in DIADEM Reading Course Presentation

Andrey Kravchenko 20th of January, 2010

Current area of research Real estate page classiDication

vs

Current area of research Input and output page distinction

Current area of research Page element classiDication

The Reading List Papers not included in this presentation

0  “An interactive clustering – based approach to integrating source query interfaces on the Deep Web” 0  This paper is concerned with input forms.

0  “Automatic wrapper induction from hidden-‐web sources with domain knowledge” 0  Only a part of the paper deals with the output pages. Their methodology for processing the output pages is based on gazetteer’s and is thus closer to linguistics than ML.

0  “Web scale extraction of structured data” 0  Deals with the whole Web.

0  “An adaptive information extraction system based on wrapper induction with POS tagging” 0  The labels are of very low granularity (e.g. work_name, work_location) and of linguistic nature. The comparison is done against linguistics systems such as Rapier (another excluded paper on the reading list), GATE-‐SVM, etc. Introducing POS tagging provides only a 5% gain in accuracy and only for some target slots for one corpus and no gain for the other two.

The Reading List Papers not included in this presentation

0  “Learning (k,l)-‐contextual tree languages for information extraction from Web pages” 0  The paper deals with learning an extraction language rather than extraction itself.

0  “Bottom-‐up relational learning of problem matching rules for Information Retrieval” 0  Deals with textual documents only.

0  “Learning rules to pre-‐process Web data for automatic integration” 0  Relies on web data extraction and alignment phases performed by the VIPER system that are not described in the paper. I wasn’t able to detect any ML involved in the stage of rule learning. No clear description of practical results. Low-‐level granularity of labels.

0  “Learning rules for information extraction” 0  Is not HTML/DOM speciDic.

The Reading List Papers included in this presentation

#1 “Web-‐page classiDication: features and algorithms” -‐ 2007

#2 “Web page element classiDication based on visual features” #3 “Stylistic and lexical co-‐training for Web-‐block classiDication” #4 “Can we learn a template-‐independent wrapper for news article extraction from a single training site?” #5 “EfDicient record-‐level wrapper induction” #6 “Towards combining Web classiDication and Web Information Extraction: a case study”

Web page classiDication: features and algorithms X. Qi and B. Davison (Lehigh University, 2007)

0 The paper distinguishes between four types of classiDication; 0 They also distinguish between subject classiDication, functional classiDication, sentiment classiDication, and other types of classiDication;

0 The paper distinguishes between on-‐page features and the features of the neighbours;

0 On-‐page features: 0 Textual analysis: bag of words vs n-‐gram; 0 Visual analysis: the multigraph approach.

Paper # 1


Paper # 1


0 When using the features of neighbouring pages the authors distinct between the weak assumption and the strong assumption;

0 They also distinguish between different types of neighbours: parents/children, grandparents/grandchildren and siblings/spouses;

0 It appears that siblings are the most important neighbours; 0 There are various features uses for different types of neighbouring pages;

0 Algorithm survey: dimension reduction and relational learning approaches;

Paper # 1

Web page element classiDication based on visual features R. Burget and I. Rudolfova (Brno University, 2009)

0 Problem: ClassiDication of elements from a web page based on its visual rendering;

0 Assumptions: A tagged corpus, DOM tree, CSSBox layout; 0 Approach: Page segmentation followed by block classiDication performed via Weka’s J48 decision tree classiYier;

0 Features: Font features, spatial features, text features, colour features;

0 Evaluation: News domain. Average F1 measure on coarse-‐grained labels, low F1 measure on high-‐grained labels.

Paper # 2


0 The approach of this papers is split into two phases: 0 Page segmentation; 0 Page element classiDication;

0 Page segmentation is done in four phases: 0 Page rendering; 0 Detecting basic visual areas; 0 Text line detection; 0 Block detection;

0 As a result of page segmentation we obtain a tree of areas.

Paper # 2


0 The actual page element classiDication is performed for each area via Weka’s J48 decision tree classiDier based on the following set of features: 0 Font features {fontsize, weight}; 0  Spatial features {aabove, abelow, aleft, aright}; 0 Text features {tdigits, tlower, tupper, tspaces, tlength}; 0 Colour features {contrast}.

Paper # 2


The set of labels Results (the testing pages from another source than the training pages)

Paper # 2

Stylistic and Lexical Co-‐training for Web Block ClassiDication C. Lee et al (National University of Singapore, 2004)

0 Problem: ClassiDication of elements from a web page based on both stylistic and lexical features;

0 Assumptions: A tagged corpus, DOM tree, CSSBox layout; 0 Approach: Web block division followed by co-‐training with Boostexter, an ensemble learning method with a decision stump corresponding to a single weak learner;

0 Features: Lexical and stylistic; 0 Evaluation: News domain. Average F1 measure on coarse-‐grained labels, low F1 measure on high-‐grained labels.

Paper # 3


0 The authors aim to combine two different classiDiers with distinctive set of features (lexical and stylistic);

0 They’ve created a PARser for Content Extraction and Layout Structure (PARCELS);

0 Web page division – the authors differentiate between structural tags and content tags.

Paper # 3


Paper # 3


0 The authors distinguish between labels of different levels of granularity. They deDine 17 tags for labelling;

0  Stylistic features: 0  Linear structure – paragraph (<p>), header (<h1>-‐<h6>) and rule tags (<hr>); 0  Table structure – cell Dlow, neighbouring cells’ data, the position of table cells; 0  XHTML/CSS structure – height, width, z-‐index; 0  Font features – colour, weight, family, size, hyperlink features; 0  Images – size, number of images within a block;

0 Lexical features: 0  Low-‐level features – count and vocabulary of the words present in the text block; 0  High-‐level features – POS-‐tags, mailto-‐links, image-‐links, text-‐links, total-‐links;

0 Boostexter is used for co-‐training. It is an ensemble learning method with a decision stump corresponding to a single weak learner.

Paper # 3


Paper # 3

Can we learn a template independent wrapper for news article extraction for a single training site? J. Wang et al (2009, Zhejiang University, MS Research)

0 Problem: ClassiDication of titles and bodies of news taken from the webpages belonging to the news domain;

0 Assumptions: A tagged corpus, DOM tree, CSSBox layout; 0 Approach: SVM; decision function gets converted to posterior probability;

0 Features: Different sets of features for body and title extraction. Features are divided into content and spatial features;

0 Evaluation: Overall 99% extraction accuracy.

Paper # 4


0 The aim of the paper is to efDiciently extract and then combine titles and bodies of news articles;

0  The main problem is in dealing with various noises around the titles.

Paper # 4


0 News body extraction: 0 Content features: FormattingElementsNum and FormattedContentLen; 0  Spatial features: normalised RectLeft, RectTop, RectWidth and RectHeight; 0 News body extraction heuristics: TopInScreen(T) and BigEnough(T);

0 News title extraction: 0 Content features: FontSize, EndWithFullStop, WordNum; 0  Spatial features: RectLeft, RectTop, RectWidth, RectHeight, Overlap, Distance, Flat; 0  News title extraction heuristics: WholeInScreen(T), NoAnchorText(T), NotCategoryName(T);

0 A SVM approach is chosen for classiDication. The decision function gets converted to posterior probability.

Paper # 4

Can we learn a template independent wrapper for news article extraction for a single training site? J. Wang et al (2009, Zhejiang University, MS Research) Testing results on the large

scale experiment Extraction results

Paper # 4

EfDicient record level wrapper induction S. Zheng et al (Pennsylvania State Univeristy, 2009)

0 Problem: EfDicient extraction of records from Web pages and classiDication of their elements;

0 Assumptions: A tagged corpus, DOM tree; 0 Approach: Alignment of the DOM subtree and the possible wrappers;

0 Features: None; 0 Evaluation: Four different domains (online shops, user reviews, digital libraries, search results). Seven detail page datasets and eleven list page datasets. A 99% F1 value.

Paper # 5


0 The paper is concerned with extracting records and their respective attributes;

0 The key distinction from other approaches is the record-‐level extraction opposed to page-‐level extraction;

0 The authors propose a novel broom structure for this task; 0 The broom structure has a head and a stick; 0 One of the main issues are crossing records.

Paper # 5


Paper # 5


0 The general architecture of the system involves training and testing phases.

Paper # 5


0 The authors claim to achieve a remarkable extraction accuracy and a signiDicant boost in running time performance;

Paper # 5

Towards combining Web classiDication and Web Information Extraction: a case study

P. Luo et al (HP Labs China, 2009)

0 Problem: Combination of web page classiDication based on their relevance to a speciDic domain with the extraction of its speciDic elements, using both forward and backward dependencies;

0 Assumptions: A tagged corpus, DOM tree; 0 Approach: Conditional Random Fields (CRFs); 0 Features: Course terms and heuristics for course homepage detection; format, position and content features for course title extraction;

0 Evaluation: OfCourse system for online course information extraction. 90% F1 value for course page classiDication, 83% F1 value for course title extraction.

Paper # 6

Towards combining Web classiDication and Web Information Extraction: a case study

P. Luo et al (HP Labs China, 2009) 0 The authors propose a method that utilises both forward and

backward dependencies between Web classiDication and information extraction;

0 The authors use a uniDied graphical CRF model for joint and simultaneous optimisation of these two steps;

0 This methodology has been used for building the OfCourse online search engine ;

0 In their results for OfCourse the authors claim that their model signiDicantly outperforms the two baseline methods;

0 Drawbacks: they only deal with DOM leave nodes as classiDication variables for the information extraction phase.

Paper # 6

Lessons learnt from the Reading Course #1 “Web page classiYication: features and algorithms” by X. Qi and B. Davison (2007): the importance of the neighbouring pages’ features, features of neighbouring pages;

#2 “Web page element classiYication based on visual features” by R. Burget and I. Rudolfova (2009): a broad set of visual features (font features, spatial features, text features and colour features);

#3 “Stylistic and Lexical Co-‐training for Web Block ClassiYication” by C. Lee et al (2004): A useful web block division algorithm. A possibility of co-‐training on the same corpus using two distinctive set of features;

Lessons learnt from the Reading Course #4 “Can we learn a template independent wrapper for news article extraction for a single training site” by J. Weng et al (2009): a distinctive set of features for news title extraction, a lot of which can be used for property title extraction in DIADEM;

#5 “EfYicient record level wrapper induction “by S. Zheng et al (2009): a new record-‐level approach for extraction. Performs much better and faster than the page-‐level approaches. Can be useful for DIADEM extraction in the record-‐heavy domains;

#6 “Towards combining Web classiYication and Web Information Extraction: a case study” by P. Luo et al (2009): backward dependency between these two tasks can work as well. Thus it is worthwhile to experiment with their mutual tie-‐up.

General lessons learnt 0 Most of the papers are recent or very recent (2004-‐2009); 0 Features play a much more important role than algorithms; 0 Initial page segmentation into blocks can help with subsequent determination of relevant DOM-‐subtrees;

0 All features can be broadly divided into content features and visual features;

0 News domain is a very popular one (3 out of 5 reviewed systems). No mention of real estate in any of the papers.

Summary of the Reading Course and its relevance to DIADEM

0 The six proposed papers are of relevance to all three areas of my current research: 0 Real estate page classiDication; 0 Output/Input page distinction; 0 Property page elements’ classiDication;

0 The most obvious synergy is with Omer’s NLP work, although cross sections with Cheng’s and Xiaonan’s work are also possible;

0  I plan to use a subset of the features presented in these papers in the classiDication of the elements of output pages and subsequent real estate page classiDication.

Thank you for your attention!

Machine Learning in DIADEM (Andrey Kravchenko)

Documents

Transcript of Machine Learning in DIADEM (Andrey Kravchenko)