Machine Learning in DIADEM (Andrey Kravchenko)

37
Machine Learning in DIADEM Reading Course Presentation Andrey Kravchenko 20 th of January, 2010

Transcript of Machine Learning in DIADEM (Andrey Kravchenko)

Page 1: Machine Learning in DIADEM (Andrey Kravchenko)

Machine  Learning  in  DIADEM  Reading  Course  Presentation    

Andrey  Kravchenko  20th  of  January,  2010  

Page 2: Machine Learning in DIADEM (Andrey Kravchenko)

Current  area  of  research  Real  estate  page  classiDication  

vs  

Page 3: Machine Learning in DIADEM (Andrey Kravchenko)

Current  area  of  research  Input  and  output  page  distinction  

Page 4: Machine Learning in DIADEM (Andrey Kravchenko)

Current  area  of  research  Page  element  classiDication  

Page 5: Machine Learning in DIADEM (Andrey Kravchenko)

The  Reading  List  Papers  not  included  in  this  presentation  

0  “An   interactive   clustering   –   based   approach   to   integrating   source   query  interfaces  on  the  Deep  Web”  0  This  paper  is  concerned  with  input  forms.  

0  “Automatic   wrapper   induction   from   hidden-­‐web   sources   with   domain  knowledge”  0  Only   a   part   of   the   paper   deals   with   the   output   pages.   Their   methodology   for  processing   the   output   pages   is   based   on   gazetteer’s   and   is   thus   closer   to  linguistics  than  ML.  

0  “Web  scale  extraction  of  structured  data”  0  Deals  with  the  whole  Web.  

0  “An   adaptive   information   extraction   system   based   on   wrapper   induction  with  POS  tagging”  0  The   labels   are   of   very   low   granularity   (e.g.  work_name,  work_location)   and   of  linguistic   nature.   The   comparison   is   done   against   linguistics   systems   such   as  Rapier  (another  excluded  paper  on  the  reading   list),  GATE-­‐SVM,  etc.   Introducing  POS  tagging  provides  only  a  5%  gain   in  accuracy  and  only   for  some  target  slots  for  one  corpus  and  no  gain  for  the  other  two.  

Page 6: Machine Learning in DIADEM (Andrey Kravchenko)

The  Reading  List  Papers  not  included  in  this  presentation  

0  “Learning   (k,l)-­‐contextual   tree   languages   for   information   extraction   from  Web  pages”  0  The  paper  deals  with  learning  an  extraction  language  rather  than  extraction  itself.  

0  “Bottom-­‐up   relational   learning   of   problem   matching   rules   for   Information  Retrieval”  0  Deals  with  textual  documents  only.  

0  “Learning  rules  to  pre-­‐process  Web  data  for  automatic  integration”  0  Relies   on   web   data   extraction   and   alignment   phases   performed   by   the   VIPER  system  that  are  not  described  in  the  paper.  I  wasn’t  able  to  detect  any  ML  involved  in   the   stage   of   rule   learning.   No   clear   description   of   practical   results.   Low-­‐level  granularity  of  labels.  

0  “Learning  rules  for  information  extraction”  0  Is  not  HTML/DOM  speciDic.  

Page 7: Machine Learning in DIADEM (Andrey Kravchenko)

The  Reading  List  Papers  included  in  this  presentation  

#1  “Web-­‐page  classiDication:  features  and  algorithms”  -­‐  2007  

#2  “Web  page  element  classiDication  based  on  visual  features”  #3  “Stylistic  and  lexical  co-­‐training  for  Web-­‐block  classiDication”  #4  “Can  we  learn  a  template-­‐independent    wrapper  for                    news  article  extraction  from  a  single  training  site?”  #5  “EfDicient  record-­‐level  wrapper  induction”    #6  “Towards  combining  Web  classiDication  and  Web  Information                                                                                              Extraction:  a  case  study”    

 

Page 8: Machine Learning in DIADEM (Andrey Kravchenko)

Web  page  classiDication:  features  and  algorithms  X.  Qi  and  B.  Davison  (Lehigh  University,  2007)  

0 The  paper  distinguishes  between  four  types  of  classiDication;  0 They  also  distinguish  between  subject  classiDication,  functional  classiDication,  sentiment  classiDication,  and  other  types  of  classiDication;  

0 The  paper  distinguishes  between  on-­‐page  features  and  the  features  of  the  neighbours;  

0 On-­‐page  features:  0 Textual  analysis:  bag  of  words  vs  n-­‐gram;  0 Visual  analysis:  the  multigraph  approach.  

 

Paper  #  1  

Page 9: Machine Learning in DIADEM (Andrey Kravchenko)

Web  page  classiDication:  features  and  algorithms  X.  Qi  and  B.  Davison  (Lehigh  University,  2007)  

Paper  #  1    

Page 10: Machine Learning in DIADEM (Andrey Kravchenko)

Web  page  classiDication:  features  and  algorithms  X.  Qi  and  B.  Davison  (Lehigh  University,  2007)  

0 When  using  the  features  of  neighbouring  pages  the  authors  distinct  between  the  weak  assumption  and  the  strong  assumption;  

0 They  also  distinguish  between  different  types  of  neighbours:  parents/children,  grandparents/grandchildren  and  siblings/spouses;  

0 It  appears  that  siblings  are  the  most  important  neighbours;  0 There  are  various  features    uses  for  different  types  of  neighbouring  pages;  

0 Algorithm  survey:  dimension  reduction  and  relational  learning  approaches;  

Paper  #  1  

Page 11: Machine Learning in DIADEM (Andrey Kravchenko)

Web  page  element  classiDication  based  on  visual  features  R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)  

0 Problem:  ClassiDication  of  elements  from  a  web  page  based  on  its  visual  rendering;  

0 Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;  0 Approach:    Page  segmentation  followed  by  block  classiDication  performed  via  Weka’s  J48  decision  tree  classiYier;  

0 Features:  Font  features,  spatial  features,  text  features,  colour  features;  

0 Evaluation:  News  domain.  Average  F1  measure  on                              coarse-­‐grained  labels,  low  F1  measure  on  high-­‐grained  labels.  

Paper  #  2  

Page 12: Machine Learning in DIADEM (Andrey Kravchenko)

Web  page  element  classiDication  based  on  visual  features  R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)  

0 The  approach  of  this  papers  is  split  into  two  phases:  0 Page  segmentation;  0 Page  element  classiDication;  

0 Page  segmentation  is  done  in  four  phases:  0 Page  rendering;  0 Detecting  basic  visual  areas;  0 Text  line  detection;  0 Block  detection;  

0 As  a  result  of  page  segmentation  we  obtain  a  tree  of  areas.  

Paper  #  2  

Page 13: Machine Learning in DIADEM (Andrey Kravchenko)

Web  page  element  classiDication  based  on  visual  features  R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)  

0 The  actual    page  element  classiDication  is  performed  for  each  area  via  Weka’s  J48  decision  tree  classiDier  based  on  the  following  set  of  features:  0 Font  features  {fontsize,  weight};  0  Spatial  features  {aabove,  abelow,  aleft,  aright};  0 Text  features  {tdigits,    tlower,    tupper,  tspaces,  tlength};  0 Colour  features  {contrast}.  

 

Paper  #  2  

Page 14: Machine Learning in DIADEM (Andrey Kravchenko)

Web  page  element  classiDication  based  on  visual  features  R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)  

The  set  of  labels   Results    (the  testing  pages  from  another  source  than  the  training  pages)  

Paper  #  2  

Page 15: Machine Learning in DIADEM (Andrey Kravchenko)

Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication  C.  Lee  et  al  (National  University  of  Singapore,  2004)  

 0 Problem:  ClassiDication  of  elements  from  a  web  page  based  on  both  stylistic  and  lexical  features;  

0 Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;  0 Approach:    Web  block  division  followed  by  co-­‐training  with  Boostexter,  an  ensemble  learning  method  with  a  decision  stump  corresponding  to  a  single  weak  learner;  

0 Features:  Lexical  and  stylistic;  0 Evaluation:  News  domain.  Average  F1  measure  on                              coarse-­‐grained  labels,  low  F1  measure  on  high-­‐grained  labels.  

Paper  #  3  

Page 16: Machine Learning in DIADEM (Andrey Kravchenko)

Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication  C.  Lee  et  al  (National  University  of  Singapore,  2004)  

 

0 The  authors  aim  to  combine  two  different  classiDiers  with  distinctive  set  of  features  (lexical  and  stylistic);  

0 They’ve  created  a  PARser  for  Content  Extraction  and  Layout  Structure  (PARCELS);  

0 Web  page  division  –  the  authors  differentiate  between  structural  tags  and  content  tags.  

Paper  #  3  

Page 17: Machine Learning in DIADEM (Andrey Kravchenko)

Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication  C.  Lee  et  al  (National  University  of  Singapore,  2004)  

 

Paper  #  3  

Page 18: Machine Learning in DIADEM (Andrey Kravchenko)

Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication  C.  Lee  et  al  (National  University  of  Singapore,  2004)  

 0 The  authors  distinguish  between  labels  of  different    levels  of  granularity.  They  deDine  17  tags  for  labelling;  

0  Stylistic  features:  0  Linear  structure  –  paragraph  (<p>),  header  (<h1>-­‐<h6>)  and  rule  tags  (<hr>);  0  Table  structure  –  cell  Dlow,  neighbouring  cells’  data,    the  position  of  table  cells;  0  XHTML/CSS  structure  –  height,  width,  z-­‐index;  0  Font  features  –  colour,  weight,  family,  size,  hyperlink  features;  0  Images  –  size,  number  of  images  within  a  block;  

0 Lexical  features:  0  Low-­‐level  features  –  count  and  vocabulary  of  the  words  present  in  the  text  block;  0  High-­‐level  features  –  POS-­‐tags,  mailto-­‐links,  image-­‐links,  text-­‐links,  total-­‐links;  

0 Boostexter  is  used  for  co-­‐training.  It  is  an  ensemble  learning  method  with  a  decision  stump  corresponding  to  a  single  weak  learner.  

Paper  #  3  

Page 19: Machine Learning in DIADEM (Andrey Kravchenko)

Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication  C.  Lee  et  al  (National  University  of  Singapore,  2004)  

 

Paper  #  3  

Page 20: Machine Learning in DIADEM (Andrey Kravchenko)

Paper  #  3  

Page 21: Machine Learning in DIADEM (Andrey Kravchenko)

Can  we  learn  a  template  independent  wrapper  for  news  article  extraction  for  a  single  training  site?  J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)  

0 Problem:  ClassiDication  of  titles  and  bodies  of  news  taken  from  the  webpages  belonging  to  the  news  domain;  

0 Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;  0 Approach:    SVM;  decision  function  gets  converted  to  posterior  probability;  

0 Features:  Different  sets  of  features  for  body  and  title  extraction.    Features  are  divided  into  content  and  spatial  features;    

0 Evaluation:  Overall  99%  extraction  accuracy.  

Paper  #  4  

Page 22: Machine Learning in DIADEM (Andrey Kravchenko)

Can  we  learn  a  template  independent  wrapper  for  news  article  extraction  for  a  single  training  site?  J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)  

0 The  aim  of  the  paper  is  to  efDiciently  extract  and  then  combine  titles  and  bodies  of  news  articles;  

0   The  main  problem  is  in  dealing  with  various  noises  around  the  titles.  

Paper  #  4  

Page 23: Machine Learning in DIADEM (Andrey Kravchenko)

Can  we  learn  a  template  independent  wrapper  for  news  article  extraction  for  a  single  training  site?  J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)  

0 News  body  extraction:  0 Content  features:  FormattingElementsNum  and  FormattedContentLen;  0  Spatial  features:  normalised  RectLeft,  RectTop,  RectWidth  and  RectHeight;  0 News  body  extraction  heuristics:  TopInScreen(T)  and  BigEnough(T);  

0 News  title  extraction:  0 Content  features:  FontSize,  EndWithFullStop,  WordNum;  0  Spatial  features:  RectLeft,  RectTop,  RectWidth,  RectHeight,  Overlap,  Distance,  Flat;  0  News  title  extraction  heuristics:  WholeInScreen(T),  NoAnchorText(T),  NotCategoryName(T);  

0 A  SVM  approach  is  chosen  for  classiDication.  The  decision  function  gets  converted  to  posterior  probability.  

Paper  #  4  

Page 24: Machine Learning in DIADEM (Andrey Kravchenko)

Can  we  learn  a  template  independent  wrapper  for  news  article  extraction  for  a  single  training  site?  J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)  Testing  results  on  the  large    

scale  experiment  Extraction  results  

Paper  #  4  

Page 25: Machine Learning in DIADEM (Andrey Kravchenko)

EfDicient  record  level  wrapper  induction  S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  

0 Problem:  EfDicient  extraction  of  records  from  Web  pages  and  classiDication  of  their  elements;  

0 Assumptions:  A  tagged  corpus,  DOM  tree;  0 Approach:    Alignment  of  the  DOM  subtree  and  the  possible  wrappers;  

0 Features:  None;  0 Evaluation:  Four  different  domains  (online  shops,  user  reviews,  digital  libraries,  search  results).  Seven  detail  page  datasets  and  eleven  list  page  datasets.  A  99%  F1  value.  

Paper  #  5  

Page 26: Machine Learning in DIADEM (Andrey Kravchenko)

EfDicient  record  level  wrapper  induction  S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  

0 The  paper  is  concerned  with  extracting  records  and  their  respective  attributes;  

0 The  key  distinction  from  other  approaches  is  the  record-­‐level  extraction  opposed  to  page-­‐level  extraction;  

0 The  authors  propose  a  novel  broom  structure  for  this  task;  0 The  broom  structure  has  a  head  and  a  stick;  0 One  of  the  main  issues  are  crossing  records.  

Paper  #  5  

Page 27: Machine Learning in DIADEM (Andrey Kravchenko)

EfDicient  record  level  wrapper  induction  S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  

Paper  #  5  

Page 28: Machine Learning in DIADEM (Andrey Kravchenko)

EfDicient  record  level  wrapper  induction  S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  

0 The  general  architecture  of  the  system  involves  training  and  testing  phases.  

Paper  #  5  

Page 29: Machine Learning in DIADEM (Andrey Kravchenko)

EfDicient  record  level  wrapper  induction  S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  

0 The  authors  claim  to  achieve  a  remarkable  extraction  accuracy  and  a  signiDicant  boost  in  running  time  performance;  

Paper  #  5  

Page 30: Machine Learning in DIADEM (Andrey Kravchenko)

Towards  combining  Web  classiDication  and  Web  Information  Extraction:  a  case  study  

P.  Luo  et  al  (HP  Labs  China,  2009)  

 0 Problem:  Combination  of  web  page  classiDication  based  on  their  relevance  to  a  speciDic  domain  with  the  extraction  of  its  speciDic  elements,  using  both  forward  and  backward  dependencies;    

0 Assumptions:  A  tagged  corpus,  DOM  tree;  0 Approach:    Conditional  Random  Fields  (CRFs);  0 Features:  Course  terms  and  heuristics  for  course  homepage  detection;  format,  position  and  content  features  for  course  title  extraction;  

0 Evaluation:  OfCourse  system  for  online  course  information  extraction.  90%  F1  value  for  course  page  classiDication,  83%  F1  value  for  course  title  extraction.  

Paper  #  6  

Page 31: Machine Learning in DIADEM (Andrey Kravchenko)

Towards  combining  Web  classiDication  and  Web  Information  Extraction:  a  case  study  

P.  Luo  et  al  (HP  Labs  China,  2009)    0 The  authors  propose  a  method  that  utilises  both  forward  and  

backward  dependencies  between  Web  classiDication  and  information  extraction;  

0 The  authors  use  a  uniDied  graphical  CRF  model  for  joint  and  simultaneous  optimisation  of  these  two  steps;  

0 This  methodology  has  been  used  for  building  the  OfCourse  online  search  engine  ;  

0 In  their  results  for  OfCourse  the  authors  claim  that  their  model  signiDicantly  outperforms  the  two  baseline  methods;  

0 Drawbacks:  they  only  deal  with  DOM  leave  nodes  as  classiDication  variables  for  the  information  extraction  phase.  

Paper  #  6  

Page 32: Machine Learning in DIADEM (Andrey Kravchenko)
Page 33: Machine Learning in DIADEM (Andrey Kravchenko)

Lessons  learnt  from  the  Reading  Course  #1  “Web  page  classiYication:  features  and  algorithms”  by  X.  Qi  and  B.  Davison  (2007):  the  importance  of  the  neighbouring  pages’  features,  features  of  neighbouring  pages;  

#2  “Web  page  element  classiYication  based  on  visual  features”  by  R.  Burget  and  I.  Rudolfova  (2009):  a  broad  set  of  visual  features  (font  features,  spatial  features,  text  features  and  colour  features);  

#3  “Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiYication”  by        C.  Lee  et  al  (2004):    A  useful  web  block  division  algorithm.  A  possibility  of  co-­‐training  on  the  same  corpus  using  two  distinctive  set  of  features;  

Page 34: Machine Learning in DIADEM (Andrey Kravchenko)

Lessons  learnt  from  the  Reading  Course  #4  “Can  we  learn  a  template  independent  wrapper  for  news  article  extraction  for  a  single  training  site”  by  J.  Weng  et  al  (2009):  a  distinctive  set  of  features  for  news  title  extraction,  a  lot  of  which  can  be  used  for  property  title  extraction  in  DIADEM;  

#5  “EfYicient  record  level  wrapper  induction  “by  S.  Zheng  et  al  (2009):  a  new  record-­‐level  approach  for  extraction.  Performs  much  better  and  faster  than  the  page-­‐level  approaches.  Can  be  useful  for  DIADEM  extraction  in  the  record-­‐heavy  domains;  

#6  “Towards  combining  Web  classiYication  and  Web  Information  Extraction:  a  case  study”  by  P.  Luo  et  al  (2009):  backward  dependency  between  these  two  tasks  can  work  as  well.  Thus  it  is  worthwhile  to  experiment  with  their  mutual  tie-­‐up.  

Page 35: Machine Learning in DIADEM (Andrey Kravchenko)

General  lessons  learnt  0 Most  of  the  papers  are  recent  or  very  recent  (2004-­‐2009);  0 Features  play  a  much  more  important  role  than  algorithms;  0 Initial  page  segmentation  into  blocks  can  help  with  subsequent  determination  of  relevant  DOM-­‐subtrees;  

0 All  features  can  be  broadly  divided  into  content  features  and  visual  features;  

0 News  domain  is  a  very  popular  one  (3  out  of  5  reviewed  systems).  No  mention  of  real  estate  in  any  of  the  papers.  

Page 36: Machine Learning in DIADEM (Andrey Kravchenko)

Summary  of  the  Reading  Course  and  its  relevance  to  DIADEM  

0 The  six  proposed  papers  are  of  relevance  to  all  three  areas  of  my  current  research:    0 Real  estate  page  classiDication;  0 Output/Input  page  distinction;  0 Property  page  elements’  classiDication;  

0 The  most  obvious  synergy  is  with  Omer’s  NLP  work,  although  cross  sections  with  Cheng’s  and  Xiaonan’s  work  are  also  possible;  

0   I  plan  to  use  a  subset  of  the  features  presented  in  these  papers  in  the  classiDication  of  the  elements  of  output  pages  and  subsequent  real  estate  page  classiDication.    

Page 37: Machine Learning in DIADEM (Andrey Kravchenko)

Thank  you  for  your  attention!