Result Page Analysis (Cheng Wang)
Transcript of Result Page Analysis (Cheng Wang)
![Page 1: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/1.jpg)
Cheng Wang
![Page 2: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/2.jpg)
![Page 3: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/3.jpg)
² A list of results decorated with ³ Ø Side bars
³ Ø Branding banners
³ Ø Advertisement
³ Ø Merchant Information
³ Ø Search forms
³ Ø Navigation part
![Page 4: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/4.jpg)
² Data Area Identification
² Record Segmentation
² Data Alignment
![Page 5: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/5.jpg)
![Page 6: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/6.jpg)
![Page 7: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/7.jpg)
² Visual Information ³ Ø ViDE, VIPER
² Ontology ³ Ø ODE
² HTML Page based ³ Ø FiVaTech
² Regular Expression ³ Ø EXALG, DELA
![Page 8: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/8.jpg)
² Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009.
² 1: Domain ontology construction ³ Ø query interface ³ Ø query result pages
² 2. Data Extraction using the ontology ³ Ø Identify data area ³ Ø Segments record ³ Ø Data Value alignment
![Page 9: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/9.jpg)
![Page 10: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/10.jpg)
![Page 11: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/11.jpg)
² Multiple Query Result Page ³ Ø PADE
![Page 12: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/12.jpg)
![Page 13: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/13.jpg)
² 1: Match query interface element to data values. Ø title=“%orientalism%”
² 2. Search for voluntary labels in table headers.
² 3. Search for voluntary labels encoded together with data values. ³ Ø ISBN No: 0814756654 ³ Ø ISBN No: 0789204592
² 4. Data values formats ³ Ø 18/09/2008 : 20080918 ³ Ø 03/18/98 : 19980318
![Page 14: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/14.jpg)
² 1. Value level matching ³ Ø Data value similarity
² 2. Label level matching ³ Ø Label co-occurrence
² 3. Label-value matching ³ Ø Check assigned label
³ Ø Assign a suitable label for columns
³ Ø Matching conflict resolution
![Page 15: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/15.jpg)
![Page 16: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/16.jpg)
![Page 17: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/17.jpg)
² 1. Matching is unique ð create attribute
² 2. Matching is 1:1 ð alias ³ Ø Category : Subject
² 3. Matching is 1:n ð n+1 attributes ³ Ø Author: {Last Name, First Name}
² 4. Matching is n:m ð n:1 + 1:m
![Page 18: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/18.jpg)
![Page 19: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/19.jpg)
² One result page ð One data area
² Maximum Entropy Model ³ Maximum Correlation Subtree Identification
![Page 20: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/20.jpg)
² Ø 1 result
² Ø several results (CABABABAD) ³ Ø find continuous repeated patterns
³ Ø Visual gap
![Page 21: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/21.jpg)
² Each data value is assigned a label Ø Maximum Entropy Model Ø Match with Ontology
² ØLabel ð Column
![Page 22: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/22.jpg)
![Page 23: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/23.jpg)
² Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009.
² ViDRE: Data Record Extractor
² ViDIE: Data Item Extractor
² New measure: revision
![Page 24: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/24.jpg)
² 1. Build a Visual Block tree
² 2. Extract data records ³ Ø Noise block filtering
³ Ø Blocks clustering
³ Ø Regroup blocks
² 3. Partition data records into data items and alignment
![Page 25: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/25.jpg)
![Page 26: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/26.jpg)
![Page 27: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/27.jpg)
² Mandatory data items
² Optional data items
² Static data items
![Page 28: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/28.jpg)
² Simple one-pass clustering algorithm ³ Ø Take the first block from the list, use it to form a
cluster.
³ Ø For each remaining blocks, compute similarities to existing clusters.
![Page 29: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/29.jpg)
² ViDE assumes ³ 1. blocks in the same cluster all come from different
data records
³ 2. the cluster which has maximum number n of blocks may contain the mandatory value of data records.
![Page 30: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/30.jpg)
² Step 1: Rearranges blocks in each cluster.
² Step 2: A cluster with n blocks is used as seed. Initialize n groups, each contains one seed block.
² Step 3: For all blocks (in all clusters), determines which group it belongs.
![Page 31: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/31.jpg)
![Page 32: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/32.jpg)
![Page 33: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/33.jpg)
² WDBt: total number of web databases processed
² WDBc: number of web databases whose precision and recall are both 100%
![Page 34: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/34.jpg)
![Page 35: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/35.jpg)
![Page 36: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/36.jpg)
Root
£
Data Area (LCA)
Record
£
Separator Record
£
Separator Record
£
![Page 37: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/37.jpg)
² Real-estate domain
² 60 agents’ websites ³ Ø MRP: 95.0%
³ Ø ERP: 90.0%
![Page 38: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/38.jpg)
Root
Data Area
Record 1
Part A
£
Record 1
Part B
Record 2
Part A
£
Record 2
Part B
Record 3
Part A
£
Record 3
Part B
![Page 39: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/39.jpg)
² DIADEM 0.1 : ³ Ø Construct Real-estate result page ontology
³ Ø Ontological Record Segmentation ° (More features)
³ Ø Data labeling and data alignment
² After: ³ Ø Add visual information
![Page 40: Result Page Analysis (Cheng Wang)](https://reader030.fdocuments.net/reader030/viewer/2022020207/55860a36d8b42a8a638b4c8c/html5/thumbnails/40.jpg)