ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 16: KEYWORD SEARCH PRINCIPLES OF DATA INTEGRATION.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA...
-
Upload
edmund-mosley -
Category
Documents
-
view
217 -
download
0
Transcript of ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA...
![Page 1: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/1.jpg)
ANHAI DOAN ALON HALEVY ZACHARY IVES
Chapter 15: Data Integration on the Web
PRINCIPLES OF
DATA INTEGRATION
![Page 2: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/2.jpg)
Outline
Introduction, opportunities and challenges with Web data The Deep Web
Vertical search Surfacing the Deep Web
Creating topical portals Lightweight data management on the Web
Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work
![Page 3: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/3.jpg)
Broad Range of Data on the Web
![Page 4: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/4.jpg)
Key Characteristics
Scale and heterogeneity Data is about everything! Overlapping sources, varying
levels of quality. Multiple formats (tables, lists, cards, etc.)
Data is laid out for visual appeal Extracting the data is very tricky! Semantics of the data are rarely specified and need to be
inferred from text and other clues.
![Page 5: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/5.jpg)
Different Forms of Structured Data on the Web
![Page 6: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/6.jpg)
Tables: hundreds of millions good ones
![Page 7: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/7.jpg)
Databases Behind Forms The Deep/Invisible Web
store locationsused cars
radio stationspatents
recipes
Tens of millions of high-quality forms
![Page 8: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/8.jpg)
HTML Lists
Every list item is a row in a table, but figuring out cell boundaries is very tricky.
![Page 9: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/9.jpg)
Structured data embedded more loosely in pages. Extraction is very tricky!
![Page 10: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/10.jpg)
What Can we do with Structured Web Data?
Integrate: Imagine integrating your data with any data on the Web! Insights come when independently developed data sets come
together (of course, you can also get garbage that way, so you need to be
careful).
Improve web search Find tables & lists when they’re relevant to queries Answer fact-seeking queries with facts rather than links to
Web pages. Aggregate: answer “total GDP of 10 largest countries” by
putting together facts from multiple pages
![Page 11: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/11.jpg)
Discover via search
Manage,Analyze,
Visualize, Integrate, create compelling stories
Extract from Web SourcesPublish back to the Web
Bigger Vision: create an ecosystem of structured data on the Web
![Page 12: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/12.jpg)
Outline
Introduction, opportunities and challenges with Web data The Deep Web
Vertical search Surfacing the Deep Web
Creating topical portals Lightweight data management on the Web
Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work
![Page 13: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/13.jpg)
What is the Deep Web?
Content hidden behind HTML forms, not accessible to search engines.
![Page 14: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/14.jpg)
The Deep Web
The collection of databases that are accessed by users entering values into HTML forms.
The crawler of search engines cannot fill the forms, therefore the content is invisible to the search engine.
The work on the Deep Web illustrates many of the challenges of managing Web data.
![Page 15: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/15.jpg)
Two Approaches to the Deep Web
Build a vertical search engine: Apply all the data integration techniques we’ve learned so
far to a set of data sources such as job sites, airplane reservations, etc.
The approach is applicable to domains that have thousands of form sites.
Surface the content: Try to guess good queries to pose to the forms. Insert the
resulting HTML pages into the Web index. The approach covers the long tail of content on the Web.
![Page 16: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/16.jpg)
Approach #1: Vertical Search: Data Integration
![Page 17: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/17.jpg)
Vertical Search as Data Integration
Mediated schema: the properties of the domain that need to be exposed to the user If you include too many attributes in the mediated schema,
you may not be able to query them on many sources. Source descriptions: relatively simple. Sources are
often distinguished by their geographical coverage. Wrappers:
Parsing the answers from the resulting HTML is the tricky part.
Alternate approach: don’t parse the answers. Just show the user the returned web pages.
![Page 18: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/18.jpg)
Tree Search
Amish quilts
Parking tickets in India
Horses
Deep Web: the Long Tail
![Page 19: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/19.jpg)
The Surfacing Approach
Crawl & Indexing time Pre-compute interesting form submissions Insert resulting pages into the Web Index
Query time: nothing! Deep web URLs in the Index are like any other URL
Advantages Reuse existing search engine infrastructure Reduced load on target web sites – users
click only on what they deem relevant. Approach taken at Google for the long tail.
![Page 20: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/20.jpg)
Surfacing Challenges1. Predicting the correct input combinations
Generating all possible URLs is wasteful and unnecessary Cars.com has ~500K listings, but 250M possible queries
2. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com
3. Don’t do anything bad! 4. Coverage of the crawl: don’t try to cover sites in their
entirety, it’s not necessary. 1. Once you get part of the content, there will be links to the rest2. It’s enough to have part of the content in the index to send it
relevant traffic.
![Page 21: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/21.jpg)
Form Processing 101
GET and POST: types of HTML forms Only GETs can be surfaced
<form action=http://www.borders.com/locator method=GET> <select name=store><option …/>… </select> … <input name=zip type=text/> <input name=search type=submit value=Go/> <input name=site type=hidden value=homepage/></form>
URL: http://www.borders.com/locator?store=All&city=&state=&zip=94043&within=25&search=Go&site=homepage
on submit
![Page 22: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/22.jpg)
Google's Deep-Web Crawl (VLDB 2008)
Predicting Input Combinations Forms can have multiple inputs Generating all possible URLs is wasteful! … and un-necessary!
Goal: minimize URLs while maximizing retrieval!
Other considerations Generated URLs must be good candidates for index Only need URLs sufficient to drive traffic Only need URLs sufficient to seed the web crawler
Solution: discover only informative input combinations.
![Page 23: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/23.jpg)
Informative Form Fields
http://jobs.shrm.org/search?state=All&kw=&type=Allhttp://jobs.shrm.org/search?state=AL&kw=&type=Allhttp://jobs.shrm.org/search?state=AK&kw=&type=All
…http://jobs.shrm.org/search?state=WV&kw=&type=All
http://jobs.shrm.org/search?state=All&kw=&type=ALLhttp://jobs.shrm.org/search?state=All&kw=&type=ANY
http://jobs.shrm.org/search?state=All&kw=&type=EXACT
Result pages different informative
Result pages similar un-informative
Varying the state results in qualitatively different content, and hence it is an informative field.
![Page 24: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/24.jpg)
Computing Informative Field Combinations
Informative field combinations can be computed bottom up: Begin with single fields and find which ones are
informative. For every informative combination, try to add another
field and check if the resulting combination is still informative.
In practice, we rarely need combinations of more than 3 fields.
![Page 25: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/25.jpg)
Google's Deep-Web Crawl (VLDB 2008)
Challenge 2: Generic and Typed Text boxes
Generic Search Boxes Accept any keywords Challenge: selecting the most appropriate values
Typed Text Boxes Only values belonging to specific types, e.g., zipcodes Challenge: selecting the type of the input
![Page 26: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/26.jpg)
Google's Deep-Web Crawl (VLDB 2008)
Example: www.wipo.int
![Page 27: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/27.jpg)
Input values for Generic Search Iterative Probing for search boxes
Select an initial list of candidate keywords
Download pages based on current set of keywords
Extract more candidate keywords from result pages
Refine the current set of keywords
Repeat until no more new candidate keywords Prune list of candidate keywords
![Page 28: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/28.jpg)
Example: www.wipo.int
MetalworkingProteinAntibodyPyrazoleImmobilizerVasoconstrictionPhosphinatesNosepieceSandbridgeViscosityCarboxydiphenylsulphideOzonizer…
![Page 29: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/29.jpg)
Outline
Introduction, opportunities and challenges with Web dataThe Deep Web
Vertical search Surfacing the Deep Web
Creating topical portals Lightweight data management on the Web
Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work
![Page 30: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/30.jpg)
Topical Portals
An integrated view of a topic: E.g., a info about database researchers, all info about
coffee and their growing regions. Topical portals find different aspects of the same
objects on different sources E.g., publications of a person may come from one source,
while their job affiliations may come from another In contrast, vertical search integrated similar objects
from multiple sources E.g., job listings, apartments for rent, …
![Page 31: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/31.jpg)
Topical Portal: example Integrated Page for an Entity
![Page 32: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/32.jpg)
Building a Topical Portal
Approach #1: Perform a focused crawl of the Web to find pages on the
topic Use word signatures as a method for determining the topic of a
page. Use information extraction techniques to get the data out
of the pages. Perform reference resolution and schema matching to
create a cleaner set of data.
![Page 33: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/33.jpg)
Creating a Topical Portal
Approach #2: Start with a set of well known sites in the domain Create an initial schema for the domain (the properties
you’re interested in modeling) Create extractors for pages on the known sites
Note: extractors will be more accurate because they were created for the sites themselves
Result: a good basis of entities and relationships to build on. Extend the initial data set:
Follow references from the initial set of chosen pages Use collaboration (of people in the community) to find additional
data and to correct extractions.
![Page 34: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/34.jpg)
Outline
Introduction, opportunities and challenges with Web dataThe Deep Web
Vertical search Surfacing the Deep Web
Creating topical portals Lightweight data management on the Web
Discovery of data sets Extracting data from Web pages Combining multiple data sets Re-using others’ work
![Page 35: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/35.jpg)
Lightweight Combination of Web Data
With such a vast collection of data, we would like to enable easy data integration. Imagine a school student combining her data about bird
species with a country population table found on the Web A journalist creating a news story with data about riots in
the UK and needing to combine it with demographic data …
Many data integration tasks are transient: the result will be used for a short period of time only Hence, creating the integrated data must be easy.
Creating a mediated schema and mappings is too tedious.
![Page 36: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/36.jpg)
Challenges to Data Integration on the Web
Discovering data on (search engines are optimized for documents, not tables or lists)
Extracting the data from the Web pages into a form that can be processed
Combining multiple data sets
Unique opportunities on the Web: re-use work of others!
![Page 37: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/37.jpg)
Not a great result!
![Page 38: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/38.jpg)
But the data does exist out there!
![Page 39: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/39.jpg)
Discovering Data on the Web
Search engines are optimized for documents E.g., proximity of terms matters in ranking. In tables, the
schema applies to all rows. “zambia” is far from “population” in a document containing population data, but should be considered close.
No special attention is given to schema rows (if they can be detected) or columns closer to the left of the table (that are often the “subject” of the table).
Tables with high quality data look like ones that are used for formatting. Over 99% of the HTML tables on the Web are not high quality
data tables!
![Page 40: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/40.jpg)
Challenges to Discovering the Semantics of Structured Data on the Web
![Page 41: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/41.jpg)
Semantics Embedded in Surrounding Text
Topic of table is in the text, and the token “2006” is crucial to understanding the data.
![Page 42: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/42.jpg)
No schema, but beautifully understandable table by people.
![Page 43: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/43.jpg)
Structured Data can be Plain Complicated!
![Page 44: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/44.jpg)
HTML Tables used for Formatting
![Page 45: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/45.jpg)
“Vertical” Tables: one tuple of a bigger table
![Page 46: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/46.jpg)
Tree Search
Amish quilts
Parking tickets in India
Horses
Can’t Use Domain Knowledge: Data is about Everything
![Page 47: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/47.jpg)
Search by Tweaking Document Traditional Search
Consider new cues in ranking: Hits on left column Hits on schema (where there is one) Number of rows, columns Hits on table body Size of table relative to page
But we can do better: try to recover the underlying semantics of the data.
![Page 48: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/48.jpg)
If we see these patterns enough times, we can infer that Green Ash is a North American species
Recovering Table Semantics: cells on the Web are mentioned in Web text
![Page 49: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/49.jpg)
If we infer that a large fraction of the left column are North American tree species, we can infer that the table is about these tree species. Which is not mentioned on the page!
Recovering Table Semantics: cells on the Web are mentioned in Web text
![Page 50: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/50.jpg)
Extracting Data from the Page
In the case of tables, it’s fairly easy Main challenge: decide if there is a row with attribute
names Lists are tricky: punctuation and formatting do not
always provide the right cues for partitioning a list element into cells boundaries.
Structured data in cards: in general, it’s an information extraction problem.
![Page 51: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/51.jpg)
Structured Data in Cards
![Page 52: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/52.jpg)
Copy & Paste Approach: Extraction by Demonstration
Using previous slide as example. Start by copying “Four Barrel” into a column of a
spreadsheet. System tries to generalize and suggest other café
names: Sightglass, Blue Bottle, Ritual. Next, the user copies the address of Four Barrel into
the next column of the spreadsheet System generalizes… Etc.
![Page 53: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/53.jpg)
Combining Multiple Data Sets
First, find related data sets. Depending on the context, you may be looking for: Data sets to join with (add new columns) Data sets to union with (add new rows)
Specifying the join: Again, by demonstration. Drag and drop a cell from one
table into another. Reference reconciliation is a big challenge:
Use reference data such as Freebase?
![Page 54: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/54.jpg)
Re-Using Work of Others
Most good data sets will get extracted more than once: Re-use the work done by other extractors
Data cleaning can be a collaborative effort Data sets that get integrated often are probably high
quality – leverage that signal With 200M tables on the Web, you can mine their
schemas to find attribute synonyms and common schematic patterns.
![Page 55: ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649da15503460f94a8ddec/html5/thumbnails/55.jpg)
Summary of Chapter 15
Structured data on the Web is an incredible collection of data More is coming on because organizations and
governments are being encouraged to publish data Data comes with little or no semantics
Huge challenge when you try to make sense of it Key emphasis: create data management tool that
anyone can use Data is no longer just for database experts!