T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... ·...
Transcript of T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... ·...
![Page 1: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/1.jpg)
2nd Workshop on Linked Web Data Management (LWDM2012)3rd Workshop on Business intelligencE and the WEB (BEWEB2012)3rd Workshop on Business intelligencE and the WEB (BEWEB2012)
March 30th, 2012, Berlin, Germany
T l f th Topology of the b fWeb of Data
Prof. Dr. Christian Bizer F i U i ität B liFreie Universität Berlin
Germany
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 2: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/2.jpg)
Slide from 2007: What does the Web offer us today?
DBHTML
DB
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 3: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/3.jpg)
Slide from 2007: What do we actually want?
Use the Web like a single, global g , g
database
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 4: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/4.jpg)
More and more Websites publish Structured Data
Microformats
RDFa
Li k d D tLinked Data
Mi d tMicrodata
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 5: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/5.jpg)
Research Prototypes: VisiNav
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 6: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/6.jpg)
Research Prototypes: SigMa
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 7: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/7.jpg)
Industry Uptake 2011: Schema.org
ask site owners to embed data to enrich search resultsdata to enrich search results
200+ Types: Event, organization, person, place, product, review
Christian Bizer: Topology of the Web of Data (04/30/2012)
Encoding: Microdata (or alternatively subset of RDFa)
![Page 8: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/8.jpg)
Usage of Schema.org Data
Data snippetsData snippetswithin
search results
Answer to afact queryfact query
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 9: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/9.jpg)
Google‘s Knowledge Graph*
describes more than 200 million entities, such as places people productssuch as places, people, products …
consists of commercial third-party data and Web data
will increasingly be used by Google to answer queries:
* Wall Street Journal: Google Gives Search a Refresh
Christian Bizer: Topology of the Web of Data (04/30/2012)
03/14/2012
![Page 10: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/10.jpg)
Outline: Topology of the Web of Data
1 Embedded Data in HTML1. Embedded Data in HTML Microformats
RDF RDFa
Microdata
W bD t C WebDataCommons.org
2. Linked Data Sharing the data integration effort
The Web of Linked Data
3. Conclusions Opportunitiespp
Challenges
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 11: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/11.jpg)
Microformats
Small data islands within HTML pages Small data islands within HTML pages
Microformats effort dates back to 2003
Small set of fixed formats hcard : people, companies, organizations, and places
XFN : relationships between people
hCalendar : calendaring and events
hListing : small-ads; classifieds
hReview : reviews of products, businesses, events
Shortcoming of Microformats can not represent any kind of data can not represent any kind of data
indexed by Google and Yahoo since 2009
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 12: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/12.jpg)
RDFa
serialization format for embedding RDF data into HTML pagesinto HTML pages
proposed in 2004, W3C Recommendation in 2008
can be used together with any vocabulary
can assign URIs as global primary keys to entities can assign URIs as global primary keys to entities
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 13: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/13.jpg)
Open Graph Protocol
allows site owners to determine how entities are displayed inside Facebook
relies on RDFa for encoding data in HTML pages
available since April 2010
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 14: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/14.jpg)
Microdata
alternative technique for embedding structured data
proposed in 2009 by WHATWG as part of HTML5 work proposed in 2009 by WHATWG as part of HTML5 work
tries to be simpler than RDFa (5 new attributes instead of 8)
W3C currently tries to reconcile the two alternative proposals
Schema org initially chose Microdata as preferred serialization Schema.org initially chose Microdata as preferred serialization
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 15: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/15.jpg)
Microformat, Microdata, RDFa Deployment
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 16: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/16.jpg)
Common Crawl
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 17: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/17.jpg)
WebDataCommons.org
extracts all Microformat, Microdata, RDFa data from the Common Craw and provides the extracted data for downloadCommon Craw and provides the extracted data for download
Two extractions runs 2009/2010 CC Corpus: 2.5 billion HTML pages (28.9 Terabyte compressed)
Feb 2012 CC Corpus: 1,4 billion HTML pages (20.9 Terabyte compressed)
used 100 machines on Amazon EC2 approx. 3000 machine/hours (spot instances of type c1.xlarge) 550 EURpp ( p yp g )
Jointed project of
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 18: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/18.jpg)
HTML Pages containing structured Data
1.4 billion HTML pages parsed (Common Crawl,Feb 2012)
188 million pages contained Microformat, Microdata, RDFa
13% f th HTML13% of the HTML pages contain structured data
Size of extracted data set: 3.2 billion RDF quadsq
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 19: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/19.jpg)
Breakdown by Format (Feb 2012)
Format URLs
html‐rdfa 67.901.246
html‐microdata 26.929.865
html‐mf‐geo 2.491.933
html‐mf‐hcalendar 1.506.379
h l f h d 6 360 686html‐mf‐hcard 61.360.686
html‐mf‐hlisting 197.027
html‐mf‐hresume 20 762html‐mf‐hresume 20.762
html‐mf‐hreview 1.971.870
html‐mf‐species 14.033 p
html‐mf‐hrecipe 422.289
html‐mf‐xfn 26.004.925
Sum 188.821.015
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 20: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/20.jpg)
Percentage of all crawled URLs
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 21: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/21.jpg)
Percentage of all crawled URLs / Yahoo Crawl
Size of the crawl: appoximtely 10 billion HTML pages
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 22: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/22.jpg)
RDFa Topics (2012)
Sample size: 49,370,729 instances RDFa from Common Crawl
150 classes und 400 properties with 1000+ instances
Top Classes Top Classes
gd = Google‘s Rich Snippet Vocabulary
Christian Bizer: Topology of the Web of Data (04/30/2012)
gd = Google s Rich Snippet Vocabulary
![Page 23: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/23.jpg)
RDFa Properties (2012)
400 properties with 1000+ instances
Top Properties
ogp = Facebook‘s Open Graph Protocol
…
Christian Bizer: Topology of the Web of Data (04/30/2012)
ogp aceboo s Ope G ap otoco
![Page 24: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/24.jpg)
Yahoo Crawl (2011)
12 billion pages, with 431 million pages containing RDFa
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 25: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/25.jpg)
Microdata Topics (2012)
Sample size: 90,526,013 Entities from the Common Crawl
182 classes and 690 properties with 1000+ instances
Top Classes Top Classes
datavoc = Google‘s Rich Snippet Vocabulary
Christian Bizer: Topology of the Web of Data (04/30/2012)
schema = Schema.org
![Page 26: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/26.jpg)
Instances per Class
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 27: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/27.jpg)
Conclusion: Embedded Data in HTML
RDFa and Microdata grow, but Microformats are still present
A rather small set of vocabularies is used
The content and the vocabularies are very focused towards The content and the vocabularies are very focused towards the mayor consumers (Google, Yahoo, Bing, Facebook)
Providing structured data has come SEO topic Providing structured data has come SEO topic
The data structures used are rather simplistic ( tl t titi )(mostly atomar entities)
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 28: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/28.jpg)
Alternative Approach: Linked Data
E t d th W b ith i l l b l d tExtend the Web with a single global data space.1. by using RDF to publish structured data on the Web2. by setting links between data items within different
data sources.
RDF RDF RDF RDF RDFRDF
RDF
RDF
RDF
RDF
RDF RDF
RDF
RDF
RDF
RDFlink
RDFlinks
RDFlinks
RDFlinks
B CA D E
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 29: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/29.jpg)
Entities must be identified with HTTP URIs
f f Prdf:type
pd:cygri
foaf:name
foaf:Persony
Richard Cyganiakfoaf:name
foaf:based neardbpedia:Berlin
foaf:based_near
fHTTP URIs take the role of global primary keys.
pd:cygri = http://richard cyganiak de/foaf rdf#cygripd:cygri = http://richard.cyganiak.de/foaf.rdf#cygridbpedia:Berlin = http://dbpedia.org/resource/Berlin
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 30: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/30.jpg)
URIs can be looked up on the Web
foaf:Personrdf:type
pd:cygri
3.405.259d l tiRichard Cyganiak
foaf:name
foaf:Personpd:cygri
dp:populationRichard Cyganiak
dbpedia:Berlinfoaf:based_near
skos:subject
dbpedia:Berlin
dp:Cities_in_Germanyy
fBy following RDF links applications can navigate the global data graph
Christian Bizer: Topology of the Web of Data (04/30/2012)
discover new data sources
![Page 31: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/31.jpg)
The Dataspace Vision
Alternative to classic data integration systems in
P ti f d t
order to cope with growing number of data sources.
Properties of dataspaces no upfront investment into a global schema
l d t i t ti rely on pay-as-you-go data integration
give best effort answers to queries
Franklin, M., Halevy, A., and Maier, D.: From Databases to Dataspaces A new Abstraction for Information Management SIGMOD Rec 2005A new Abstraction for Information Management, SIGMOD Rec. 2005.
Madhavan, J., et al.: Web-scale Data Integration: You Can Only Afford to Pay As You Go, CIDR 2007
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 32: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/32.jpg)
Linked Data relies on the Pay-as-You-Go Idea
for Identity Management
for Schema/Vocabulary Management
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 33: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/33.jpg)
Providing Integration Hints
by publishing Identity Links on the Web by publishing Identity Links on the Web
Identity Link<http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4>
owl:sameAs
Identity Link
<http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .
you publish links pointing at other data sources.
somebody else publishes links pointing at your somebody else publishes links pointing at yourdata source.
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 34: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/34.jpg)
Effort Distribution between Publisher and Consumer
Consumer data mines identit linksidentity links
Effort Distribution
Publishers or third parties provides
identity links
Christian Bizer: Topology of the Web of Data (04/30/2012)
y
![Page 35: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/35.jpg)
Providing Integration Hints
by publishing Vocabulary Links on the Web by publishing Vocabulary Links on the Web
Vocabulary Link<http://xmlns.com/foaf/0.1/Person>
owl:equivalentClass
<http://dbpedia org/ontology/Person>
Terms for expressing Correspondences
<http://dbpedia.org/ontology/Person> .
Terms for expressing Correspondences owl:equivalentClass, owl:equivalentProperty
rdfs:subClassOf rdfs:subPropertyOf rdfs:subClassOf, rdfs:subPropertyOf
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 36: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/36.jpg)
Effort Distribution between Publisher and Consumer
Consumer defines or data mines mappings
EffortEffort Distribution
Publisher reuses vocabulariesvocabularies
Publisher or third party publishes mappings
Christian Bizer: Topology of the Web of Data (04/30/2012)
publishes mappings
![Page 37: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/37.jpg)
Somebody-Pays-As-You-Go
The overall data integration effort is split between the data publisher the
Fix Overall Data Integration
split between the data publisher, the data consumer and third parties.
Data Publisher publishes data as RDF
IntegrationEffort
sets identity links
reuses terms or publishes mappings
Third Parties set identity links pointing at your data Third
Publisher‘sy p g y
publish mappings to the Web
Data Consumer
Party Effort
Publisher‘sEffort
Data Consumer has to do the rest
using record linkage and schema matching
Consumer‘sEffort
Christian Bizer: Topology of the Web of Data (04/30/2012)
using record linkage and schema matching techniques
![Page 38: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/38.jpg)
W3C Linking Open Data Project
Grassroots community effort toy publish existing open license datasets as Linked Data on the Web interlink things between different data sources.
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 39: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/39.jpg)
LOD Datasets on the Web: May 2007
Over 500 million RDF triples
Christian Bizer: Topology of the Web of Data (04/30/2012)
p Around 120,000 RDF links between data sources
![Page 40: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/40.jpg)
LOD Datasets on the Web: September 2008
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 41: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/41.jpg)
LOD Datasets on the Web: September 2010
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 42: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/42.jpg)
LOD Datasets on the Web: November 2011
Christian Bizer: Topology of the Web of Data (04/30/2012)
31,6 billion RDF triples 503 million RDF links
![Page 43: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/43.jpg)
Distribution by Topical Domain (Nov 2011)
Domain Data Sets Triples Percent RDF Links Percent
Media 25 1,841,852,061 5.82 % 50,440,705 10.01 %Media 25 1,841,852,061 5.82 % 50,440,705 10.01 %
Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 %
Government 49 13,315,009,400 42.09 % 19,343,519 3.84 %, , , , ,
Publications 87 2,950,720,693 9.33 % 139,925,218 27.76 %
Cross‐domain 41 4,184,635,715 13.23 % 63,183,065 12.54 %
Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 %
User content 20 134,127,413 0.42 % 3,449,143 0.68 %
SUM 295 31,634,213,770 503,998,829
Source: State of the LOD Cloudhttp://www4.wiwiss.fu-berlin.de/lodcloud/state/
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 44: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/44.jpg)
Vocabulary Usage (Nov 2011)
Only proprietary vocabularies: 104 (35 25 %) of the 295 sources104 (35.25 %) of the 295 sources
Terms from non-proprietary vocabularies: 191 (64.75 %) of the 295 sources
Common Vocabularies
dc 92 (31.19 %)foaf 81 (27.46 %)( )skos 58 (19.66 %)geo 25 (8.47 %)kt 17 (5 76 %)akt 17 (5.76 %)bibo 14 (4.75 %)mo 13 (4.41 %)( )vcard 10 (3.39 %)sioc 10 (3.39 %)
8 (2 71 %)
Source: State of the LOD Cloudhttp://www4.wiwiss.fu-b li d /l d l d/ t t /
Christian Bizer: Topology of the Web of Data (04/30/2012)
cc 8 (2.71 %) berlin.de/lodcloud/state/
![Page 45: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/45.jpg)
Deployment of Vocabulary Links
S Li k d O V b l i
Christian Bizer: Topology of the Web of Data (04/30/2012)
Source: Linked Open Vocabularies, http://labs.mondeca.com/dataset/lov
![Page 46: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/46.jpg)
Uptake in the Government Domain
The EU is starting to publish Linked Data (LOD2, LATC)
Various other national efforts
W3C eGovernment Interest Group
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 47: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/47.jpg)
Uptake in the Libraries Community
Institutions publishing Linked Data Library of Congress (subject headings)
German National Library (PND dataset and subject headings)
Swedish National Library (Libris - catalog)
Hungarian National Library (OPAC and Digital Library)
Europeana Digital Library just released data about 4 million artifacts
G lGoals: 1. Integrate Library Catalogs on global scale.
2 I t t b t it i2. Interconnect resources between repositories (by topic, by location, by historical period, by ...).
W3C Library Linked Data Incubator Group
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 48: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/48.jpg)
Conclusion: Web of Linked Data
Compared to Microformats, Microdata, RDFa
number of data providers is significantly lower
wider range of topics coveredwider range of topics covered
wider range of common and proprietary vocabularies used
more complex data structures
emphasis on setting RDF Links between sources emphasis on setting RDF Links between sources
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 49: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/49.jpg)
Conclusion: Topology of the Web of Data
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 50: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/50.jpg)
3. Opportunities and Challenges
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 51: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/51.jpg)
The Web of Data provides equal Opportunities
Everybody can crawl the data.
different from alternative approaches like Google Base
like Facebook
like Google Fusion Tables
just as on the classic Web
The haystack is there,so lets look for the needle!so lets look for the needle!
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 52: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/52.jpg)
Search Engines turn into Answering Engines
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 53: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/53.jpg)
Global Data Mining
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 54: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/54.jpg)
Challenges
Applications hate heterogeneity and low quality data!
The wild wild west My little world
Christian Bizer: Topology of the Web of Data (04/30/2012)
The wild wild west My little world
![Page 55: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/55.jpg)
Things that require more work
1. More research on data space profiling is needed.1. More research on data space profiling is needed. What is in the data space and how does the content change over time?
2 M h d t lit t d2. More research on data quality assessment and SPAM detection is needed.
3. More research on learning mappings and identity resolution heuristics within the Web context. Identity links make it easier to learn vocabulary links. Vocabulary links make it easier to learn identity links.
4. More research on pay-as-you-go data integration is needed. How do human, community and machine contributions play together
over time?
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 56: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/56.jpg)
Hands-on: How to play around with the data?
Download the Billion Triples Challenge Dataset2 billi i l (20GB i d) 2 billion triples (20GB gzipped)
crawled from the public Web of Linked Data in May/June 2011
http://challenge.semanticweb.org/
Download the Web Data Commons DumpDownload the Web Data Commons Dump 3 billion triples (49 GB, gzipped) RDFa, Microdata, Microformat data crawled February 2012y http://www.webdatacommons.org/
Download the Sindice Dump 12 billion triples (164GB gzipped, ~1,16TB uncompressed)
Linked Data, RDFa, Microdata, Microformat crawled 2009-2011
http://data.sindice.com/trec2011/download.html
Christian Bizer: Topology of the Web of Data (04/30/2012)
![Page 57: T l f th Topology of the Webfb of Datawifo5-03.informatik.uni-mannheim.de/bizer/pub/Bizer... · 2012. 10. 17. · 2nd Workshop on Linked Web Data Management (LWDM2012) 3rd Workshop](https://reader035.fdocuments.net/reader035/viewer/2022071000/5fbc26ceca46b33423384a11/html5/thumbnails/57.jpg)
Thanks!
References St ti ti HTML b dd d d t htt // bd t Statistics on HTML-embedded data: http://webdatacommons.org
Statistics on Linked Data: http://www4.wiwiss.fu-berlin.de/lodcloud/state/
Textbook: Tom Heath Christian Bizer: Linked Data: Evolving the Web into a Global
Christian Bizer: Topology of the Web of Data (04/30/2012)
Textbook: Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space. http://linkeddatabook.com/