Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf ·...

9
03/01/18 1 Dbpedia Estrazione di conoscenza da wikipedia Wikipedia 2 http://dbpedia.org DBpedia Tutorial 09.02.2015 Wikipedia articles: 4,7 mil. Articles; 780 article additions per day are highly topical containing only few errors, which can easily be revised cover often very specific content →Wikipedia is the knowledge compendium of humanity. benefits of using Linked Data Consumer View -link data from any other place in the web -discover more related data while consuming data -reuse parts of the data -reuse existing tools and libraries -combine data safely with other data -query data over different repositories Publisher View -make your data discoverable -increase the value of your data (by linkingit) -have fine-granular control over the data items and optimise their access -design data to fit your domain knowledge 3 http://dbpedia.org DBpedia Tutorial 09.02.2015 What's DBpedia? DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web toWikipedia data. 4 http://dbpedia.org DBpedia Tutorial 09.02.2015

Transcript of Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf ·...

Page 1: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

1

DbpediaEstrazione di conoscenza da wikipedia

Wikipedia

2 http://dbpedia.orgDBpediaTutorial09.02.2015

Wikipedia articles:–4,7 mil. Articles; 780 article additions per day–are highly topical–containing only few errors, which can easily be

revised–cover often very specific content

→ Wikipedia is the knowledge compendium of humanity.

benefits of using Linked Data

Consumer View-link data from any other place in the web-discover more related data while consuming data-reuse parts of the data-reuse existing tools and libraries-combine data safely with other data-query data over different repositories

Publisher View-make your data discoverable-increase the value of your data (by linking it)-have fine-granular control over the

data items and optimise their access-design data to fit your domain knowledge

3 http://dbpedia.orgDBpediaTutorial09.02.2015

What's DBpedia?

– DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web.

– DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web toWikipedia data.

4 http://dbpedia.orgDBpediaTutorial09.02.2015

Page 2: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

2

What's DBpedia?

5 http://dbpedia.orgDBpediaTutorial09.02.2015

–DBpedia project was started in 2006–has been a key factor for the success ofthe

Linked Open Data initiative– serves as an interlinking hub for other data

sets–DBpedia provides a testbed serving real data

spanning various domains–In more than 120 language editions

Where is Wikipedia information useful?

6 http://dbpedia.orgDBpediaTutorial09.02.2015

„Which films starred John Cleese without any other members of Monty Python?“

„What have Dublin and Leipzig in common?“

„Which Software products are developed by an organisation founded in California?“

„Which populated places in Germany are below sea level?“

Where is Wikipediainformation useful?

● as terminology and concept repository and fact source for Entity Linking and Disambiguation:

The series follows the adventures of a space-faring crew onboard the starship USS Enterprise (NCC-1701-D), the fifth Federation vessel to bear the nameand registry and the seventh starship bythat name

The Enterprise is commanded by Captain Jean-Luc Picard and is staffed by first officer Commander William Riker, operations manager Data, security chief Tasha Yar, ship's counselor Deanna Troi, chief medical officer Dr. Beverly Crusher, connofficer Lieutenant Geordi La Forge, and junior officer Lieutenant Worf.

⇒ no company, no aircraft carrier, no satellite⇒ correlate the mentionings and concept starship

7 http://dbpedia.orgDBpediaTutorial09.02.2015

⇒ Star Trek rank, contemporary or past military or law enforcement

Why search engines aren'talways enough

„Which films starred John Cleese without any other members of Monty Python?“

8 http://dbpedia.orgDBpediaTutorial09.02.2015

Page 3: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

3

http://dbpedia.orgDBpediaTutorial09.02.2015

9 http://dbpedia.orgDBpediaTutorial09.02.2015

10

What is needed to do better?

11 http://dbpedia.orgDBpediaTutorial09.02.2015

● ontological represantation of entities and facts

„An ontology is a specification of a conceptualization.“ (Gruber, 1993)

⇒ formal description of concepts and relationships

What is needed to do better?

12 http://dbpedia.orgDBpediaTutorial09.02.2015

● ontological represantation of entities and facts

● well-defined taxonomy of entity types● assertions about entities in and their relations

A British Comedy is a kind of Comedy. A Comedy is a kind of Film.A British Comedy is a kind of Film.Clockwise is a British Comedy. John Cleese starsClockwise.John Cleese stars a Film.

● thoroughly specified, machine-actionable, but flexible formalism for representation

Page 4: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

4

DBpedia -motivation and use cases

13 http://dbpedia.orgDBpediaTutorial09.02.2015

an RDF view of structured Wikipedia information enables:

● sophisitated queries⇒ cross-referencing facts of entities⇒ filtering of entities based on their types

and fact assertions

● combining facts from Wikipedia with machine-actionable knowledge from other structured datasets (Geodata, Yellowpages, WordNet, ...)

Another take onQuestion Answering

„Which films starred John Cleese without any other members of Monty Python?“

14 http://dbpedia.orgDBpediaTutorial09.02.2015

DBpedia Tutorial 09 .02 .2015 http://dbpedia.org21

DBpedia -contents and datasets

● Wikipedia article ⇔DBpedia resourcehttp://en.wikipedia.org/wiki/Monty_Python⇔http://dbpedia.org/resource/Monty_Python

● mapping-based types and facts governed by the DBpedia Ontology

16 http://dbpedia.orgDBpediaTutorial09.02.2015

Page 5: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

5

DBpedia -contents and datasets

17 http://dbpedia.orgDBpediaTutorial09.02.2015

● 4.58 mio. entities and 583 mio. triples (Englisch DBpedia 2014)131,2 mio. fact assertions (devived from info boxes) 168,5 mio. triples representing Wikipedia structure57,1 mio. links to external datasets

● DBpedia resources are categorised in several manners:● by Wikipedia categories (represented in SKOS)● by YAGO classification● by links to WordNet Synsets● by assignment of classes from the DBpedia ontology

● Provenance meta-data⇒ From which part of which Wikipedia page was a triple derived?

Mappings Wiki

18 http://dbpedia.orgDBpediaTutorial09.02.2015

a community effort to:–develop an ontology schema–provide mappings from Wikipedia Infoboxes

properties to this ontology

→ creating an alignment between Wikipedia and Dbpedia→ eliminating name variations in properties and classes→ big boost for Precision

DBpedia Ontology

19 http://dbpedia.orgDBpediaTutorial09.02.2015

cross-domain ontology– maintained and extended by the community in the

DBpedia Mappings Wiki– manually created based on the most commonly used

infoboxes– currently covers 685 classes which form a subsumption

hierarchy and are described by 2,795 different properties

– subsumption hierarchy with a maximal depth of 5– is maintained and extended by the community in the

DBpedia Mappings Wiki

Dbpedia Ontology Extract

20 http://dbpedia.orgDBpedia Tutorial 09 .02 .2015

Page 6: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

6

Wikipedia articles

– Wikipedia articles consist mostly of free text– also comprise various types of structured

information– including: infobox templates, categorisation

information, images, geo-coordinates, links to external web pages, disambiguation pages, redirects between pages, other language links– Title– Abstract– Infoboxes– Geo-

21 http://dbpedia.orgDBpediaTutorial09.02.2015

coordinates– Categories– Images

article outline–Links

» other language versions

» other Wikipedia pages» To the Web»Redirects»Disambiguations

DBpedia Tutorial 09 .02 .2015 http://dbpedia.org28

Structure in Wikipedia

Title Abstract InfoboxesGeo-coordinates Categories ImagesLinks

– other language versions– other Wikipedia pages– To the Web– Redirects– Disambiguations

29

{{Infobox Korean settlement= Busan Metropolitan City= Busan.jpg= A view of the [[Geumjeong]] district in Busan

http://dbpedia.org

= 부산 광역시

= 763.46= 3635389= 2006= Hur Nam-sik= 15 wards (Gu), 1 county (Gun)= [[Yeongnam]]= [[Gyeongsang]]

| title| img| imgcaption| hangul...| area_km2| pop| popyear| mayor| divs| region| dialect}}

″부산 광역시″@Hangdbp:Busandbp:Busan dbp:Busan

dbp:titledbp:hangul dbp:area_km2

″Busan Metropolitan City″

″763.46“^xsd:floatdbp:Busan dbp:pop ″3635389“^xsd:intdbp:Busan dbp:region dbp:Yeongnamdbp:Busan dbp:dialect dbp:Gyeongsang...

infobox encondig

DBpediaTutorial09.02.2015

http://dbpedia.org30

heterogeneiety in infoboxes

DBpedia Tutorial 09 .02 .2015

Page 7: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

7

Björk (Musician) Occupation = Musician, Actor Born = 21.12.1965, Reykjavík

Brown (Prime Minister)office = Prime Minister of the UK birth_date = 20.4.1951 birth_place = Govan

Romero (Actor) occupation = Actor, Editor birthdate = 4.2.1940 birthplace = New York

25 http://dbpedia.orgDBpediaTutorial09.02.2015

DBpedia ExtractionFramework

26 http://dbpedia.orgDBpediaTutorial09.02.2015

DIEF - DBpedia Information Extraction Framework– extracts structured information from Wikipedia and

turns it into a rich knowledge base– Mapping-Based Infobox Extraction, Raw Infobox

Extraction, Feature Extraction, Statistical Extraction– Hosted on GitHub– Written in Scala & Java

27 http://dbpedia.orgDBpediaTutorial09.02.2015

Accessing DBpedia - Browsing

● official DBpedia mirror http://dbpedia.org⇒ run on Virtuoso⇒ faceted search withVirtuoso Facets

28 http://dbpedia.orgDBpediaTutorial09.02.2015

Page 8: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

8

Acessing DBpedia - SPARQL

● official SPARQL endpoint http://dbpedia.org/sparql● ⇒ subject to a fair use policy (limited query runtime)● ⇒ query with any SPARQL compliant tool or API

29 http://dbpedia.orgDBpediaTutorial09.02.2015

Querying RDF with SPARQL

● SPARQL Protocol and RDF Query Language⇒ graph patterns as set of triples (with variables)⇒ successful matches of graph patters generate

bindings in (sub-)query solutions

30 http://dbpedia.orgDBpediaTutorial09.02.2015

Querying RDF with SPARQL

31 http://dbpedia.orgDBpediaTutorial09.02.2015

● SPARQL Protocol and RDF Query Language⇒ graph patterns as set of triples (with variables)⇒ successful matches of graph patters generate

bindings in (sub-)query solutions● different result types for queriesSELECT ⇒ bindings, ASK ⇒ true/false, CONSTRUCT ⇒ new graph

● combinators and modifiers for basic graph patterns⇒ UNION, FILTER, MINUS, FILTER (NOT) EXISTS

● result set modifiesLIMIT, OFFSET, DISTINCT, ORDER BY

● numerous operators and operators for resource and literal values

● many additions in 1.1 revision:grouping & aggregates, regular property path expr., sub-queries

SPARQL query Examples

SELECT?name ?birth ?death ?personWHERE{?person foaf:name ?name .?person dbo:deathDate ?death .?person dbo:birthPlace :Berlin .?person dbo:birthDate ?birth .FILTER(?birth <"1900-01-01"^^xsd:date).}ORDERBY?name

PeoplewhowereborninBerlinbefore1900

Page 9: Dbpedia Estrazione di conoscenza da wikipediacena/materiale/BusinessIntelligence/DBpedia.pdf · DBpedia Tutorial 34 09.02.2015 FlintSparqlEditor: Javascript SPARQLEditor syntax highlighting,

03/01/18

9

SPARQL Query Example

33 http://dbpedia.orgDBpedia Tutorial 09 .02 .2015

SPARQL Tooling

34 http://dbpedia.orgDBpediaTutorial09.02.2015

● FlintSparqlEditor: Javascript SPARQL Editor● syntax highlighting, code assistance● auto-completion for properties and classes (for small datasets)

● Protegé: full-fledged ontology editor● good to get an overview of ontologies backing datasets● two SPARQL plug-ins (one supporting entailment)

● curl or your favourite simple REST API● allows for simple testing queries from any text editor with SPARQL syntax support (e.g. Emacs, Vim, Sublime Text)

$curl -H 'Accept: application/json' --data-urlencode "query=$(cat query.sparql)" http://dbpedia.org/sparql

Further Reading: Website

35 http://dbpedia.orgDBpediaTutorial09.02.2015

landing page: http://dbpedia.org/About

overview over datasets (also info on localized datasets):http://wiki.dbpedia.org/Datasets

DBpeda data access overview: http://wiki.dbpedia.org/OnlineAccess

Further Reading: Browsing

36 http://dbpedia.orgDBpediaTutorial09.02.2015

DBpedia VAD: http://dbpedia.org/page/DBpedia

DBpedia Facets: http://dbpedia.org/fct/

new DBpedia frontend:http://de.dbpedia.org/page/DBpedia (get an impression to the German DBpedia version) https://github.com/lukovnikov/ldviewer (source code)

Context platform: http://context.aksw.org/app/hub.php?corpus=6&action=facets (online demo to browse LOD2 Blog) http://context.aksw.org/app/ (project home)