Apachecon 2011 stanbol_ogrisel

52
11/7/11 Apache Stanbol (Incubating) and the Web of Data Olivier Grisel, Nuxeo [email protected], 2011-11-11

description

Olivier Grisel, R&D engineer at Nuxeo, presents CMS integration and semantic components of the Apache Stanbol project at ApacheCon 2011.

Transcript of Apachecon 2011 stanbol_ogrisel

Page 1: Apachecon 2011 stanbol_ogrisel

11/7/11

Apache Stanbol (Incubating)and the Web of Data

Olivier Grisel, [email protected], 2011-11-11

Page 2: Apachecon 2011 stanbol_ogrisel

11/7/11

My Background

Olivier Grisel - R&D Engineer

nuxeoOpen Source ECM

European project: IKS

Stuff I do:Machine Learning Natural Language Processing All things data

Page 3: Apachecon 2011 stanbol_ogrisel

11/7/11

Agenda

The Web of Data: what, why, how?

CMS integration demo

Semantic Components in Stanbol

Building models for Stanbol

Page 4: Apachecon 2011 stanbol_ogrisel

The Web of Data

What, Why, How?

Page 5: Apachecon 2011 stanbol_ogrisel
Page 6: Apachecon 2011 stanbol_ogrisel

11/7/11

“To a computer, then, the web is a flat, boring world devoid of meaning”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

Page 7: Apachecon 2011 stanbol_ogrisel

11/7/11

“This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

Page 8: Apachecon 2011 stanbol_ogrisel

11/7/11

“The Semantic Web is not a separate Web but an extension of the current one, in which information

is given well-defined meaning, better enabling computers and people to work in cooperation.”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

Page 9: Apachecon 2011 stanbol_ogrisel

11/7/11

“Adding semantics to the web involves two things: allowing documents which have information

in machine-readable forms, and allowing links to be created with relationship values.”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

Page 10: Apachecon 2011 stanbol_ogrisel

11/7/11

The Web of Data – What?

Shared description of the real world

oStructured with vocabularies

oDecentralized

oScoped by namespaces

oLinked

Page 11: Apachecon 2011 stanbol_ogrisel

11/7/11

The Web of Data – Why?• Strings are ambiguous

o New York / The Big Apple / NYCo Washington (Person, State, City, Sports Team...)

• Structured context helps humans o Who is this guy?o Where is this city?

• Conceptual frame helps machineso Explicit user intent decodingo Smarter indexing / search?

Page 12: Apachecon 2011 stanbol_ogrisel

11/7/11

Decoding User Intents

Page 13: Apachecon 2011 stanbol_ogrisel

11/7/11

Decoding User Intents

Next Generation User InterfacesSiri - conversational interfaceIBM DeepQA: Watson for Heath Care

Tell Google about your stuffPublish structured descriptions of your products"3 bedrooms flat near Montmartre"

Useful for non-public data as wellIntranet query: "ApacheCon slides"Intranet query: "Xerox invoices"Intranet query: "Xerox salesperson email"

Page 14: Apachecon 2011 stanbol_ogrisel

11/7/11

The Web of Data - How?

RDF / TripeStores / SparqlGraph stores with dynamic schemasStrong interoperability

JSON-LDUpgrade your JSON with scoped vocabulariesWeb / Mobile / JS developer friendly

RDFa + schema.org & rNewsPublish annotation in structured markupVocabulary understood by Search Engines

Page 15: Apachecon 2011 stanbol_ogrisel

11/7/11

HTML example

<p>

  My name is Manu Sporny and you can give me a ring via  1-800-555-0155.    <img src="http://manu.sporny.org/images/manu.png" />    I have a <a href="http://manu.sporny.org/">blog</a>.

</p>

Page 16: Apachecon 2011 stanbol_ogrisel

11/7/11

RDFa example

<p vocab="http://schema.org/"   prefix="foaf: http://xmlns.com/foaf/0.1/" about="#manu" typeof="Person">

  My name is <span property="name">Manu Sporny</span>  and you can give me a ring via  <span property="telephone">1-800-555-0155</span>.    <img rel="image"     src="http://manu.sporny.org/images/manu.png" />    I have a <a rel="foaf:weblog"    href="http://manu.sporny.org/">blog</a>.</p>

Page 17: Apachecon 2011 stanbol_ogrisel

11/7/11

JSON-LD example

Page 18: Apachecon 2011 stanbol_ogrisel

11/7/11

2007 2008

20092010

Page 19: Apachecon 2011 stanbol_ogrisel

2011

Page 20: Apachecon 2011 stanbol_ogrisel

Bridging the Web of Dataand my CMS

Page 21: Apachecon 2011 stanbol_ogrisel
Page 22: Apachecon 2011 stanbol_ogrisel

11/7/11

Apache Stanbol

EnhancerText analysis with Apache OpenNLP / Tika

EntityHub / ContentHubLinked Data Indexing with Apache SolrGraph Storage with Apache Clerezza / Jena

Reasoner / RulesInference with Apache Jena & OWLApiI

Components / HTTP ServicesOSGi with Apache Felix / JAX-RS with Jersey

Page 23: Apachecon 2011 stanbol_ogrisel
Page 24: Apachecon 2011 stanbol_ogrisel
Page 25: Apachecon 2011 stanbol_ogrisel
Page 26: Apachecon 2011 stanbol_ogrisel
Page 27: Apachecon 2011 stanbol_ogrisel

RESTfulis

Beautiful

Page 28: Apachecon 2011 stanbol_ogrisel

11/7/11

Minimalist HTTP Client

curl -X POST -H "Accept: text/turtle" \ -H "Content-type: text/plain" \ --data "John Smith was born in London." \ http://stanbol.demo.nuxeo.com/engines

Page 29: Apachecon 2011 stanbol_ogrisel
Page 30: Apachecon 2011 stanbol_ogrisel
Page 31: Apachecon 2011 stanbol_ogrisel

Local IT infrastructure (LAN)

Nuxeo DMNuxeo DM

addon

1

1

Apache StanbolApache Stanbol

12

1

Engine 1Engine 1

Engine 2Engine 2

Engine 3Engine 3

3

DBpedia

Freebase

GeonamesLDAP

Page 32: Apachecon 2011 stanbol_ogrisel

11/7/11

Stanbol Enhancer

Chain of Enhancement Engines

Language Detection (Tika)

Named Entity Detection (OpenNLP)

Linked Data dereferencing (Solr)

Refactoring / Translation (Jena)

Page 33: Apachecon 2011 stanbol_ogrisel

11/7/11

Stanbol EntityHub

Referenced SitesDBpediaGeonames(NY Times, MusicBrainz, ProductDB, UnitProt...)

Fast local offline indices (Solr)Batch indexing utilities for RDF dumpsMultilingual fulltext search in labels & descriptions

Vocabulary mapping / merging

Page 34: Apachecon 2011 stanbol_ogrisel

11/7/11

Stanbol Reasoner

RDFS / OWL-lite / OWL2

Consistency checksCardinality checks: each person has 1 birth date Range constraints: birth dates are valid dates

Materializing types / propertiesTypes from subclass: Musician > Artist > PersonSymmetric property: A worked with BTransitive property: A is a located in B

Query-time expansion / inference?

Page 35: Apachecon 2011 stanbol_ogrisel

11/7/11

Stanbol Rules

Simple Prolog-like language uncleRule[ has(<http://example.org/family.owl#hasParent>, ?x, ?z) . has(<http://example.org/family.owl#hasSibling>, ?z, ?y) -> has(<http://example.org/family.owl#hasUncle>, ?x, ?y) ]

Sparql Construct or SWRL PREFIX family: <http://example.org/family.owl#> CONSTRUCT { ?x family:hasUncle} ?y } WHERE { ?x family:hasParent ?z . ?z family:hasSibling ?y}

Page 36: Apachecon 2011 stanbol_ogrisel

11/7/11

Online Demos

Simple analyzer with small index https://stanbol.demo.nuxeo.com

All services deployed http://dev.iks-project.eu:8081

Page 37: Apachecon 2011 stanbol_ogrisel

Building Stanbol Enhancer models from Wikipedia

with the Apache data tools

Page 38: Apachecon 2011 stanbol_ogrisel

11/7/11

Universal Topic Classification

UseApache Lucene / Solr MoreLikeThis

to perform atruncated nearest neighbors query

in theTF-IDF vector space of Wikipedia

Page 39: Apachecon 2011 stanbol_ogrisel

11/7/11

Universal Topic ClassificationIndex text of all articles grouped by topic

Solr MoreLikeThis query on new document

DBpedia dumps provide:Text summaries for each article

“subject” relationships between articles and topics

“broader” / “narrower” SKOS hieararchy between topics

Page 40: Apachecon 2011 stanbol_ogrisel

11/7/11

About the Data500k purely technical categories

“People_with_missing_birth_place”, “Rivers_in_Romania”

70k “semantically grounded” categories

Paths to roots require both “technical” and “grounded” categories

Scale:1.2M topic / topic links30M topic / article links

Page 41: Apachecon 2011 stanbol_ogrisel

11/7/11

Some results (Wikinews)US children who celebrate Independence Day more likely to become Republicans, says Harvard study

FireworksVoting theoryRepublican Party (United States)StatisticsElectoral systems

Page 42: Apachecon 2011 stanbol_ogrisel

11/7/11

Some results (Wikinews)U.S. space agency NASA sues ex-astronaut

American astronautsAviation halls of fameEdwards Air Force BaseApollo programExploration of the Moon

Page 43: Apachecon 2011 stanbol_ogrisel

11/7/11

Some results (Wikinews)Hundreds of thousands of British public sector workers strike over planned pension changes

Retirement in the United KingdomUnited Kingdom pensions and benefitsPensions in the United KingdomLabor disputes by countryLabor disputes

Page 44: Apachecon 2011 stanbol_ogrisel

11/7/11

Some results (PLoS One)Metabolic Programming during Lactation Stimulates Renal Na+ Transport in the Adult Offspring Due to an Early Impact on Local Angiotensin II Pathways

Renal physiologyKidneyNephrologyHypertensionMembrane biology

Page 45: Apachecon 2011 stanbol_ogrisel

11/7/11

Wrap Up

Web of Databrings Sructured Context Frameto decode User Intention

NLP + Entities & Topics indicesto automate Content Enrichmentto provide Disambiguationn

Page 46: Apachecon 2011 stanbol_ogrisel

11/7/11

Resources

Documentation, svn, mailing list:   http://incubator.apache.org/stanbol

IKS project blog:   http://blog.iks-project.eu

Blog posts about Semantic ECM:   http://blogs.nuxeo.com/dev/semantic/

Page 47: Apachecon 2011 stanbol_ogrisel

11/7/11

Thank you for your attention!

Olivier Grisel

[email protected]

https://twitter.com/ogrisel

Page 48: Apachecon 2011 stanbol_ogrisel

Training models for NER from Wikipedia

Extract sentences with link positions in Wikipedia articles

DBPedia to the find type of the target entity (Person, Location, Organization)

Apache Pig scripts to compute the join + format the result as training files for OpenNLP

Apache OpenNLP to build and evaluate the models

Apache Hadoop / Apache Whirr for distributed processing

Page 49: Apachecon 2011 stanbol_ogrisel
Page 50: Apachecon 2011 stanbol_ogrisel
Page 51: Apachecon 2011 stanbol_ogrisel
Page 52: Apachecon 2011 stanbol_ogrisel

52