ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

71
Preserving Linked Data: Challenges and Opportuni8es Vassilis Christophides University of Crete & FORTHICS Heraklion, Crete

Transcript of ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Page 1: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Preserving  Linked  Data:  Challenges  and  Opportuni8es  

Vassilis  Christophides  University  of  Crete  &  FORTH-­‐ICS    

Heraklion,  Crete  

Page 2: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Many Web sites containing unstructured, textual content

Few large Web sites are specialized on specific content types •  Semi-structured/xml

content floating around e-services

A  bit  of  History:  from  Web  1.0  &  2.0  …  

Page 3: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

A  bit  of  History:  …to  Web  3.0  

Data as Service (DaaS) Syndicating arbitrarily semi-structured content across Web sites using higher-lever data abstractions (entities)

Data themselves become “infrastructure” •  a valuable asset, on which

science, technology, the economy and society can advance

Page 4: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

The  Emerging  Web  of  Data    

}  Global  data  space  connec8ng  data  from  diverse  domains  and  sources    }  Primary  objects:  “things”  (or  descrip8on  of  things)  }  Links  between  “things”  

}  Granularity  of  informa8on:  from  en8re  datasets  to  atomic  data  

Adapted  from  Chris  Bizer,  Richard  Cyganiak,  Tom  Heath,  available  at  h[p://linkeddata.org/guides-­‐and-­‐tutorials  

Web  of  Data

Conceptual  Representation

entity entity entityentity

Typed  Links Typed  Links Typed  Links

Spreadsheets

HTMLXMLRDFa

URIs

URIs

URIs

URIs

URIs URIsrepresent

SemiStructuredTriplesStatistical

represent represent represent

A Web of things in

the world, described by data on

the Web

Page 5: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

En88es:  an  Invaluable  Asset  

•  “En88es”  is  what  a  large  part  of  our  knowledge  is  about  

Page 6: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Web  Data  of  Increasing  Standardiza8on  

Not  all  linked  data  is  open  and  not  all  open  data  is  linked!  ★  Available  on  the  web  (whatever  format)  but  with  an  open  license,  to  be  Open  Data  ★★Available  as  machine-­‐readable  structured  data  (e.g.  excel  vs.  image  scan  of  a  table)  ★★★  as  (2)  plus  non-­‐proprietary  format  (e.g.  CSV  instead  of  excel)  ★★★★  as  (3),  plus  using  open  standards  from  W3C  (RDF    and  SPARQL  )  to  iden8fy  things  through  dereferenceable  HTTP  URIs,  to  ensure  effec8ve  access    ★★★★★  as  all  the  above  plus  establishing  links  between  data  of  different  sources      

File  format  

Recommenda0ons  (on  a  scale  of  0-­‐5)  

csv   ★★★  

xls   ★  

pdf   ★  

doc   ★  

xml   ★★★★  

rdf   ★★★★★  

shp   ★★★  

ods   ★★  

8ff   ★  

jpeg   ★  

json   ★★★  

txt   ★  

html   ★★  

Page 7: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

The  LOD  Cloud  

Media

Government

Geo

Publications

User-generated

Life sciences

Cross-domain

Page 8: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Basic  Terminology  •  A  dataset  is  a  set  of  RDF  triples  that  are  published,  maintained  or  aggregated  by  a  single  provider  

•  A  linkset  is  a  collec8on  of  RDF  links  between  two  datasets  i.e.  triples  whose  subject  &  object  are  described  in  different  datasets  

•  But  what  do  we  really  know  about  the  produc8on  and  cura8on  processes  of  the  sources  publishing  in  RDF?  

Page 9: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

What  is  Digital  Preserva8on  (DP)?  •  Ensure  accessibility  and  usability  of  digital  objects  over  &me  and  across  domains,  and  protect  them  from  media  failure,  physical  loss,  &  hardware/so6ware  obsolescence  

•  Tradi8onally,  the  objec8ve  is  to  preserve  on  the  long  run  the  authen&city  of  a  digital  object  as  originally  recorded  against  any  technological  change  – extensive  annota8on  of  digital  objects  with  informa8on  related  to  their  significant  proper8es    •  content  format  •  context  of  produc&on  •  structural  meta-­‐data  •  current  behavior,  …  

 

Page 10: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Linked  Data  vs  Digital  Objects  •  DP  techniques  proposed  for  memory  ins8tu8ons  and              data  centers  concern  fixed  digital  objects  featuring            mostly  unstructured  data  –  raw  data  sets  held  in  files,  scholarly  data  held  in  papers…  

•  Linked  Data  are  digitally-­‐born  objects  which  – are  graph-­‐structured  op8onally  sa8sfying  integrity  constraints  (expressed  in  higher  logic  formalisms)  

– exhibit  complex  interdependencies  across  sources  as  well  as  varying  data  quality  (curated  knowledge  bases  vs  extracted  from  text  or  Web  2.0  sources)    

– change  without  no8fica8on  at  different  granularity        levels  to  keep  them  fit  for  contemporary  purposes,  and  be  available  for  discovery  and  re-­‐use  in  the  future  

Page 11: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Linked  Data:  Behind  the  scenes!  

Property  names   Property  values  

Page 12: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Different  Descrip8ons  of  the  same  En8ty  URI   dbpedia:Statue_of_Lib

erty  

rdfs:label Statue of Liberty, Freiheitsstatue, …

dbpprop:location

New York City, New York, U.S., dbpedia:Liberty_Island

dbpprop:sculptor dbpedia:Frédéric_Auguste_Bartholdi

dcterms:subject dbpedia_category:1886_sculptures, …

foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Statue_of_Liberty

dbpprop:beginningDate 1886-10-28 (xsd:date)

dbpprop:restored 19381984 (xsd:integer)

dbpprop:visitationNum 3200000 (xsd:integer)

dbpprop:visitationYear 2009 (xsd:integer)

h[p://www.w3.org/ns/prov#wasDerivedFrom

h[p://en.wikipedia.org/wiki/Statue_of_Liberty?oldid=494328330

URI   C:m.072p8  

fb:art_form fb:m.06msq (Sculpture)

fb:media fb:m.025rsfk (Copper)

fb:architect fb:m.0jph6 (F. Bartholdi), fb:m.036qb (G. Eiffel), fb:m.02wj4z (R. Hunt)

fb:height_meters 93

fb:opened 1886-10-28

12

URI   yago:Statue_of_Liberty  

skos:prefLabel Statue of Liberty

rdf:type yago:History_museums_in_NY, yago:GeoEntity

yago:hasHeight 46.0248

yago:wasCreatedOnDate 1886-##-##

yago:isLocatedIn

yago:Manhattan, yago:Liberty_Island,

yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/Statue_of_Liberty

Page 13: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Linked  Datasets  Depend  on  Vocabularies  URI   dbpedia:Statue_of_Lib

erty  

rdfs:label Statue of Liberty, Freiheitsstatue, …

dbpprop:location

New York City, New York, U.S., dbpedia:Liberty_Island

dbpprop:sculptor dbpedia:Frédéric_Auguste_Bartholdi

dcterms:subject dbpedia_category:1886_sculptures, …

foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Statue_of_Liberty

dbpprop:beginningDate 1886-10-28 (xsd:date)

dbpprop:restored 19381984 (xsd:integer)

dbpprop:visitationNum 3200000 (xsd:integer)

dbpprop:visitationYear 2009 (xsd:integer)

h[p://www.w3.org/ns/prov#wasDerivedFrom

h[p://en.wikipedia.org/wiki/Statue_of_Liberty?oldid=494328330

URI   C:m.072p8  

fb:art_form fb:m.06msq (Sculpture)

fb:media fb:m.025rsfk (Copper)

fb:architect fb:m.0jph6 (F. Bartholdi), fb:m.036qb (G. Eiffel), fb:m.02wj4z (R. Hunt)

fb:height_meters 93

fb:opened 1886-10-28

13

URI   yago:Statue_of_Liberty  

skos:prefLabel Statue of Liberty

rdf:type yago:History_museums_in_NY, yago:GeoEntity

yago:hasHeight 46.0248

yago:wasCreatedOnDate 1886-##-##

yago:isLocatedIn

yago:Manhattan, yago:Liberty_Island,

yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/Statue_of_Liberty

Page 14: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Linked  Datasets  Have  Varying  Quality  URI   dbpedia:Statue_of_Lib

erty  

rdfs:label Statue of Liberty, Freiheitsstatue, …

dbpprop:location

New York City, New York, U.S., dbpedia:Liberty_Island

dbpprop:sculptor dbpedia:Frédéric_Auguste_Bartholdi

dcterms:subject dbpedia_category:1886_sculptures, …

foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Statue_of_Liberty

dbpprop:beginningDate 1886-10-28 (xsd:date)

dbpprop:restored 19381984 (xsd:integer)

dbpprop:visitationNum 3200000 (xsd:integer)

dbpprop:visitationYear 2009 (xsd:integer)

h[p://www.w3.org/ns/prov#wasDerivedFrom

h[p://en.wikipedia.org/wiki/Statue_of_Liberty?oldid=494328330

URI   C:m.072p8  

fb:art_form fb:m.06msq (Sculpture)

fb:media fb:m.025rsfk (Copper)

fb:architect fb:m.0jph6 (F. Bartholdi), fb:m.036qb (G. Eiffel), fb:m.02wj4z (R. Hunt)

fb:height_meters 93

fb:opened 1886-10-28

14

URI   yago:Statue_of_Liberty  

skos:prefLabel Statue of Liberty

rdf:type yago:History_museums_in_NY, yago:GeoEntity

yago:hasHeight 46.0248

yago:wasCreatedOnDate 1886-##-##

yago:isLocatedIn

yago:Manhattan, yago:Liberty_Island,

yago:hasWikipediaUrl h[p://en.wikipedia.org/wiki/Statue_of_Liberty

Page 15: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Linked  Datasets  Evolve  Over  Time  

15  

Current  version  of  DBpedia   Previous  version  of  DBpedia  URI   dbpedia:Statue_of_Lib

erty  

rdfs:label Statue of Liberty, Freiheitsstatue, …

dbpprop:location

New York City, New York, U.S., dbpedia:Liberty_Island

dbpprop:sculptor dbpedia:Frédéric_Auguste_Bartholdi

dcterms:subject dbpedia_category:1886_sculptures, …

foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Statue_of_Liberty

dbpprop:beginningDate 1886-10-28 (xsd:date)

dbpprop:restored 19381984 (xsd:integer)

dbpprop:visitationNum 3200000 (xsd:integer)

dbpprop:visitationYear 2009 (xsd:integer)

h[p://www.w3.org/ns/prov#wasDerivedFrom

h[p://en.wikipedia.org/wiki/Statue_of_Liberty?oldid=494328330

URI   dbpedia:Statue_of_Liberty  

rdfs:label Statue of Liberty, Freiheitsstatue, …

dbpprop:location

New York City, New York, U.S., dbpedia:Liberty_Island

dbpprop:sculptor dbpedia:Frédéric_Auguste_Bartholdi

dcterms:subject dbpedia_category:1886_sculptures, …

foaf:isPrimaryTopicOf http://en.wikipedia.org/wiki/Statue_of_Liberty

dbpprop:built 1886-10-28 (xsd:date)

dbpprop:restored 19381984 (xsd:integer)

dbpprop:hasHeight 151 (xsd:integer)

h[p://www.w3.org/ns/prov#wasDerivedFrom

h[p://en.wikipedia.org/wiki/Statue_of_Liberty?oldid=494328330

Page 16: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

DP  Challenges  for  Linked  Data    

–  Incompleteness:  real  world  en8tles  are                          usually  par8ally  described  in  data  sources  

–  Redundancy:  the  same  real  world  en88es                          are  represented  in  mul8ple  data  sources  

–  Inconsistency:  various  forms  of  inter  and                                intra  source  data  conflicts  

–  Incorrectness:  errors  can  be  propagated  from  one  source  to  the  other  due  to  copying  

•  Mastering  the  varying  data  quality  is  a  prerequisite  for  trus8ng  preserved  data  origina8ng  from  various  sources    

•  The  ‘publish-­‐first-­‐refine-­‐later’  philosophy  of  the  Linked  Data  movement,  complemented  by  the  open,  decentralized  nature  of  the  Web  results  in  Data  

Page 17: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

DP  Challenges  for  Linked  Data    •  S8ll  data  publishing  and  preserva8on  are  two  largely  separated  

processes  which  are  addressed  by  SW  and  DP  communi8es.    –  Shouldn’t  the  “publishers”  worry  about  preserving  all  their  hard  work?  

–  Shouldn’t  the  “preservers”  be  concerned  about  the  way  they  organize,  link,  and  annotate  their  linked  data?  

•  Need  to  break  down  the  tradi8onal  boundaries  between  the  data  creators  &  publishers  and  the  data  archivists  &  brokers  –  Integrate  cura8on  [when  high  current/ongoing  interest]  and  preserva8on  [when  fall  off  in  interest]  ac8vi8es  

– Distribute  preserva8on  costs  over  the  life-­‐cycle  of  linked  datasets  among  data  stewards  (pay-­‐as-­‐you-­‐go  data  preserva8on)  

Page 18: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Vision  2030:  High-­‐Level  Group  on  Scien8fic  Data  

“Researchers  and  prac88oners  from  any  discipline  are  able    to  find,  access  and  process  the  data  they  need.  They  can  be  confident  in  their  ability  to  use  and  understand  data  and  they  can  evaluate  the  degree  to  which  the  data  can  be  trusted”  •  How  we  support  future  users  in  Trus8ng  Data?  

– By  assessing  their  quality  wr.t  to  the  en88es  they  describe  – By  recording  from  which  sources  did  they  originate  from    – By  understanding  how  their  iden8ty  and  integrity  has  evolved  over  8me  

– By  ensuring  that  they  has  been                                                                                      preserved  properly    

Page 19: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Frame  Linked  Data  Preserva8on  as  a  Research  Problem  

•  How  can  we  iden8fy  that  different  resource  descrip8ons  within  or  across  datasets  refer  to  the  same  real-­‐world  en8ty  (the  en8ty  resolu8on  problem)  to  convey  various  aspects  of  the  quality  of  the  harvested  datasets  (e.g.,  redundancy,  completeness,  freshness)?  

•  How  can  we  record  dependencies  of  datasets  (the  provenance  problem)  and  how  they  can  smoothly  represented  along  with  other  (temporal,  spa8al,  thema8c)  metadata  (the  annota8on  problem)?  

Page 20: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Frame  Linked  Data  Preserva8on  as  a  Research  Problem  

•  How  can  we  monitor  changes  of  third-­‐party  datasets  (the  evolu8on  tracking  problem)  or  how  can  local/remote  data  imperfec8ons  (e.g.,  due  to  change  propaga8on)  can  be  repaired  (the  cura8on  problem)?  

•  How  can  we  appease  what  versions  of  linked  datasets  should  to  be  preserved  for  future  use  (the  mul8-­‐version  archive  consistency  problem)  and  how  we  will  be  able  to  ask  a  query  not  only  about  any  past  state  of  the  dataset  but  also  about  the  evolu8on  of  some  part  of  it  (the  longitudinal  querying  problem)?    

Page 21: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

21  

dbpedia:Statue_of_Liberty  

dbpedia:Liberty_Island  

C:m.072p8  

yago:Statue_of_Liberty  

yago:History_museums_in_NY  

yago:  GeoEn8ty  

yago:  Liberty_Island  

1886-­‐##-­‐##  

yago:wasCreated  onDate  

yago:isLocatedIn  rdf:type  

rdf:type  

dbpedia_category:1886_sculptures  

dbpedia:  Frédéric_Auguste

_Bartholdi  

3200000  

u:m.06msq  

u:m.0jph6  

u:architect  

u:art_form  

dcterms:subject  

dbprop:visita8onNum  

dbprop:  loca8on  

dbprop:  sculptor  

Statue  of  Liberty  

skos:  prefLabel  

ER  Example  

En8ty  resolu8on:  The  problem  of  iden8fying  descrip8ons  of  the  same  en8ty  within  one  or  across  mul8ple  data  sources  wrt.  a  match  func8on  

•  No  longer  just  matching  of  en8ty  names    

Page 22: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

dbpedia:Statue_of_Liberty  

yago:Statue_of_Liberty  

owl:sameAs  

An  en8ty  resolu8on  is  a  par88on  of  a  set  of  en8ty  descrip8ons,  such  that:  1.   Matching  en8ty  descrip8ons  are  placed  in  the  same  subset  2.   All  the  descrip8ons  of  the  same  subset  match  

C:m.072p8  

yago:  Liberty_Island  

u:m.0jph6  owl:sameAs  

owl:sameAs  

dbpedia:Liberty_Island  

dbpedia:  Frédéric_Auguste_Bartholdi  

yago:  Liberty_Island  

dbpedia:  Frédéric_Auguste_

Bartholdi  u:m.0jph6  

ER  Example  

u:architect  

yago:isLocatedIn  

Persons

Places

Artifacts

=>  Need  to  infer  also  other  kind  of    rela8onships  than  “equivalence”  

Page 23: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

What  Makes  ER  Difficult  for  Linked  Data    

•  Linked  Data  are  inherently  semi-­‐structured  –  several  seman8c  types  (see  rdf:type  proper8es  in  Yago)  could  be  simultaneously  employed  resul8ng  to  en8ty  descrip8ons  even  of  the  same  type  (persons,  places,  …)  with  quite  different  structures  

 •  Linked  Data  heavily  rely  on  heterogeneous  vocabularies  

–  DBPedia  3.4:  50,000  proper8es  –  Google  Base:100,000  schemata  and  10,000  en8ty  types  

•  Linked  Data  are  Big  Data  –  The  LOD  cloud  consists  of  32  billion  RDF  triples  (last  update:  2011)  –  DBPedia  3.4:  36.5  million  triples,  2.1  million  en8ty  descrip8ons      –  BTC09:  1.15  billion  triples,  182  million  en8ty  descrip8ons  

=>  Deal  with  loosely  structured  en00es  

=>  Calls  for  efficient  parallel  techniques  

=>  Need  for  cross-­‐domain  techniques  

Page 24: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

The  Role  of  Similarity  Func8ons  

Set  of  all  pairs  of  en8ty  

descrip8ons  

Matching  pairs  of  en8ty  

descrip8ons  

Pairs  of  en8ty  descrip8ons  sa8sfying  

a  strict  similarity  func8on      

Page 25: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

The  Role  of  Similarity  Func8ons  

Set  of  all  pairs  of  en8ty  

descrip8ons  

Matching  pairs  of  en8ty  

descrip8ons  

Pairs  of  en8ty  descrip8ons  sa8sfying  

a  loose  similarity  func8on      

Page 26: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

En8ty  Collec8ons  and  ER  Types  •  2  kinds  of  en8ty  collec8ons  given  as  input  to  an  ER  task:  –  Clean,  which  are  duplicate-­‐free  (e.g.,DBPedia,Freebase,DBLP)  

– Dirty,  which  contain  duplicate  en8ty  descrip8ons  in  themselves  (e.g.,  Google  Scholar,  Citeseer)  

•  An  ER  task  that  receives  as  input  two  en8ty  collec8ons  can  be  of  the  following  types:  –  Clean-­‐Clean  ER:  Given  two  clean,  but  overlapping  en8ty  collec8ons,  iden8fy  the  common  en8ty  descrip8ons  (a.k.a.  the  Record  Linkage  in  databases)  

– Dirty-­‐Clean  ER:  – Dirty-­‐Dirty  ER:  Iden8fy  unique  en8ty  descrip8ons  contained  in  union  of  the  input  en8ty  collec8ons  (a.k.a.  the  Deduplica8on  problem  in  databases)  

•  In  the  Web  of  Data  we  encountering  more  Clean-­‐Clean  ER    

Page 27: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Scaling  ER  to  the  Web  of  Data  •  Blocking  to  reduce  the  number  of  comparisons:  

–  Split  en8ty  descrip8ons  into  blocks  –  Compare  each  descrip8on  to  the  descrip8ons  within  the  same  block  

•  Desiderata  –  Similar  en8ty  descrip8ons  in  the  same  block  – Dissimilar  en8ty  descrip8ons  in  different  blocks  

•  Blocking  approaches  are  dis8nguished  between:  –  Par88oning,  where  each  descrip8on  is  placed  in  exactly  one  block  :  Fewer  comparisons  

– Overlapping,  where  each  descrip8on  can  be  placed  in  more  than  one  block  :  More  iden8fied  matches  

Page 28: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Blocking  techniques  for  Linked  Data  

•  Mul8-­‐rela8onal  and  cross-­‐domain  en8ty  resolu8on  – Token  blocking  – Property  clustering  – Prefix-­‐Infix(-­‐Suffix)  

•  Large-­‐scale  en8ty  resolu8on  – Choose  a  computa8onally  not  expensive  similarity  func8on  

– Process  in  parallel  par88ons  of  the  en8ty  graph  in  Map/Reduce  nodes    

Page 29: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Token  Blocking  [Papadakis  et  al.  2011]  

•  Ignores  (seman8c  or  structured)  types  of  en88es  – String  similarity  of  tokens  of  property  literal  values  

•  Dis8nct  tokens  of  each  property  value  of  each  en8ty  descrip8on  corresponds  to  a  block  – Each  block  contains  all  en88es  with  the  corresponding  token  

•  High  recall  at  the  cost  of  low  precision  and  low  efficiency:  – Most  true  matches  are  placed  in  the  same  block  – Many  non-­‐matches  are  also  placed  in  the  same  block  – The  same  pair  of  descrip8ons  is  contained  in  many  blocks  

Page 30: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

String  Similarity  Measures  

Page 31: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Token  Blocking  -­‐  Example  Entity descriptions: e1= {(name, Eiffel Tower), (architect, Sauvestre), (year, 1889),

(location, Paris)}

e2= {(name, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)}

e3= {(about, Lady Liberty), (architect, Eiffel), (location, NY)}

e4= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)}

e5= {(name, White Tower), (year-constructed, 1450), (location, Thessaloniki)}

Generated  blocks:  

The  pair  (e1,  e4)  is  contained  in  5  different  blocks!  

Eiffel  

e1,  e2,  e3,  e4  

Tower  

e1,  e4,  e5  

Statue  

e2  

Liberty  

e2,  e3  

White  

e5  

1889  

e1,  e4  

Bartholdi  

e2  

NY  

e2,  e3  

Sauvestre  

e1,  e4  

Paris  

e1,  e4  

1886  

e2  

1450  

e5  

Lady  

e3  

Thessaloniki  

e5  

Page 32: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Property  clustering  [Papadakis  et  al.  2013]  

•  Assuming  two  duplicate-­‐free  datasets  – Recognize  similarity  of  proper8es  based  on  the  string  similarity  of  their  literal  values  occurring  in  en8ty  descrip8ons    

•  Two  main  blocking  steps:  1. Similar  proper8es  are  placed  together  in  non-­‐overlapping  clusters  

2. Token  blocking  is  performed  on  the  descrip8ons  of  each  cluster  

 

Page 33: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es  

1. For  each  property  of  dataset  D1:  – Find  the  most  similar  property  of  dataset  D2  

2. For  each  property  of  dataset  D2:    – Find  the  most  similar  property  of  dataset  D1  

3. Compute  the  transi8ve  closure  of  the  generated  pairs  of  similar  proper8es  

4. Similar  proper8es  form  clusters  5. All  singleton  clusters  are  merged  into  a  common  one  

Page 34: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

D1   D2  

e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)}

e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)}

e3= {(about, Auguste Bartholdi), (born, 1834)}

e4= {(about, Joan Tower), (born, 1938)}

e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}

e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)}

e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)}

e1  e2  e3  

e4  

e5  e6  e7  

Clustering  En8ty  Proper8es:  Example  

Page 35: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)}

e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)}

e3= {(about, Auguste Bartholdi), (born, 1834)}

e4= {(about, Joan Tower), (born, 1938)}

e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}

e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)}

e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)}

Clustering  En8ty  Proper8es:  Example  

Finding  the  property  of  D2  that  is  the  most  similar  to  the  property  “about”  of  D1:  values  of  about:  {Eiffel,  Tower,  Statue,  Liberty,  Auguste,  Bartholdi,  Juan,  Tower}    compared  to  (with  Jaccard  similarity)  :  values  of  work:  {Lady,  Liberty,  Eiffel,  Tower,  Bartholdi,  Fountain}    à  Jaccard  =  4/10  values  of  ar8st:  {Bartholdi}  à  Jaccard  =  1/8  values  of  loca8on:  {NY,  Paris,  Washington,  D.C.}  à  Jaccard  =  0  values  of  year-­‐constructed:  {1889,  1876}  à  Jaccard  =  0  

Page 36: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)}

e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)}

e3= {(about, Auguste Bartholdi), (born, 1834)}

e4= {(about, Joan Tower), (born, 1938)}

e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}

e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)}

e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)}

Clustering  En8ty  Proper8es:  Example  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 37: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Similarly  for  the  rest  of  the  proper8es…  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 38: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Similarly  for  the  rest  of  the  proper8es…  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 39: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Similarly  for  the  rest  of  the  proper8es…  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 40: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Similarly  for  the  rest  of  the  proper8es…  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 41: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Similarly  for  the  rest  of  the  proper8es…  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 42: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Similarly  for  the  rest  of  the  proper8es…  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 43: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Similarly  for  the  rest  of  the  proper8es…  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 44: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Similarly  for  the  rest  of  the  proper8es…  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 45: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

D1   D2  

e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)}

e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)}

e3= {(about, Auguste Bartholdi), (born, 1834)}

e4= {(about, Joan Tower), (born, 1938)}

e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}

e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)}

e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)}

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

Page 46: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Compute  the  transi8ve  closure  of  the  generated  property  name  pairs  – Connected  proper8es  form  clusters  

     Pairs:  (about,  work),  (work,  about),  (ar8st,  architect),  (architect,  work)  

Transi8ve  closure:  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

about  work  

architect  ar8st  C1  

Page 47: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Compute  the  transi8ve  closure  of  the  generated  property  name  pairs  – Connected  proper8es  form  clusters  

     Pairs:  (year,  year-­‐constructed),  (year-­‐constructed,  year),  (year-­‐constructed,  born)  

Transi8ve  closure:  about  work  

architect  ar8st  C1  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

year  year-­‐constructed  

born  C2  

Page 48: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Compute  the  transi8ve  closure  of  the  generated  property  name  pairs  – Connected  proper8es  form  clusters  

     Pairs:  (located,  loca8on),  (loca8on,  located)  

Transi8ve  closure:  about  work  

architect  ar8st  C1  

year  year-­‐constructed  

born  C2  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

loca8on  located    

C3  

Page 49: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Clustering  En8ty  Proper8es:  Example  

•  Compute  the  transi8ve  closure  of  the  generated  property  name  pairs  – Connected  proper8es  form  clusters  

     •  Generated  property  clusters:  

D1   D2  about  architect  year  born  located  

work  ar8st  year-­‐constructed  loca8on  

about  work  

architect  ar8st  C1  

year  year-­‐constructed  

born  C2  

loca8on  located    

C3  

Page 50: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Token  Blocking  for  Each  Cluster  

Some  of    the  generated  blocks:  

                             compare  Lady  Liberty  to  Auguste  Bartholdi  

 

C3.NY  

e2,  e5  

C1.Tower  

e1,  e4,  e6  

C1.Bartholdi  

e2,  e3,  e5,  e7  

e1= {(about, Eiffel Tower), (architect, Sauvestre), (year, 1889), (located, Paris)}

e2= {(about, Statue of Liberty), (architect, Bartholdi Eiffel), (year, 1886), (located, NY)}

e3= {(about, Auguste Bartholdi), (born, 1834)}

e4= {(about, Joan Tower), (born, 1938)}

e5= {(work, Lady Liberty), (artist, Bartholdi), (location, NY)}

e6= {(work, Eiffel Tower), (year-constructed, 1889), (location, Paris)}

e7= {(work, Bartholdi Fountain), (year-constructed,1876), (location, Washington D.C.)}

about  work  

architect  ar8st  C1  

year  year-­‐constructed  

born  C2  

loca8on  located    

C3  

Page 51: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Prefix-­‐Infix(-­‐Suffix)  [Papadakis  et  al.  2012]  •  How  we  can  explore  the  seman8cs  of  URIs  to  be[er  match  en8ty  descrip8ons?  – E.g.  66%  of  the  182  million  URIs  of  BTC09  (km.aiu.kit.  edu/projects/btc-­‐2009)  follow  the  scheme:  Prefix-­‐Infix(-­‐Suffix)    • Prefix  describes  the  source,  i.e.  domain,  of  the  URI    •  Infix  is  a  local  iden8fier    •  The  op8onal  Suffix  contains  details  about  the  format,  e.g.  .rdf  and  .nt,  or  a  named  anchor  

•  Token  blocking  on  the  Infixes  appearing  in  the  resource  values  of  proper8es  (as  subject  or  object)      

Page 52: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Prefix-­‐Infix(-­‐Suffix)  [Papadakis  et  al.  2012]  E.g. (Infix-profile):

e1= {(skos:prefLabel, Statue of Liberty), (yago:isLocatedIn, yago:Liberty_Island)}

e2= {(rdfs:label, Statue of Liberty), (dbpprop:location, dbpedia:Liberty_Island)}

e3= {(freeb:official_name, Statue of Liberty), (freeb:containedby, freeb:m.026kp2)}

e4= {(geonames:name, Statue of Liberty), (geonames:nearby, geonames:5124330)}

Generated  blocks:        Note:  The  effec8veness  of  the  approach  relies  on  the  good  naming  prac8ces  of  the  data  publishers  

 

Liberty_Island  

e1,  e2  

m.026kp2  

e3  

5124330  

e4  

GeoNames  

Page 53: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

LINDA  [Böhm  et  al.  2012]  

•  Works  on  an  en8ty  graph  constructed  from  RDF  triples  by  considering  the  URIs  appearing  in  their  subject,  predicate  and  object  posi8ons  

•  Matches  are  iden8fied  using  two  kinds  of  similari8es:  – Descrip8ons  are  similar  wrt.  a  string  similarity  of  their  literal  values:  Checked  once  

– Descrip8ons  have  similar  neighbours  in  the  en8ty  graph:  Checked  itera&vely  

Page 54: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

LINDA  [Böhm  et  al.  2012]  •  Scalability:  En8ty  graph  par88ons  are  processed  in  parallel  

•  Each  Map/Reduce  node  holds:    – A  par88on  of  the  graph  along  with  the  similari8es  of  the  en8ty  descrip8on  pairs  in  this  par88on  • descrip8on  pairs  are  stored  in  a  priority  queue  in  descending  order  wrt.  their  similarity  

– Fast  merge-­‐join-­‐like  access    •  Effec8veness:  Messages  from  mappers  to  reducers,  only  for  the  pairs  of  descrip8ons    that  need  similarity  re-­‐computa8on  

Page 55: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

LINDA  ER  Algorithm  

•  Two  matrices  are  used:  – X  captures  the  iden8fied  matches  (binary  matrix)  – Y  captures  the  pair-­‐wise  similari8es  (real  values)  

•  Ini8aliza8on:  common  neighbors  and  string  similarity  of  literals  

• Updates:  Use  the  iden8fied  matches  of  X  •  Un8l  the  priority  queue  (extracted  from  Y)  becomes  empty:  

– Get  the  pair  (ei,  ej)  with  the  highest  similarity  •  (ei,  ej)  match  by  default  

–  Update  X:  matches  of  ei  are  also  matches  of  ej  

– Update  the  queue  wrt.  the  new  matches    

Page 56: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

dbpedia:Statue_of_Liberty  

dbpedia:Liberty_Island  

C:m.072p8  

yago:Statue_of_Liberty  

yago:  Liberty_Island  

yago:isLocatedIn  

dbpedia:  Frédéric_Auguste_Bartholdi  

u:m.0jph6  

u:architect  

dbprop:  loca8on   dbprop:  

sculptor  

Priority Queue:

(dbpedia:Liberty_Island,  yago:Liberty_Island)  

(dbpedia:Statue_of_Liberty,  yago:Liberty_Island)  

(u:m.072p8,  dbpedia:Liberty_Island)  

Page 57: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

dbpedia:Statue_of_Liberty  

dbpedia:Liberty_Island  

C:m.072p8  

yago:Statue_of_Liberty  

yago:  Liberty_Island  

yago:isLocatedIn  

dbpedia:  Frédéric_Auguste_Bartholdi  

u:m.0jph6  

u:architect  

dbprop:  loca8on   dbprop:  

sculptor  

Priority Queue:

(dbpedia:Liberty_Island,  yago:Liberty_Island)  

(dbpedia:Statue_of_Liberty,  yago:Liberty_Island)  

(u:m.072p8,  dbpedia:Liberty_Island)  

match  

Page 58: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

dbpedia:Statue_of_Liberty  

dbpedia:Liberty_Island  

C:m.072p8  

yago:Statue_of_Liberty  

yago:  Liberty_Island  

yago:isLocatedIn  

dbpedia:  Frédéric_Auguste_Bartholdi  

u:m.0jph6  

u:architect  

dbprop:  loca8on   dbprop:  

sculptor  

Priority Queue:

(dbpedia:Liberty_Island,  yago:Liberty_Island)  

(dbpedia:Statue_of_Liberty,  yago:Liberty_Island)  

(u:m.072p8,  dbpedia:Liberty_Island)  dequeue  these  pairs,    as  each  en8ty  can  be  mapped  at  most  to  one  en8ty  per  data  source  

match  

Page 59: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

dbpedia:Statue_of_Liberty  

dbpedia:Liberty_Island  

C:m.072p8  

yago:Statue_of_Liberty  

yago:  Liberty_Island  

yago:isLocatedIn  

dbpedia:  Frédéric_Auguste_Bartholdi  

u:m.0jph6  

u:architect  

dbprop:  loca8on   dbprop:  

sculptor  

Priority Queue:

(dbpedia:Liberty_Island,  yago:Liberty_Island)  

(dbpedia:Statue_of_Liberty,  yago:Statue_of_Liberty)    

enqueue  this  pair,  as    it  is  in  the  context  of  the  newly-­‐found  match  

match  

Page 60: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

dbpedia:Statue_of_Liberty  

dbpedia:Liberty_Island  

C:m.072p8  

yago:Statue_of_Liberty  

yago:  Liberty_Island  

yago:isLocatedIn  

dbpedia:  Frédéric_Auguste_Bartholdi  

u:m.0jph6  

u:architect  

dbprop:  loca8on   dbprop:  

sculptor  

Priority Queue:

(dbpedia:Statue_of_Liberty,  yago:Statue_of_Liberty)  

match  match  

Page 61: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

LINDA  

Distribute  across  a  cluster  the  input  en8ty  graph  •  A  node  i  holds  a  por8on  Qi  of  the  priority  queue  and  the  respec8ve  part  Gi  of  the  graph    

 

Map  phase  •  Mapper  i  reads  Qi  and  forwards  messages  to  reducers  for  similari8es  re-­‐computa8ons  – Matrix  X  of  iden8fied  matches  is  updated    

 

Reduce  phase  •  Similari8es  re-­‐computa8ons  (Matrix  Y)  •  Updates  on  priority  queues    

Page 62: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Frame  Linked  Data  Preserva8on  as  a  Sustainable  Economic  Ac8vity  

•  Economic  ac8vity:  deliberate  alloca8on  of  resources  –  Cost  of  losing  datasets  

•  Sustainable:  ongoing  resource  alloca8on  over  long  periods  of  8me  –  Involved  data  subjects  

•  Ar8culate  the  problem/provide  recommenda8ons  &  guidelines  –  Economic  and  societal  benefits  

Technical

Social Economic

Page 63: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Sustainability  Condi8ons  

•  Who  benefits  from  use  of  the  preserved  data?  

•  Who  selects  what  data  to  preserve?  

•  Who  owns  the  data?  •  Who  preserves  the  data?  •  Who  pays  both  for  data  and  preserva8on  services?  

•  recogni8on  of  the  benefits  of  preserva8on  by  decision  makers  

•  selec8on  of  datasets  with  long-­‐term  value  

•  incen8ves  for  decision  makers  to  act  in  the  public  interest  or  to  elaborate  new  business  models    

•  appropriate  governance  of  preserva8on  ac8vi8es  

•  ongoing  and  efficient  alloca8on  of  resources  to  preserva8on  

•  8mely  ac8ons  to  ensure  long-­‐term  data  access  and  usability  

Page 64: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Conclusions  •  We  need  new  abstrac8ons  bridging  closer  data  crea8on,  processing,  publica8on  and  processing  – Diachronic  Data:  Data  annotated  with  temporal  and  provenance  informa8on  self-­‐describing  their  evolu8on  history    

– Preserve  (semi-­‐)structured,  interrelated,  evolving  data  by  keeping  them  constantly  accessible  &  reusable  from  an  open  framework  such  as  the  Data  Web  

•  We  need  new  business  models  for  spreading  data  publica8on  and  archiving  costs  among  data  stakeholders  –   Pay-­‐as-­‐you-­‐go  data  preserva8on  as  data  products  are  re-­‐used  through  complex  value  making  chains  (both  memory  ins8tu8ons  and  data  market  places)  

Page 65: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Acknowledgements  

•                                                                     EU  IP  Project  No  601043  – h[p://www.diachron-­‐fp7.eu  

       •                                                                     EU  CSA  Project  No  600663  

– h[p://www.prelida.eu  

Page 66: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Collaborators  

•  Vassilis  E�himiou  (University  of  Crete)  •  Kostas  Stefanidis  (FORTH/ICS)  •  Grigoris  karvounarakis  (LogicBox)  •  Giorgos  Flouris  (FORTH/ICS)  

Page 67: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Ques8ons  

Page 68: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

References  •  Bohm,  C.,  de  Melo,  G.,  Naumann,  F.,  Weikum,  G.:  Linda:  distributed  web-­‐of-­‐data-­‐

scale  en8ty  matching.  In  CIKM  2012  •  Geerts,  F.,  Karvounarakis,  G.,  Christophides,  V.  and  Fundulaki  I.:  Algebraic  

structures  for  capturing  the  provenance  of  SPARQL  queries.  In  ICDT  2013.  •  Papadakis,  G.,  Ioannou,  E.,  Niederee,  C.,  Palpanas,  T.,  Nejdl,  W.:  Beyond  100  million  

en88es:  large-­‐scale  blocking-­‐based  resolu8on  for  heterogeneous  data.  In  WSDM  2012.  

•  Papadakis,  G.,  Ioannou,  E.,  Palpanas,  T.,  Niederee,  C.,  Nejdl,  W.:  A  Blocking  Framework  for  En8ty  Resolu8on  in  Highly  Heterogeneous  Informa8on  Spaces.  IEEE  Trans.  Knowl.  Data  Eng.  (2013)  To  appear  

•  Papavasileiou,  V.,  Flouris,  G.,  Fundulaki,  I,.  Kotzinos,  D.,  Christophides,  V.  :High-­‐level  change  detec8on  in  RDF(S)  KBs.  ACM  Trans.  Database  Syst.  38,  1,  April  2013  

•  High-­‐Level  Group  on  Scien8fic  Data.  “Riding  the  Wave:  how  Europe  can  gain  from  the  raising  8de  of  scien8fic  data”  ©  European  Union,  2010  cordis.europa.eu/fp7/ict/e-­‐infrastructure/docs/hlg-­‐sdi-­‐report.pdf  

•  Blue  Ribbon  Task  Force  on  Sustainable  Digital  Preserva8on  and  Access,  Final  report  2010  br�.sdsc.edu/biblio/BRTF_Final_Report.pdf‎  

Page 69: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Data  Science:  A  Mul8disciplinary  Challenge    

Source  www.oraly8cs.com/2012/06/data-­‐science-­‐is-­‐mul8disciplinary.html  

Page 70: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Data  Science  Research  Agenda  Acquisi0on,  Storage,  and  Management  of  “Big  Data”  

Data  representa8on,  storage,  and  retrieval    

New  parallel  data  architectures,  including  clouds    

Data  management  policies,  including  privacy  and  secure  access  

Communica8on  and  storage  devices  with  extreme  capaci8es  

Sustainable  economic  models  for  access  and  preserva0on    

Data  Analy0cs  

Computa8onal,  mathema8cal,  sta8s8cal,  and  algorithmic  techniques  for  modeling  high  dimensional  data  

Learning,  inference,  predic8on,  and  knowledge  discovery  for  large  volumes  of  dynamic  data  sets  

Data  mining  to  enable  automated  hypothesis  genera8on,  event  correla8on,  and  anomaly  detec8on    

Informa8on  infusion  of  mul8ple  data  sources  

Data  Sharing  and  Collabora0on  

Tools  for  distant  data  sharing,  real  8me  visualiza8on,  and  so�ware  reuse  of  complex  data  sets  

Cross  disciplinary  model,  informa8on  and  knowledge  sharing  

Remote  opera8on  and  real  8me  access  to  distant  data  sources  and  instruments    

Source  Big  Data  R&D  Ini8a8ve  Howard  Wactlar    NIST  Big  Data  Mee8ng  June,  2012  

Page 71: ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data

Towards  Data  Accountability