Data Integration in a Big Data Context

37
Data Integration in a Big Data Context Alasdair J G Gray [email protected] alasdairjggray.co.uk @gray_alasdair

Transcript of Data Integration in a Big Data Context

Page 1: Data Integration in a Big Data Context

Data Integration in a Big Data ContextAlasdair J G [email protected]@gray_alasdair

Page 2: Data Integration in a Big Data Context

Data Linkage and Querying

2 September 2015 UBDC Seminar

Linking it all together!

2

Page 3: Data Integration in a Big Data Context

3

Big Data

2 September 2015 UBDC Seminar

Volume

VelocityVariety

http://i.kinja-img.com/gawker-media/image/upload/lvzm0afp8kik5dctxiya.jpg

Page 4: Data Integration in a Big Data Context

UBDC Seminar 4

Purpose: Extracting Value

2 September 2015

http://senderocorp.com/images/uploads/bigdata_v9.png

Volume Velocity Variety Veracity

Value

VisualizationAnalytics

Big Data Technology

Page 5: Data Integration in a Big Data Context

5

Big Data

2 September 2015 UBDC Seminar

Volume

VelocityVariety

Page 6: Data Integration in a Big Data Context

6

Big Data: VolumeMore data than you can process

Scalable processing

Relative term WSN query

processing

2 September 2015 UBDC Seminar

Volume

VelocityVariety

Page 7: Data Integration in a Big Data Context

7

Big Data: VarietyMany sources of data

Heterogeneous Formats Models

Reconcile meaning

2 September 2015 UBDC Seminar

Volume

VelocityVariety

Page 8: Data Integration in a Big Data Context

8

Big Data: VelocityData constantly generated

Real-time processing

Contextualise

2 September 2015 UBDC Seminar

Volume

VelocityVariety

Page 9: Data Integration in a Big Data Context

9

RDF: An Integration Dream

2 September 2015 UBDC Seminar

http://www.w3.org/TR/rdf11-primer/

Page 10: Data Integration in a Big Data Context

102 September 2015 UBDC Seminar

https://www.flickr.com/photos/mobilestreetlife/4179063482

“RDF and OWL do not solve the interoperability problem, they just lay it bare on the table!”Frank van Harmelen

Page 11: Data Integration in a Big Data Context

UBDC Seminar 11

Solent Use Case Busy shipping

channel Two major ports Complex tidal

and wave patterns

2 September 2015

Page 12: Data Integration in a Big Data Context

UBDC Seminar 12

Estuarine Flooding Financial implications

Damage Loss of business

Personal factors Emotional impact

Flood prediction Locations Severity

Requires correlating Sea-state data Weather forecasts Details of sea defences

Response Planning Evacuation routes Personnel deployment …

Requires more data Traffic reports Shipping …

2 September 2015

Image: http://www.metro.co.uk/

Page 13: Data Integration in a Big Data Context

UBDC Seminar 13

Flood defences data (database)

Flood Detection“Detect overtopping events in the Solent region” sea-level >

sea-defence

•Sea-level: sensors

•Defence heights: databases

2 September 2015

Real-time sensor data

Wave,Wind,Tide

Page 14: Data Integration in a Big Data Context

UBDC Seminar 14

Meteorological forecasts

Response Planning“Provide contextual information”• Web feeds• Other sources: maps, models

• Real-time merging of datasets

2 September 2015

Other sources:Maps, models, …

Page 15: Data Integration in a Big Data Context

UBDC Seminar 15

Abstract Problem

Stored dataSensor

Network

Integrator

2 September 2015

Sensor Network

Stored data service

Streaming data service

Streaming data service

Page 16: Data Integration in a Big Data Context

UBDC Seminar 16

Data sourceData

streamQuery

capabilitiesData

access

Types of Heterogeneity

Stored dataSensor

Network

Integrator

2 September 2015

Sensor Network

Stored data service

Streaming data service

Streaming data service

Data semantics

Page 17: Data Integration in a Big Data Context

UBDC Seminar

Querying Approach Use ontologies as common model

Requires: Representation of RDF stream Expressing continuous queries over an RDF

stream Establishing mappings between ontology

models and data source schemas Accessing data sources through queries

over ontology model

2 September 2015 17

Page 18: Data Integration in a Big Data Context

UBDC Seminar 18

RDF Stream Named graph Continuously updating Triples annotated with timestamp

2 September 2015

STREAM http://www.semsorgrid4env.eu/ccometeo.srdf......( <ssg4e:Obs1, rdf:type, cd:Observation>, ti ),( <ssg4e:Obs1, cd:observationResult, “34.5”>, t i ),( <ssg4e:Obs2, rdf:type, cd:Observation>, ti+1 ),( <ssg4e:Obs2, cd:observationResult,”20.3”>, ti+1 ),......

cd:Observation

xsd:double

cd:observationResult

Page 19: Data Integration in a Big Data Context

UBDC Seminar 19

SPARQLStream

PREFIX cd: <http://www.semsorgrid4env.eu/ontologies/CoastalDefences.owl#>PREFIX sb: <http://www.w3.org/2009/SSN-XG/Ontologies/SensorBasis.owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> RSTREAMSELECT ?windspeed ?windts FROM STREAM <http://www.semsorgrid4env.eu/ccometeo.srdf> [ NOW – 1 MINUTE TO NOW STEP 5 MINUTES ] WHERE { ?WindObs a cd:Observation; cd:observationResult ?windspeed; cd:observationResultTime ?windts; cd:observedProperty ?windProperty; cd:featureOfInterest ?windFeature. ?windFeature a cd:Feature; cd:locatedInRegion cd:SolentCCO. ?windProperty a cd:WindSpeed. }

2 September 2015

cd:Observation

xsd:double

cd:observationResult

cd:Feature

cd:featureOfInterest

cd:Property

cd:observedProperty

cd:Region

cd:locatedInRegion

“Every 5 minutes give me with the wind speed observations over the last minute in the Solent Region ”

Page 20: Data Integration in a Big Data Context

UBDC Seminar 20

Initial Display

2 September 2015

Page 21: Data Integration in a Big Data Context

UBDC Seminar 21

Sensor Data

2 September 2015

Page 22: Data Integration in a Big Data Context

UBDC Seminar 22

Sea-state Forecast Model

2 September 2015

Page 23: Data Integration in a Big Data Context

Drug Discovery Use Case

“Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”

Chemical Properties (Chemspider) Launched drugs (Drugbank) Human => Mouse (Homologene) Protein Families (Enzyme) Bioactivty Data (ChEMBL) … other info (Uniprot/Entrez etc.)

“Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”

2 September 2015 UBDC Seminar 23

Page 24: Data Integration in a Big Data Context

Open PHACTS Discovery Platform

2 September 2015 UBDC Seminar 24

Drug Discovery Platform

Apps

Domain API

Interactive responses

Production qualityintegration platform

MethodCalls

Standard Web Technologies

Page 25: Data Integration in a Big Data Context

26

API Hits

2 September 2015 UBDC Seminar

Jan 2013

Feb 2013

Mar 2013

Apr 2013

May 2013

June 2013

July 2013

Aug 2013

Sept 2

013

Oct 2013

Nov 2013

Dec 2013

Jan 2014

Feb 2014

Mar 2014

Apr 2014

May 2014

June 2014

July 2014

Aug 2014

Sept 2

014

Oct 2014

Nov 2014

Dec 2014

Jan 2015

Feb 2015

Mar 2015

Apr 2015

May 2015

June 2015

0

10000000

20000000

30000000

40000000

50000000

60000000

Month

No

of H

its

Public launchof 1.2 API

1.3 API 1.4 API 1.5 API

Page 26: Data Integration in a Big Data Context

27

Open PHACTS Data

2 September 2015 UBDC Seminar

Page 27: Data Integration in a Big Data Context

Multiple Identities

P12047X31045P12047

GB:29384RS_2353

2 September 2015 UBDC Seminar

Andy Law's Third Law“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”

http://bioinformatics.roslin.ac.uk/lawslaws/

28

Are these the same

thing?

Page 28: Data Integration in a Big Data Context

Gleevec®: Imatinib Mesylate

2 September 2015 UBDC Seminar 29

DrugbankChemSpider PubChem

Imatinib

MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N

Page 29: Data Integration in a Big Data Context

Gleevec®: Imatinib Mesylate

2 September 2015 UBDC Seminar 30

DrugbankChemSpider PubChem

Imatinib

MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N

Are these records the same?It depends upon your task!

Page 30: Data Integration in a Big Data Context

UBDC Seminar 31

skos:exactMatch(InChI)

Strict Relaxed

Analysing Browsing

Structure Lens

2 September 2015

I need to perform an analysis, give me details of the active compound in

Gleevec.

Page 31: Data Integration in a Big Data Context

UBDC Seminar 32

skos:closeMatch(Drug Name)

skos:closeMatch(Drug Name)

skos:exactMatch(InChI)

Strict Relaxed

Analysing Browsing

Name Lens

2 September 2015

Which targets are known to interact with Gleevec?

Page 32: Data Integration in a Big Data Context

33

What is a Scientific Lens?A lens defines a conceptual view over the data Specifies operational equivalence conditions

Consists of: Identifier (URI) Title

(dct:title) Description

(dct:description) Documentation link

(dcat:landingPage) Creator

(pav:createdBy) Timestamp

(pav:createdOn) Equivalence rules

(bdb:linksetJustification)2 September 2015 UBDC Seminar

Page 33: Data Integration in a Big Data Context

37

Administrative Data Research Network

UBDC Seminar

Administrative Data Service

2 September 2015

Page 34: Data Integration in a Big Data Context

38

ADRC-Scotland

UBDC Seminar

Co-located with Farr Institute, Scottish Government and NHS.

Universities of Aberdeen, Dundee,Edinburgh, Glasgow, Herriot-Watt, St Andrews and Stirling.

Expertise in administrative data and public engagement, linkage, law and relevant computer science techniques.

Provide research support, facilities, training

2 September 2015

Page 35: Data Integration in a Big Data Context

39

Research Focus

UBDC Seminar

http://www.gov.scot/Resource/0044/00442276-390.jpg

Schools, colleges and universities The criminal and justice system Social work services Social welfare Housing system Transport system Health system Historical administrative data

2 September 2015

Page 36: Data Integration in a Big Data Context

40

Data Matching

UBDC Seminar

Messy dataProbabilistic matchesSchema matching

John GrantFisherman

Fiona Sinclair

Ian GrantSmithy

Born: 1861

Stuart AdamWheelwright

Morag Scott

Flora AdamSeamstressBorn: 1866

Married: 1884

John GrantFarmer

Fiona Grant

Iain GrantBorn: 1860

2 September 2015

Page 37: Data Integration in a Big Data Context

41

Summary RDF eases data integration Working on RDF stream extensions Data is complex and messy

Requires flexibility in linking

Equivalence depends upon context Lenses provide support for operation equivalence

2 September 2015 UBDC Seminar

[email protected]@gray_alasdair