D4.1 IoT Data Attributes and Quality of Information...3 Abstract: This deliverable summarises the...

D4.1 IoT Data Attributes and Quality of Information

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 779852

2

Title: Document Version:

D4.1 IoT Data Attributes and Quality of Information 0.1

Project Number:

Project

Acronym: Project Title:

779852 IoTCrawler IoTCrawler

Contractual Delivery Date: Actual Delivery Date: Deliverable Type* - Security**:

31/01/2019 31/01/2018 R - PU * Type: P - Prototype, R - Report, D - Demonstrator, O - Other

** Security Class: PU- Public, PP - Restricted to other programme participants (including the

Commission), RE - Restricted to a group defined by the consortium (including the Commission),

CO - Confidential, only for members of the consortium (including the Commission)

Responsible and Editor/Author: Organization: Contributing WP:

Thorben Iggena University of Applied

Sciences Osnabrück WP4

Authors (organizations):

Thorben Iggena (UASO), Daniel Kümper (UASO), Ralf Tönjes (UASO), Martin Strohbach (AGT), Pavel

Smirnov (AGT), Juan A. Martinez (OdinS), Sahr Thomas Acton (UoS), Roonak Rezvani (UoS), Josiane

Parreira (SIE)

3

Abstract:

This deliverable summarises the work of Task T2.3 and T4.1. In a first step the requirements for an

information model for the IoTCrawler framework are presented based on the development of

scenarios in deliverable D2.1. The following state of the art about ontology models and quality

calculation is used to prepare the development of an information model capable of fulfilling the

requirements. The main aspects that have been considered are the general data source annotation

including time and geo information, privacy, and quality of information. In addition, it is shown how

the information model and the additional quality information can be used for the indexing and

crawling mechanisms for the framework.

More details on the calculation of quality of information and its metrics are presented followed by

some early implementations and demonstrations. All in all this deliverable builds a common basis

for understanding information crawled with the IoTCrawler framework and shows how the process

of data searching with IoTCrawler can be supported by additional Quality of Information to finally

enhance the user experience and to improve results of machine initiated data queries.

Keywords:

Information Model, IoTCrawler framework, Annotation, Quality of Information, Quality Measures.

Disclaimer:

The present report reflects only the authors’ view. The European Commission is not responsible for

any use that may be made of the information it contains.

4

Abbreviations

AA Attribute Authority

ABE Attribute Based Encryption

AEC Architecture, Engineering and Construction

CR-Index Continuous Range Index

DHT Distributed Hash Table

DIM Distributed Index for Multi-dimensional Data

GDPR General Data Protection Regulation

GIS Geographic Information System

IBE Identity Based Encryption

IdM Identity Management Component

IFC Industry Foundation Classes

IoT Internet of Things

IoT-O Internet of Things Ontology

IRM IoT Relationship Model

iSAX Indexable Symbolic Aggregation Approximation

JSON JavaScript Object Notation

MDR Meta Data Repository

OGC Open Geospatial Consortium

PKC Public Key Cryptography

PPO Privacy Preference Ontology

PROV-O Provenance Ontology

PTO Pervasive Trust Ontology

QoI Quality of Information

QoS Quality of Service

RDF Resource Description Framework

SAO Stream Annotation Ontology

SAREF Smart Appliances Reference Ontology

SAX Symbolic Aggregation Approximation

SCF Smart City Framework

5

SKC Symmetric Key Cryptography

SOSA Sensor, Observation, Sample, Actuator

SSN Sensor Network ontology

SVD Singular Value Decomposition

6

Executive Summary

This deliverable focuses on the work done in task T2.3 and T4.1 of the IoTCrawler project.

Main focus of these tasks is to develop an information model to make all the information

collected by IoTCrawler machine readable and, therefore, ready to be indexed and ranked

for user or machine queries. This with the aim of finding data sources for certain use cases.

The information model has been developed under consideration of the 22 use cases defined

in deliverable D2.2 and the additional requirement of Privacy and Quality of Information.

Hence, this document presents an ontology and metrics to calculate and annotate the Quality

of Information for data sources found with IoTCrawler’s crawling mechanisms.

Utilising the added Quality of Information for IoT data sources, the ranking mechanism can

not only consider the annotation information of the data source but also information about

the data quality of a data source. With the defined metrics it is possible to define the focus

for a data query to the IoTCrawler framework. As an example, it might be useful to use an up

to date data source, which provides data in near real-time and therefore has a higher

Timeliness instead of a more accurate one providing data only several times per day. Used

in this way, the Quality of Information component supports the framework by providing

additional information to enhance user experiences for better services.

Disclaimer

This project has received funding from the European Union’s Horizon 2020 research and

innovation programme under grant agreement No 779852, but this document only reflects

the consortium’s view. The European Commission is not responsible for any use that may be

made of the information it contains.

7

Table of Contents

1 Introduction ............................................................................................................................................................... 11

1.1 Sample Scenarios ........................................................................................................................................ 11

1.2 Overview ............................................................................................................................................................ 13

2 Requirements for IoTCrawler Information Model ....................................................................... 15

2.1 Machine Enabled Discovery ............................................................................................................... 15

2.2 Indexing .............................................................................................................................................................. 16

2.3 Quality Measures for Ranking ............................................................................................................ 16

2.4 Security and Privacy Concerns ......................................................................................................... 17

3 State of the Art ....................................................................................................................................................... 19

3.1 Ontology Models ......................................................................................................................................... 19

3.1.1 IoT Ontologies ...................................................................................................................................... 19

3.1.2 Geo and Time Ontologies ............................................................................................................21

3.1.3 Security and Privacy Ontologies ............................................................................................ 22

3.2 QoI Computation .......................................................................................................................................... 23

4 Annotation Model for Discovery ............................................................................................................... 25

5 Data Representation for Indexing ........................................................................................................... 30

6 Quality Measures and Analysis for Ranking ..................................................................................... 34

6.1 Quality of Information............................................................................................................................... 34

6.1.1 General Approach and QoI Vector ....................................................................................... 34

6.1.2 Completeness ...................................................................................................................................... 34

6.1.3 Timeliness ............................................................................................................................................... 35

6.1.4 Plausibility ...............................................................................................................................................36

6.1.5 Artificiality ................................................................................................................................................ 37

8

6.1.6 Concordance ......................................................................................................................................... 37

6.1.7 Quality Ontology ................................................................................................................................ 38

6.2 Quality of Service ....................................................................................................................................... 40

6.3 Model-based Analysis ............................................................................................................................. 41

6.3.1 Infrastructure Model ........................................................................................................................ 42

7 Security and Privacy for Enabling Data Access ............................................................................ 47

7.1 Authentication ................................................................................................................................................ 47

7.2 Authorization ...................................................................................................................................................48

7.3 Privacy/Secure Group Sharing ........................................................................................................ 50

7.1 Security properties ..................................................................................................................................... 52

8 Implementation and Experimental Results ...................................................................................... 54

8.1 Annotation Process .................................................................................................................................... 54

8.1.1 Data Evaluation .................................................................................................................................. 56

8.1.2 Geo access benchmark ................................................................................................................58

8.2 Example Quality Calculation and Annotation ....................................................................... 60

8.3 Example Location Indexing ................................................................................................................. 61

9 Conclusion ................................................................................................................................................................ 64

10 References ............................................................................................................................................................... 66

9

Table of Figures

Figure 1-1 Smart Home Scenario ........................................................................................................................ 13

Figure 3-1 The SOSA and SSN ontologies and their vertical and horizontal modules

(taken from https://www.w3.org/TR/vocab-ssn/. Last accessed: 28/11/2018) ......... 20

Figure 3-2 CityPulse QoI Ontology .................................................................................................................... 24

Figure 4-1 An instance of an IoTCrawler Object ..................................................................................... 26

Figure 4-2 IoTCrawler Information Model .................................................................................................... 28

Figure 6-1 Quality Ontology for ioT Data Sources ................................................................................. 39

Figure 6-2 Infrastructure Model Description of the Utilised OpenStreetMap Database

[53] .............................................................................................................................................................................................. 43

Figure 6-3 Building floor of the UASO Lab showing connected sensor infrastructure

...................................................................................................................................................................................................... 44

Figure 6-4 a) Soil type overview of the NIBIS Server[59], b) NDVI (Normalized

Difference Vegetation Index) based on combined Sentinel2 bands (B8-B4)/(B8+B4)

[60] ............................................................................................................................................................................................. 44

Figure 6-5 Combined Infrastructure Model View of a office sensor network floorplan

(IndoorGML IGML), building, path and road network (OSM) and topology soiltype data

...................................................................................................................................................................................................... 45

Figure 7-1 Authentication Example ...................................................................................................................48

Figure 7-2 Authorization Example ..................................................................................................................... 49

Figure 7-3 Access Control & Ownership Ontology .............................................................................. 50

Figure 7-4 Data Privacy Example ....................................................................................................................... 51

Figure 7-5 Quality of Information Ontology including Security ................................................... 52

Figure 7-6 Security Ontology ................................................................................................................................ 53

Figure 8-1 Classes related and Connected to sosa:Sensor ............................................................ 54

Figure 8-2 IoTCrawler Information Model Example ............................................................................. 55

Figure 8-3 The output of clustering algorithm after applying Lag and PCA on real-

time air pollution data (left). Notice the different patterns of observation from each

cluster (right). ..................................................................................................................................................................... 57

Figure 8-4 Geo access benchmark, Experiment 1 non indexed data .....................................58

10

Figure 8-5 Geo access benchmark, Experiment 1 indexed data ................................................ 59

Figure 8-6 Geo access benchmark, Experiment 1 Non indexed vs Index data .............. 59

Figure 8-7 Example of Weather Sensor Data, measuring Pressure, Relative humidity

and Temperature ........................................................................................................................................................... 60

Figure 8-8 Weather Sensor data with Integrated QoI values ....................................................... 61

Figure 8-9 Geohash example: Precision with 7 characters string .............................................63

Figure 8-10 Geohash example: Precision with 9 characters string ...........................................63

1 Introduction

The integration of IoT resources, provided by third parties, requires that a search

engine such as IoTCrawler has rich metadata about these resources. In order to

maximize the relevance of search results, it is important that IoTCrawler specifies a

data model that clearly defines a set of data attributes that 1) are of interest to search

requests, and 2) describe the IoT resources sufficiently well to allow for efficient

search and data access. In this deliverable we define such a model including its data

attributes. This model builds the backbone of the IoTCrawler framework, as it defines

the basic data model used by the metadata repository as defined in deliverable D2.2

[48].

The model presented in this deliverable is driven by the analysis of the IoTCrawler

scenarios. In deliverable D2.1 [47] we have extensively described and analysed 22

IoTCrawler scenarios based on which we derived several key challenges that

IoTCrawler needs to address. In contrast to traditional data management systems

and, in particular, to web search engines, crawling and searching IoT resources

involves additional challenges that need to be considered. These include, for

instance, coping with heterogeneity of data sources, managing dynamically changing

locations of resources due to their mobility, knowing about their varying and changing

availability, be informed about the different levels of quality of information, and finally

protect data from unwanted access and ensure trust between data sources and

consumers.

The data model, including its attributes, which are defined in this deliverable,

holistically addresses these challenges and particularly pays attention to define the

necessary attributes for privacy-enabled data access and determining the Quality of

Information (QoI). QoI is particularly relevant for selecting and ranking the data

sources.

1.1 Sample Scenarios

The data model described in the following sections is strongly motivated by the

scenarios descried in deliverable D2.1. The model covers among other aspects all the

relevant aspects for machine enabled discovery (see Section 2.1), scalability (see

12

Section 2.2), quality measures for ranking (see Section 2.3) and security and privacy

concerns (see Section 2.4). These aspects can be well motivated by all of the 22

scenarios described in D2.1. Here we briefly examine some of them as motivation for

the provided data model. Please note that these examples are merely selected for

illustrative purposes of the results presented in this deliverable and do not imply a

selection of the scenarios further pursued as driving implementation scenarios. This

decision will be taken in the next few months as part of WP7.

For instance, considering the Pulse of Aarhus scenario. This scenario is based on the

idea of crawling all available data sources and correlating sensor data in order to

create metrics and new insights about the city and its citizens, e.g. whether they are

happy or busy, prefer certain locations etc. The data is not under uniform control, e.g.

the data can be provided by the municipality, companies, or individuals.

Consequently, for the discovery of the data, it is important to know about where the

data source is deployed (only Aarhus may be of interest) and what the data type is.

As we expect a high number of data sources in this scenario, another important

aspect is the Quality of Information associated with the data sources. Consider, for

instance, a search request for pollution levels in the city or a certain region of the city.

The city may provide access to high accuracy but “old” pollution levels with low

sensor coverage. There may also be community-driven sensor networks with much

higher coverage and up to date data, but at the cost of lower accuracy. If these QoI

attributes can be captured, they can be used for ranking and an application can make

an informed decision which source has to be use.

The Find My Lost Child scenario stresses the requirement to capture dynamic geo-

spatial attributes of data sources. In this scenario, both fixed cameras and mobile

cameras of people in the city are used to identify missing people such as a child. The

data model presented in this deliverable (see Section 4) particularly takes into

account that resources such as mobile phones have changing location attributes.

Mobile IoT resources can provide their location as part of an IoTStream. The indexing

can access this information and update the index appropriately.

Finally, in domestic settings, a Smart Home Crawler as we demonstrated at the ICT

Event 2018 in Vienna, offers the opportunity to offer innovative Smart Home

applications. However, in such a setting it is paramount that people’s privacy is

13

protected and that people are in control with whom and how they share the data. The

dilemma between data richness and the need to control and protect data flows is

illustrated in Figure 1-1.

Figure 1-1 Smart Home Scenario

In Smart Home scenarios both individuals and companies have an interest to provide

and use rich data. Individuals are interested in using innovative services based on their

data, and companies require the data to offer these services. However, individuals do

not want to give away their data in an uncontrolled way. In the same way, companies

need a technical solution to comply with General Data Protection Regulation (GDPR).

IoTCrawler is designed to address this dilemma by 1) providing individuals, or more

generally data providers, full control over their data, 2) offering a technical solution

that assists in compliance with GDPR, and 3) help companies to establish a trust

relation with their data providers. Note that these trust issues are also particular

important in B2B scenarios, such as Industry 4.0, were industrial customers typically

have strong concerns in providing data related to their manufacturing processes.

This deliverable addresses these issues by preparing the information model and

providing additional privacy and security ontologies. Furthermore, it enables trustful

relationships between data providers by offering quality metrics and analysis for

heterogeneous data sources.

1.2 Overview

The document is structured as follows: The requirements for the IoT Crawler

annotation model are described in Section 2. The overview of state of the art in

ontological models together with the quality of information (QoI) evaluation

14

techniques are presented in Section 3. The annotation model providing the basic data

model used for new machine enabled discovery is described in Section 4. The

annotation model that defines the core data model used by the metadata repository

(MDR) is described in the D2.2. Additionally, Section 5 describes the data model used

by the index that provides efficient access to the metadata stored in the MDR. This is

comparable to a database index. Section 6 provides a zoom in on the annotation

model detailing the QoI attributes and their calculation. Section 7 details privacy-

enabled data access, illustrating how the annotation model is used to protect data

and control access to it. Finally, in Section 8 we describe three implementation

prototypes that illustrate the annotation process, show how quality is calculated and

annotated, and how the location attributes can be indexed.

15

2 Requirements for IoTCrawler Information Model

2.1 Machine Enabled Discovery

In the past, search engines were mainly used by human users to search for content

and information. In the newly emerging search model, information is provided

depending on the consumers’ (human user or a machine) context and requirements

(for example, location, time, activity, and profile). Because of its necessity and

relevance, the information access can be initiated without the user’s (human or a

machine) explicit query or instruction (context-aware search). IoTCrawler will develop

enablers for context-aware IoT search, where the requirements of the different

applications will be mapped to the solutions by selecting resources considering

parameters such as security and privacy level, quality, latency, availability, reliability

and continuity. Therefore, one of the main requirements for the IoTCrawler

Information Model is to enable a rich, multi-faceted description of IoT data and

services. The meta-data description should cover both functional (e.g. location,

sensor type) and non-functional (latency, quality) properties.

Moreover, in many scenarios the applications and services make use of higher-level

concepts (e.g. topics and/or events) to access the data and services (e.g. find all the

areas that show “traffic jams”, return the location of areas that currently have “high air

pollution levels”, display all the CCTV cameras that show a “moving object”). These

higher-level concepts are domain dependent and require a common understanding

and description of the key concepts in the domain (i.e. described by a topical

ontology). Enabling the match between higher-level concepts representing user or

machine context, to the lower-level attributes and patterns describing the IoT

resources is another requirement for the IoTCrawler Information Model. In addition,

to represent domain independent meta-data about resources, the model must be

easily extensible to allow the linkage to domain specific information. It must also

support reasoning methods to allow vague or near complete descriptions of the

application needs to be mapped to concrete data and service descriptions.

Also relevant for the discovery is the ability to automatically adapt to changes in both

the context and data. Therefore, the IoTCrawler Information Model must provide

means to represent and easily access dynamic information.

16

2.2 Indexing

The efficiency of IoT applications heavily depends on the cooperation of IoT devices

and services. Finding the most appropriate devices, services and data to respond to

the requirements of applications is a crucial challenge in dynamic and multi-purpose

IoT environments. For the IoT context-aware search to be successful, it requires that

efficient underlying crawling and indexing mechanisms are in place to find, describe

and register resources. A successful indexing strategy must provide all the relevant

information needed for the discovery task, while being decentralized, lightweight and

able to adapt to the changes in the IoT environments. Moreover, it should support

self-configuration and fault recovery mechanisms. Spatial, temporal and thematic

attributes are essential. In addition, our solution will develop hierarchical mechanisms

to also index other attributes of the data/services such as type, quality and time. This

will enable multi-level indexing of data with multiple attributes. However, this will also

require different indexes constructed based on spatial, temporal and thematic

attributes, to be combined.

One of the main requirements for the IoTCrawler Information Model is that it must

contain all attributes needed by the indexing mechanism. Keeping the index up-to-

date is particularly challenging, as data providers can be transient and/or mobile and

their data related attributes could change over time. The Information Model must also

be extensible to represent data summaries to be used to decide when the index

should be updated, as well as mixed with a set of attributes to support multi-level

indexing.

2.3 Quality Measures for Ranking

The amount of generated data all over the world has been massively increased in the

last years. Smarty cities, social networks, Industry 4.0 and many more have created

dozens of new data sources. Even agriculture is turning more and more into a data

related business. For a comprehensive list of scenarios dealing with these data

sources see D2.2 [48].

The availability of all these new sensors, data sources, open data platforms etc. offers

new possibilities for innovative applications and use-cases. The heterogeneity of the

data sources could become an obstacle for building new applications. It is often

17

nearly impossible to automatically combine these data sources to extract knowledge

because machine readable information is not available. At this point Quality Measures

for annotation could offer opportunities to support the process of finding the right

information.

In IoTCrawler, the ranking mechanisms have to select the right data sources for all

thinkable use-cases. In a first step, this would be the data source or information that

is needed (e.g. traffic sensors for routing). To support the ranking mechanisms, it is

required to rate the available data sources. If it is known if an information is correct,

application developers are able to guarantee a better user experience. Therefore, the

Quality Measures have not only to consider single data sources but also combinations

of heterogeneous and independent data sources offers new possibilities for analysis,

but this requires more complex mechanisms.

To support the ranking in IoTCrawler, not only the correctness of data sources could

be considered by the Quality Measures, even if it could be stated as one of the most

important metrics. Other thinkable metrics could be timeliness ratings of data. This

would enable the ranking mechanism to distinguish data sources, which have the

same correctness but different timeliness ratings. This could easily occur if one data

source provides data more often than another. In this case, the ranking has to decide

if it needs more accurate or more current data. This “system” can be extended with

more metrics to rate the quality. To store all possible measures, an annotation

scheme has to be developed, which is, on one hand, capable of being used by the

ranking mechanisms and, on the other hand, easily to be extended for additional

measures.

2.4 Security and Privacy Concerns

With the rapid development of Internet of Things (IoT), there are a variety of IoT

applications that contribute to our everyday life. They cover from traditional

equipment to general household object, which help make human being’s life better.

Nowadays, IoT is widely applied to social life applications such as smart grid,

intelligent transportation, smart security, or smart home [1]. Additionally, smart cities

will involve millions of autonomous smart objects around us, monitoring, collecting

and sharing data without being aware of it [2].

18

Various application fields require the consideration of security and privacy as a

cornerstone in our architecture and, of course, in our information model at different

levels. From the point of view of the use of IoTCrawler, we have two kinds of entities

worth considering: producers and consumers. Each one requires different and

complementary security measurements.

From the point of view of producers, our information model must be aware of security

requirements such as ownership, access control, visibility, and the likes, which allow

the producer to specify security properties over the data to be registered in our

IoTCrawler. On the other hand, from the consumer point of view, searches and queries

carried out must be controlled in a secure way, presenting the results matching their

authorization privileges.

Furthermore, privacy must be considered because it allows the concealment of

sensitive information, which must be consumed only by the legitimate consumers.

All in all, security and privacy are important and have been considered in our

information model because of the above implications. It enables the enforcement of

security and privacy policies by selected security mechanisms and technologies.

19

3 State of the Art

3.1 Ontology Models

3.1.1 IoT Ontologies

The Semantic Sensor Network ontology (SSN) is probably the best-known ontology

to describe sensors in terms of capabilities, measurement processes, observations

and deployments. Recently, the authors of SSN have proposed SOSA (Sensor,

Observation, Sample, and Actuator), to act as a central building block for the SSN but

also to be used as a standalone lightweight alternative (see Figure 3-1) [13].

SOSA includes concepts such as sensors, outputs, observation value and feature of

interests. IoT-Lite [14] is another lightweight ontology for IoT resources, entities and

services, which is an instantiation of SSN and describes three concepts: objects,

system or resources and services. The lightweight characteristic allows the

representation and use of IoT platforms without consuming excessive processing

time when querying the ontology, making it quite suitable for real time sensor

discovery.

There is another ontology in Internet of Things domain, IoT-O1, which is intended to

model knowledge about IoT systems and applications. It has different modules, such

as sensing module, acting module, service module, lifecycle module and energy

module. It also has a specific module for IoT.

1 https://www.irit.fr/recherches/MELODI/ontologies/IoT-O.html

20

Figure 3-1 The SOSA and SSN ontologies and their vertical and horizontal modules (taken from

https://www.w3.org/TR/vocab-ssn/. Last accessed: 28/11/2018)

The Stream Annotation Ontology (SAO) is a lightweight semantic model, which is built

on top of well-known models to represent IoT data streams. It uses concepts from

the Timeline Ontology (e.g. instant and interval) to represent temporal features, and it

also adopts Agent, Entity and Activity concepts from the PROV-O ontology. For

representing events, SAO uses the Event class from Event Ontology and it introduces

the StreamEvent concept, which describes the output of a stream observation. SAO

adopts the Sensor concept from SSN and extends the SSN’s Observation class with

the StreamData concept, which describes a segment or point which is connected to

time (i.e. it represents a stream as a point or segment). One of the main features of

SAO is the StreamAnalysis class, which allows the representation of an IoT stream

which is derived from one or more data streams, following a data analysis process.

SAO allows the representation of both the derived stream as well as the methods that

were used during the analysis process [15].

Smart Appliances REFerence (SAREF) is an ontology which links different smart

appliances using different ontologies and standards. SAREF ontology combines

21

different parts of existing ontologies based on the need [16]. OMA is another ontology

which is domain-specific and, according to its technical specification, it defines

different types of objects, such as Security, Server, AccessControl, Device,

ConnectivityMonitoring, Firmware, Location and ConnectivityStatistics [17].

3.1.2 Geo and Time Ontologies

There are also some location ontologies which have been used in IoT sensors and

streams ontologies. GeoSPARQL is one of the ontologies which represents Geospatial

data for the Semantic Web. GeoSPARQL can work with systems based on both

qualitative spatial reasoning and quantitative spatial computations [18].

The Geo ontology2 is another popular ontology that represents location data in RDF

(Resource Description Framework). It does not try to tackle many of the matters

covered in the professional GIS (Geographic Information System) world. Instead the

ontology offers just a few basic simple terms that can used in RDF (e.g. RSS 1.0 or

FOAF documents) when there is a need to describe latitudes and longitudes. The use

of RDF as a carrier for lat (latitude) and long (longitude) simplifies the capability for

cross-domain data mixing, as well as describing entities that are positioned on the

map (e.g. Sensor, Deployment, Platform or A System).

GeoJSON3 is a Geospatial data interchange format based on JavaScript Object

Notation (JSON). It describes numerous types of JSON objects and the way they are

joined to represent data about geographic features, their properties, and their spatial

extents.

There are also some time ontologies, Time ontology4 is one of them. This is an

ontology for describing temporal properties in the world and Web pages. It has

vocabulary for representing information about topological (ordering) relations,

duration and temporal position (I.e. date-time information). Time can be expressed

using conventional clock, Unix-time, geologic time and other reference systems. For

duration, we can also use different systems, for example, Gregorian calendar.

2 http://www.w3.org/2003/01/geo/wgs84_pos# 3 https://tools.ietf.org/html/rfc7946 4 https://www.w3.org/TR/2017/REC-owl-time-20171019/

http://www.w3.org/2003/01/geo/wgs84_pos

https://tools.ietf.org/html/rfc7946

https://www.w3.org/TR/2017/REC-owl-time-20171019/

22

3.1.3 Security and Privacy Ontologies

Early work on privacy and security ontologies were focused on interactions over the

Web. Ontologies focused on specifying privacy settings to enable access control.

One example is the Privacy Preference Ontology (PPO) [27], a lightweight privacy

vocabulary aces control. It enables users to create fine-grained privacy preferences

for their data. The vocabulary is designed to restrict any resource to certain attributes

which a requester must satisfy in order to gain access. A security ontology was

proposed in [29], which models stakeholders and their assets, as well as different

properties resulted from security risk analysis tools, such as vulnerability, impact and

counter measures. The authors in [28] analysed trust on the web and have proposed

a simple trust ontology where different types of trust and their properties are

formalized. The Pervasive Trust Ontology (PTO) [32] deals with trust computation in

pervasive domains and uses different properties among devices to assign trust

values. A trust manager is responsible for assigning the initial trust values and defining

semantic relations among trust categories. Trust values are updated based on

interactions among devices, and security rules, written in SWRL5 [1], are used to

decide whether an interaction is allowed.

More recent work builds upon existing models, but with focus on IoT domains. The

work in [30] extends the Smart Appliances REFerence (SAREF) ontology6 with security

features, with focus on the smart home domain. Their model includes infrastructure,

attacks, vulnerabilities and counter measures for the main components of smart

home energy management systems such as Smart Meter, Smart Appliance, Home

Gateway, and Billing data. The work in [33] also targets security and privacy in IoT-

based smart homes. The authors proposed an ontology-based security service

framework for supporting security and privacy preservation in the process of

interactions/interoperations. The ontology captures the security objective of an

interaction/interoperation (e.g. if the focus is on integrity or confidentiality), digital

signatures, security keys, encryption algorithms used, among others. Security polices

based on the ontology can then be created to indicate the abilities of

5 https://www.w3.org/Submission/SWRL/ 6 http://ontology.tno.nl/saref/

23

interactions/interoperations between the service providers and customers in the

developed cloud architectural model for IoT-based smart home.

The SecKit [31] is a model-based security toolkit that supports integrated modelling

of the IoT system design and runtime viewpoints, to allow an integrated specification

of security requirements, threat scenarios, trust relationships, and usage control

policies. The SecKit integrates approaches for policy refinement, policy enforcement

technology at different levels of abstraction with strong guarantees, context-based

policy specification, and trust management. Although no formal ontology is given, the

system and its trust metamodels can be used as a reference for ontology modelling.

Mozzaquatro et al. [34] have proposed a reference ontology for IoT security with

security concepts of M2M communications. The work was based on a state-of-the-

art analysis on information security and Internet of Things. Their IoTSec ontology

features classes such as Assets, Threats, Security Mechanism, and Vulnerability, and

it’s available online7.

3.2 QoI Computation

Quality of Information is a term originally developed to describe the quality of data

held in databases. Strong et al. cited different studies, which assume that faulty data

costs billions of dollars [3]. To address these issues, they defined four categories of

data quality, each one with a list of dimensions. With additional problem patterns to

identify possible faulty data, they built a basis for further rating of quality of

information. Over the following years different approaches and frameworks have

been developed to rate the QoI for different use cases, e.g. [4], [5].

Key aspects when talking about Quality of Information are the used categories or

metrics. These are used to describe the details of QoI information added to some

data. Five common metrics are Completeness, Correctness, Concordance, Currency

and Plausibility, which are used by the authors of [6]. In addition, they provide

assessment methods for quality and mappings between the quality dimensions.

An extensive ontology to describe Quality of Information in a machine-readable way

has been developed by the CityPulse project. They used five categories, namely

7 https://github.com/brunomozza/IoTSecurityOntology

24

Timeliness, Cost, Accuracy, Communication and Security, each with a bunch of sub

metrics. It has been designed to annotate data streams with QoI and therefore has a

relation to other ontologies.

Figure 3-2 CityPulse QoI Ontology

Besides the representation and categorisation of QoI, data algorithms to fill the

metrics with values have to be developed. One major problem, especially when

talking about correctness, is the lack of ground truth. This problem is well known in

the image processing domain, if an image has to be rated for its quality without any

reference [7]. To rate images without an existing ground truth, the work in [8] analyses

the sharpness of edges or the noise levels.

Ideas on how to deal with a missing ground truth for social media data have been

proposed by [9], which are spatiotemporal, causality, and outcome evaluation. The

proposed concept of spatiotemporal evaluation is similar to the approach of Valid.IoT

framework presented in [10]. The framework not only uses spatiotemporal relations,

but also historic data is considered to evaluate the quality of data sources. To

calculate the quality measures, also infrastructure models can be used. This

approach is more complex than using simple metrics like Euclidian distance etc. but

avoids simple common errors that can easily occur, e.g., the “real” distance from one

street to another can be much longer than Euclidean distance caused by the street

network [11], [12].

25

4 Annotation Model for Discovery

Semantic data model is a systematic process of structuring data in order to represent

it in a precise logical manner. Its conceptual data model includes semantic

information that adds a basic meaning/description to the data and the relationships

that lie between them. This approach of data modeling (data structure) preserves

data consistency in the process of updating the data.

The design of IoT-Crawler ontology is based on a set of principles, always keeping

the ontology as lightweight as possible. This ontology extends IoT-Stream ontology

which provides stream annotation concepts derived from SOSA.

Whilst developing the semantic model we followed several best practices found

within the literature. First of all, we referred to the most followed guide for creating

ontologies, created in 2001; ontology development 101 [19]. Secondly, in 2003 W3C

published a list of sample Good Ontologies following specific good practices, which

we have followed. And finally, in 2016, the development of the IoT-Lite ontology lead

to the extension of these guidelines to cover the scalability of ontologies.

The guidelines of "ontology development 101”, divides the development of an

ontology into seven detailed steps.

In IoT-Crawler the implementation of step 1, which focuses on determining the

domain and scope of the ontology, is the result of several discussions with partners

within the project that have a wealth of knowledge and insights in the developments

of ontologies. This collective wealth of experience allows us to answer the question

as to who the end user of the ontology will be.

Step 2, which considers reusing existing ontologies, is derived from the study during

several years of the ontologies in the area, and the selection of the proper ones to be

reused, or partially reuse.

Steps 3 to 6 deal with the enumeration of important terms in the ontology, definition

of the classes and the class hierarchy, their properties of classes slots, and their facets

of the slots. These steps are further defined down in this section, and step 7 for

creating instances is illustrated in Figure 4-1.

Similarly, the “Good Ontologies” list published by W3C scored the ontologies based

on five aspects, and in the development of IoT-Lite [14] the authors published 10 steps

26

for semantic model development. We have followed all these aspects and steps, but

due to space limit we cannot describe them here. Following the previous guidelines,

and keeping always in mind that the main principle behind IoT-Crawler is to provide

a lightweight extension for data streams, we have developed IoT-Stream. The

development of IoT-Crawler has always followed the linked-data approach, that tries

to increase the chance of interoperability by extending popular ontologies. The SSN

ontology has proved to have a significant impact in its adoption for semantically

annotating IoT elements and, therefore, it is chosen as the ontology that IoT-Crawler

mainly extends, in addition to adopting its revised version, SOSA.

The development of IoT-Crawler has always followed the linked-data approach that

increases the chance of interoperability by extending and using existing popular

ontologies. The IoT-Stream ontology has proved to have the most significant impact

in its adoption for semantically annotation IoT elements, consequently it is chosen as

the ontology that IoTCrawler mainly extends, in addition to adopting it revised version,

SOSA.

Figure 4-1 An instance of an IoTCrawler Object

Hence, the aim of IoTCrawler ontology is to provide it with the appropriate elementary

concepts needed to process stream data, using and extending SOSA and IoT-Stream,

and thus allowing stream annotations. A huge percentage of the data retrieved from

27

IoT Devices is stored in streams.

For example, sensors measuring relative humidity, traffic or pollution are recovered

as streams of data with timestamps. When researchers and software developers

need to process IoT data to obtain meaningful insights they have to deal with the

heterogeneity of formats and syntactic of the data.

To overcome this heterogeneity, semantics can aid to solve this issue. However,

semantics in the early days was thought to give detailed annotations of the data,

when there was no need for (quasi) real-time responses, and the amount of data was

not huge. Therefore, the few semantic models that cover stream data, tend to

describe the real world in detail with several concepts. One example of such

semantics models is SAO, which was successfully used for forensics analysis and

some (quasi) real-time analysis. However, when dealing with huge amounts of data

of high granularity, analytics start to delay.

IoTCrawler intends to leverage the processing time of stream ontologies by

annotating the streams with the elementary concepts needed to process data.

IoTCrawler is composed of many different concepts from popular and commonly

used ontologies. IoTStream, being the main one, is composed only of four main

concepts, as can be seen in Figure 4-2.

28

Figure 4-2 IoTCrawler Information Model

These concepts are IoTStream, StreamObservation, Analytics and Event. The first

concepts created in the ontology was StreamObservation. This concept has only two

temporal data properties, windowStart and WindowEnd, and one temporal property

sosa:resultTime from sosa:Observation. Although the well-know and commonly used

Time Ontology provides temporal concepts, linking to these properties for each

StreamObservation would be rather too heavy. The other properties relating to the

observation value are captured by the sosa:hasSimpleResult. The rest of the concepts

with other useful information are kept outside this concept.

Using analytics, events can be detectedFrom single or multiple stream. For example,

if it is a windy, snowy or sunny day derived from humidity and temperature streams.

One of the main contribution of ontologies is the power to share and link data from

different models. That is why most of the semantic community has adopted the linked

approach and now we can find hundreds of sematic models linked together, that can

share their annotated data, without the need for further conversion efforts.

We have selected only stable and well-known ontologies, generally backed up by a

strong organisation body, such as W3C or open Geospatial Consortium (OGC). In

doing so we ensure that the ontologies are dereferenceable and will not disappear in

29

the near future. E.g., for the annotation of sensor devices and services we have used

SOSA and IoT-Lite ontologies.

It is common for IoT applications to support filtering, sorting and searching by

different parameters such as sensor type, location, quantity kind, etc. Therefore, we

have linked IoTCrawler ontology to several existing ontologies that describe these

concepts. For location we have linked to the Geo ontology and for observation

coverage GeoSPARQL and IoTLite are employed. For quantity kind and unit

taxonomies we have used the QU.

Taking this approach, IoTStreams link to concepts from other ontologies to capture

information about the qoi:Quality of the stream by using the providedBy property of

the class iot-lite:Service. It also links to the sosa:Sensor that it was generatedBy, the

qu:QuanityKind and qu:Unit that the sosa:Sensor measures with, and also the geo:Point

where the stream originates from. Therefore, for searching purposes we have

centered the ontology around the IotCrawler concept, providing from this concept

direct links to the needed information to form common searches. Having a centered

concept, the queries become lighter, because they need less triples to find each

aspect of the search.

30

5 Data Representation for Indexing

As described in the Deliverable D2.2 [48], the IoTCrawler architecture envisions a

distributed metadata repository, which compiles information (i.e. metadata) about the

different IoT deployments and their services/devices/data. Similar to Web search

engines, in which Web pages are indexed to allow fast data retrieval, indexes are

needed in the IoTCrawler framework to support search and data discovery. However,

most of the existing solutions for the Web indexing are designed based on text

analysis and exploitation of links between different documents/data resources on

the Internet and are not suitable for large scale, dynamic IoT data networks [9].

Indexing large volumes of heterogeneous and dynamic IoT data/sources requires

distributed, efficient and scalable mechanisms that can provide a fast access and

retrieval to data to respond to user queries. A decision on how often these indexes

should be updated and re-arranged while data streams are continuously published

is crucial to enable on-line indexing [36]. With on-line indexing, building indexing

structures is incremental with the goal to update the indexes continuously with the

new connected resources and the data that becomes available on the network,

without re-building the entire indexing structure. Moreover, the majority of IoT data

have inherently spatio-temporal aspects that must be considered. Searching for

patterns (e.g. temperature more than 24° C in a certain location) or answering complex

queries (e.g. get all temperature sensors whose locations are at most 2 km away, or

get all sensors that have observed extremely low visibility within the last 20 minutes)

often requires a combination of different indexing techniques in order to find relevant

results.

As the metadata repository holds information spreading across multiple IoT

deployments from different domains, the number of different types of services

available and their properties is expected to be large. Having indexes for every meta

data in the repository is prohibitively expensive to store and maintain. An index

(specially in IoT) must be lightweight, yet effective in supporting the search task.

Therefore, the decision on what to index is often application dependent.

In [36] Fathy et al. have performed an extensive survey on the state-of-the-art on

indexing, discovery and ranking of IoT data. Their work has identified three main types

31

of queries in IoT applications: data (find a particular piece of information), resource

(find a particular device) and higher-level abstractions (find a particular pattern).

Moreover, the information needs from each of these query types can be

heterogeneous (e.g. queries with location and resource type requirements).

Depending on the attributes to be indexed, different indexing structures and

techniques are required. Below we describe some of the indexing methods which we

will consider in IoTCrawler. These solutions will be extended to provide distributed

and adaptive features.

For indexing location attributes, the work in [37] proposes a framework that supports

spatial indexing for geographical values of collected observation and measurement

data from sensory devices. The work is based on encoding the locations of

measurement and observation data using geohash (Z-order curve). Geohash is a well-

known and widely used geocoding algorithm that is based on interleaving bits to

convert spatial coordinates (longitude and latitude) into a single string. Other works

make use of geohashing in combination of other attributes (e.g. time or sensor type)

to provide indexing solutions for multi-dimensional data.

Barnaghi et al. [38] makes use of geohashing, and it further annotates sensor data

semantically to allow creating spatio-temporal indexing. While the spatial

characteristic of the data is described using geohash, the temporal feature and the

type of the data are encoded using MD5 digest algorithm. Singular Value

Decomposition (SVD) is applied to geohash vectors to reduce dimensionality before

applying k−means clustering algorithm to distribute data among repositories and

allow data querying. Each data item is represented as a long string; the geohash tag

of a location and the MD5 digest of time and type values. Answering approximate

queries is based on a predictive method that is used to select the repository that

might have the requested hash string. Then, a string similarity method is used to find

the best match for a requested query in the selected repository. In [35] an indexing

structure is proposed which is built by first clustering different resources based on

their spatial features. A tree-like structure is then constructed per cluster in which

each branch represents a type of resource (e.g. temperature, humidity sensors). The

indexing mechanism supports an adaptive process for updating indexing with

minimal cost. However, the approach is limited to a predefined set of resource types

32

and it only supports exact search queries of multi-dimension attributes (i.e. exact

locations and types).

A Distributed Index for Multi-dimensional data (DIM) is proposed in [39]. It is a tree-

based indexing approach, which divides the network into geographical zones. Each

zone corresponds to a node in the tree, and each node represents a range of values

such that the tree root represents the entire range. DIM allows answering multi-

dimensional range queries. However, the routing algorithm to answer user queries is

computationally expensive which hinders its scalability within large-scale networks.

Distributed Hash Tables (DHT) are largely used for indexing based on a particular term

or attribute. While DHTs only support exact match for a given key, there is existing

work [40] that extends DHTs to support multi-attribute and range queries.

Continuous Range Index (CR-index) [41] is a tree-based method for indexing

observation data based on their type attribute and value ranges. The method

constructs a compact indexing scheme in which a collection of observation and

measurement data items are grouped into boundary blocks based on their value

ranges [min, max] (i.e. interval blocks).

While CR-index is an example of the use of the attribute values in the indexing

structure, it does not consider the temporal characteristics of sensor streaming data.

The approach in [42] creates a Bayes tree indexing based on aggregating hierarchical

Gaussian mixture in the form of a tree to represent the entire data set. The approach

supports both incremental learning and anytime classification of new arrival data

streams. The latter allows fast access and search for data with a good level of

accuracy. However, Gaussian mixture models are not suitable for multi-feature

indexing and the number of mixture components also need to be known in advance

which hinders flexibility. For time-series indexing, the most notable works are SAX

and its derivations, which we describe below.

Symbolic Aggregate approximation (SAX) was the first symbolic representation for

time series that allows for dimensionality reduction and indexing with a lower-

bounding distance measure. Much of the utility of SAX has now been subsumed by

iSAX (indexable Symbolic Aggregation approximation) [43], which is a generalization

of SAX that allows indexing and mining of massive datasets. An adaptive iSAX and

tree-based indexing mechanism to answer approximate queries has been proposed

33

in [44]. The cost of index construction is shifted from initialisation time to query time,

by creating and refining indexes while responding to queries. However, similar to SAX,

iSAX strongly assumes that the data has a Gaussian distribution and uses a z-

normalisation in which the magnitude of the data is lost. However, IoT data does not

necessarily follow Gaussian distribution; data distribution might change over time due

to the nature of the observed phenomenon and/or concept drifts. SensorSAX [44] is

an enhancement of the SAX approach to change adaptively the window size based

on the spread of data values (i.e. using standard deviation criterion). The proposed

approach converts raw sensor data into symbolic representation and infers higher-

level abstractions (e.g. dark room or warm temperature).

34

6 Quality Measures and Analysis for Ranking

6.1 Quality of Information

Section 2.3 identified a number of issues and requirements for common and future

IoT scenarios. One major requirement is the need of quality analysis for data sources

and their delivered/produced data. This is important due to the fact that false or

misleading information might cause problems during processing and using of the

information. This problem reaches from simple misconfigured sensors, which deliver

wrong information, to intentionally provided false information that leads to

malfunctioning systems and applications. To approach these problems, IoTCrawler

integrates quality measures and analysis modules to rate data sources to identify the

best fitting data sources to get the needed information.

The first step before implementing some quality analysis modules is to identify

quality measures, which can be used to rate data sources and the

delivered/produced data for their Quality of Information. To measure the QoI,

IoTCrawler has identified and developed several metrics. They are presented in the

following subchapters.

6.1.1 General Approach and QoI Vector

The general approach of the quality analysis component of IoTCrawler is to generate

a quality vector �⃗� , which can be used by the indexing and ranking components to find

the best fitting data sources for a use case.

�⃗� = ⟨𝑞𝑐𝑚𝑝, 𝑞𝑡𝑖𝑚, 𝑞𝑝𝑙𝑎, 𝑞𝑎𝑟𝑡, 𝑞𝑐𝑜𝑛⟩

�⃗� contains a list of quality metrics: Completeness 𝑞𝑐𝑚𝑝, Timeliness 𝑞𝑡𝑖𝑚, Plausibility

𝑞𝑝𝑙𝑎, Artificiality 𝑞𝑎𝑟𝑡 and Concordance 𝑞𝑐𝑜𝑛. These metrics will be described in detail

in the following sections.

6.1.2 Completeness

The Completeness metric defines if a data source message provides all information

that was defined in its description. While registering a new data source, a part of its

description are the fields that should be contained within delivered messages, e.g.,

35

temperature, humidity and wind speed for a weather sensor. The Completeness can

then be calculated as

𝑞𝑐𝑚𝑝 = 1 −𝑀𝑚𝑖𝑠𝑠

𝑀𝑒𝑥𝑝

with 𝑀𝑚𝑖𝑠𝑠 = 𝑠𝑢𝑚 𝑜𝑓 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑣𝑎𝑙𝑢𝑒𝑠 and 𝑀𝑒𝑥𝑝 = 𝑠𝑢𝑚 𝑜𝑓 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 of an

incoming dataset. 𝑀𝑒𝑥𝑝 is though extracted from the model description of the IoT

Metadata of the data source (IoTCrawler’s MDR [48]). Each data source message is

rated for Completeness on its own without including older messages.

6.1.3 Timeliness

The Timeliness metric rates if an observation was processed within a defined time

frame before being delivered to the framework. In technical terms, it calculates the

difference between the current time and a time stamp of the measured effect. If the

difference is out of the defined range (if the observation is too old) the QoI metric

Timeliness is lowered. To calculate the Timeliness, it is required that a timestamp is

added to measured values as close as possible to the sensor. If the sensor itself

cannot add a timestamp and there is no direct gateway for the sensor that can add a

timestamp, the calculation of Timeliness will not be available. In contrast to common

QoS evaluation, the time-related quality metric depends on the availability of the

(updated) information rather than the technical transmission times. The Timeliness

quality is evaluated against the previously annotated source properties (in the

Metadata Repository).

𝑇𝑓𝑟𝑒𝑞 defines the maximum time interval in ms expected between two measured

values. 𝑇𝑓𝑟𝑒𝑞 = 0 defines a solely event-based measurement transfer. The Frequency

is calculated as

𝑇𝑓𝑟𝑒𝑞 = 𝑡(𝑥) − 𝑡(𝑥 − 1)

where 𝑡(𝑥) is the time stamp of a received data source message. This time stamp is

compared to the time stamp of the last message. 𝑇𝑓𝑟𝑒𝑞 can then be used to measure

the Timeliness of a data source message by comparing it with the announced timing

settings in the data source description. In comparison to the Frequency the Age

metric measures how old a data source message is when arriving in the framework.

36

It is calculated by the difference between the current time stamp and the time the

measurement was taken.

𝑇𝑎𝑔𝑒 = 𝑡𝑛𝑜𝑤 − 𝑡(𝑥)

To normalize the Frequency and Age within an interval of 0 and 1, the Reward and

Punishment algorithm, introduced in [46], is modified and integrated. This algorithm

takes numerical values like Age or Frequency and compares them with some given

upper and lower bounds annotated within the IoT Metadata description. If a measured

value is within these bounds the reward increases otherwise it is punished. The

reward is calculated as follows

𝑅𝑑(𝑡) =𝛼𝑊−1(𝑡 − 1)

𝑊 − 1−

𝛼𝑊−1(𝑡 − 1)+𝛼𝑐𝑢𝑟𝑟𝑒𝑛𝑡(𝑡)

𝑊

where 𝑊 is the length of a sliding window over the last inputs and 𝛼𝑊−1 denotes the

number of measurements within the given interval, which have been rewarded.

𝛼𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ∈ {0,1} is therefore the current reward or punishment decision which will be

1 for a measurement within the interval or 0 otherwise. The quality metric can then be

calculated using

𝑞(𝑡) = |𝑞(𝑡 − 1) − 2 ∗ 𝑅𝑑(𝑡)| 𝑤𝑖𝑡ℎ 𝑞(0) = 1

With 𝑞(𝑡) for the value of the quality metric at time t and 𝑞(𝑡 − 1) for the past value.

By normalising both submetrics of Timeliness with the help of the Reward and

Punishment algorithm, they are combined together to form the numeric value 𝑞𝑡𝑖𝑚 for

the Timeliness within the QoI vector 𝑄.

6.1.4 Plausibility

The Plausibility metric defines if a received data source information makes sense

regarding the probabilistic knowledge about what it is measuring. Therefore, physical

value ranges (e.g., indoor temperature or vehicle speed) and sensor specifications are

used to calculate the plausibility of a measurement.

The evaluation of the plausibility utilises a set of sensor annotations and upper

ontology definitions to determine an expected value range of an incoming

measurement. These Measurements are hierarchically evaluated and combined

based on their probability.

37

Atomic Fail-Plausibility Value Definition Value Range

Pfail(DS18B20)=0 Sensor Temperature Dallas

Semiconductors DS18B20

[–55°C ,+100°C]

Pfail (Indoor Office Temperature)=0.5 Indoor Office Temperature [14°C , 25°C]]

Pfail(Individual Room Temperature

Environment)=0.8

Environment Temperature

PDF

Room Specific

PDF

Table 6-1 Plausibility value ranges in different scenarios

𝑞𝑝𝑙𝑎(𝑣) = ∏𝑃𝐴𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛(𝑣) = 𝑃𝐷𝑆18𝐵20(𝑣) ∗ 𝑃𝐼𝑛𝑑𝑜𝑜𝑟(𝑣) ∗ 𝑃𝐸𝑛𝑣𝑖𝑟𝑜𝑛𝑚𝑒𝑛𝑡(𝑣)

𝑓𝑜𝑟 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡 𝑣

The range of Plausibility value is defined between 0 and 1. If the value lies within the

“Value Range”, as described in Table 6-1, then the Plausibility will be 1, otherwise it

will be 0.

6.1.5 Artificiality

The Artificiality metric determines the inverse degree of used sensor fusion

techniques and defines if this is a direct measurement of a singular sensor, an

aggregated sensor value of multiple sources or an artificial spatiotemporally

interpolated value. If the sensor information originates from an individual IoT

hardware sensor, which is not aggregated or interpolated we assume 𝑞𝑎𝑟𝑡 = 1. An

unidentified information source, which aggregates information with unidentified

algorithms will be annotated as 𝑞𝑎𝑟𝑡 = 0. The metric can be individually adapted to

the utilised openness of the connected IoT framework.

The specification of the metric will be published with D5.3 in correlation with the

description of the virtual sensor algorithms.

6.1.6 Concordance

The Concordance metric is used to describe the agreement between information of

the data source and the information of further independent data sources, which

report correlating effects. The Concordance analysis takes any given sensor 𝑥0 and

computes the individual concordances, 𝑐(𝑥0, 𝑥𝑖), with a finite set of 𝑛 sensors, where

𝑖 ∈ 𝑁. It can be assumed that 𝑐(𝑥0, 𝑥0) = 1, since it represents the same sensor. As in

a real-world scenario, the concordance of a sensor with itself is not required so it can

be established that 𝑥0 ≠ 𝑥𝑖 . Hence 𝑐(𝑥0, 𝑥𝑖) will be in the range [0,1). The decision, i.e.,

38

which individual information agrees with each other (e.g., slow traffic event with ∆𝑣 =

−12𝑘𝑚

ℎ ), is stored in the IoT Relationship Model, see Section 6.3.

The overall concordance 𝑞𝑐𝑜𝑛(𝑥0) at the given sensor location 𝑥0 is then calculated by

the concordance function

𝑞𝑐𝑜𝑛(𝑥0) = ∑𝜆𝑖(𝑥0) ∗ 𝑐(𝑥0, 𝑥𝑖) =

𝑛

𝑖=1

𝜆1(𝑥0) ∗ 𝑐(𝑥0, 𝑥1) + 𝜆2(𝑥0) ∗ 𝑐(𝑥0, 𝑥2) + ⋯+ 𝜆𝑛(𝑥0)

∗ 𝑐(𝑥0, 𝑥𝑛)

with 𝜆 as a weight-function

𝜆𝑖(𝑥0) =1

𝑑(𝑥0, 𝑥𝑖)

And 𝑑(𝑥𝑎, 𝑥𝑏) as a propagation- and infrastructure-based distance function between

sensor location 𝑥𝑎 and 𝑥𝑏 for sensors 𝑎 and 𝑏.

To achieve an exponential neglection of samples, which have a high distance, the 𝑥𝑡ℎ

power can be applied based on the derived propagation model.

𝑞𝑐𝑜𝑛(𝑥0) =

∑𝑐(𝑥0, 𝑥𝑖)𝑑𝑒

𝑥(𝑥0, 𝑥𝑖)𝑛𝑖=1

∑1

𝑑𝑒𝑥(𝑥0,𝑥𝑖)

𝑛𝑖=1

6.1.7 Quality Ontology

To integrate the QoI measurements into the IoTCrawler information model for further

processing, an ontology has been created. The ontology is designed to hold the

measurement values of the quality analysis and to connect them to the ontologies

used for IoTCrawler’s information model.

In contrast to other QoI ontologies (see Section 3.2), the IoTCrawler ontology is

optimised to QoI value representation. There is neither static sensor information

contained in this ontology nor any QoS values.

39

Figure 6-1 Quality Ontology for ioT Data Sources

Figure 6-1 shows the Quality Ontology for IoT Data Sources containing classes for all

QoI metrics, which are integrated as subclasses of Quality. It is available online at

http://w3id.org/iot/qoi as a first draft version and will be updated during project

lifetime (and kept online after project).

To integrate the ontology into the IoTCrawler information model the property

hasQuality is used.

Property: qoi:hasQuality URI: https://w3id.org/iot/qoi#hasQuality hasQuality - Connects a quality to a data source. OWL Type: ObjectProperty sub-property-of: owl:topObjectProperty Domain: http://purl.org/iot/ontology/iot-stream#StreamObservation http://purl.org/iot/ontology/iot-stream#iot-stream Range: qoi:Quality

Listing 6-1 Quality Ontology hasQuality Property

Listing 6-1 shows the detailed description for the object property hasQuality. Its

domain identifies two possible connections to the information model of IoTCrawler,

namely StreamObservation and iot-stream. The range names the classes, which can

be connected to the classes mentioned in the domain. It is recognisable that

hasQuality can connect the class Quality and all of its subclasses (all QoI metrics are

subclasses of Quality).

The integration of values is achieved by additional properties:

hasAbsoluteValue

This datatype property holds absolute values for a QoI metric, e.g. 60 for Age.

40

hasRatedValue

This datatype property holds the QoI values that are rated with the reward and

punishment algorithm. Therefore, all values are between 0 and 1.

hasUnitOfMeasurement

For absolute values, this object property describes the unit of the given value.

It links to a unit within the OM2 ontology8.

Listing 6-2 shows parts of an exemplary QoI annotation and gives an idea how QoI

annotation will look like.

@prefix qoi: <https://w3id.org/iot/qoi#> . @prefix om-2: <http://www.ontology-of-units-of-measure.org/resource/om-2/> . @prefix sosa: <http://www.w3.org/ns/sosa/> . ### https://w3id.org/iot/qoi <https://w3id.org/iot/qoi> rdf:type owl:NamedIndividual . ### https://w3id.org/iot/qoi#testCompleteness qoi:testCompleteness rdf:type owl:NamedIndividual , qoi:Completeness ; qoi:hasAbsoluteValue 0.8 . ### https://w3id.org/iot/qoi#testFrequency qoi:testFrequency rdf:type owl:NamedIndividual , qoi:Frequency ; qoi:hasUnitOfMeasurement om-2:second-Time ; qoi:hasAbsoluteValue 0.7 ; qoi:hasRatedValue 0.85 . ### https://w3id.org/iot/qoi#testPlausibility qoi:testPlausibility rdf:type owl:NamedIndividual , qoi:Plausibility ; qoi:hasAbsoluteValue 0.9 . ### https://w3id.org/iot/qoi#testSensor qoi:testSensor rdf:type owl:NamedIndividual , sosa:Sensor ; qoi:hasQuality qoi:testCompleteness , qoi:testFrequency , qoi:testPlausibility .

Listing 6-2 Quality Annotation Example

6.2 Quality of Service

In addition to the QoI presented, including the QoI ontology and its metrics, there will

be additional information that can be used for the indexing and ranking mechanisms

(depending on the data source). For the information model, see Section 4, IoTCrawler

8 http://www.ontology-of-units-of-measure.org/page/om-2

41

uses the SOSA respectively the SSN ontology for the information model as shown in

Section 4. This ontology also contains metrics to describe somehow the quality of a

single sensor. As an example, it comprises metrics like frequency (ssn-

system:Frequency, the smallest possible time between two measurements), precision

(ssn-system:Precision, the closeness between replicated observations for an

unchanged value) or latency (ssn-system:Latency, the time between the command for

an observation and the result being provided). Although, these terms sound like QoI

or QoS metrics, they represent a description of the sensor and its properties and

quality. IoTCrawler plans to use complete frameworks ant not only single sensors as

data sources. Therefore, these metrics might not be sufficient for every use case.

In order to include typical QoS metrics like bandwidth, jitter, delay, and latency,

additional ontologies have to be included, as we plan to keep for example the QoI

ontology simple and stick to QoI only. Although there have been several QoS

ontologies in the past, e.g. QoSOnt [26], none of them is available online today. We

will primarily work with the QoI ontology and use the possibilities of SASO/SSN to

describe the quality of the sensor. If necessary and possible, we will include additional

QoS metrics by extending these ontologies and the information model.

6.3 Model-based Analysis

The availability of appropriate, accurate and trustworthy data sources in

heterogeneous IoT environments is rapidly growing. However, the lack of available,

machine-readable metadata is a still an ongoing shortcoming. Especially when it

comes to the information that describes the relation between various sensors and

actuators. Previous works [10], [12] showed that a distinct knowledge of the

infrastructure, where IoT devices are deployed, enhances monitoring and validation

of available sources. The simplified metadata annotation with latitude/longitude

coordinates of IoT devices prevents utilisation of the infrastructure knowledge. Due

to a frequent unavailability of precise sensor data and a missing ground truth, there is

a high need for interpolation of available information sources. Common approaches

of geometries whereby the entity relationships (e.g., a sensor belonging to a building)

and infrastructural limitations (e.g., blocking of light and sound at a large object or the

propagation of traffic jams along the streets) are not considered. Furthermore, applied

42

interpolation methods do not reflect these infrastructural limitations. WP5 will

develop infrastructure dependent algorithms for monitoring and interpolation, and

therefore needs access to distinctly modelled infrastructure and entity knowledge.

Entity relations of IoT devices (e.g., belongs to building, room street or car), as well as

the knowledge of the infrastructure (e.g., street-map, building plan, soil), where

physical effects are propagating on, have to be modelled for a meaningful quality

analysis of available IoT devices. In the following, this section proposes a joined

infrastructure model approach for buildings (IoT and industry), city/street

infrastructure (Smart City) as well as rural/soil-based applications (e.g. agricultural

scenarios) that can be extended to similar infrastructure patterns.

6.3.1 Infrastructure Model

The utilised Infrastructure Model in IoTCrawler stores the physical models that can

be used to determine the relations between the IoT data sources and actuators. It

stores e.g., maps and building plans, or Geospatial Topology, which can be used to

determine propagation directions. To enable domain-independent utilisation of

IoTCrawler, we utilise the following three infrastructure models that can be reused

for infrastructure-based monitoring and interpolation approaches. Therefore, these

models can be transformed from the Euclidean space to a topology space and

directed graphs, whereby the process is described in IotCrawler Deliverable D5.1.

a) Transportation and city Infrastructure from the whole world are modelled

freely available in OpenStreetMap [52]. The OpenStreetMap database is used

to get infrastructure information for the movement of vehicles on roads,

pedestrians and trains. Furthermore, rough information of building outline is

stored. Figure 6-2 shows an example of the detailed information, which is used

to restrict the propagation of vehicle-related movement patterns. It shows the

definition of road type, e.g., a:h=secondary is the definition of a road category

similar to a main road (see. [53] and the definition of junction points, which

connect the road segment, and areas like cycleways). The OpenStreetMap

infrastructure model is stored as a directed graph, which is tagged with a

defined set of attributes for every edge and node of the graph, which, e.g.,

define maximum speeds, one-way usage and the road surface.

43

Figure 6-2 Infrastructure Model Description of the Utilised OpenStreetMap Database [53]

b) Internal building and industry spaces can be modelled with the IndoorGML

standard of the OGC [54] that can be derived [55], [56] out of BIM models in the

Industry Foundation Classes (IFC) data format for openBIM. The IFC data model

is intended to describe building and construction industry data. It is an object-

based file format with a data model developed by an International Organization

(buildingSMART9) to facilitate interoperability in the architecture, engineering

and construction (AEC) industry, and it is a commonly used collaboration

format in BIM based projects. It is basically the open export standard for state-

of-the-art architecture CAD systems. The transformation of Bim, CityGML,

Indoor GML enables an optimised integration of existing models. It describes

indoor spaces like rooms, transitions like doors and windows, and enables a

placement of several sensors/sources in the Euclidean space whilst enabling

the transformation to directed graphs for automated analysis. Figure 6-3 shows

an example IndoorGML-model of a building floor.

9 http://www.buildingsmart-tech.org/

44

Figure 6-3 Building floor of the UASO Lab showing connected sensor infrastructure

c) Geospatial soil data is a key driver for agricultural applications. Furthermore,

currently many applications utilise Satellite data (e.g., Sentinel, MODIS and

Landsat), in which also a Geospatial Topology [57], [58] out of raster- or vector

data is stored. The topology contains normalized spatial data with standardized

interfaces. Furthermore, it ensures topological integrity by defining shared

borders between areas and prevention of overlapping areas/values, which

leads to invalid models due to multiple values for the same area. Due to this

strict way of storing geospatial data, it derives explicit spatial relationships. The

geospatial topology model can be extracted out of common .shp. .geojson or

GML files that describe the infrastructure or utilise DBMS Systems like Postgis.

Figure 6-4 a) Soil type overview of the NIBIS Server[59], b) NDVI (Normalized Difference Vegetation Index) based on combined Sentinel2 bands (B8-B4)/(B8+B4) [60]

IotCrawler transforms these three models into an integrated Propagation Model that

is used to support the entity definitions and relationship modelling of distributed

45

sensor/actuator networks. Figure 6-5 shows a combined view of the previously

integrated infrastructures.

Figure 6-5 Combined Infrastructure Model View of a office sensor network floorplan (IndoorGML IGML), building, path and road network (OSM) and topology soiltype data

The IoT Relationship Model (IRM) describes the relationship between individual

sensor entities in the real-world and their mutual impact. This relationship can depict

that, e.g., sensors, which are in the same room, are connected by any physical

relationship (e.g., different temperature sensors, placed in a river) or are connected

via an infrastructure like traffic on a road network. It utilises the spatiotemporal

propagation model of a data source to describe the interconnection and propagation

of sensor values based on their relation. However, the model does not necessarily

describe inevitable effects between data streams. If, for example, a traffic sensor A

reports an average car count of 0 vehicles per minute, a data source B that reports

traffic jams out of different sensor data can be used to verify, respectively support the

concordance of the information of sensor A. At the same time, 0 vehicles per minute

does not necessarily deduce a traffic jam. There could also be no traffic during the

night. The IRMs, as component of the Monitoring and Virtual Sensor components,

46

utilise the previously described Infrastructure Models and will be described in

IoTCrawler Deliverable D5.1.

The Interpolation Algorithm defines possible algorithms to determine spatiotemporal

interpolation of sensor values based on individual IRM and Infrastructure Model.

Simple areal propagation scenarios are solved using interpolation algorithms like

Kriging and IDW. However, the novel approach of this validation component is to

regard restricting infrastructures and use propagation algorithms like the previously

published (I-IDW) [12]. Furthermore, 3-dimensional gas, dust and noise propagations

can be used on the same infrastructure model and use the digital terrain model or

digital surface model (including buildings and other objects). The individual

Interpolation Models will be described in IoTCrawler Deliverable D5.1 and D5.3, as

components of the Monitoring and Virtual Sensor Components.

47

7 Security and Privacy for Enabling Data Access

Within the framework of this project we have considered different security and

privacy mechanisms that must be adopted by our IoTCrawler platform to, firstly,

securely integrate information coming from other IoT platforms or systems and,

secondly, to employ these mechanisms to control how the information is revealed as

results obtained by user searches with different privileges. As a matter of fact, in

Section 3.1.3, we provided the state of the art of security and privacy ontologies in the

field of IoT. In this section we elaborate on privacy and security aspects such as

authentication, authorization and identity management that will become fundamental

properties of our IoTCrawler platform. Additionally, we propose an extension of the

Quality of Information ontology to include security and privacy aspects.

7.1 Authentication

Authentication ensures that an identity of a subject (user or object) is valid, i.e., that

the subject is indeed who or what it claims to be. It allows binding an identity to a

subject. The authentication can be performed based on something the subject knows

(e.g. password), something the subject possesses (e.g. smart cards, security token) or

something the subject is (e.g. fingerprint or retinal pattern).

Our IoTCrawler platform must integrate an authentication component responsible for

authenticating users and smart objects based on the provided credentials. These

credentials can be in the form of login/password, shared key, or digital certificate. An

example authentication is depictured in Figure 7-1.

48

Figure 7-1 Authentication Example

Additionally, IoTCrawler can also adopt alternative and more sophisticated ways of

performing authentication, ensuring, at the same time, privacy and minimal disclosure

of attributes. In any case, such performance will be carried out by the Identity

Management Component (IdM). This way, the IdM will be able to verify anonymous

credential and then, in case the identity is proved, the IdM interacts with the

authentication module which is the one that delivers authentication assertion to be

used during a transaction.

7.2 Authorization

The inherent requirements and constraints of IoT environments, as well as the nature

of the potential applications of these scenarios, have brought about a greater

consensus among academia and industry to consider access control as one of the

key aspects to be addressed for a full acceptance of all IoT stakeholders. An example

authorization mechanism is shown in Figure 7-2.

49

Figure 7-2 Authorization Example

The proposed authorization system is based on a combination of different access

control models and authorization techniques, in order to provide a comprehensive

solution for the set of considered scenarios. Specifically, we employ two different

technologies: one is based on the use of access control policies to make authorization

decisions, and the other employs authorization tokens as an access control

mechanism to be used by IoT devices as well as the resources stored in our

IoTCrawler platform.

In this regard, standards such as SAML [20] and XACML [21] could be adopted;

nonetheless, it is desirable to employ access control solutions specially meant to IoT

scenarios. In this sense, our security components can perform capability-based

access control based on the mechanism presented in [25]. It describes authorization

tokens specified in JSON and ECC optimizations to manage access control on

constrained devices.

50

Figure 7-3 Access Control & Ownership Ontology

As we can see in Figure 7-3, a sensor has a relationship with its owner, which

provides an AccessControlList comprised by AccessControlEntries in which the

access rights are defined. Such representation can help the design and

development of new authorization enablers.

7.3 Privacy/Secure Group Sharing

Privacy is a very broad security term which can be applied to different entities such

as: (1) users who prefer to keep secret their identity; (2) communications in which every

exchanged message is secured and encrypted by allowing only both ends of the

communication to decrypt their information; and (3) it can also be applied to the data

itself. As a matter of fact, this is the definition this document tries to address inside the

scope of Data Modelling. For instance, we can think of PayTV services whose streams

of multimedia are transmitted in an encrypted manner to everybody, but only

legitimate users can access to that information. Figure 7-4 shows an example for

privacy.

51

Figure 7-4 Data Privacy Example

Because of the usage of resource-constrained devices, Symmetric Key Cryptography

(SKC) has been widely used on IoT, requiring producer and consumer share a specific

key. Nevertheless, this approach is not able to provide a suitable level of scalability

and interoperability in a future with billions of heterogeneous smart objects. These

issues are tackled by Public Key Cryptography (PKC), but presenting significant higher

computing and memory requirement as well as the need to manage the

corresponding certificates. A common feature with SKC is that PKC allows a producer

to encrypt information to be accessed only by a specific consumer. However, given

the pervasive, dynamic and distributed nature of IoT, it is necessary to consider

different scenarios in which some information can be shared with a group of

consumers or a set of unknown receivers and, therefore, not addressable a priori.

In that sense, Identity-Based Encryption (IBE) [22], [23] was designed as an alternative

without certificates for PKC, in which the identity of an entity is not determined by a

public key, but a string. Consequently, it enables more advanced sharing schemes

since a data producer could share data with a set of consumers whose identity is

described by a specific string. In this direction, Attribute-Based Encryption (ABE) [24]

represents the generalization of IBE, in which the identity of the participants is not

represented by a single string, but by a set of attributes related to their identity. Just

as IBE, it does not use certificates, while cryptographic credentials are managed by

an entity usually called Attribute Authority (AA). In this way, ABE provides a high level

52

of flexibility and expressiveness, compared to previous schemes. In ABE, a piece of

information can be made accessible to a set of entities whose real, probably unknown

identity, is based on a certain set of attributes.

Based on ABE, in a CP-ABE scheme [24], a ciphertext is encrypted under a policy of

attributes, while keys of participants are associated with sets of attributes. In this way,

a data producer can exert full control over how the information is disseminated to

other entities, while a consumer’s identity can be intuitively reflected by a certain

private key. Moreover, in order to enable the application of CP-ABE on constrained

environments, the scheme could be used in combination with SKC. Thus, a message

would be protected with a symmetric key, which, in turn, would be encrypted with

CP-ABE under a specific policy. In the case of smart objects cannot apply CP-ABE

directly, the encryption and decryption functionality could be realized by more

powerful devices, such as trustworthy gateways. In addition, CP-ABE can rely on

Identity Management Systems (e.g. anonymous credentials systems) to obtain private

keys associated to certain user’s attributes from a specific AA, after demonstrating

the possession of such attributes in the partial identity. Then, these private keys can

be used by consumers to decrypt data which is disseminated by producers, as long

as the consumer satisfies the policy which was used to encrypt.

7.1 Security properties

From our point to the use of different enablers for applying different security

technologies such as authentication, authorization and privacy, our focus in this

deliverable is the representation of the information. In this scope, and in terms of

semantic annotation and enrichment, we have also considered security properties.

Following Figure 6-1 regarding Quality of Information, we propose an extension to

include such aforementioned properties.

Figure 7-5 Quality of Information Ontology including Security

53

In the following figure, we propose the integration of a set of elements comprising different

security properties grouped in categories related to integrity, confidentiality or key carrying.

The integration of such terms will allow us to add security to the Quality of Information

vocabulary.

Figure 7-6 Security Ontology

54

8 Implementation and Experimental Results

8.1 Annotation Process

To be able to model IoT data streams using the IoT-Crawler ontology, a reference

annotation tool is required. For this, an offline annotator has been developed to adapt

data from the Aarhus Dataset Repository for historical datasets, but also an online

“on-demand” annotator, which annotates pre-configured dataset sources at the point

of request from a remote client. This tool has been developed using the Jackson

JSON toolkit. The toolkit encompasses of a set of Java annotations which can control

how JSON data is read into objects, or how JSON is generated from the objects. Using

the IoT-Crawler ontology, the Jackson annotation was used to map classes, object

properties, annotation properties and data properties. Listing 8-1 shows a code

snippet example of how the Jackson toolkit was used to create the classes listed on

Figure 8-1 along with the object properties in relation to the sosa:Sensor class.

platformClass = ontModel.getOntResource(SOSA_PREFIX + "Platform").asClass(); systemClass = ontModel.getOntResource(SSN_PREFIX + "System").asClass();

hasLocation=ontModel.getObjectProperty(GEO_PREFIX + "location").asObjectProperty(); hasUnit = ontModel.getObjectProperty(IOT_LITE_PREFIX + "hasUnit").asObjectProperty();

Listing 8-1 Jackson Annotation Code

Figure 8-1 Classes related and Connected to sosa:Sensor

55

Figure 8-2 IoTCrawler Information Model Example

Figure 8-2 shows an example of the annotated data output that was generated using

the Jackson toolkit. As shown the data is represented in JSON-LD.

Challenges when annotating data:

It is essential for IoT stream data to have correct labeled data with appropriate names

(features) and values, as well as having a rich description. But is it not often the case

to find well labeled rich IoT sensor dataset, consequently requiring additional

process. The IoT-Crawler ontology model composes of essentials components

(classes and properties) required to annotate IoT data streams along with its meta-

data.

The quality of the annotated data can have a substantial impact on the performance,

usage and service of the IoTCrawler search engine.

The process of data annotation often suffers from common limitations of processes,

in particular, annotating components with missing meta-data. Annotating

components with missing meta-data often requires additional manual process,

56

compared to extracting the information from meta-data label or description. Manually

annotating missing meta-data often affects the quality of the information.

8.1.1 Data Evaluation

We have used probabilistic machine learning techniques (i.e. Gaussian Mixture

Models [49]), pattern creation techniques (i.e. Lagrangian Multiplier) and statistical

procedure (i.e. Principal Component Analysis [50]), for crawling and analysing the

semantic descriptions of IoT data and services. Applying these techniques, we have

developed a range of different models to search for the content of the time-series

data harvested from the city of Aarhus live open dataset (e.g. live traffic10 , air quality

and population, etc.).

These models have been evaluated and they have been applied to two different data

sets: synthetic data and a real-world air pollution. The datasets used for evaluation

purposes have very similar characteristics to the harvested dataset from the City of

Aarhus; this also allows us to compare the dataset to real world existing and similar

datasets as well as similar approaches used.

Table 2 Data Silhouette Coefficients

Data Silhouette Coefficient

Without Noise 0.87

With Noise 0.47

To illustrate the performance of our proposed method for real-world applications, we

selected air quality data from the CityPulse11 project’s open data set. We used the air

quality observations data for a period of two months which were recorded at a sample

rate of 5 minute intervals (i.e. 12 samples per hour).

The data has two dimensions: Nitrogen-dioxide (NO2) and Particulate Matter (PM).

Due to the sample rate, we set the step size as 12 (s = 12), which contains observations

for an hour. We clustered the data in three different clusters based on the air-quality

10https://portal.opendata.dk/dataset/realtids-trafikdata/resource/b3eeb0ff-c8a8-4824-

99d6-e0a3747c8b0d

11 http://iot.ee.surrey.ac.uk:8080/datasets.html

https://portal.opendata.dk/dataset/realtids-trafikdata/resource/b3eeb0ff-c8a8-4824-99d6-e0a3747c8b0d

https://portal.opendata.dk/dataset/realtids-trafikdata/resource/b3eeb0ff-c8a8-4824-99d6-e0a3747c8b0d

http://iot.ee.surrey.ac.uk:8080/datasets.html

57

index for air-pollution assessments (i.e. low risk, medium risk and high risk). See Figure

8-3 for the clustering result. To evaluate our proposed method, we compared it with

existing solutions. We applied the GMM clustering to the raw data and to the data

after applying only the Lagrangian Multiplier and to the data after applying only PCA.

Figure 8-3 The output of clustering algorithm after applying Lag and PCA on real-time air pollution data (left). Notice the different patterns of observation from each cluster (right).

To provide numerical assessment, we calculated Silhouette coefficient and also the

ratio of average distance between clusters to average distance within clusters12 . Note

that we have used the ratio as the Lagrangian transformation scale the data and this

affects the distance measures for different scenarios. Therefore, to be able to provide

a fair and consistent comparison, we calculate the ratio. The results are shown in

Table 3.

Table 3 Method Silhouette Coefficient Ratios

Method Silhouette Coefficient Ratio

Lag + PCA + GMM 0.69 4.09

Raw + GMM 0.46 2.25

Lag + GMM 0.457 2.25

PCA + GMM 0.395 2.05

12 The higher the distance between and smaller the distance within clusters the better the clustering

performance, so ratio of a high performance clustering should be high [51]

58

The Silhouette coefficient for our proposed method is 0.69, which shows higher

performance compared with other methods. The ratio of average distance between

clusters to average distance within clusters is higher in our proposed method, which

means the samples are closer within each cluster and they are well-separated from

other clusters.

8.1.2 Geo access benchmark

The following figures show benchmark results for the experiments.

Figure 8-4 Geo access benchmark, Experiment 1 non indexed data

Using the IoTCrawler search platform, geospatial search query experiments were

conducted in order to analyse the performance and time complexity of the geohash

search functionality built within the platform.

The Geohash coordinates ‘u1zr2r2grd45’ (56.15855872, 10.20765351) were used to

query and find the closest (neighbors) car parking arears within the given

coordinates. From the given string coordinates, the platform was able to find three

close geohash neighbor with the precision (a number between 1 and 12 that specifies

the precision, i.e., number of characters of the resultant symbol, in this case ‘u1zr2’)

set to 5.

59

Figure 8-5 Geo access benchmark, Experiment 1 indexed data

The Geohash coordinates data values were then indexed within the database, to

optimise query time and to improve overall performance. Mongodb uses the b-tree

data structure for indexing data, subsequently improving the time complexity from

O(N) to O(log n). From the 10 experiments conducted, the average Geospatial query

time recorded was 0.2223ms, resulting with an improved average query time

performance of 0.0214ms.

Figure 8-6 Geo access benchmark, Experiment 1 Non indexed vs Index data

As shown in in Figure 8-6, it is clear that the indexed data-set has a faster overall

geospatial query time, resulting to better overall system performance. Comparing the

two experiments results and their time complexities O(n) and O(log n), it is clear that

O(log n) will have better overall performance within a scalable deployed system.

60

8.2 Example Quality Calculation and Annotation

QoI metrics in Section 6 have to be calculated and fused with the metadata for data

enrichment. QoI calculation are performed at two layers. Completeness (𝑞𝑐𝑚𝑝),

Timeliness (𝑇𝑎𝑔𝑒 and 𝑇𝑓𝑟𝑒𝑞), Plausibility (𝑞𝑝𝑙𝑎) and Concordance (𝑞𝑐𝑜𝑛) have to be

computed for each sensor value in the Micro layer [48] and then averaged or

statistically computed for a whole dataset for semantic enrichment in Internal

processing layer [48]. Artificiality (𝑞𝑎𝑟𝑡) is strictly a trait of a dataset and has to be

computed only for semantic enrichment in Internal processing layer to facilitate

Ranking of a dataset.

To depict the QoI computation, consider a weather sensor that measures

temperature, relative humidity and pressure. For a particular query, the data sent by

the sensor is shown in Figure 8-7.

Figure 8-7 Example of Weather Sensor Data, measuring Pressure, Relative humidity and Temperature

For the shown data values, QoI metrics of Completeness, Timeliness and Plausibility

will be calculated in the following fashion:

Completeness: In the three measured quantities i.e. pressure, relative humidity and

temperature, only the pressure value is missing at two positions. As described in

Section 6.1.2, 𝑞𝑐𝑚𝑝 is calculated for each observation. For observations in rows 0, 2, 3

and 4, 𝑀𝑚𝑖𝑠𝑠 = 0 and hence, 𝑞𝑐𝑚𝑝 = 1. However, row 1 has one missing value, so

𝑀𝑚𝑖𝑠𝑠 = 1 for row 1 due to which it has 𝑞𝑐𝑚𝑝 = 0.67.

Timeliness: 𝑇𝑓𝑟𝑒𝑞 is first calculated by subtracting time of each observation with the

time of preceding value. As the sensors are set to log data after an interval of one

hour, we can see that 𝑇𝑓𝑟𝑒𝑞 is within the defined time limit and for the reward and

punishment algorithm, all the measurements, in this case, will be rewarded.

61

The next sub-metric for Timeliness is Age. For 𝑇𝑎𝑔𝑒, the current time is subtracted

from the time at which the data value was generated by the sensor. For the given

observation data, the calculated 𝑇𝑎𝑔𝑒, in ms, comes out to be:

520 350 432 367 503

After the calculation of 𝑇𝑓𝑟𝑒𝑞and 𝑇𝑎𝑔𝑒, the reward 𝑅𝑑(𝑡) has to be calculated to help

in the computation of 𝑞𝑡𝑖𝑚. However, Reward value characterises whole dataset which

in turn makes 𝑞𝑡𝑖𝑚 a quality of a dataset rather than a quality of individual value.

Plausibility: For Plausibility, a range for the measured quantity has to be defined. After

the range has been defined, plausibility for each observation is computed. Comparing

the logged data to the plausibility range, it can be seen that all the values fulfill the

criteria except the pressure values in rows 1 and 3. As the value in row 1 is a missing

value and its effect is already calculated by 𝑞𝑐𝑚𝑝, it will not have an effect on 𝑞𝑝𝑙𝑎.

However, pressure value in row 3 is far from the plausible range and, hence, is given

𝑞𝑝𝑙𝑎 = 0. Figure 8-7 shows the sensor data integrated with the related QoI values.

Figure 8-8 Weather Sensor data with Integrated QoI values

Concordance (𝑞𝑐𝑜𝑛) is also not calculated here, as it is computed through statistical analysis

which will be in WP5.

8.3 Example Location Indexing

As mentioned in Section 5, support for geospatial queries is one of the core

IoTCrawler functionalities. IoTCrawler Ontology links to geo:Point, which can be used

to capture absolute co-ordinates or relative location. Using the geo:Point concepts,

the locations of a given system, deployment, platform or IoTStream can not only be

used for GeoSpatial query, but also for Geospatial analysis.

This feature is demonstrated in the IoTCrawler geohash demo, which is available

62

online13. Figure 8-8 shows a screenshot of the demo, which currently uses open data

from the City of Aarhus.

Figure 8-8 IoTCrawler Geohash demo

Geohash encodes a geographic location into a short string of letters and digits. The

precision increases with the length of the strings (see Figure 8-9 and Figure 8-10). In

the IoTCrawler demo, the latitude and longitude coordinates of the city sensors were

encoded using geohash, which are then indexed using a prefix tree. When the user

or application specify a location or area in the map, a geohash is generated and

matched against the index to return the sensors that match the requested location.

13 http://iot-crawler.ee.surrey.ac.uk/

63

Figure 8-9 Geohash example: Precision with 7 characters string

Figure 8-10 Geohash example: Precision with 9 characters string

64

9 Conclusion

This deliverable summarizes the work done in Task T2.3 and T4.1. After an analysis of

the state of the art, we developed an information model capable of fulfilling the

requirements for a search engine for the Internet of Things.

The information model is therefore addressed from different directions. The base

model is designed to annotate data sources with additional meta data information.

This meta data information is needed for the Indexing and Ranking components of

the IoTCrawler framework. These components will use the crawled information and

its meta data to answer user and machine initiated queries in the world of IoT. To store

the additional meta and quality information, the IoTCrawler framework will use a

distributed meta data repository. This process is described in detail in D2.2.

The use of additional Quality of Information and Security and Privacy descriptions

further extends the possibilities of the annotation model. The Quality of Information

supports the indexing and ranking components by calculating QoI metrics for data

sources. As an example, a concordance metric compares the data provided by one

data source with other data sources to find out if the data is correct. As this metric

also takes heterogeneous data sources into account, a model-based analysis is

introduced. The model-based analysis develops infrastructure models by combining

different representations of real world environments. By combination of all these

representations an information and IoT relationship model is created. The model

describes the relations between individual sensor entities and their mutual impacts.

This work will be continued and further extended in WP 7 of the IoTCrawler project.

This deliverable also addresses security and privacy annotations of data sources and

extends the ontologies by additional properties to annotate, for example, the integrity

or confidentiality of data sources. In addition, required authentication and

authorization annotations are added to the information model to allow IoTCrawler to

consider privacy and security requirements of users and information providers.

65

Finally, we provide an implementation and experiment section where we provide

demonstrations and benchmarks. It is shown how the developed information model

can be used. This concludes with an initial example of an annotated data source,

some quality annotation examples and a demonstration for location indexing.

10 References

[1] Sundmaeker, H., Guillemin, P., Friess, P., & Woelffle, S. (2010). Vision and challenges for realising the internet of things. Cluster of European Research Projects on the Internet of Things —CERP IoT.

[2] Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. Communications Surveys & Tutorials, IEEE, 17(4), 2347-2376.

[3] Strong, Diane M., Yang W. Lee, and Richard Y. Wang. "Data quality in context." Communications of the ACM 40.5 (1997): 103-110.

[4] Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American society for information science and technology 58.12 (2007): 1720-1733.

[5] Bisdikian, Chatschik, et al. "Building principles for a quality of information specification for sensor information." Information Fusion, 2009. FUSION'09. 12th International Conference on. IEEE, 2009.

[6] Weiskopf, Nicole Gray, and Chunhua Weng. "Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research." Journal of the American Medical Informatics Association 20.1 (2013): 144-151.

[7] Li, Xin. "Blind image quality assessment." Image Processing. 2002. Proceedings. 2002 International Conference on. Vol. 1. IEEE, 2002.

[8] Mittal, Anish, Anush Krishna Moorthy, and Alan Conrad Bovik. "No-reference image quality assessment in the spatial domain." IEEE Transactions on Image Processing 21.12 (2012): 4695-4708.

[9] Zafarani, Reza, and Huan Liu. "Evaluation without ground truth in social media research." Communications of the ACM 58.6 (2015): 54-60.

[10] Kuemper, Daniel, et al. "Valid. IoT: a framework for sensor data quality analysis and interpolation." Proceedings of the 9th ACM Multimedia Systems Conference. ACM, 2018.

[11] Kuemper, Daniel, et al. "Monitoring data stream reliability in smart city environments." Internet of Things (WF-IoT), 2016 IEEE 3rd World Forum on. IEEE, 2016.

[12] Kuemper, Daniel, Ralf Toenjes, and Elke Pulvermueller. "An infrastructure-based interpolation and propagation approach for IoT data analytics." Innovations in Clouds, Internet and Networks (ICIN), 2017 20th Conference on. IEEE, 2017.

[13] Janowicz K, Haller A, Cox SJ, Le Phuoc D, Lefrançois M. SOSA: A lightweight ontology for sensors, observations, samples, and actuators. Journal of Web Semantics. 2018 Jul 11.

67

[14] Bermudez-Edo M, Elsaleh T, Barnaghi P, Taylor K. IoT-Lite: a lightweight semantic model for the Internet of Things. In2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld) 2016 Jul 18 (pp. 90-97). IEEE.

[15] Kolozali S, Bermudez-Edo M, Puschmann D, Ganz F, Barnaghi P. A knowledge-based approach for real-time iot data stream annotation and processing. InInternet of Things (iThings), 2014 IEEE International Conference on, and Green Computing and Communications (GreenCom), IEEE and Cyber, Physical and Social Computing (CPSCom), IEEE 2014 Sep 1 (pp. 215-222). IEEE.

[16] Daniele L, den Hartog F, Roes J. Created in close interaction with the industry: the smart appliances reference (SAREF) ontology. In International Workshop Formal Ontologies Meet Industries 2015 Aug 5 (pp. 100-112). Springer, Cham.

[17] Ehrig M, Sure Y. Ontology mapping by axioms (OMA). In Biennial Conference on Professional Knowledge Management/Wissensmanagement 2005 Apr 10 (pp. 560-569). Springer, Berlin, Heidelberg.

[18] Battle R, Kolas D. Enabling the geospatial semantic web with parliament and geosparql. Semantic Web. 2012 Jan 1;3(4):355-70.

[19] F. N. Natalya, L. Deborah et al., “Ontology development 101: A guide to creating your first ontology”, Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, 2001.

[20] SAML (Security Assertion Markup Language): http://docs.oasis-open.org/security/saml/v2.0

[21] OASIS Standard. eXtensible Access Control Markup Language (XACML) Version 3.0. January 2013: http://docs.oasis-open.org/xacml/3.0

[22] D. Boneh and M. Franklin, “Identity-based encryption from the weil pairing,” in Advances in Cryptology—CRYPTO 2001. Springer, 2001, pp. 213–229

[23] A. Sahai and B. Waters, “Fuzzy identity-based encryption,” in Advances in Cryptology–EUROCRYPT 2005. Springer, 2005, pp. 457–473

[24] J. Bethencourt, A. Sahai, and B. Waters, “Ciphertext-policy attribute-based encryption,” in Security and Privacy, 2007. SP’07. IEEE Symposium on. IEEE, 2007, pp. 321–334.

[25] J. L. Hernández-Ramos, A. J. Jara, L. Marín, and A. F. Skarmeta, “Dcapbac: Embedding authorization logic into smart things through ecc optimizations,” International Journal of Computer Mathematics, no. just-accepted, pp. 1–22, 2014.

http://docs.oasis-open.org/security/saml/v2.0

http://docs.oasis-open.org/security/saml/v2.0

http://docs.oasis-open.org/xacml/3.0

68

[26] Dobson, Glen, Russell Lock, and Ian Sommerville. "QoSOnt: a QoS ontology for service-centric systems." Software Engineering and Advanced Applications, 2005. 31st EUROMICRO Conference on. IEEE, 2005.

[27] O. Sacco, and A. Passant. LDOW, volume 813 of CEUR Workshop Proceedings, CEUR-WS.org, (2011)

[28] Huang, J., and Fox, M.S., (2006). "An Ontology of Trust - Formal Semantics and Transitivity", Proceedings of the International Conference on Electronic Commerce, Association of Computing Machinery, pp. 259-270.

[29] Bill TSOUMAS and Dimitris GRITZALIS. Towards an Ontology-based Security Management. In Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA), pp. 985-992 (2006).

[30] GETINET AYELE ESHETE. Semantic Description of IoT Security for Smart Grid. Master Thesis, University of Agder, June 2017

[31] Neisse R, Steri G, Fovino IN, Baldini G, SecKit: a Model-based Security Toolkit for the Internet of Things. In Computers & Security (2015), doi: 10.1016/j.cose.2015.06.002.

[32] M. Taherian, R. Jalili, and M. Amini, “PTO: A Trust Ontology for Pervasive Environments,” in AINA’08 Workshop proceedings, 2008, pp. 301–306.

[33] M. Tao, J. Zuo, Z. Liu, A. Castiglione, F. Palmieri, Multi-layer cloud architectural model and ontology-based security service framework for IoT-based smart homes. In Future Generation Computer Systems (2016),

[34] Mozzaquatro, Bruno Augusti et al. “Towards a reference ontology for security in the Internet of Things.” 2015 IEEE International Workshop on Measurements & Networking (M&N) (2015): 1-6.

[35] [9] Y. Fathy, P. Barnaghi, S. Enshaeifar, and R. Tafazolli. A Distributed In-network Indexing Mechanism for the Internet of Things. In Internet of Things (WF-IoT), 2016 IEEE 3nd World Forum on. IEEE.

[36] Yasmin Fathy, Payam Barnaghi, and Rahim Tafazolli. Large-Scale Indexing, Discovery, and Ranking for the Internet of Things (IoT). ACM Comput. Surv. 51, 2, Article 29 (March 2018), 53 pages. DOI: https://doi.org/10.1145/3154525

[37] Y. Zhou, S. De, W. Wang, and K. Moessner. Enabling Query of Frequently Updated Data from Mobile Sensing Sources. In Computational Science and Engineering (CSE), 2014 IEEE 17th International Conference on. IEEE, 946–952, 2014.

[38] P. Barnaghi, W. Wang, L. Dong, and C. Wang. A Linked-Data Model for Semantic Sensor Streams. In 2013 IEEE International Conference on Green Computing and Communications (GreenCom) and IEEE Internet of Things (iThings) and IEEE Cyber, Physical and Social Computing(CPSCom). IEEE, 468–475.

69

[39] X. Li, Y. J. Kim, R. Govindan, and W. Hong. Multi-dimensional range queries in sensor networks. In Proceedings of the 1st International Conference on Embedded Networked Sensor Systems (SenSys). ACM, 63–75, 2003.

[40] F. Paganelli and D. Parlanti. A DHT-based discovery service for the Internet of Things. Journal of Computer Networks and Communications 2012.

[41] S. Wang, D. Maier, and B. C. Ooi. Lightweight indexing of observational data in log-structured storage. Proceedings of the VLDB Endowment 7, 7 (2014), 529–540.

[42] T. Seidl, I. Assent, P. Kranen, R. Krieger, and J. Herrmann. Indexing density models for incremental learning and anytime classification on data streams. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. ACM, 311–322, 2009.

[43] J. Shieh and E. Keogh. iSAX: indexing and mining terabyte sized time series. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 623–631, 2008.

[44] K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for interactive exploration of big data series. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1555–1566, 2014.

[45] F. Ganz, P. Barnaghi, and F. Carrez. Information abstraction for heterogeneous real world internet data. Sensors Journal, IEEE 13, 10 (2013), 3793–3805.

[46] Hossain, M. Anwar, Pradeep K. Atrey, and Abdulmotaleb El Saddik. "Modeling and assessing quality of information in multisensor multimedia monitoring systems." ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 7.1 (2011): 3.

[47] M. Strohbach et al, “IoTCrawler: D2.1 Requirements and Design Templates for IoT Crawling”, project report, 2018.

[48] A. Skarmeta et al, “IoTCrawler: D2.2 Security and Privacy-Aware IoTCrawler Framework”, project report, 2019.

[49] Christophe Biernacki, Gilles Celeux, and G´erard Govaert, “Assessing a mixture model for clustering with the integrated completed likelihood,” IEEE transactions on pattern analysis and machine intelligence, vol. 22, no.7, pp. 719–725, 2000.

[50] Hyunjin Yoon, Kiyoung Yang, and Cyrus Shahabi, “Feature subset selection and feature ranking for multivariate time series,” IEEE transactions on knowledge and data engineering, vol. 17, no. 9, pp.1186–1198,2005.

[51] Daniel S Wilks, “Cluster analysis,” in International geophysics, vol. 100, pp. 603–616. Elsevier, 2011.

[52] Haklay, Mordechai, and Patrick Weber. "Openstreetmap: User-generated street maps." Ieee Pervas Comput 7.4 (2008): 12-18.

70

[53] Marek Kleciak. 2015. Proposed features/area highway/mapping guidelines. (Aug 2015). https://wiki.openstreetmap.org/wiki/Proposed_features/area_highway/mapping_guidelines, last visit: 2019-01-22

[54] Li, Ki-Joune, et al. "OGC IndoorGML: A Standard Approach for Indoor Maps." Geographical and Fingerprinting Data to Create Systems for Indoor Positioning and Indoor/Outdoor Navigation. Academic Press, 2019. 187-207. Kim, Y. J., H. Y. Kang, and J. Lee. "Development of indoor spatial data model using CityGML ADE." ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 1.2 (2013): 41-45.

[55] Kim, Joon-Seok, Sung-Jae Yoo, and Ki-Joune Li. "Integrating IndoorGML and CityGML for indoor space." International Symposium on Web and Wireless Geographical Information Systems. Springer, Berlin, Heidelberg, 2014.

[56] Srivastava, Srishti, Nishith Maheshwari, and K. S. Rajan. "TOWARDS GENERATING SEMANTICALLY-RICH INDOORGML DATA FROM ARCHITECTURAL PLANS." International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences 42.4 (2018).

[57] "Understanding topology in vector data" (PDF). Department of Land Affairs, Eastern Cape, South Africa. 2009. Retrieved 2011-11-25.

[58] Santilli, S. Topology with postgis 2.0., 23rd PostgreSQL Sessions, Paris,

2011 [59] Nibis Map Server,

http://www.lbeg.niedersachsen.de/kartenserver/nibis-kartenserver-72321.html last visit: 2019-01-22

[60] Sentinel Data Access Overview - Sentinel Online - ESA , https://sentinel.esa.int/web/sentinel/sentinel-data-access , last visit: 2019-01-22

https://wiki.openstreetmap.org/wiki/Proposed_features/area_highway/mapping_guidelines

https://wiki.openstreetmap.org/wiki/Proposed_features/area_highway/mapping_guidelines

http://www.lbeg.niedersachsen.de/kartenserver/nibis-kartenserver-72321.html

http://www.lbeg.niedersachsen.de/kartenserver/nibis-kartenserver-72321.html

https://sentinel.esa.int/web/sentinel/sentinel-data-access

71

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 779852

@IoTCrawler IoTCrawler EUproject /IoTCrawler www.IoTCrawler.eu [email protected]

D4.1 IoT Data Attributes and Quality of Information...3 Abstract: This deliverable summarises the...

Documents

Transcript of D4.1 IoT Data Attributes and Quality of Information...3 Abstract: This deliverable summarises the...