D4.1 IoT Data Attributes and Quality of Information...3 Abstract: This deliverable summarises the...
Transcript of D4.1 IoT Data Attributes and Quality of Information...3 Abstract: This deliverable summarises the...
D4.1 IoT Data Attributes and Quality of Information
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 779852
2
Title: Document Version:
D4.1 IoT Data Attributes and Quality of Information 0.1
Project Number:
Project
Acronym: Project Title:
779852 IoTCrawler IoTCrawler
Contractual Delivery Date: Actual Delivery Date: Deliverable Type* - Security**:
31/01/2019 31/01/2018 R - PU * Type: P - Prototype, R - Report, D - Demonstrator, O - Other
** Security Class: PU- Public, PP - Restricted to other programme participants (including the
Commission), RE - Restricted to a group defined by the consortium (including the Commission),
CO - Confidential, only for members of the consortium (including the Commission)
Responsible and Editor/Author: Organization: Contributing WP:
Thorben Iggena University of Applied
Sciences Osnabrück WP4
Authors (organizations):
Thorben Iggena (UASO), Daniel Kümper (UASO), Ralf Tönjes (UASO), Martin Strohbach (AGT), Pavel
Smirnov (AGT), Juan A. Martinez (OdinS), Sahr Thomas Acton (UoS), Roonak Rezvani (UoS), Josiane
Parreira (SIE)
3
Abstract:
This deliverable summarises the work of Task T2.3 and T4.1. In a first step the requirements for an
information model for the IoTCrawler framework are presented based on the development of
scenarios in deliverable D2.1. The following state of the art about ontology models and quality
calculation is used to prepare the development of an information model capable of fulfilling the
requirements. The main aspects that have been considered are the general data source annotation
including time and geo information, privacy, and quality of information. In addition, it is shown how
the information model and the additional quality information can be used for the indexing and
crawling mechanisms for the framework.
More details on the calculation of quality of information and its metrics are presented followed by
some early implementations and demonstrations. All in all this deliverable builds a common basis
for understanding information crawled with the IoTCrawler framework and shows how the process
of data searching with IoTCrawler can be supported by additional Quality of Information to finally
enhance the user experience and to improve results of machine initiated data queries.
Keywords:
Information Model, IoTCrawler framework, Annotation, Quality of Information, Quality Measures.
Disclaimer:
The present report reflects only the authors’ view. The European Commission is not responsible for
any use that may be made of the information it contains.
4
Abbreviations
AA Attribute Authority
ABE Attribute Based Encryption
AEC Architecture, Engineering and Construction
CR-Index Continuous Range Index
DHT Distributed Hash Table
DIM Distributed Index for Multi-dimensional Data
GDPR General Data Protection Regulation
GIS Geographic Information System
IBE Identity Based Encryption
IdM Identity Management Component
IFC Industry Foundation Classes
IoT Internet of Things
IoT-O Internet of Things Ontology
IRM IoT Relationship Model
iSAX Indexable Symbolic Aggregation Approximation
JSON JavaScript Object Notation
MDR Meta Data Repository
OGC Open Geospatial Consortium
PKC Public Key Cryptography
PPO Privacy Preference Ontology
PROV-O Provenance Ontology
PTO Pervasive Trust Ontology
QoI Quality of Information
QoS Quality of Service
RDF Resource Description Framework
SAO Stream Annotation Ontology
SAREF Smart Appliances Reference Ontology
SAX Symbolic Aggregation Approximation
SCF Smart City Framework
5
SKC Symmetric Key Cryptography
SOSA Sensor, Observation, Sample, Actuator
SSN Sensor Network ontology
SVD Singular Value Decomposition
6
Executive Summary
This deliverable focuses on the work done in task T2.3 and T4.1 of the IoTCrawler project.
Main focus of these tasks is to develop an information model to make all the information
collected by IoTCrawler machine readable and, therefore, ready to be indexed and ranked
for user or machine queries. This with the aim of finding data sources for certain use cases.
The information model has been developed under consideration of the 22 use cases defined
in deliverable D2.2 and the additional requirement of Privacy and Quality of Information.
Hence, this document presents an ontology and metrics to calculate and annotate the Quality
of Information for data sources found with IoTCrawler’s crawling mechanisms.
Utilising the added Quality of Information for IoT data sources, the ranking mechanism can
not only consider the annotation information of the data source but also information about
the data quality of a data source. With the defined metrics it is possible to define the focus
for a data query to the IoTCrawler framework. As an example, it might be useful to use an up
to date data source, which provides data in near real-time and therefore has a higher
Timeliness instead of a more accurate one providing data only several times per day. Used
in this way, the Quality of Information component supports the framework by providing
additional information to enhance user experiences for better services.
Disclaimer
This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No 779852, but this document only reflects
the consortium’s view. The European Commission is not responsible for any use that may be
made of the information it contains.
7
Table of Contents
1 Introduction ............................................................................................................................................................... 11
1.1 Sample Scenarios ........................................................................................................................................ 11
1.2 Overview ............................................................................................................................................................ 13
2 Requirements for IoTCrawler Information Model ....................................................................... 15
2.1 Machine Enabled Discovery ............................................................................................................... 15
2.2 Indexing .............................................................................................................................................................. 16
2.3 Quality Measures for Ranking ............................................................................................................ 16
2.4 Security and Privacy Concerns ......................................................................................................... 17
3 State of the Art ....................................................................................................................................................... 19
3.1 Ontology Models ......................................................................................................................................... 19
3.1.1 IoT Ontologies ...................................................................................................................................... 19
3.1.2 Geo and Time Ontologies ............................................................................................................21
3.1.3 Security and Privacy Ontologies ............................................................................................ 22
3.2 QoI Computation .......................................................................................................................................... 23
4 Annotation Model for Discovery ............................................................................................................... 25
5 Data Representation for Indexing ........................................................................................................... 30
6 Quality Measures and Analysis for Ranking ..................................................................................... 34
6.1 Quality of Information............................................................................................................................... 34
6.1.1 General Approach and QoI Vector ....................................................................................... 34
6.1.2 Completeness ...................................................................................................................................... 34
6.1.3 Timeliness ............................................................................................................................................... 35
6.1.4 Plausibility ...............................................................................................................................................36
6.1.5 Artificiality ................................................................................................................................................ 37
8
6.1.6 Concordance ......................................................................................................................................... 37
6.1.7 Quality Ontology ................................................................................................................................ 38
6.2 Quality of Service ....................................................................................................................................... 40
6.3 Model-based Analysis ............................................................................................................................. 41
6.3.1 Infrastructure Model ........................................................................................................................ 42
7 Security and Privacy for Enabling Data Access ............................................................................ 47
7.1 Authentication ................................................................................................................................................ 47
7.2 Authorization ...................................................................................................................................................48
7.3 Privacy/Secure Group Sharing ........................................................................................................ 50
7.1 Security properties ..................................................................................................................................... 52
8 Implementation and Experimental Results ...................................................................................... 54
8.1 Annotation Process .................................................................................................................................... 54
8.1.1 Data Evaluation .................................................................................................................................. 56
8.1.2 Geo access benchmark ................................................................................................................58
8.2 Example Quality Calculation and Annotation ....................................................................... 60
8.3 Example Location Indexing ................................................................................................................. 61
9 Conclusion ................................................................................................................................................................ 64
10 References ............................................................................................................................................................... 66
9
Table of Figures
Figure 1-1 Smart Home Scenario ........................................................................................................................ 13
Figure 3-1 The SOSA and SSN ontologies and their vertical and horizontal modules
(taken from https://www.w3.org/TR/vocab-ssn/. Last accessed: 28/11/2018) ......... 20
Figure 3-2 CityPulse QoI Ontology .................................................................................................................... 24
Figure 4-1 An instance of an IoTCrawler Object ..................................................................................... 26
Figure 4-2 IoTCrawler Information Model .................................................................................................... 28
Figure 6-1 Quality Ontology for ioT Data Sources ................................................................................. 39
Figure 6-2 Infrastructure Model Description of the Utilised OpenStreetMap Database
[53] .............................................................................................................................................................................................. 43
Figure 6-3 Building floor of the UASO Lab showing connected sensor infrastructure
...................................................................................................................................................................................................... 44
Figure 6-4 a) Soil type overview of the NIBIS Server[59], b) NDVI (Normalized
Difference Vegetation Index) based on combined Sentinel2 bands (B8-B4)/(B8+B4)
[60] ............................................................................................................................................................................................. 44
Figure 6-5 Combined Infrastructure Model View of a office sensor network floorplan
(IndoorGML IGML), building, path and road network (OSM) and topology soiltype data
...................................................................................................................................................................................................... 45
Figure 7-1 Authentication Example ...................................................................................................................48
Figure 7-2 Authorization Example ..................................................................................................................... 49
Figure 7-3 Access Control & Ownership Ontology .............................................................................. 50
Figure 7-4 Data Privacy Example ....................................................................................................................... 51
Figure 7-5 Quality of Information Ontology including Security ................................................... 52
Figure 7-6 Security Ontology ................................................................................................................................ 53
Figure 8-1 Classes related and Connected to sosa:Sensor ............................................................ 54
Figure 8-2 IoTCrawler Information Model Example ............................................................................. 55
Figure 8-3 The output of clustering algorithm after applying Lag and PCA on real-
time air pollution data (left). Notice the different patterns of observation from each
cluster (right). ..................................................................................................................................................................... 57
Figure 8-4 Geo access benchmark, Experiment 1 non indexed data .....................................58
10
Figure 8-5 Geo access benchmark, Experiment 1 indexed data ................................................ 59
Figure 8-6 Geo access benchmark, Experiment 1 Non indexed vs Index data .............. 59
Figure 8-7 Example of Weather Sensor Data, measuring Pressure, Relative humidity
and Temperature ........................................................................................................................................................... 60
Figure 8-8 Weather Sensor data with Integrated QoI values ....................................................... 61
Figure 8-9 Geohash example: Precision with 7 characters string .............................................63
Figure 8-10 Geohash example: Precision with 9 characters string ...........................................63
1 Introduction
The integration of IoT resources, provided by third parties, requires that a search
engine such as IoTCrawler has rich metadata about these resources. In order to
maximize the relevance of search results, it is important that IoTCrawler specifies a
data model that clearly defines a set of data attributes that 1) are of interest to search
requests, and 2) describe the IoT resources sufficiently well to allow for efficient
search and data access. In this deliverable we define such a model including its data
attributes. This model builds the backbone of the IoTCrawler framework, as it defines
the basic data model used by the metadata repository as defined in deliverable D2.2
[48].
The model presented in this deliverable is driven by the analysis of the IoTCrawler
scenarios. In deliverable D2.1 [47] we have extensively described and analysed 22
IoTCrawler scenarios based on which we derived several key challenges that
IoTCrawler needs to address. In contrast to traditional data management systems
and, in particular, to web search engines, crawling and searching IoT resources
involves additional challenges that need to be considered. These include, for
instance, coping with heterogeneity of data sources, managing dynamically changing
locations of resources due to their mobility, knowing about their varying and changing
availability, be informed about the different levels of quality of information, and finally
protect data from unwanted access and ensure trust between data sources and
consumers.
The data model, including its attributes, which are defined in this deliverable,
holistically addresses these challenges and particularly pays attention to define the
necessary attributes for privacy-enabled data access and determining the Quality of
Information (QoI). QoI is particularly relevant for selecting and ranking the data
sources.
1.1 Sample Scenarios
The data model described in the following sections is strongly motivated by the
scenarios descried in deliverable D2.1. The model covers among other aspects all the
relevant aspects for machine enabled discovery (see Section 2.1), scalability (see
12
Section 2.2), quality measures for ranking (see Section 2.3) and security and privacy
concerns (see Section 2.4). These aspects can be well motivated by all of the 22
scenarios described in D2.1. Here we briefly examine some of them as motivation for
the provided data model. Please note that these examples are merely selected for
illustrative purposes of the results presented in this deliverable and do not imply a
selection of the scenarios further pursued as driving implementation scenarios. This
decision will be taken in the next few months as part of WP7.
For instance, considering the Pulse of Aarhus scenario. This scenario is based on the
idea of crawling all available data sources and correlating sensor data in order to
create metrics and new insights about the city and its citizens, e.g. whether they are
happy or busy, prefer certain locations etc. The data is not under uniform control, e.g.
the data can be provided by the municipality, companies, or individuals.
Consequently, for the discovery of the data, it is important to know about where the
data source is deployed (only Aarhus may be of interest) and what the data type is.
As we expect a high number of data sources in this scenario, another important
aspect is the Quality of Information associated with the data sources. Consider, for
instance, a search request for pollution levels in the city or a certain region of the city.
The city may provide access to high accuracy but “old” pollution levels with low
sensor coverage. There may also be community-driven sensor networks with much
higher coverage and up to date data, but at the cost of lower accuracy. If these QoI
attributes can be captured, they can be used for ranking and an application can make
an informed decision which source has to be use.
The Find My Lost Child scenario stresses the requirement to capture dynamic geo-
spatial attributes of data sources. In this scenario, both fixed cameras and mobile
cameras of people in the city are used to identify missing people such as a child. The
data model presented in this deliverable (see Section 4) particularly takes into
account that resources such as mobile phones have changing location attributes.
Mobile IoT resources can provide their location as part of an IoTStream. The indexing
can access this information and update the index appropriately.
Finally, in domestic settings, a Smart Home Crawler as we demonstrated at the ICT
Event 2018 in Vienna, offers the opportunity to offer innovative Smart Home
applications. However, in such a setting it is paramount that people’s privacy is
13
protected and that people are in control with whom and how they share the data. The
dilemma between data richness and the need to control and protect data flows is
illustrated in Figure 1-1.
Figure 1-1 Smart Home Scenario
In Smart Home scenarios both individuals and companies have an interest to provide
and use rich data. Individuals are interested in using innovative services based on their
data, and companies require the data to offer these services. However, individuals do
not want to give away their data in an uncontrolled way. In the same way, companies
need a technical solution to comply with General Data Protection Regulation (GDPR).
IoTCrawler is designed to address this dilemma by 1) providing individuals, or more
generally data providers, full control over their data, 2) offering a technical solution
that assists in compliance with GDPR, and 3) help companies to establish a trust
relation with their data providers. Note that these trust issues are also particular
important in B2B scenarios, such as Industry 4.0, were industrial customers typically
have strong concerns in providing data related to their manufacturing processes.
This deliverable addresses these issues by preparing the information model and
providing additional privacy and security ontologies. Furthermore, it enables trustful
relationships between data providers by offering quality metrics and analysis for
heterogeneous data sources.
1.2 Overview
The document is structured as follows: The requirements for the IoT Crawler
annotation model are described in Section 2. The overview of state of the art in
ontological models together with the quality of information (QoI) evaluation
14
techniques are presented in Section 3. The annotation model providing the basic data
model used for new machine enabled discovery is described in Section 4. The
annotation model that defines the core data model used by the metadata repository
(MDR) is described in the D2.2. Additionally, Section 5 describes the data model used
by the index that provides efficient access to the metadata stored in the MDR. This is
comparable to a database index. Section 6 provides a zoom in on the annotation
model detailing the QoI attributes and their calculation. Section 7 details privacy-
enabled data access, illustrating how the annotation model is used to protect data
and control access to it. Finally, in Section 8 we describe three implementation
prototypes that illustrate the annotation process, show how quality is calculated and
annotated, and how the location attributes can be indexed.
15
2 Requirements for IoTCrawler Information Model
2.1 Machine Enabled Discovery
In the past, search engines were mainly used by human users to search for content
and information. In the newly emerging search model, information is provided
depending on the consumers’ (human user or a machine) context and requirements
(for example, location, time, activity, and profile). Because of its necessity and
relevance, the information access can be initiated without the user’s (human or a
machine) explicit query or instruction (context-aware search). IoTCrawler will develop
enablers for context-aware IoT search, where the requirements of the different
applications will be mapped to the solutions by selecting resources considering
parameters such as security and privacy level, quality, latency, availability, reliability
and continuity. Therefore, one of the main requirements for the IoTCrawler
Information Model is to enable a rich, multi-faceted description of IoT data and
services. The meta-data description should cover both functional (e.g. location,
sensor type) and non-functional (latency, quality) properties.
Moreover, in many scenarios the applications and services make use of higher-level
concepts (e.g. topics and/or events) to access the data and services (e.g. find all the
areas that show “traffic jams”, return the location of areas that currently have “high air
pollution levels”, display all the CCTV cameras that show a “moving object”). These
higher-level concepts are domain dependent and require a common understanding
and description of the key concepts in the domain (i.e. described by a topical
ontology). Enabling the match between higher-level concepts representing user or
machine context, to the lower-level attributes and patterns describing the IoT
resources is another requirement for the IoTCrawler Information Model. In addition,
to represent domain independent meta-data about resources, the model must be
easily extensible to allow the linkage to domain specific information. It must also
support reasoning methods to allow vague or near complete descriptions of the
application needs to be mapped to concrete data and service descriptions.
Also relevant for the discovery is the ability to automatically adapt to changes in both
the context and data. Therefore, the IoTCrawler Information Model must provide
means to represent and easily access dynamic information.
16
2.2 Indexing
The efficiency of IoT applications heavily depends on the cooperation of IoT devices
and services. Finding the most appropriate devices, services and data to respond to
the requirements of applications is a crucial challenge in dynamic and multi-purpose
IoT environments. For the IoT context-aware search to be successful, it requires that
efficient underlying crawling and indexing mechanisms are in place to find, describe
and register resources. A successful indexing strategy must provide all the relevant
information needed for the discovery task, while being decentralized, lightweight and
able to adapt to the changes in the IoT environments. Moreover, it should support
self-configuration and fault recovery mechanisms. Spatial, temporal and thematic
attributes are essential. In addition, our solution will develop hierarchical mechanisms
to also index other attributes of the data/services such as type, quality and time. This
will enable multi-level indexing of data with multiple attributes. However, this will also
require different indexes constructed based on spatial, temporal and thematic
attributes, to be combined.
One of the main requirements for the IoTCrawler Information Model is that it must
contain all attributes needed by the indexing mechanism. Keeping the index up-to-
date is particularly challenging, as data providers can be transient and/or mobile and
their data related attributes could change over time. The Information Model must also
be extensible to represent data summaries to be used to decide when the index
should be updated, as well as mixed with a set of attributes to support multi-level
indexing.
2.3 Quality Measures for Ranking
The amount of generated data all over the world has been massively increased in the
last years. Smarty cities, social networks, Industry 4.0 and many more have created
dozens of new data sources. Even agriculture is turning more and more into a data
related business. For a comprehensive list of scenarios dealing with these data
sources see D2.2 [48].
The availability of all these new sensors, data sources, open data platforms etc. offers
new possibilities for innovative applications and use-cases. The heterogeneity of the
data sources could become an obstacle for building new applications. It is often
17
nearly impossible to automatically combine these data sources to extract knowledge
because machine readable information is not available. At this point Quality Measures
for annotation could offer opportunities to support the process of finding the right
information.
In IoTCrawler, the ranking mechanisms have to select the right data sources for all
thinkable use-cases. In a first step, this would be the data source or information that
is needed (e.g. traffic sensors for routing). To support the ranking mechanisms, it is
required to rate the available data sources. If it is known if an information is correct,
application developers are able to guarantee a better user experience. Therefore, the
Quality Measures have not only to consider single data sources but also combinations
of heterogeneous and independent data sources offers new possibilities for analysis,
but this requires more complex mechanisms.
To support the ranking in IoTCrawler, not only the correctness of data sources could
be considered by the Quality Measures, even if it could be stated as one of the most
important metrics. Other thinkable metrics could be timeliness ratings of data. This
would enable the ranking mechanism to distinguish data sources, which have the
same correctness but different timeliness ratings. This could easily occur if one data
source provides data more often than another. In this case, the ranking has to decide
if it needs more accurate or more current data. This “system” can be extended with
more metrics to rate the quality. To store all possible measures, an annotation
scheme has to be developed, which is, on one hand, capable of being used by the
ranking mechanisms and, on the other hand, easily to be extended for additional
measures.
2.4 Security and Privacy Concerns
With the rapid development of Internet of Things (IoT), there are a variety of IoT
applications that contribute to our everyday life. They cover from traditional
equipment to general household object, which help make human being’s life better.
Nowadays, IoT is widely applied to social life applications such as smart grid,
intelligent transportation, smart security, or smart home [1]. Additionally, smart cities
will involve millions of autonomous smart objects around us, monitoring, collecting
and sharing data without being aware of it [2].
18
Various application fields require the consideration of security and privacy as a
cornerstone in our architecture and, of course, in our information model at different
levels. From the point of view of the use of IoTCrawler, we have two kinds of entities
worth considering: producers and consumers. Each one requires different and
complementary security measurements.
From the point of view of producers, our information model must be aware of security
requirements such as ownership, access control, visibility, and the likes, which allow
the producer to specify security properties over the data to be registered in our
IoTCrawler. On the other hand, from the consumer point of view, searches and queries
carried out must be controlled in a secure way, presenting the results matching their
authorization privileges.
Furthermore, privacy must be considered because it allows the concealment of
sensitive information, which must be consumed only by the legitimate consumers.
All in all, security and privacy are important and have been considered in our
information model because of the above implications. It enables the enforcement of
security and privacy policies by selected security mechanisms and technologies.
19
3 State of the Art
3.1 Ontology Models
3.1.1 IoT Ontologies
The Semantic Sensor Network ontology (SSN) is probably the best-known ontology
to describe sensors in terms of capabilities, measurement processes, observations
and deployments. Recently, the authors of SSN have proposed SOSA (Sensor,
Observation, Sample, and Actuator), to act as a central building block for the SSN but
also to be used as a standalone lightweight alternative (see Figure 3-1) [13].
SOSA includes concepts such as sensors, outputs, observation value and feature of
interests. IoT-Lite [14] is another lightweight ontology for IoT resources, entities and
services, which is an instantiation of SSN and describes three concepts: objects,
system or resources and services. The lightweight characteristic allows the
representation and use of IoT platforms without consuming excessive processing
time when querying the ontology, making it quite suitable for real time sensor
discovery.
There is another ontology in Internet of Things domain, IoT-O1, which is intended to
model knowledge about IoT systems and applications. It has different modules, such
as sensing module, acting module, service module, lifecycle module and energy
module. It also has a specific module for IoT.
1 https://www.irit.fr/recherches/MELODI/ontologies/IoT-O.html
20
Figure 3-1 The SOSA and SSN ontologies and their vertical and horizontal modules (taken from
https://www.w3.org/TR/vocab-ssn/. Last accessed: 28/11/2018)
The Stream Annotation Ontology (SAO) is a lightweight semantic model, which is built
on top of well-known models to represent IoT data streams. It uses concepts from
the Timeline Ontology (e.g. instant and interval) to represent temporal features, and it
also adopts Agent, Entity and Activity concepts from the PROV-O ontology. For
representing events, SAO uses the Event class from Event Ontology and it introduces
the StreamEvent concept, which describes the output of a stream observation. SAO
adopts the Sensor concept from SSN and extends the SSN’s Observation class with
the StreamData concept, which describes a segment or point which is connected to
time (i.e. it represents a stream as a point or segment). One of the main features of
SAO is the StreamAnalysis class, which allows the representation of an IoT stream
which is derived from one or more data streams, following a data analysis process.
SAO allows the representation of both the derived stream as well as the methods that
were used during the analysis process [15].
Smart Appliances REFerence (SAREF) is an ontology which links different smart
appliances using different ontologies and standards. SAREF ontology combines
21
different parts of existing ontologies based on the need [16]. OMA is another ontology
which is domain-specific and, according to its technical specification, it defines
different types of objects, such as Security, Server, AccessControl, Device,
ConnectivityMonitoring, Firmware, Location and ConnectivityStatistics [17].
3.1.2 Geo and Time Ontologies
There are also some location ontologies which have been used in IoT sensors and
streams ontologies. GeoSPARQL is one of the ontologies which represents Geospatial
data for the Semantic Web. GeoSPARQL can work with systems based on both
qualitative spatial reasoning and quantitative spatial computations [18].
The Geo ontology2 is another popular ontology that represents location data in RDF
(Resource Description Framework). It does not try to tackle many of the matters
covered in the professional GIS (Geographic Information System) world. Instead the
ontology offers just a few basic simple terms that can used in RDF (e.g. RSS 1.0 or
FOAF documents) when there is a need to describe latitudes and longitudes. The use
of RDF as a carrier for lat (latitude) and long (longitude) simplifies the capability for
cross-domain data mixing, as well as describing entities that are positioned on the
map (e.g. Sensor, Deployment, Platform or A System).
GeoJSON3 is a Geospatial data interchange format based on JavaScript Object
Notation (JSON). It describes numerous types of JSON objects and the way they are
joined to represent data about geographic features, their properties, and their spatial
extents.
There are also some time ontologies, Time ontology4 is one of them. This is an
ontology for describing temporal properties in the world and Web pages. It has
vocabulary for representing information about topological (ordering) relations,
duration and temporal position (I.e. date-time information). Time can be expressed
using conventional clock, Unix-time, geologic time and other reference systems. For
duration, we can also use different systems, for example, Gregorian calendar.
2 http://www.w3.org/2003/01/geo/wgs84_pos# 3 https://tools.ietf.org/html/rfc7946 4 https://www.w3.org/TR/2017/REC-owl-time-20171019/
22
3.1.3 Security and Privacy Ontologies
Early work on privacy and security ontologies were focused on interactions over the
Web. Ontologies focused on specifying privacy settings to enable access control.
One example is the Privacy Preference Ontology (PPO) [27], a lightweight privacy
vocabulary aces control. It enables users to create fine-grained privacy preferences
for their data. The vocabulary is designed to restrict any resource to certain attributes
which a requester must satisfy in order to gain access. A security ontology was
proposed in [29], which models stakeholders and their assets, as well as different
properties resulted from security risk analysis tools, such as vulnerability, impact and
counter measures. The authors in [28] analysed trust on the web and have proposed
a simple trust ontology where different types of trust and their properties are
formalized. The Pervasive Trust Ontology (PTO) [32] deals with trust computation in
pervasive domains and uses different properties among devices to assign trust
values. A trust manager is responsible for assigning the initial trust values and defining
semantic relations among trust categories. Trust values are updated based on
interactions among devices, and security rules, written in SWRL5 [1], are used to
decide whether an interaction is allowed.
More recent work builds upon existing models, but with focus on IoT domains. The
work in [30] extends the Smart Appliances REFerence (SAREF) ontology6 with security
features, with focus on the smart home domain. Their model includes infrastructure,
attacks, vulnerabilities and counter measures for the main components of smart
home energy management systems such as Smart Meter, Smart Appliance, Home
Gateway, and Billing data. The work in [33] also targets security and privacy in IoT-
based smart homes. The authors proposed an ontology-based security service
framework for supporting security and privacy preservation in the process of
interactions/interoperations. The ontology captures the security objective of an
interaction/interoperation (e.g. if the focus is on integrity or confidentiality), digital
signatures, security keys, encryption algorithms used, among others. Security polices
based on the ontology can then be created to indicate the abilities of
5 https://www.w3.org/Submission/SWRL/ 6 http://ontology.tno.nl/saref/
23
interactions/interoperations between the service providers and customers in the
developed cloud architectural model for IoT-based smart home.
The SecKit [31] is a model-based security toolkit that supports integrated modelling
of the IoT system design and runtime viewpoints, to allow an integrated specification
of security requirements, threat scenarios, trust relationships, and usage control
policies. The SecKit integrates approaches for policy refinement, policy enforcement
technology at different levels of abstraction with strong guarantees, context-based
policy specification, and trust management. Although no formal ontology is given, the
system and its trust metamodels can be used as a reference for ontology modelling.
Mozzaquatro et al. [34] have proposed a reference ontology for IoT security with
security concepts of M2M communications. The work was based on a state-of-the-
art analysis on information security and Internet of Things. Their IoTSec ontology
features classes such as Assets, Threats, Security Mechanism, and Vulnerability, and
it’s available online7.
3.2 QoI Computation
Quality of Information is a term originally developed to describe the quality of data
held in databases. Strong et al. cited different studies, which assume that faulty data
costs billions of dollars [3]. To address these issues, they defined four categories of
data quality, each one with a list of dimensions. With additional problem patterns to
identify possible faulty data, they built a basis for further rating of quality of
information. Over the following years different approaches and frameworks have
been developed to rate the QoI for different use cases, e.g. [4], [5].
Key aspects when talking about Quality of Information are the used categories or
metrics. These are used to describe the details of QoI information added to some
data. Five common metrics are Completeness, Correctness, Concordance, Currency
and Plausibility, which are used by the authors of [6]. In addition, they provide
assessment methods for quality and mappings between the quality dimensions.
An extensive ontology to describe Quality of Information in a machine-readable way
has been developed by the CityPulse project. They used five categories, namely
7 https://github.com/brunomozza/IoTSecurityOntology
24
Timeliness, Cost, Accuracy, Communication and Security, each with a bunch of sub
metrics. It has been designed to annotate data streams with QoI and therefore has a
relation to other ontologies.
Figure 3-2 CityPulse QoI Ontology
Besides the representation and categorisation of QoI, data algorithms to fill the
metrics with values have to be developed. One major problem, especially when
talking about correctness, is the lack of ground truth. This problem is well known in
the image processing domain, if an image has to be rated for its quality without any
reference [7]. To rate images without an existing ground truth, the work in [8] analyses
the sharpness of edges or the noise levels.
Ideas on how to deal with a missing ground truth for social media data have been
proposed by [9], which are spatiotemporal, causality, and outcome evaluation. The
proposed concept of spatiotemporal evaluation is similar to the approach of Valid.IoT
framework presented in [10]. The framework not only uses spatiotemporal relations,
but also historic data is considered to evaluate the quality of data sources. To
calculate the quality measures, also infrastructure models can be used. This
approach is more complex than using simple metrics like Euclidian distance etc. but
avoids simple common errors that can easily occur, e.g., the “real” distance from one
street to another can be much longer than Euclidean distance caused by the street
network [11], [12].
25
4 Annotation Model for Discovery
Semantic data model is a systematic process of structuring data in order to represent
it in a precise logical manner. Its conceptual data model includes semantic
information that adds a basic meaning/description to the data and the relationships
that lie between them. This approach of data modeling (data structure) preserves
data consistency in the process of updating the data.
The design of IoT-Crawler ontology is based on a set of principles, always keeping
the ontology as lightweight as possible. This ontology extends IoT-Stream ontology
which provides stream annotation concepts derived from SOSA.
Whilst developing the semantic model we followed several best practices found
within the literature. First of all, we referred to the most followed guide for creating
ontologies, created in 2001; ontology development 101 [19]. Secondly, in 2003 W3C
published a list of sample Good Ontologies following specific good practices, which
we have followed. And finally, in 2016, the development of the IoT-Lite ontology lead
to the extension of these guidelines to cover the scalability of ontologies.
The guidelines of "ontology development 101”, divides the development of an
ontology into seven detailed steps.
In IoT-Crawler the implementation of step 1, which focuses on determining the
domain and scope of the ontology, is the result of several discussions with partners
within the project that have a wealth of knowledge and insights in the developments
of ontologies. This collective wealth of experience allows us to answer the question
as to who the end user of the ontology will be.
Step 2, which considers reusing existing ontologies, is derived from the study during
several years of the ontologies in the area, and the selection of the proper ones to be
reused, or partially reuse.
Steps 3 to 6 deal with the enumeration of important terms in the ontology, definition
of the classes and the class hierarchy, their properties of classes slots, and their facets
of the slots. These steps are further defined down in this section, and step 7 for
creating instances is illustrated in Figure 4-1.
Similarly, the “Good Ontologies” list published by W3C scored the ontologies based
on five aspects, and in the development of IoT-Lite [14] the authors published 10 steps
26
for semantic model development. We have followed all these aspects and steps, but
due to space limit we cannot describe them here. Following the previous guidelines,
and keeping always in mind that the main principle behind IoT-Crawler is to provide
a lightweight extension for data streams, we have developed IoT-Stream. The
development of IoT-Crawler has always followed the linked-data approach, that tries
to increase the chance of interoperability by extending popular ontologies. The SSN
ontology has proved to have a significant impact in its adoption for semantically
annotating IoT elements and, therefore, it is chosen as the ontology that IoT-Crawler
mainly extends, in addition to adopting its revised version, SOSA.
The development of IoT-Crawler has always followed the linked-data approach that
increases the chance of interoperability by extending and using existing popular
ontologies. The IoT-Stream ontology has proved to have the most significant impact
in its adoption for semantically annotation IoT elements, consequently it is chosen as
the ontology that IoTCrawler mainly extends, in addition to adopting it revised version,
SOSA.
Figure 4-1 An instance of an IoTCrawler Object
Hence, the aim of IoTCrawler ontology is to provide it with the appropriate elementary
concepts needed to process stream data, using and extending SOSA and IoT-Stream,
and thus allowing stream annotations. A huge percentage of the data retrieved from
27
IoT Devices is stored in streams.
For example, sensors measuring relative humidity, traffic or pollution are recovered
as streams of data with timestamps. When researchers and software developers
need to process IoT data to obtain meaningful insights they have to deal with the
heterogeneity of formats and syntactic of the data.
To overcome this heterogeneity, semantics can aid to solve this issue. However,
semantics in the early days was thought to give detailed annotations of the data,
when there was no need for (quasi) real-time responses, and the amount of data was
not huge. Therefore, the few semantic models that cover stream data, tend to
describe the real world in detail with several concepts. One example of such
semantics models is SAO, which was successfully used for forensics analysis and
some (quasi) real-time analysis. However, when dealing with huge amounts of data
of high granularity, analytics start to delay.
IoTCrawler intends to leverage the processing time of stream ontologies by
annotating the streams with the elementary concepts needed to process data.
IoTCrawler is composed of many different concepts from popular and commonly
used ontologies. IoTStream, being the main one, is composed only of four main
concepts, as can be seen in Figure 4-2.
28
Figure 4-2 IoTCrawler Information Model
These concepts are IoTStream, StreamObservation, Analytics and Event. The first
concepts created in the ontology was StreamObservation. This concept has only two
temporal data properties, windowStart and WindowEnd, and one temporal property
sosa:resultTime from sosa:Observation. Although the well-know and commonly used
Time Ontology provides temporal concepts, linking to these properties for each
StreamObservation would be rather too heavy. The other properties relating to the
observation value are captured by the sosa:hasSimpleResult. The rest of the concepts
with other useful information are kept outside this concept.
Using analytics, events can be detectedFrom single or multiple stream. For example,
if it is a windy, snowy or sunny day derived from humidity and temperature streams.
One of the main contribution of ontologies is the power to share and link data from
different models. That is why most of the semantic community has adopted the linked
approach and now we can find hundreds of sematic models linked together, that can
share their annotated data, without the need for further conversion efforts.
We have selected only stable and well-known ontologies, generally backed up by a
strong organisation body, such as W3C or open Geospatial Consortium (OGC). In
doing so we ensure that the ontologies are dereferenceable and will not disappear in
29
the near future. E.g., for the annotation of sensor devices and services we have used
SOSA and IoT-Lite ontologies.
It is common for IoT applications to support filtering, sorting and searching by
different parameters such as sensor type, location, quantity kind, etc. Therefore, we
have linked IoTCrawler ontology to several existing ontologies that describe these
concepts. For location we have linked to the Geo ontology and for observation
coverage GeoSPARQL and IoTLite are employed. For quantity kind and unit
taxonomies we have used the QU.
Taking this approach, IoTStreams link to concepts from other ontologies to capture
information about the qoi:Quality of the stream by using the providedBy property of
the class iot-lite:Service. It also links to the sosa:Sensor that it was generatedBy, the
qu:QuanityKind and qu:Unit that the sosa:Sensor measures with, and also the geo:Point
where the stream originates from. Therefore, for searching purposes we have
centered the ontology around the IotCrawler concept, providing from this concept
direct links to the needed information to form common searches. Having a centered
concept, the queries become lighter, because they need less triples to find each
aspect of the search.
30
5 Data Representation for Indexing
As described in the Deliverable D2.2 [48], the IoTCrawler architecture envisions a
distributed metadata repository, which compiles information (i.e. metadata) about the
different IoT deployments and their services/devices/data. Similar to Web search
engines, in which Web pages are indexed to allow fast data retrieval, indexes are
needed in the IoTCrawler framework to support search and data discovery. However,
most of the existing solutions for the Web indexing are designed based on text
analysis and exploitation of links between different documents/data resources on
the Internet and are not suitable for large scale, dynamic IoT data networks [9].
Indexing large volumes of heterogeneous and dynamic IoT data/sources requires
distributed, efficient and scalable mechanisms that can provide a fast access and
retrieval to data to respond to user queries. A decision on how often these indexes
should be updated and re-arranged while data streams are continuously published
is crucial to enable on-line indexing [36]. With on-line indexing, building indexing
structures is incremental with the goal to update the indexes continuously with the
new connected resources and the data that becomes available on the network,
without re-building the entire indexing structure. Moreover, the majority of IoT data
have inherently spatio-temporal aspects that must be considered. Searching for
patterns (e.g. temperature more than 24° C in a certain location) or answering complex
queries (e.g. get all temperature sensors whose locations are at most 2 km away, or
get all sensors that have observed extremely low visibility within the last 20 minutes)
often requires a combination of different indexing techniques in order to find relevant
results.
As the metadata repository holds information spreading across multiple IoT
deployments from different domains, the number of different types of services
available and their properties is expected to be large. Having indexes for every meta
data in the repository is prohibitively expensive to store and maintain. An index
(specially in IoT) must be lightweight, yet effective in supporting the search task.
Therefore, the decision on what to index is often application dependent.
In [36] Fathy et al. have performed an extensive survey on the state-of-the-art on
indexing, discovery and ranking of IoT data. Their work has identified three main types
31
of queries in IoT applications: data (find a particular piece of information), resource
(find a particular device) and higher-level abstractions (find a particular pattern).
Moreover, the information needs from each of these query types can be
heterogeneous (e.g. queries with location and resource type requirements).
Depending on the attributes to be indexed, different indexing structures and
techniques are required. Below we describe some of the indexing methods which we
will consider in IoTCrawler. These solutions will be extended to provide distributed
and adaptive features.
For indexing location attributes, the work in [37] proposes a framework that supports
spatial indexing for geographical values of collected observation and measurement
data from sensory devices. The work is based on encoding the locations of
measurement and observation data using geohash (Z-order curve). Geohash is a well-
known and widely used geocoding algorithm that is based on interleaving bits to
convert spatial coordinates (longitude and latitude) into a single string. Other works
make use of geohashing in combination of other attributes (e.g. time or sensor type)
to provide indexing solutions for multi-dimensional data.
Barnaghi et al. [38] makes use of geohashing, and it further annotates sensor data
semantically to allow creating spatio-temporal indexing. While the spatial
characteristic of the data is described using geohash, the temporal feature and the
type of the data are encoded using MD5 digest algorithm. Singular Value
Decomposition (SVD) is applied to geohash vectors to reduce dimensionality before
applying k−means clustering algorithm to distribute data among repositories and
allow data querying. Each data item is represented as a long string; the geohash tag
of a location and the MD5 digest of time and type values. Answering approximate
queries is based on a predictive method that is used to select the repository that
might have the requested hash string. Then, a string similarity method is used to find
the best match for a requested query in the selected repository. In [35] an indexing
structure is proposed which is built by first clustering different resources based on
their spatial features. A tree-like structure is then constructed per cluster in which
each branch represents a type of resource (e.g. temperature, humidity sensors). The
indexing mechanism supports an adaptive process for updating indexing with
minimal cost. However, the approach is limited to a predefined set of resource types
32
and it only supports exact search queries of multi-dimension attributes (i.e. exact
locations and types).
A Distributed Index for Multi-dimensional data (DIM) is proposed in [39]. It is a tree-
based indexing approach, which divides the network into geographical zones. Each
zone corresponds to a node in the tree, and each node represents a range of values
such that the tree root represents the entire range. DIM allows answering multi-
dimensional range queries. However, the routing algorithm to answer user queries is
computationally expensive which hinders its scalability within large-scale networks.
Distributed Hash Tables (DHT) are largely used for indexing based on a particular term
or attribute. While DHTs only support exact match for a given key, there is existing
work [40] that extends DHTs to support multi-attribute and range queries.
Continuous Range Index (CR-index) [41] is a tree-based method for indexing
observation data based on their type attribute and value ranges. The method
constructs a compact indexing scheme in which a collection of observation and
measurement data items are grouped into boundary blocks based on their value
ranges [min, max] (i.e. interval blocks).
While CR-index is an example of the use of the attribute values in the indexing
structure, it does not consider the temporal characteristics of sensor streaming data.
The approach in [42] creates a Bayes tree indexing based on aggregating hierarchical
Gaussian mixture in the form of a tree to represent the entire data set. The approach
supports both incremental learning and anytime classification of new arrival data
streams. The latter allows fast access and search for data with a good level of
accuracy. However, Gaussian mixture models are not suitable for multi-feature
indexing and the number of mixture components also need to be known in advance
which hinders flexibility. For time-series indexing, the most notable works are SAX
and its derivations, which we describe below.
Symbolic Aggregate approximation (SAX) was the first symbolic representation for
time series that allows for dimensionality reduction and indexing with a lower-
bounding distance measure. Much of the utility of SAX has now been subsumed by
iSAX (indexable Symbolic Aggregation approximation) [43], which is a generalization
of SAX that allows indexing and mining of massive datasets. An adaptive iSAX and
tree-based indexing mechanism to answer approximate queries has been proposed
33
in [44]. The cost of index construction is shifted from initialisation time to query time,
by creating and refining indexes while responding to queries. However, similar to SAX,
iSAX strongly assumes that the data has a Gaussian distribution and uses a z-
normalisation in which the magnitude of the data is lost. However, IoT data does not
necessarily follow Gaussian distribution; data distribution might change over time due
to the nature of the observed phenomenon and/or concept drifts. SensorSAX [44] is
an enhancement of the SAX approach to change adaptively the window size based
on the spread of data values (i.e. using standard deviation criterion). The proposed
approach converts raw sensor data into symbolic representation and infers higher-
level abstractions (e.g. dark room or warm temperature).
34
6 Quality Measures and Analysis for Ranking
6.1 Quality of Information
Section 2.3 identified a number of issues and requirements for common and future
IoT scenarios. One major requirement is the need of quality analysis for data sources
and their delivered/produced data. This is important due to the fact that false or
misleading information might cause problems during processing and using of the
information. This problem reaches from simple misconfigured sensors, which deliver
wrong information, to intentionally provided false information that leads to
malfunctioning systems and applications. To approach these problems, IoTCrawler
integrates quality measures and analysis modules to rate data sources to identify the
best fitting data sources to get the needed information.
The first step before implementing some quality analysis modules is to identify
quality measures, which can be used to rate data sources and the
delivered/produced data for their Quality of Information. To measure the QoI,
IoTCrawler has identified and developed several metrics. They are presented in the
following subchapters.
6.1.1 General Approach and QoI Vector
The general approach of the quality analysis component of IoTCrawler is to generate
a quality vector �⃗� , which can be used by the indexing and ranking components to find
the best fitting data sources for a use case.
�⃗� = ⟨𝑞𝑐𝑚𝑝, 𝑞𝑡𝑖𝑚, 𝑞𝑝𝑙𝑎, 𝑞𝑎𝑟𝑡, 𝑞𝑐𝑜𝑛⟩
�⃗� contains a list of quality metrics: Completeness 𝑞𝑐𝑚𝑝, Timeliness 𝑞𝑡𝑖𝑚, Plausibility
𝑞𝑝𝑙𝑎, Artificiality 𝑞𝑎𝑟𝑡 and Concordance 𝑞𝑐𝑜𝑛. These metrics will be described in detail
in the following sections.
6.1.2 Completeness
The Completeness metric defines if a data source message provides all information
that was defined in its description. While registering a new data source, a part of its
description are the fields that should be contained within delivered messages, e.g.,
35
temperature, humidity and wind speed for a weather sensor. The Completeness can
then be calculated as
𝑞𝑐𝑚𝑝 = 1 −𝑀𝑚𝑖𝑠𝑠
𝑀𝑒𝑥𝑝
with 𝑀𝑚𝑖𝑠𝑠 = 𝑠𝑢𝑚 𝑜𝑓 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑣𝑎𝑙𝑢𝑒𝑠 and 𝑀𝑒𝑥𝑝 = 𝑠𝑢𝑚 𝑜𝑓 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 of an
incoming dataset. 𝑀𝑒𝑥𝑝 is though extracted from the model description of the IoT
Metadata of the data source (IoTCrawler’s MDR [48]). Each data source message is
rated for Completeness on its own without including older messages.
6.1.3 Timeliness
The Timeliness metric rates if an observation was processed within a defined time
frame before being delivered to the framework. In technical terms, it calculates the
difference between the current time and a time stamp of the measured effect. If the
difference is out of the defined range (if the observation is too old) the QoI metric
Timeliness is lowered. To calculate the Timeliness, it is required that a timestamp is
added to measured values as close as possible to the sensor. If the sensor itself
cannot add a timestamp and there is no direct gateway for the sensor that can add a
timestamp, the calculation of Timeliness will not be available. In contrast to common
QoS evaluation, the time-related quality metric depends on the availability of the
(updated) information rather than the technical transmission times. The Timeliness
quality is evaluated against the previously annotated source properties (in the
Metadata Repository).
𝑇𝑓𝑟𝑒𝑞 defines the maximum time interval in ms expected between two measured
values. 𝑇𝑓𝑟𝑒𝑞 = 0 defines a solely event-based measurement transfer. The Frequency
is calculated as
𝑇𝑓𝑟𝑒𝑞 = 𝑡(𝑥) − 𝑡(𝑥 − 1)
where 𝑡(𝑥) is the time stamp of a received data source message. This time stamp is
compared to the time stamp of the last message. 𝑇𝑓𝑟𝑒𝑞 can then be used to measure
the Timeliness of a data source message by comparing it with the announced timing
settings in the data source description. In comparison to the Frequency the Age
metric measures how old a data source message is when arriving in the framework.
36
It is calculated by the difference between the current time stamp and the time the
measurement was taken.
𝑇𝑎𝑔𝑒 = 𝑡𝑛𝑜𝑤 − 𝑡(𝑥)
To normalize the Frequency and Age within an interval of 0 and 1, the Reward and
Punishment algorithm, introduced in [46], is modified and integrated. This algorithm
takes numerical values like Age or Frequency and compares them with some given
upper and lower bounds annotated within the IoT Metadata description. If a measured
value is within these bounds the reward increases otherwise it is punished. The
reward is calculated as follows
𝑅𝑑(𝑡) =𝛼𝑊−1(𝑡 − 1)
𝑊 − 1−
𝛼𝑊−1(𝑡 − 1)+𝛼𝑐𝑢𝑟𝑟𝑒𝑛𝑡(𝑡)
𝑊
where 𝑊 is the length of a sliding window over the last inputs and 𝛼𝑊−1 denotes the
number of measurements within the given interval, which have been rewarded.
𝛼𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ∈ {0,1} is therefore the current reward or punishment decision which will be
1 for a measurement within the interval or 0 otherwise. The quality metric can then be
calculated using
𝑞(𝑡) = |𝑞(𝑡 − 1) − 2 ∗ 𝑅𝑑(𝑡)| 𝑤𝑖𝑡ℎ 𝑞(0) = 1
With 𝑞(𝑡) for the value of the quality metric at time t and 𝑞(𝑡 − 1) for the past value.
By normalising both submetrics of Timeliness with the help of the Reward and
Punishment algorithm, they are combined together to form the numeric value 𝑞𝑡𝑖𝑚 for
the Timeliness within the QoI vector 𝑄.
6.1.4 Plausibility
The Plausibility metric defines if a received data source information makes sense
regarding the probabilistic knowledge about what it is measuring. Therefore, physical
value ranges (e.g., indoor temperature or vehicle speed) and sensor specifications are
used to calculate the plausibility of a measurement.
The evaluation of the plausibility utilises a set of sensor annotations and upper
ontology definitions to determine an expected value range of an incoming
measurement. These Measurements are hierarchically evaluated and combined
based on their probability.
37
Atomic Fail-Plausibility Value Definition Value Range
Pfail(DS18B20)=0 Sensor Temperature Dallas
Semiconductors DS18B20
[–55°C ,+100°C]
Pfail (Indoor Office Temperature)=0.5 Indoor Office Temperature [14°C , 25°C]]
Pfail(Individual Room Temperature
Environment)=0.8
Environment Temperature
Room Specific
Table 6-1 Plausibility value ranges in different scenarios
𝑞𝑝𝑙𝑎(𝑣) = ∏𝑃𝐴𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛(𝑣) = 𝑃𝐷𝑆18𝐵20(𝑣) ∗ 𝑃𝐼𝑛𝑑𝑜𝑜𝑟(𝑣) ∗ 𝑃𝐸𝑛𝑣𝑖𝑟𝑜𝑛𝑚𝑒𝑛𝑡(𝑣)
𝑓𝑜𝑟 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡 𝑣
The range of Plausibility value is defined between 0 and 1. If the value lies within the
“Value Range”, as described in Table 6-1, then the Plausibility will be 1, otherwise it
will be 0.
6.1.5 Artificiality
The Artificiality metric determines the inverse degree of used sensor fusion
techniques and defines if this is a direct measurement of a singular sensor, an
aggregated sensor value of multiple sources or an artificial spatiotemporally
interpolated value. If the sensor information originates from an individual IoT
hardware sensor, which is not aggregated or interpolated we assume 𝑞𝑎𝑟𝑡 = 1. An
unidentified information source, which aggregates information with unidentified
algorithms will be annotated as 𝑞𝑎𝑟𝑡 = 0. The metric can be individually adapted to
the utilised openness of the connected IoT framework.
The specification of the metric will be published with D5.3 in correlation with the
description of the virtual sensor algorithms.
6.1.6 Concordance
The Concordance metric is used to describe the agreement between information of
the data source and the information of further independent data sources, which
report correlating effects. The Concordance analysis takes any given sensor 𝑥0 and
computes the individual concordances, 𝑐(𝑥0, 𝑥𝑖), with a finite set of 𝑛 sensors, where
𝑖 ∈ 𝑁. It can be assumed that 𝑐(𝑥0, 𝑥0) = 1, since it represents the same sensor. As in
a real-world scenario, the concordance of a sensor with itself is not required so it can
be established that 𝑥0 ≠ 𝑥𝑖 . Hence 𝑐(𝑥0, 𝑥𝑖) will be in the range [0,1). The decision, i.e.,
38
which individual information agrees with each other (e.g., slow traffic event with ∆𝑣 =
−12𝑘𝑚
ℎ ), is stored in the IoT Relationship Model, see Section 6.3.
The overall concordance 𝑞𝑐𝑜𝑛(𝑥0) at the given sensor location 𝑥0 is then calculated by
the concordance function
𝑞𝑐𝑜𝑛(𝑥0) = ∑𝜆𝑖(𝑥0) ∗ 𝑐(𝑥0, 𝑥𝑖) =
𝑛
𝑖=1
𝜆1(𝑥0) ∗ 𝑐(𝑥0, 𝑥1) + 𝜆2(𝑥0) ∗ 𝑐(𝑥0, 𝑥2) + ⋯+ 𝜆𝑛(𝑥0)
∗ 𝑐(𝑥0, 𝑥𝑛)
with 𝜆 as a weight-function
𝜆𝑖(𝑥0) =1
𝑑(𝑥0, 𝑥𝑖)
And 𝑑(𝑥𝑎, 𝑥𝑏) as a propagation- and infrastructure-based distance function between
sensor location 𝑥𝑎 and 𝑥𝑏 for sensors 𝑎 and 𝑏.
To achieve an exponential neglection of samples, which have a high distance, the 𝑥𝑡ℎ
power can be applied based on the derived propagation model.
𝑞𝑐𝑜𝑛(𝑥0) =
∑𝑐(𝑥0, 𝑥𝑖)𝑑𝑒
𝑥(𝑥0, 𝑥𝑖)𝑛𝑖=1
∑1
𝑑𝑒𝑥(𝑥0,𝑥𝑖)
𝑛𝑖=1
6.1.7 Quality Ontology
To integrate the QoI measurements into the IoTCrawler information model for further
processing, an ontology has been created. The ontology is designed to hold the
measurement values of the quality analysis and to connect them to the ontologies
used for IoTCrawler’s information model.
In contrast to other QoI ontologies (see Section 3.2), the IoTCrawler ontology is
optimised to QoI value representation. There is neither static sensor information
contained in this ontology nor any QoS values.
39
Figure 6-1 Quality Ontology for ioT Data Sources
Figure 6-1 shows the Quality Ontology for IoT Data Sources containing classes for all
QoI metrics, which are integrated as subclasses of Quality. It is available online at
http://w3id.org/iot/qoi as a first draft version and will be updated during project
lifetime (and kept online after project).
To integrate the ontology into the IoTCrawler information model the property
hasQuality is used.
Property: qoi:hasQuality URI: https://w3id.org/iot/qoi#hasQuality hasQuality - Connects a quality to a data source. OWL Type: ObjectProperty sub-property-of: owl:topObjectProperty Domain: http://purl.org/iot/ontology/iot-stream#StreamObservation http://purl.org/iot/ontology/iot-stream#iot-stream Range: qoi:Quality
Listing 6-1 Quality Ontology hasQuality Property
Listing 6-1 shows the detailed description for the object property hasQuality. Its
domain identifies two possible connections to the information model of IoTCrawler,
namely StreamObservation and iot-stream. The range names the classes, which can
be connected to the classes mentioned in the domain. It is recognisable that
hasQuality can connect the class Quality and all of its subclasses (all QoI metrics are
subclasses of Quality).
The integration of values is achieved by additional properties:
hasAbsoluteValue
This datatype property holds absolute values for a QoI metric, e.g. 60 for Age.
40
hasRatedValue
This datatype property holds the QoI values that are rated with the reward and
punishment algorithm. Therefore, all values are between 0 and 1.
hasUnitOfMeasurement
For absolute values, this object property describes the unit of the given value.
It links to a unit within the OM2 ontology8.
Listing 6-2 shows parts of an exemplary QoI annotation and gives an idea how QoI
annotation will look like.
@prefix qoi: <https://w3id.org/iot/qoi#> . @prefix om-2: <http://www.ontology-of-units-of-measure.org/resource/om-2/> . @prefix sosa: <http://www.w3.org/ns/sosa/> . ### https://w3id.org/iot/qoi <https://w3id.org/iot/qoi> rdf:type owl:NamedIndividual . ### https://w3id.org/iot/qoi#testCompleteness qoi:testCompleteness rdf:type owl:NamedIndividual , qoi:Completeness ; qoi:hasAbsoluteValue 0.8 . ### https://w3id.org/iot/qoi#testFrequency qoi:testFrequency rdf:type owl:NamedIndividual , qoi:Frequency ; qoi:hasUnitOfMeasurement om-2:second-Time ; qoi:hasAbsoluteValue 0.7 ; qoi:hasRatedValue 0.85 . ### https://w3id.org/iot/qoi#testPlausibility qoi:testPlausibility rdf:type owl:NamedIndividual , qoi:Plausibility ; qoi:hasAbsoluteValue 0.9 . ### https://w3id.org/iot/qoi#testSensor qoi:testSensor rdf:type owl:NamedIndividual , sosa:Sensor ; qoi:hasQuality qoi:testCompleteness , qoi:testFrequency , qoi:testPlausibility .
Listing 6-2 Quality Annotation Example
6.2 Quality of Service
In addition to the QoI presented, including the QoI ontology and its metrics, there will
be additional information that can be used for the indexing and ranking mechanisms
(depending on the data source). For the information model, see Section 4, IoTCrawler
8 http://www.ontology-of-units-of-measure.org/page/om-2
41
uses the SOSA respectively the SSN ontology for the information model as shown in
Section 4. This ontology also contains metrics to describe somehow the quality of a
single sensor. As an example, it comprises metrics like frequency (ssn-
system:Frequency, the smallest possible time between two measurements), precision
(ssn-system:Precision, the closeness between replicated observations for an
unchanged value) or latency (ssn-system:Latency, the time between the command for
an observation and the result being provided). Although, these terms sound like QoI
or QoS metrics, they represent a description of the sensor and its properties and
quality. IoTCrawler plans to use complete frameworks ant not only single sensors as
data sources. Therefore, these metrics might not be sufficient for every use case.
In order to include typical QoS metrics like bandwidth, jitter, delay, and latency,
additional ontologies have to be included, as we plan to keep for example the QoI
ontology simple and stick to QoI only. Although there have been several QoS
ontologies in the past, e.g. QoSOnt [26], none of them is available online today. We
will primarily work with the QoI ontology and use the possibilities of SASO/SSN to
describe the quality of the sensor. If necessary and possible, we will include additional
QoS metrics by extending these ontologies and the information model.
6.3 Model-based Analysis
The availability of appropriate, accurate and trustworthy data sources in
heterogeneous IoT environments is rapidly growing. However, the lack of available,
machine-readable metadata is a still an ongoing shortcoming. Especially when it
comes to the information that describes the relation between various sensors and
actuators. Previous works [10], [12] showed that a distinct knowledge of the
infrastructure, where IoT devices are deployed, enhances monitoring and validation
of available sources. The simplified metadata annotation with latitude/longitude
coordinates of IoT devices prevents utilisation of the infrastructure knowledge. Due
to a frequent unavailability of precise sensor data and a missing ground truth, there is
a high need for interpolation of available information sources. Common approaches
of geometries whereby the entity relationships (e.g., a sensor belonging to a building)
and infrastructural limitations (e.g., blocking of light and sound at a large object or the
propagation of traffic jams along the streets) are not considered. Furthermore, applied
42
interpolation methods do not reflect these infrastructural limitations. WP5 will
develop infrastructure dependent algorithms for monitoring and interpolation, and
therefore needs access to distinctly modelled infrastructure and entity knowledge.
Entity relations of IoT devices (e.g., belongs to building, room street or car), as well as
the knowledge of the infrastructure (e.g., street-map, building plan, soil), where
physical effects are propagating on, have to be modelled for a meaningful quality
analysis of available IoT devices. In the following, this section proposes a joined
infrastructure model approach for buildings (IoT and industry), city/street
infrastructure (Smart City) as well as rural/soil-based applications (e.g. agricultural
scenarios) that can be extended to similar infrastructure patterns.
6.3.1 Infrastructure Model
The utilised Infrastructure Model in IoTCrawler stores the physical models that can
be used to determine the relations between the IoT data sources and actuators. It
stores e.g., maps and building plans, or Geospatial Topology, which can be used to
determine propagation directions. To enable domain-independent utilisation of
IoTCrawler, we utilise the following three infrastructure models that can be reused
for infrastructure-based monitoring and interpolation approaches. Therefore, these
models can be transformed from the Euclidean space to a topology space and
directed graphs, whereby the process is described in IotCrawler Deliverable D5.1.
a) Transportation and city Infrastructure from the whole world are modelled
freely available in OpenStreetMap [52]. The OpenStreetMap database is used
to get infrastructure information for the movement of vehicles on roads,
pedestrians and trains. Furthermore, rough information of building outline is
stored. Figure 6-2 shows an example of the detailed information, which is used
to restrict the propagation of vehicle-related movement patterns. It shows the
definition of road type, e.g., a:h=secondary is the definition of a road category
similar to a main road (see. [53] and the definition of junction points, which
connect the road segment, and areas like cycleways). The OpenStreetMap
infrastructure model is stored as a directed graph, which is tagged with a
defined set of attributes for every edge and node of the graph, which, e.g.,
define maximum speeds, one-way usage and the road surface.
43
Figure 6-2 Infrastructure Model Description of the Utilised OpenStreetMap Database [53]
b) Internal building and industry spaces can be modelled with the IndoorGML
standard of the OGC [54] that can be derived [55], [56] out of BIM models in the
Industry Foundation Classes (IFC) data format for openBIM. The IFC data model
is intended to describe building and construction industry data. It is an object-
based file format with a data model developed by an International Organization
(buildingSMART9) to facilitate interoperability in the architecture, engineering
and construction (AEC) industry, and it is a commonly used collaboration
format in BIM based projects. It is basically the open export standard for state-
of-the-art architecture CAD systems. The transformation of Bim, CityGML,
Indoor GML enables an optimised integration of existing models. It describes
indoor spaces like rooms, transitions like doors and windows, and enables a
placement of several sensors/sources in the Euclidean space whilst enabling
the transformation to directed graphs for automated analysis. Figure 6-3 shows
an example IndoorGML-model of a building floor.
9 http://www.buildingsmart-tech.org/
44
Figure 6-3 Building floor of the UASO Lab showing connected sensor infrastructure
c) Geospatial soil data is a key driver for agricultural applications. Furthermore,
currently many applications utilise Satellite data (e.g., Sentinel, MODIS and
Landsat), in which also a Geospatial Topology [57], [58] out of raster- or vector
data is stored. The topology contains normalized spatial data with standardized
interfaces. Furthermore, it ensures topological integrity by defining shared
borders between areas and prevention of overlapping areas/values, which
leads to invalid models due to multiple values for the same area. Due to this
strict way of storing geospatial data, it derives explicit spatial relationships. The
geospatial topology model can be extracted out of common .shp. .geojson or
GML files that describe the infrastructure or utilise DBMS Systems like Postgis.
Figure 6-4 a) Soil type overview of the NIBIS Server[59], b) NDVI (Normalized Difference Vegetation Index) based on combined Sentinel2 bands (B8-B4)/(B8+B4) [60]
IotCrawler transforms these three models into an integrated Propagation Model that
is used to support the entity definitions and relationship modelling of distributed
45
sensor/actuator networks. Figure 6-5 shows a combined view of the previously
integrated infrastructures.
Figure 6-5 Combined Infrastructure Model View of a office sensor network floorplan (IndoorGML IGML), building, path and road network (OSM) and topology soiltype data
The IoT Relationship Model (IRM) describes the relationship between individual
sensor entities in the real-world and their mutual impact. This relationship can depict
that, e.g., sensors, which are in the same room, are connected by any physical
relationship (e.g., different temperature sensors, placed in a river) or are connected
via an infrastructure like traffic on a road network. It utilises the spatiotemporal
propagation model of a data source to describe the interconnection and propagation
of sensor values based on their relation. However, the model does not necessarily
describe inevitable effects between data streams. If, for example, a traffic sensor A
reports an average car count of 0 vehicles per minute, a data source B that reports
traffic jams out of different sensor data can be used to verify, respectively support the
concordance of the information of sensor A. At the same time, 0 vehicles per minute
does not necessarily deduce a traffic jam. There could also be no traffic during the
night. The IRMs, as component of the Monitoring and Virtual Sensor components,
46
utilise the previously described Infrastructure Models and will be described in
IoTCrawler Deliverable D5.1.
The Interpolation Algorithm defines possible algorithms to determine spatiotemporal
interpolation of sensor values based on individual IRM and Infrastructure Model.
Simple areal propagation scenarios are solved using interpolation algorithms like
Kriging and IDW. However, the novel approach of this validation component is to
regard restricting infrastructures and use propagation algorithms like the previously
published (I-IDW) [12]. Furthermore, 3-dimensional gas, dust and noise propagations
can be used on the same infrastructure model and use the digital terrain model or
digital surface model (including buildings and other objects). The individual
Interpolation Models will be described in IoTCrawler Deliverable D5.1 and D5.3, as
components of the Monitoring and Virtual Sensor Components.
47
7 Security and Privacy for Enabling Data Access
Within the framework of this project we have considered different security and
privacy mechanisms that must be adopted by our IoTCrawler platform to, firstly,
securely integrate information coming from other IoT platforms or systems and,
secondly, to employ these mechanisms to control how the information is revealed as
results obtained by user searches with different privileges. As a matter of fact, in
Section 3.1.3, we provided the state of the art of security and privacy ontologies in the
field of IoT. In this section we elaborate on privacy and security aspects such as
authentication, authorization and identity management that will become fundamental
properties of our IoTCrawler platform. Additionally, we propose an extension of the
Quality of Information ontology to include security and privacy aspects.
7.1 Authentication
Authentication ensures that an identity of a subject (user or object) is valid, i.e., that
the subject is indeed who or what it claims to be. It allows binding an identity to a
subject. The authentication can be performed based on something the subject knows
(e.g. password), something the subject possesses (e.g. smart cards, security token) or
something the subject is (e.g. fingerprint or retinal pattern).
Our IoTCrawler platform must integrate an authentication component responsible for
authenticating users and smart objects based on the provided credentials. These
credentials can be in the form of login/password, shared key, or digital certificate. An
example authentication is depictured in Figure 7-1.
48
Figure 7-1 Authentication Example
Additionally, IoTCrawler can also adopt alternative and more sophisticated ways of
performing authentication, ensuring, at the same time, privacy and minimal disclosure
of attributes. In any case, such performance will be carried out by the Identity
Management Component (IdM). This way, the IdM will be able to verify anonymous
credential and then, in case the identity is proved, the IdM interacts with the
authentication module which is the one that delivers authentication assertion to be
used during a transaction.
7.2 Authorization
The inherent requirements and constraints of IoT environments, as well as the nature
of the potential applications of these scenarios, have brought about a greater
consensus among academia and industry to consider access control as one of the
key aspects to be addressed for a full acceptance of all IoT stakeholders. An example
authorization mechanism is shown in Figure 7-2.
49
Figure 7-2 Authorization Example
The proposed authorization system is based on a combination of different access
control models and authorization techniques, in order to provide a comprehensive
solution for the set of considered scenarios. Specifically, we employ two different
technologies: one is based on the use of access control policies to make authorization
decisions, and the other employs authorization tokens as an access control
mechanism to be used by IoT devices as well as the resources stored in our
IoTCrawler platform.
In this regard, standards such as SAML [20] and XACML [21] could be adopted;
nonetheless, it is desirable to employ access control solutions specially meant to IoT
scenarios. In this sense, our security components can perform capability-based
access control based on the mechanism presented in [25]. It describes authorization
tokens specified in JSON and ECC optimizations to manage access control on
constrained devices.
50
Figure 7-3 Access Control & Ownership Ontology
As we can see in Figure 7-3, a sensor has a relationship with its owner, which
provides an AccessControlList comprised by AccessControlEntries in which the
access rights are defined. Such representation can help the design and
development of new authorization enablers.
7.3 Privacy/Secure Group Sharing
Privacy is a very broad security term which can be applied to different entities such
as: (1) users who prefer to keep secret their identity; (2) communications in which every
exchanged message is secured and encrypted by allowing only both ends of the
communication to decrypt their information; and (3) it can also be applied to the data
itself. As a matter of fact, this is the definition this document tries to address inside the
scope of Data Modelling. For instance, we can think of PayTV services whose streams
of multimedia are transmitted in an encrypted manner to everybody, but only
legitimate users can access to that information. Figure 7-4 shows an example for
privacy.
51
Figure 7-4 Data Privacy Example
Because of the usage of resource-constrained devices, Symmetric Key Cryptography
(SKC) has been widely used on IoT, requiring producer and consumer share a specific
key. Nevertheless, this approach is not able to provide a suitable level of scalability
and interoperability in a future with billions of heterogeneous smart objects. These
issues are tackled by Public Key Cryptography (PKC), but presenting significant higher
computing and memory requirement as well as the need to manage the
corresponding certificates. A common feature with SKC is that PKC allows a producer
to encrypt information to be accessed only by a specific consumer. However, given
the pervasive, dynamic and distributed nature of IoT, it is necessary to consider
different scenarios in which some information can be shared with a group of
consumers or a set of unknown receivers and, therefore, not addressable a priori.
In that sense, Identity-Based Encryption (IBE) [22], [23] was designed as an alternative
without certificates for PKC, in which the identity of an entity is not determined by a
public key, but a string. Consequently, it enables more advanced sharing schemes
since a data producer could share data with a set of consumers whose identity is
described by a specific string. In this direction, Attribute-Based Encryption (ABE) [24]
represents the generalization of IBE, in which the identity of the participants is not
represented by a single string, but by a set of attributes related to their identity. Just
as IBE, it does not use certificates, while cryptographic credentials are managed by
an entity usually called Attribute Authority (AA). In this way, ABE provides a high level
52
of flexibility and expressiveness, compared to previous schemes. In ABE, a piece of
information can be made accessible to a set of entities whose real, probably unknown
identity, is based on a certain set of attributes.
Based on ABE, in a CP-ABE scheme [24], a ciphertext is encrypted under a policy of
attributes, while keys of participants are associated with sets of attributes. In this way,
a data producer can exert full control over how the information is disseminated to
other entities, while a consumer’s identity can be intuitively reflected by a certain
private key. Moreover, in order to enable the application of CP-ABE on constrained
environments, the scheme could be used in combination with SKC. Thus, a message
would be protected with a symmetric key, which, in turn, would be encrypted with
CP-ABE under a specific policy. In the case of smart objects cannot apply CP-ABE
directly, the encryption and decryption functionality could be realized by more
powerful devices, such as trustworthy gateways. In addition, CP-ABE can rely on
Identity Management Systems (e.g. anonymous credentials systems) to obtain private
keys associated to certain user’s attributes from a specific AA, after demonstrating
the possession of such attributes in the partial identity. Then, these private keys can
be used by consumers to decrypt data which is disseminated by producers, as long
as the consumer satisfies the policy which was used to encrypt.
7.1 Security properties
From our point to the use of different enablers for applying different security
technologies such as authentication, authorization and privacy, our focus in this
deliverable is the representation of the information. In this scope, and in terms of
semantic annotation and enrichment, we have also considered security properties.
Following Figure 6-1 regarding Quality of Information, we propose an extension to
include such aforementioned properties.
Figure 7-5 Quality of Information Ontology including Security
53
In the following figure, we propose the integration of a set of elements comprising different
security properties grouped in categories related to integrity, confidentiality or key carrying.
The integration of such terms will allow us to add security to the Quality of Information
vocabulary.
Figure 7-6 Security Ontology
54
8 Implementation and Experimental Results
8.1 Annotation Process
To be able to model IoT data streams using the IoT-Crawler ontology, a reference
annotation tool is required. For this, an offline annotator has been developed to adapt
data from the Aarhus Dataset Repository for historical datasets, but also an online
“on-demand” annotator, which annotates pre-configured dataset sources at the point
of request from a remote client. This tool has been developed using the Jackson
JSON toolkit. The toolkit encompasses of a set of Java annotations which can control
how JSON data is read into objects, or how JSON is generated from the objects. Using
the IoT-Crawler ontology, the Jackson annotation was used to map classes, object
properties, annotation properties and data properties. Listing 8-1 shows a code
snippet example of how the Jackson toolkit was used to create the classes listed on
Figure 8-1 along with the object properties in relation to the sosa:Sensor class.
platformClass = ontModel.getOntResource(SOSA_PREFIX + "Platform").asClass(); systemClass = ontModel.getOntResource(SSN_PREFIX + "System").asClass();
hasLocation=ontModel.getObjectProperty(GEO_PREFIX + "location").asObjectProperty(); hasUnit = ontModel.getObjectProperty(IOT_LITE_PREFIX + "hasUnit").asObjectProperty();
Listing 8-1 Jackson Annotation Code
Figure 8-1 Classes related and Connected to sosa:Sensor
55
Figure 8-2 IoTCrawler Information Model Example
Figure 8-2 shows an example of the annotated data output that was generated using
the Jackson toolkit. As shown the data is represented in JSON-LD.
Challenges when annotating data:
It is essential for IoT stream data to have correct labeled data with appropriate names
(features) and values, as well as having a rich description. But is it not often the case
to find well labeled rich IoT sensor dataset, consequently requiring additional
process. The IoT-Crawler ontology model composes of essentials components
(classes and properties) required to annotate IoT data streams along with its meta-
data.
The quality of the annotated data can have a substantial impact on the performance,
usage and service of the IoTCrawler search engine.
The process of data annotation often suffers from common limitations of processes,
in particular, annotating components with missing meta-data. Annotating
components with missing meta-data often requires additional manual process,
56
compared to extracting the information from meta-data label or description. Manually
annotating missing meta-data often affects the quality of the information.
8.1.1 Data Evaluation
We have used probabilistic machine learning techniques (i.e. Gaussian Mixture
Models [49]), pattern creation techniques (i.e. Lagrangian Multiplier) and statistical
procedure (i.e. Principal Component Analysis [50]), for crawling and analysing the
semantic descriptions of IoT data and services. Applying these techniques, we have
developed a range of different models to search for the content of the time-series
data harvested from the city of Aarhus live open dataset (e.g. live traffic10 , air quality
and population, etc.).
These models have been evaluated and they have been applied to two different data
sets: synthetic data and a real-world air pollution. The datasets used for evaluation
purposes have very similar characteristics to the harvested dataset from the City of
Aarhus; this also allows us to compare the dataset to real world existing and similar
datasets as well as similar approaches used.
Table 2 Data Silhouette Coefficients
Data Silhouette Coefficient
Without Noise 0.87
With Noise 0.47
To illustrate the performance of our proposed method for real-world applications, we
selected air quality data from the CityPulse11 project’s open data set. We used the air
quality observations data for a period of two months which were recorded at a sample
rate of 5 minute intervals (i.e. 12 samples per hour).
The data has two dimensions: Nitrogen-dioxide (NO2) and Particulate Matter (PM).
Due to the sample rate, we set the step size as 12 (s = 12), which contains observations
for an hour. We clustered the data in three different clusters based on the air-quality
10https://portal.opendata.dk/dataset/realtids-trafikdata/resource/b3eeb0ff-c8a8-4824-
99d6-e0a3747c8b0d
11 http://iot.ee.surrey.ac.uk:8080/datasets.html
57
index for air-pollution assessments (i.e. low risk, medium risk and high risk). See Figure
8-3 for the clustering result. To evaluate our proposed method, we compared it with
existing solutions. We applied the GMM clustering to the raw data and to the data
after applying only the Lagrangian Multiplier and to the data after applying only PCA.
Figure 8-3 The output of clustering algorithm after applying Lag and PCA on real-time air pollution data (left). Notice the different patterns of observation from each cluster (right).
To provide numerical assessment, we calculated Silhouette coefficient and also the
ratio of average distance between clusters to average distance within clusters12 . Note
that we have used the ratio as the Lagrangian transformation scale the data and this
affects the distance measures for different scenarios. Therefore, to be able to provide
a fair and consistent comparison, we calculate the ratio. The results are shown in
Table 3.
Table 3 Method Silhouette Coefficient Ratios
Method Silhouette Coefficient Ratio
Lag + PCA + GMM 0.69 4.09
Raw + GMM 0.46 2.25
Lag + GMM 0.457 2.25
PCA + GMM 0.395 2.05
12 The higher the distance between and smaller the distance within clusters the better the clustering
performance, so ratio of a high performance clustering should be high [51]
58
The Silhouette coefficient for our proposed method is 0.69, which shows higher
performance compared with other methods. The ratio of average distance between
clusters to average distance within clusters is higher in our proposed method, which
means the samples are closer within each cluster and they are well-separated from
other clusters.
8.1.2 Geo access benchmark
The following figures show benchmark results for the experiments.
Figure 8-4 Geo access benchmark, Experiment 1 non indexed data
Using the IoTCrawler search platform, geospatial search query experiments were
conducted in order to analyse the performance and time complexity of the geohash
search functionality built within the platform.
The Geohash coordinates ‘u1zr2r2grd45’ (56.15855872, 10.20765351) were used to
query and find the closest (neighbors) car parking arears within the given
coordinates. From the given string coordinates, the platform was able to find three
close geohash neighbor with the precision (a number between 1 and 12 that specifies
the precision, i.e., number of characters of the resultant symbol, in this case ‘u1zr2’)
set to 5.
59
Figure 8-5 Geo access benchmark, Experiment 1 indexed data
The Geohash coordinates data values were then indexed within the database, to
optimise query time and to improve overall performance. Mongodb uses the b-tree
data structure for indexing data, subsequently improving the time complexity from
O(N) to O(log n). From the 10 experiments conducted, the average Geospatial query
time recorded was 0.2223ms, resulting with an improved average query time
performance of 0.0214ms.
Figure 8-6 Geo access benchmark, Experiment 1 Non indexed vs Index data
As shown in in Figure 8-6, it is clear that the indexed data-set has a faster overall
geospatial query time, resulting to better overall system performance. Comparing the
two experiments results and their time complexities O(n) and O(log n), it is clear that
O(log n) will have better overall performance within a scalable deployed system.
60
8.2 Example Quality Calculation and Annotation
QoI metrics in Section 6 have to be calculated and fused with the metadata for data
enrichment. QoI calculation are performed at two layers. Completeness (𝑞𝑐𝑚𝑝),
Timeliness (𝑇𝑎𝑔𝑒 and 𝑇𝑓𝑟𝑒𝑞), Plausibility (𝑞𝑝𝑙𝑎) and Concordance (𝑞𝑐𝑜𝑛) have to be
computed for each sensor value in the Micro layer [48] and then averaged or
statistically computed for a whole dataset for semantic enrichment in Internal
processing layer [48]. Artificiality (𝑞𝑎𝑟𝑡) is strictly a trait of a dataset and has to be
computed only for semantic enrichment in Internal processing layer to facilitate
Ranking of a dataset.
To depict the QoI computation, consider a weather sensor that measures
temperature, relative humidity and pressure. For a particular query, the data sent by
the sensor is shown in Figure 8-7.
Figure 8-7 Example of Weather Sensor Data, measuring Pressure, Relative humidity and Temperature
For the shown data values, QoI metrics of Completeness, Timeliness and Plausibility
will be calculated in the following fashion:
Completeness: In the three measured quantities i.e. pressure, relative humidity and
temperature, only the pressure value is missing at two positions. As described in
Section 6.1.2, 𝑞𝑐𝑚𝑝 is calculated for each observation. For observations in rows 0, 2, 3
and 4, 𝑀𝑚𝑖𝑠𝑠 = 0 and hence, 𝑞𝑐𝑚𝑝 = 1. However, row 1 has one missing value, so
𝑀𝑚𝑖𝑠𝑠 = 1 for row 1 due to which it has 𝑞𝑐𝑚𝑝 = 0.67.
Timeliness: 𝑇𝑓𝑟𝑒𝑞 is first calculated by subtracting time of each observation with the
time of preceding value. As the sensors are set to log data after an interval of one
hour, we can see that 𝑇𝑓𝑟𝑒𝑞 is within the defined time limit and for the reward and
punishment algorithm, all the measurements, in this case, will be rewarded.
61
The next sub-metric for Timeliness is Age. For 𝑇𝑎𝑔𝑒, the current time is subtracted
from the time at which the data value was generated by the sensor. For the given
observation data, the calculated 𝑇𝑎𝑔𝑒, in ms, comes out to be:
520 350 432 367 503
After the calculation of 𝑇𝑓𝑟𝑒𝑞and 𝑇𝑎𝑔𝑒, the reward 𝑅𝑑(𝑡) has to be calculated to help
in the computation of 𝑞𝑡𝑖𝑚. However, Reward value characterises whole dataset which
in turn makes 𝑞𝑡𝑖𝑚 a quality of a dataset rather than a quality of individual value.
Plausibility: For Plausibility, a range for the measured quantity has to be defined. After
the range has been defined, plausibility for each observation is computed. Comparing
the logged data to the plausibility range, it can be seen that all the values fulfill the
criteria except the pressure values in rows 1 and 3. As the value in row 1 is a missing
value and its effect is already calculated by 𝑞𝑐𝑚𝑝, it will not have an effect on 𝑞𝑝𝑙𝑎.
However, pressure value in row 3 is far from the plausible range and, hence, is given
𝑞𝑝𝑙𝑎 = 0. Figure 8-7 shows the sensor data integrated with the related QoI values.
Figure 8-8 Weather Sensor data with Integrated QoI values
Concordance (𝑞𝑐𝑜𝑛) is also not calculated here, as it is computed through statistical analysis
which will be in WP5.
8.3 Example Location Indexing
As mentioned in Section 5, support for geospatial queries is one of the core
IoTCrawler functionalities. IoTCrawler Ontology links to geo:Point, which can be used
to capture absolute co-ordinates or relative location. Using the geo:Point concepts,
the locations of a given system, deployment, platform or IoTStream can not only be
used for GeoSpatial query, but also for Geospatial analysis.
This feature is demonstrated in the IoTCrawler geohash demo, which is available
62
online13. Figure 8-8 shows a screenshot of the demo, which currently uses open data
from the City of Aarhus.
Figure 8-8 IoTCrawler Geohash demo
Geohash encodes a geographic location into a short string of letters and digits. The
precision increases with the length of the strings (see Figure 8-9 and Figure 8-10). In
the IoTCrawler demo, the latitude and longitude coordinates of the city sensors were
encoded using geohash, which are then indexed using a prefix tree. When the user
or application specify a location or area in the map, a geohash is generated and
matched against the index to return the sensors that match the requested location.
13 http://iot-crawler.ee.surrey.ac.uk/
63
Figure 8-9 Geohash example: Precision with 7 characters string
Figure 8-10 Geohash example: Precision with 9 characters string
64
9 Conclusion
This deliverable summarizes the work done in Task T2.3 and T4.1. After an analysis of
the state of the art, we developed an information model capable of fulfilling the
requirements for a search engine for the Internet of Things.
The information model is therefore addressed from different directions. The base
model is designed to annotate data sources with additional meta data information.
This meta data information is needed for the Indexing and Ranking components of
the IoTCrawler framework. These components will use the crawled information and
its meta data to answer user and machine initiated queries in the world of IoT. To store
the additional meta and quality information, the IoTCrawler framework will use a
distributed meta data repository. This process is described in detail in D2.2.
The use of additional Quality of Information and Security and Privacy descriptions
further extends the possibilities of the annotation model. The Quality of Information
supports the indexing and ranking components by calculating QoI metrics for data
sources. As an example, a concordance metric compares the data provided by one
data source with other data sources to find out if the data is correct. As this metric
also takes heterogeneous data sources into account, a model-based analysis is
introduced. The model-based analysis develops infrastructure models by combining
different representations of real world environments. By combination of all these
representations an information and IoT relationship model is created. The model
describes the relations between individual sensor entities and their mutual impacts.
This work will be continued and further extended in WP 7 of the IoTCrawler project.
This deliverable also addresses security and privacy annotations of data sources and
extends the ontologies by additional properties to annotate, for example, the integrity
or confidentiality of data sources. In addition, required authentication and
authorization annotations are added to the information model to allow IoTCrawler to
consider privacy and security requirements of users and information providers.
65
Finally, we provide an implementation and experiment section where we provide
demonstrations and benchmarks. It is shown how the developed information model
can be used. This concludes with an initial example of an annotated data source,
some quality annotation examples and a demonstration for location indexing.
10 References
[1] Sundmaeker, H., Guillemin, P., Friess, P., & Woelffle, S. (2010). Vision and challenges for realising the internet of things. Cluster of European Research Projects on the Internet of Things —CERP IoT.
[2] Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. Communications Surveys & Tutorials, IEEE, 17(4), 2347-2376.
[3] Strong, Diane M., Yang W. Lee, and Richard Y. Wang. "Data quality in context." Communications of the ACM 40.5 (1997): 103-110.
[4] Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American society for information science and technology 58.12 (2007): 1720-1733.
[5] Bisdikian, Chatschik, et al. "Building principles for a quality of information specification for sensor information." Information Fusion, 2009. FUSION'09. 12th International Conference on. IEEE, 2009.
[6] Weiskopf, Nicole Gray, and Chunhua Weng. "Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research." Journal of the American Medical Informatics Association 20.1 (2013): 144-151.
[7] Li, Xin. "Blind image quality assessment." Image Processing. 2002. Proceedings. 2002 International Conference on. Vol. 1. IEEE, 2002.
[8] Mittal, Anish, Anush Krishna Moorthy, and Alan Conrad Bovik. "No-reference image quality assessment in the spatial domain." IEEE Transactions on Image Processing 21.12 (2012): 4695-4708.
[9] Zafarani, Reza, and Huan Liu. "Evaluation without ground truth in social media research." Communications of the ACM 58.6 (2015): 54-60.
[10] Kuemper, Daniel, et al. "Valid. IoT: a framework for sensor data quality analysis and interpolation." Proceedings of the 9th ACM Multimedia Systems Conference. ACM, 2018.
[11] Kuemper, Daniel, et al. "Monitoring data stream reliability in smart city environments." Internet of Things (WF-IoT), 2016 IEEE 3rd World Forum on. IEEE, 2016.
[12] Kuemper, Daniel, Ralf Toenjes, and Elke Pulvermueller. "An infrastructure-based interpolation and propagation approach for IoT data analytics." Innovations in Clouds, Internet and Networks (ICIN), 2017 20th Conference on. IEEE, 2017.
[13] Janowicz K, Haller A, Cox SJ, Le Phuoc D, Lefrançois M. SOSA: A lightweight ontology for sensors, observations, samples, and actuators. Journal of Web Semantics. 2018 Jul 11.
67
[14] Bermudez-Edo M, Elsaleh T, Barnaghi P, Taylor K. IoT-Lite: a lightweight semantic model for the Internet of Things. In2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld) 2016 Jul 18 (pp. 90-97). IEEE.
[15] Kolozali S, Bermudez-Edo M, Puschmann D, Ganz F, Barnaghi P. A knowledge-based approach for real-time iot data stream annotation and processing. InInternet of Things (iThings), 2014 IEEE International Conference on, and Green Computing and Communications (GreenCom), IEEE and Cyber, Physical and Social Computing (CPSCom), IEEE 2014 Sep 1 (pp. 215-222). IEEE.
[16] Daniele L, den Hartog F, Roes J. Created in close interaction with the industry: the smart appliances reference (SAREF) ontology. In International Workshop Formal Ontologies Meet Industries 2015 Aug 5 (pp. 100-112). Springer, Cham.
[17] Ehrig M, Sure Y. Ontology mapping by axioms (OMA). In Biennial Conference on Professional Knowledge Management/Wissensmanagement 2005 Apr 10 (pp. 560-569). Springer, Berlin, Heidelberg.
[18] Battle R, Kolas D. Enabling the geospatial semantic web with parliament and geosparql. Semantic Web. 2012 Jan 1;3(4):355-70.
[19] F. N. Natalya, L. Deborah et al., “Ontology development 101: A guide to creating your first ontology”, Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, 2001.
[20] SAML (Security Assertion Markup Language): http://docs.oasis-open.org/security/saml/v2.0
[21] OASIS Standard. eXtensible Access Control Markup Language (XACML) Version 3.0. January 2013: http://docs.oasis-open.org/xacml/3.0
[22] D. Boneh and M. Franklin, “Identity-based encryption from the weil pairing,” in Advances in Cryptology—CRYPTO 2001. Springer, 2001, pp. 213–229
[23] A. Sahai and B. Waters, “Fuzzy identity-based encryption,” in Advances in Cryptology–EUROCRYPT 2005. Springer, 2005, pp. 457–473
[24] J. Bethencourt, A. Sahai, and B. Waters, “Ciphertext-policy attribute-based encryption,” in Security and Privacy, 2007. SP’07. IEEE Symposium on. IEEE, 2007, pp. 321–334.
[25] J. L. Hernández-Ramos, A. J. Jara, L. Marín, and A. F. Skarmeta, “Dcapbac: Embedding authorization logic into smart things through ecc optimizations,” International Journal of Computer Mathematics, no. just-accepted, pp. 1–22, 2014.
68
[26] Dobson, Glen, Russell Lock, and Ian Sommerville. "QoSOnt: a QoS ontology for service-centric systems." Software Engineering and Advanced Applications, 2005. 31st EUROMICRO Conference on. IEEE, 2005.
[27] O. Sacco, and A. Passant. LDOW, volume 813 of CEUR Workshop Proceedings, CEUR-WS.org, (2011)
[28] Huang, J., and Fox, M.S., (2006). "An Ontology of Trust - Formal Semantics and Transitivity", Proceedings of the International Conference on Electronic Commerce, Association of Computing Machinery, pp. 259-270.
[29] Bill TSOUMAS and Dimitris GRITZALIS. Towards an Ontology-based Security Management. In Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA), pp. 985-992 (2006).
[30] GETINET AYELE ESHETE. Semantic Description of IoT Security for Smart Grid. Master Thesis, University of Agder, June 2017
[31] Neisse R, Steri G, Fovino IN, Baldini G, SecKit: a Model-based Security Toolkit for the Internet of Things. In Computers & Security (2015), doi: 10.1016/j.cose.2015.06.002.
[32] M. Taherian, R. Jalili, and M. Amini, “PTO: A Trust Ontology for Pervasive Environments,” in AINA’08 Workshop proceedings, 2008, pp. 301–306.
[33] M. Tao, J. Zuo, Z. Liu, A. Castiglione, F. Palmieri, Multi-layer cloud architectural model and ontology-based security service framework for IoT-based smart homes. In Future Generation Computer Systems (2016),
[34] Mozzaquatro, Bruno Augusti et al. “Towards a reference ontology for security in the Internet of Things.” 2015 IEEE International Workshop on Measurements & Networking (M&N) (2015): 1-6.
[35] [9] Y. Fathy, P. Barnaghi, S. Enshaeifar, and R. Tafazolli. A Distributed In-network Indexing Mechanism for the Internet of Things. In Internet of Things (WF-IoT), 2016 IEEE 3nd World Forum on. IEEE.
[36] Yasmin Fathy, Payam Barnaghi, and Rahim Tafazolli. Large-Scale Indexing, Discovery, and Ranking for the Internet of Things (IoT). ACM Comput. Surv. 51, 2, Article 29 (March 2018), 53 pages. DOI: https://doi.org/10.1145/3154525
[37] Y. Zhou, S. De, W. Wang, and K. Moessner. Enabling Query of Frequently Updated Data from Mobile Sensing Sources. In Computational Science and Engineering (CSE), 2014 IEEE 17th International Conference on. IEEE, 946–952, 2014.
[38] P. Barnaghi, W. Wang, L. Dong, and C. Wang. A Linked-Data Model for Semantic Sensor Streams. In 2013 IEEE International Conference on Green Computing and Communications (GreenCom) and IEEE Internet of Things (iThings) and IEEE Cyber, Physical and Social Computing(CPSCom). IEEE, 468–475.
69
[39] X. Li, Y. J. Kim, R. Govindan, and W. Hong. Multi-dimensional range queries in sensor networks. In Proceedings of the 1st International Conference on Embedded Networked Sensor Systems (SenSys). ACM, 63–75, 2003.
[40] F. Paganelli and D. Parlanti. A DHT-based discovery service for the Internet of Things. Journal of Computer Networks and Communications 2012.
[41] S. Wang, D. Maier, and B. C. Ooi. Lightweight indexing of observational data in log-structured storage. Proceedings of the VLDB Endowment 7, 7 (2014), 529–540.
[42] T. Seidl, I. Assent, P. Kranen, R. Krieger, and J. Herrmann. Indexing density models for incremental learning and anytime classification on data streams. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. ACM, 311–322, 2009.
[43] J. Shieh and E. Keogh. iSAX: indexing and mining terabyte sized time series. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 623–631, 2008.
[44] K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for interactive exploration of big data series. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1555–1566, 2014.
[45] F. Ganz, P. Barnaghi, and F. Carrez. Information abstraction for heterogeneous real world internet data. Sensors Journal, IEEE 13, 10 (2013), 3793–3805.
[46] Hossain, M. Anwar, Pradeep K. Atrey, and Abdulmotaleb El Saddik. "Modeling and assessing quality of information in multisensor multimedia monitoring systems." ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 7.1 (2011): 3.
[47] M. Strohbach et al, “IoTCrawler: D2.1 Requirements and Design Templates for IoT Crawling”, project report, 2018.
[48] A. Skarmeta et al, “IoTCrawler: D2.2 Security and Privacy-Aware IoTCrawler Framework”, project report, 2019.
[49] Christophe Biernacki, Gilles Celeux, and G´erard Govaert, “Assessing a mixture model for clustering with the integrated completed likelihood,” IEEE transactions on pattern analysis and machine intelligence, vol. 22, no.7, pp. 719–725, 2000.
[50] Hyunjin Yoon, Kiyoung Yang, and Cyrus Shahabi, “Feature subset selection and feature ranking for multivariate time series,” IEEE transactions on knowledge and data engineering, vol. 17, no. 9, pp.1186–1198,2005.
[51] Daniel S Wilks, “Cluster analysis,” in International geophysics, vol. 100, pp. 603–616. Elsevier, 2011.
[52] Haklay, Mordechai, and Patrick Weber. "Openstreetmap: User-generated street maps." Ieee Pervas Comput 7.4 (2008): 12-18.
70
[53] Marek Kleciak. 2015. Proposed features/area highway/mapping guidelines. (Aug 2015). https://wiki.openstreetmap.org/wiki/Proposed_features/area_highway/mapping_guidelines, last visit: 2019-01-22
[54] Li, Ki-Joune, et al. "OGC IndoorGML: A Standard Approach for Indoor Maps." Geographical and Fingerprinting Data to Create Systems for Indoor Positioning and Indoor/Outdoor Navigation. Academic Press, 2019. 187-207. Kim, Y. J., H. Y. Kang, and J. Lee. "Development of indoor spatial data model using CityGML ADE." ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 1.2 (2013): 41-45.
[55] Kim, Joon-Seok, Sung-Jae Yoo, and Ki-Joune Li. "Integrating IndoorGML and CityGML for indoor space." International Symposium on Web and Wireless Geographical Information Systems. Springer, Berlin, Heidelberg, 2014.
[56] Srivastava, Srishti, Nishith Maheshwari, and K. S. Rajan. "TOWARDS GENERATING SEMANTICALLY-RICH INDOORGML DATA FROM ARCHITECTURAL PLANS." International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences 42.4 (2018).
[57] "Understanding topology in vector data" (PDF). Department of Land Affairs, Eastern Cape, South Africa. 2009. Retrieved 2011-11-25.
[58] Santilli, S. Topology with postgis 2.0., 23rd PostgreSQL Sessions, Paris,
2011 [59] Nibis Map Server,
http://www.lbeg.niedersachsen.de/kartenserver/nibis-kartenserver-72321.html last visit: 2019-01-22
[60] Sentinel Data Access Overview - Sentinel Online - ESA , https://sentinel.esa.int/web/sentinel/sentinel-data-access , last visit: 2019-01-22
71
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 779852
@IoTCrawler IoTCrawler EUproject /IoTCrawler www.IoTCrawler.eu [email protected]