Handling Data Flows of Streaming Internet of Things...

70
IT16048 Examensarbete 30 hp Juni 2016 Handling Data Flows of Streaming Internet of Things Data Yonatan Kebede Serbessa Masterprogram i datavetenskap Master Programme in Computer Science

Transcript of Handling Data Flows of Streaming Internet of Things...

Page 1: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

IT16048

Examensarbete 30 hpJuni 2016

Handling Data Flows of Streaming Internet of Things Data

Yonatan Kebede Serbessa

Masterprogram i datavetenskapMaster Programme in Computer Science

Page 2: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

i

Page 3: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Handling Data Flows of Streaming Internet of ThingsData

Yonatan Kebede Serbessa

Streaming data in various formats is generated in a very fast way and these data needsto be processed and analyzed before it becomes useless. The technology currentlyexisting provides the tools to process these data and gain more meaningfulinformation out of it. This thesis has two parts: theoretical and practical. Thetheoretical part investigates what tools are there that are suitable for stream dataflow processing and analysis. In doing so, it starts with studying one of the mainstreaming data source that produce large volumes of data: Internet of Things. In this,the technologies behind it, common use cases, challenges, and solutions are studied.Then it is followed by overview of selected tools namely Apache NiFi, Apache SparkStreaming and Apache Storm studying their key features, main components, andarchitecture. After the tools are studied, 5 parameters are selected to review howeach tool handles these parameters. This can be useful for considering choosingcertain tool given the parameters and the use case at hand. The second part of thethesis involves Twitter data analysis which is done using Apache NiFi, one of the toolsstudied. The purpose is to show how NiFi can be used for processing data startingfrom ingestion to finally sending it to storage systems. It is also to show how itcommunicates with external storage, search, and indexing systems.

Tryckt av: Reprocentralen ITCIT16048Examinator: Edith NgaiÄmnesgranskare: Matteo MagnaniHandledare: Markus Nilsson

Page 4: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

iii

Page 5: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Acknowledgment

It is with great honor that I express my gratitude to the ”Swedish Institute” for awardingme the ”Swedish Institute Study Scholarship” for my Masters study at Uppsala University,Uppsala, Sweden.I also like to extend my gratitude for my supervisor Markus Nilsson for providing me witha chance to work the thesis at Granditude AB and give me important feedback on thisreport and my reviewer Matteo Magnani from Uppsala University for being my reviewerand see my progress each time. My gratitude also goes to the whole team at Granditudefor being supportive and provide good working environment.And last but not least, I would like to thank my family and friends for their prayers andsupport. Thank You!

iv

Page 6: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Contents

1 Introduction 11.1 Problem Formulation and Goal . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope and Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Internet of Things Overview 42.1 Technologies in IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Radio Frequency Identification (RFID) . . . . . . . . . . . . . . . 52.1.2 Wireless Sensor Network (WSN) . . . . . . . . . . . . . . . . . . . 52.1.3 TCP/IP (IPv4,IPv6) . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Visualization Component . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Smart Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Wearable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Smart City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.4 IoT in Agriculture - Smart Farming and Animals . . . . . . . . . 72.2.5 IoT in Health/Connected Health . . . . . . . . . . . . . . . . . . 7

2.3 Challenges and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Overview of Tools 113.1 Apache NiFi History and Overview . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 NiFi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.3 NiFi UI components . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.4 NiFi Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Apache Spark Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Basic Concepts and Main Operations . . . . . . . . . . . . . . . . 233.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Apache Storm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Basic Concepts and Architecture . . . . . . . . . . . . . . . . . . 253.3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

v

Page 7: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

3.3.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Review and Comparison of the Tools 284.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.1 Apache NiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1.2 Spark Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.3 Apache Storm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Differences and Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Discussion of the parameters . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 How each tool handles the use case . . . . . . . . . . . . . . . . . . . . . 354.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Practical analysis/Twitter Data Analysis 395.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.1 Data Ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3.3 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3.4 Data Indexing & visualization . . . . . . . . . . . . . . . . . . . . 445.3.5 Data Result & Discussion . . . . . . . . . . . . . . . . . . . . . . 465.3.6 Data Analysis in Solr . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Evaluation 50

7 Conclusion and Future Work 547.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

References 57

Appendix - Apache License, 2.0. 62

vi

Page 8: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

vii

Page 9: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Chapter 1

Introduction

The number of connected devices to the internet is increasing each year at an alarmingrate. It is expected that 50 billion devices will be connected to the internet by 2020according to Cisco of which most of the connections are from Internet of Things (IoT)devices such as wearable, smart home appliances, connected cars and many more [1][2].And large volume of data is produced from these devices in a very fast rate that needsto be processed in real-time to gain more insight from it. There are different kindsof tools designed to process only one form of data; either static or real-time, or someare designed for processing both static and real-time. This thesis project mainly dealswith handling/processing of real-time data flows after a thorough study on some selectedstream analytic tools has been made.The thesis project is done at Granditude AB [4]. Granditude AB provides advanced dataanalytics and big data solutions built up on open source software to satisfy the needs oftheir customers. The company mainly uses open source frameworks and projects in theHadoop ecosystem.

1.1 Problem Formulation and Goal

There are different types of data sources; namely real-time and static data sources. Thedata produced by real-time sources has characteristics such as fast, continuous, very large,structured/unstructured. And the data from the static source is a historical data storedwhich is very large and is used for enriching the real-time data. Real-time data beingproduced in a fast way; it has to be processed by the rate it is produced before it isperished. So this is one problem which streaming data face that the data may not beprocessed fast enough. The data coming from these two sources need to be combined,processed and analyzed to provide meaningful information out of the analysis that inturn is vital for making better decisions. But this is also another area of problem forstream data flow processing where these data from both sources is not combined dueto poor integration of the different sources (static & real-time) together or data comingfrom different mobile devices that will result in a data that is not analyzed properly, notenriched from historical data and hence produce poor result.The other problems of streaming data that makes its handling or processing difficult isinability to adopt to real-time changing conditions as for example when errors occur.There are many tools which mainly process stream data; but studying, understanding,

1

Page 10: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

and using all these platforms as they come is not scalable and not covered in this work.This project aims to make processing of flow of streaming data using one tool. To achievethis, first overview of selected tools in this area is done and then the tool that is goingto be used in the analysis is chosen after review and discussion of the tools using certainparameters and use case. This thesis project generally tries to answer questions such as:

• What tools currently exist that are used for data extraction, processing, and alsoanalysis? - to study some of the selected tools in this area - architecture, keyfeatures, components

• Based on the study, describing which tool is good for a particular use case?

• Which tool best handles both static and real-time data produced for analysis?

• which tool enables to make changes in the flow easily?

The defined use case consists of both real-time and static data to be processed andanalyzed. The real-time data is tweets from Twitter API and the static one is initiallystored tweets from NoSQL database HBase. Then the two data sources need to becombined and filtered out based on given properties. Based on the filtered result, incorrectdata will be logged in to another file while the correct data will be stored to HBase.Finally, some of the filtered data will be indexed into Solr which is an enterprise searchplatform.In this process, we will see what happen to each input sources before and after they arecombined together. What techniques are used to merge, filter, what priority levels shouldbe given to each of them, are also some of the questions that are answered during thesestage. The basis for separating the data as correct and incorrect is also defined.

1.2 Scope and Method

The project is mainly divided in two parts which are: Theoretical and Practical/Analysispart. In the theoretical part, IoT will be studied as it uses many devices that producethese large amounts of data in a fast way. In addition to that, the challenges it has andthe solutions that should be taken, and common use cases that are existing in IoT arecovered. Next to that, overview of some of selected tools/platforms is done which consistsof the study of their main components, features, and common use cases. Besides these,the tools are further reviewed by defining use case and certain parameters and see howeach of the tools handle the parameters defined. Finally, based on the discussion resultone tool will be selected to use it for analysis part of the project.The tools are chosen based on the requirement that they should be data processing orstreaming tools and are within the Hadoop framework. Based on this requirement, thetools chosen are:

• Apache Spark Streaming, Version 1.6.0

• Apache Storm, Version 0.9.6

• Apache NiFi, Version 0.6.0, HDF 1.2.0

In the practical part, a particular use case is used to showcase how the analysis is doneusing one of the tools studied.

2

Page 11: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

1.3 Structure of the report

Here the structure of the paper is briefly outlined. In Chapter 2, Overview of Internetof Things is done which comprises of technologies that make up IoT and common usecases. The challenges and solutions of IoT is also discussed briefly. Chapter 3 deals withOverview of selected tools (Apache NiFi, Apache Spark Streaming, Apache Storm). Itdiscusses about the key features of each tool, their architecture, and different compo-nents/elements they have. Chapter 4 is a continuation of the previous chapter; it definescertain parameters and use case to discuss the characteristics of the tools to see how eachof the tools behave. Finally based on the discussion, one tool is selected to use it in thepractical part. In Chapter 5, practical phase of the project is discussed. It uses the toolchosen from the previous step to make Twitter data analysis. Chapter 6 discusses aboutthe evaluation of the tool against performance. Finally Conclusion and Future work isoutlined in Chapter 7.

1.4 Literature Review

Many of the papers discuss the technologies involved, common use cases, and the chal-lenges IoT is facing and solutions for that. For example, giant technological company,Ericsson is engaged in the IoT Initiative (IoT-i) with the objective of increasing thebenefits and possibilities of IoT and identifying and proposing solutions to tackle thechallenges with a team that comprise of both industries and academia [1]. In [2] Mio-randi et al. present a survey of technologies, applications, and research challenges for theIoT. The survey also suggests RFID as the basis for the IoT technology to spread widely.In [3] Cisco white paper defines IoT as the Internet of Objects that changes everythingconsidering the different ways of our lives that it impacts such as in education, commu-nication, business, science, and government.Different IoT application areas are also discussed in their report “Unlocking the Poten-tial of the Internet of Things”, by the McKinsey Global Institute [4] which describes thebroad range of potential applications that include home, vehicles, humans, cities, facto-ries as settings. In [5] the white paper discusses how IoT is being used in the healthcare to improve access to care, increase quality and reduce cost for care. Some of theirproducts include “Massimo radical-7” for clinical care and “Sonamba Daily Monitoring”Solution for early intervention/prevention which can be used as wearable devices. Weberapproaches IoT from the perspective of an Internet-based global architecture and dis-cusses a significant impact on the privacy and security of all stakeholders involved [6].Spark Streaming uses Discretized Streams (DStreams) which is defined by Zaharia et al.in [7] as stream programming model that is capable of integrating with batch systemsand provide consistent and efficient fault recovery.Since Apache NiFi is a new framework/tool for data flow management and processing,papers regarding the study of its features, programming models and so on could not befound readily. So the study of the tool is made mostly by referring and studying from itspage [8].

3

Page 12: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Chapter 2

Internet of Things Overview

Internet of Things (IoT) as defined by International Telecommunication Union (ITU)[9], is a global infrastructure for the information society enabling advanced services byinterconnecting things based on existing and evolving inter-operable Information Commu-nication Technology (ICT). It was first coined by Kevin Ashton in 1999 in MIT Auto-IDLabs [10]. Internet of Things (IoT) as a name stands is a combination of two words:Internet and Things [11]. An Internet is a network of networks interconnecting millionsof computers globally using a standard communication protocol, TCP/IP. A Thing is anyphysical or virtual thing that can be identifiable, distinguishable and be given an addressas in [11]. Examples of Things include humans, cars, food, different machines, electronicdevices which can be sensed and connected [11]. So when combined Internet of Thingsrefers as a technology that seamlessly interconnects these “Things” using existing andevolving communication technologies and standards anywhere, anytime that is capableof exchanging information, data and resources between them. Internet of Things aimsat making these things smarter in a way that makes them to get information without orlittle human intervention. By these it allows communication between Human-to-Human(H2H), Human-to-Things (H2T) and Things-to-Things (T2T) providing unique identityto each and every object which is described in [11].In the subsequent subsections, the technologies that are used in IoT, common use casesand the challenges that IoT is facing currently and their solutions are discussed.

2.1 Technologies in IoT

Different kinds of technologies are used in the IoT applications. Basically they can becategorized as Hardware, Middleware and a presentation component [12]. The Hardwarecomponents include things such as embedded sensors while the Middleware consists ofapplication tools for analysis. The presentation component is all about how this analyzeddata is presented to the end user, i.e visualization in different platforms.Below are some of the main technologies behind IoT implementations.

4

Page 13: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

2.1.1 Radio Frequency Identification (RFID)

RFID is a wireless microchip that enables to uniquely identify “Things”. It was firstfounded in Auto-ID lab in MIT in 1999 [10]. It is an easy, reliable, efficient, and securedtechnology and it comes cheap compared to other devices. It consists of a reader andone or more tags that can be active, passive, semi passive based on the computationalpower and the sensing capability [12]. Passive tag RFID does not use battery while theactive one uses their own battery. RFID has various uses such as personal identification,distribution management, tracking, patient monitoring, vehicle management, and so on.

2.1.2 Wireless Sensor Network (WSN)

It is also one of the main technologies used in IoT which can communicate informationremotely in a different ways. It has smart sensors with micro controllers that enableto gather, process, analyze and distribute certain measurements such as temperaturefluctuations, sound, pressure, health heart beat rates instantly in real time [11].

2.1.3 TCP/IP (IPv4,IPv6)

TCP/IP is a protocol that identifies computers on a network. There are two kinds ofTCP/IP protocol namely IPv4 and IPv6. IPv4 is currently the most widely used butmost of its address spaces are being depleted. When we consider IoT that interconnectsanything, IPv4 will not be good choice because of the less number of address spaces ithas. So the new version IPv6 is a good solution for future when everything is connectedbecause it has very huge address space that can provide an address and uniquely identifyanything [11]. Even if it is not yet being used widely nowadays, it is a future for IoT whenthinking about connecting almost anything because of its large address space available.

2.1.4 Visualization Component

This is also important component of IoT because without a good visualization, interactionof the user with the environment is not achievable [12]. It needs to be noted that whiledesigning any kind of visualization for IoT, the way the analyzed data is presented mattersto make better decisions. That means easy to understand and user friendly interfaceproducts need to be designed that use already existing technologies such as touch screenfor display purposes in smart phones, tablets, and other devices according to the needsof the end user.

5

Page 14: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

2.2 Application Areas

IoT is a future of technology where all things are interconnected to exchange data andprovide information for the better of the society. There are a lot of application areasthat already break IoT market and some not widely deployed yet. Examples of commonIoT Application areas include Transportation which has many domains such as Trafficmanagement, parking for vehicles, highway and road construction, smart vehicles for thepublic. IoT can also make infrastructures to be available with reduction in costs andresources by providing smart metering for utilities in water and light distribution andsmart grid systems. So all these applications and many others show that IoT is beingapplied in all kinds of areas and lives for better services and promises that it will be usedeven more widely in the future.In the next subsections, selected IoT use cases are discussed briefly.

2.2.1 Smart Home

Smart Home is a technology that enables almost all home appliances that are used indaily basis to be connected to the internet or to each other [13]. This helps to providebetter services and act according to the preference of the owner. The home appliancesmay include Heating, Ventilation, and Air Conditioning (HVAC) systems, microwaveovens, lighting systems, refrigerator, garage, Smart TVs and so on. Examples includecontrolling the temperature of the house, lighting systems in the rooms and checkingwhether the oven is on/off. These things can be deployed in a smart home environmentand can also be monitored by voice control from a smart phone (Siri and HomeKit fromApple products for example) [14].

2.2.2 Wearable

This area of IoT is also getting popular nowadays as more and more wearable devicesare getting manufactured. Wearable is a small mobile electronic device that comes withwireless sensor communication capability to process and gather information [15].Wearable devices can work by themselves or by being connected to the smart phone viaa Bluetooth. Examples include smart watches, wrist band sensors and rings to mention afew. For example, smart watches provide a variety of uses for the individuals such as emailnotifications, alerts for messages and incoming calls being connected via Bluetooth. Theother kind of wearable that is being used widely is wrist band sensors where they can beapplied for interactive exercise and activity tracking (heart beats, pulse rates, etc...)[16].Examples include Apple smart watch, Samsung Gear smart watch, and Google Glass.

2.2.3 Smart City

It is a technology that delivers smart urban services to the general public maintainingsafer environment and minimizing cost. It aims at using the available resources wiselyand effectively to provide better services while reducing operational costs in doing so[17]. The different areas where IoT in the city can be deployed include e-governance,traffic management, parking services, street road lighting and many more [17]. It can

6

Page 15: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

also be used in pollution reduction which arise from traffic congestion in bigger citieshence playing vital role towards sustainability of the city.

2.2.4 IoT in Agriculture - Smart Farming and Animals

This IoT application area is a promising area especially in countries where economies aremainly dependent on agricultural productions. It is a technology where traditional agri-cultural equipment such as tractors have smart sensors that measure the temperature,humidity of the soil and water distribution. It also include animals in agricultural farmswhere they will be identified using RFIDs [18][19]. It enables animals to be traced anddetected in real time when outbreak of contagious diseases occurs. This technology canalso be used for preventive maintenance of the equipments well in advance. It revolution-ize how traditional farming is done and will take it to the next level of using the datagenerated from the embedded smart sensors to get better productions and make betterdecisions such as what seeds to plant, the expected crop yields and water utilizationlevels. It also enables farmers to deliver their products directly to the consumers [19].

2.2.5 IoT in Health/Connected Health

It is one of the most widely used IoT use case. It is a technology that enables hospitals andpatients to be connected remotely. Connected health technology keeps the patients to beconnected 24/7 which enables monitoring their health conditions and sending data to thehospital which in turn helps doctors to flexibly control and monitor their patients’ wellbeing. These can be achieved by using smart phones and embedding wearables in the formof implantable that work remotely into their bodies, or palms so that these devices transfergenerated data for further process to the doctor’s end; notifying emergency conditions,trace symptoms for health threats well in advance[5]. This is vital for both hospitalsand patients. For the first, the number of doctors to patients’ ratio is not at equallydistributed stage, so this technology enables doctors to follow more patients from wherethey are which rather was not possible without this technology. The other benefit is,since the data is gathered by the devices, the tendency of error to exist is minimizedthan it was to be entered by human intervention which in turn is readily available to thedoctors fastening better decision making. For the patients’ side, it is good for emergencycases and it enables preventive care for the patients especially the elderly people [5].

7

Page 16: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

2.3 Challenges and Solutions

As there are a lot of emerging applications and evolving technologies in the IoT field, thechallenges it has also increased with these growing trends in applications and technologies.In the following subsection, major challenges and solutions are discussed.

2.3.1 Challenges

There are a lot of challenges the IoT fields is facing currently. Bandwidth and batteryproblems in small devices, power disrupts related to the devices and configurations aresome of the problems that this field is facing [12][20]. Apart from these, it can begeneralized that the major challenges facing the IoT are: Data Security, Data Control &Access, No uniform standard/structure, and Large Volume of Data produced.

1. Data Security : Data security in terms of IoT is defined as a necessity to ensureavailability and continuity of a certain application and to avoid potential operationalfailures and interruptions with internet connected devices. The threats here couldcome in different levels such as at the device level, network or system/applicationlevels. They also come in a variety of ways such as using arbitrary attacks such asDistributed Denial of Service (DDoS) and malicious software [20]. Different devicessuch as sensors, RFID tags, Cameras, or network services (WSN,Bluetooth) couldbe vulnerable to such attacks which in turn can be used as botnets [21]. Homeappliances such as refrigerator, TVs can also be used as botnets to attack this andsimilar devices.

2. Data Control & Access/Privacy : It is known that IoT applications producelarge volume of data in a faster rate from different devices and these smart devicescollect and process personal information [22]. But knowing what risks these deviceshave, how the data is produced and used, who owns and controls it and who hasaccess to it are some privacy questions that one need to ask while getting theservices of these devices. It is obvious that the data that is produced from thesedevices face privacy concerns from the users. The concerns most of the times couldcome in two forms [20], first where personal information about the individual iscontrolled, identified, and the owner does not know who access it or to whom it isknown. Secondly, the individuals’ physical location can be traced and be knownhis/her whereabouts, hence violating privacy. This shows that privacy is one of thebasic challenges in the IoT field as is anywhere in the IT field.

3. No Uniform Standards/Structures : IoT is comprised of different componentssuch as hardware devices, sensors and applications. These different componentsare manufactured and developed by different industries. When these componentsdesigned to be used in IoT solutions, they need to exchange data. Problems arisewhile trying to communicate because the standard used in one product is not usedin the other and it creates communication or data exchange problems which mayhinder the expansion of IoT products. The problem is not only in the design ofdevices, but also in the internet protocols used today. Currently working standardprotocols for the internet are not compatible with IoT implementations [20], so

8

Page 17: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

sometimes ad-hoc protocols from different vendors are being used for example inwireless communications. The absence of uniform standard/structure for differenttechnologies used in IoT is one challenge for the field.

4. Large Volumes of Data Produced : This is another challenge in IoT, i.e thedata produced from various sensor and mobile devices is heterogeneous, continuous,very large and fast. These produced data need to be processed instantly beforeit is expired. Managing these kinds of data is against the capacity of traditionaldatabases. As the number of connected devices is expected to increase in the future,the data produced from these devices is going to increase exponentially and a goodanalytic platform and storage systems are needed.

2.3.2 Solutions

As the challenges of IoT are larger, solutions that adhere to these challenges should bedeveloped and come into work to provide better services that are trusted by all partiessuch as users, companies, and so on. Some of the solutions include using standard en-cryption technologies that comply with IoT. Since the devices are mobile, the encryptiontechnologies that are going to be used must be faster and less energy consuming becauseenergy consumption is another problem of IoT devices. Using authentication and autho-rization schemes for controlling access level to view the data is also another solution thatshould be considered while designing IoT applications.

Some of the solutions put regarding the problems discussed include:

1. Having Uniform Shared Standards/Structures : This is helpful in a waythat having a standard protocols or structures makes vendors to follow this struc-ture and will not create a problem when there is a need to integrate the differentparts developed by different organizations. For example if hardware and sensor de-vice designers, network service providers and application developers all follow somestandard for IoT, it will greatly reduce the problem that will arise due to integrationproblem and compatibility issues [20].

2. Making Strong Privacy Policy for IoT : Strong privacy policy towards IoT onhow to collect, and use individual data in a transparent way to the user increasesthe trust of the user for the service, make him/her aware how it is used and howto control it. This is to mean that the user should be made the center to decide onwhat personal information goes where and how it is used [23].

3. Using Anonymization : Anonymization is a method of modifying personal dataso that nothing is known about the individual. It does not only include de-identification by removing certain attributes but has to also be linkable becausea large volume of data is being produced each time [24]. Methods such as K-anonymity can be used.

4. Robust storage systems : As the data produced from IoT devices is large volumedata, it is needed to have fast and powerful storage mechanisms such as fault

9

Page 18: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

tolerant NoSQL databases which can handle very large data even more than it isneeded currently.

10

Page 19: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Chapter 3

Overview of Tools

In this chapter, three tools that are mainly used in the analysis of streaming data will bestudied. The tools chosen are Apache NiFi, Apache Spark Streaming and Apache Storm.Their general overview and features will be reviewed which can be used as a basis forstudy of their similarity and differences in the next chapter.

3.1 Apache NiFi History and Overview

Apache NiFi originally named “Niagara Files”, was first developed by the National Se-curity Agency (NSA) in the United States in 2006, and has been used there for 8 years.It was first developed to automate data flow between systems [25]. Later by November2014, it was donated to Apache Software Foundation (ASF) as a Technology TransferProgram by the NSA. In July 2015, it has become Top Level Project for ASF and 6releases of NiFi exist the time this paper is written (0.6.0).Data Flow is an automated and managed flow of data between systems. This means thatthere is a flow of information from one system to another which one can be consideredas a producer and the other as a consumer. So these flows of information need to beguaranteed to make sure that they reach the intended parties at the time needed. But itis clear nowadays that there are a lot of challenges or problems that data flow betweensystems are facing. It is much more a challenge today than earlier times because in thosedays organizations do not have very large systems that are exchanging information; theyonly have one or two systems that are not of a big problem or too complex to integrateand exchange data/information between them. But currently, there are a lot of challengesfor data flow management systems to handle different sets of data.

The major problems of data flow are:

• Integration problem: This is a problem because the different systems existingin the organizations have different architectures and even the new built systemsmay not consider the architectures of the existing ones’. Integrating the differentsystems existing in the organization is beneficial to both the organization and theusers. For the company, the systems are integrated means information can easilyflow between the different systems which in turn is good for better decision making.

11

Page 20: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

For the users, they would be able to get what they request in a fast and easy waywithout knowing exactly where each module or functions are found. And havingintegrated system with good data flow provides this to the users efficiently andeffectively.

• Priorities of organizations change over time: This is to mean that, what wasconsidered invaluable at one time may be considered as a valuable thing next andneed to be considered while making decisions. So in these kinds of conditions, thedata flow system must be robust and fast enough to handle the new changes thatoccur and adapt to the existing ones without affecting other flows.

• Compliance and Security: This is a problem for data flow management systemsbecause whenever organizational policies or business decisions change, there will bea possibility that data security will be mistreated when trying to adhere to the newlaws or decisions. So systems must always be kept secured for users whether thereis a change in organizational policies or business decisions which again enhancesdata flow management.

NiFi supports running environment ranging from a laptop to many enterprise serversdepending on the size and nature of the data flow involved. It also requires larger orsufficient disk space as it has many repositories (content, flow file, provenance) wheretheir content is stored in a disk. It can run on any machine with major operating sys-tems (Windows, Linux, Unix, Mac OS) and its web interface rendered on latest majorbrowsers such as Internet Explorer, Firefox and Google chrome.

3.1.1 NiFi Architecture

NiFi supports both standalone and cluster mode processing. Their features are discussedbelow.

Standalone Architecture

NiFi requires Java so that the JVM holds it and the amount of memory it uses dependson the JVM. It has a web server inside the JVM where it displays its components inuser friendly UI. The flow file, content, and provenance repositories are all stored insidea local storage.

12

Page 21: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

The different parts of the architecture as in Figure 3.1 from [8]:

• Flow controller: it is the main part of NiFi architecture that control threadallocation for the different components.

• Processor: it is the main building block for NiFi and it is controlled by the flowcontroller.

• Extensions: operate within JVM and holds the different extension points in NiFi.

• Flow File Repository: is a place where NiFi keeps track of the state of activeprocessor. It uses Write-Ahead logging which lives on a specified disk partition.

• Content File Repository: holds actual content of given flow file and are storedin the file system.

• Provenance Repository: it holds information about the data; what happen tothe data, how and where it moves over some period of time beginning from itsorigin. These whole information is indexed and hence making the search easy.

Figure 3.1: NiFi standalone Architecture - source [8]

Cluster Architecture

NiFi can also be used in a cluster, where the NiFi cluster manager (NCM) is the masterand the other NiFi instances connected to it are the Nodes (Slaves). In this model, itis the Nodes that make the actual processing of the data and the NCM is for managingand monitoring the changes.

NiFi cluster uses a site-to-site protocol which enables it to communicate with other NiFi

13

Page 22: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 3.2: NiFi Cluster Architecture - source [8]

instances, other cluster, or other systems such as Apache Spark. Figure 3.2 shows thatthe Nodes are only communicating with the NCM and not with each other. The com-munication between the Nodes and the NCM can be by Unicast or Multicast.When one Node fails, the other nodes will not automatically pick the load but rather it isthe NCM that calculates the load balance and distribute it to another Node. The otherfunctions of the NCM are: communicate data flow changes to all the Nodes, receiveshealth (whether they are working properly) and status information from the Nodes. Thenodes are regularly checked for load balancing by the master so that they are given flowfiles to process according to their load. As many Node instances as possible can be addedhorizontally to the master cluster whenever there is a need to add as long as the NCM isworking and operating.

3.1.2 Key Features

Apache NiFi has a lot of useful features that allows providing better flow managementmechanisms when compared with other systems. It can be said that it is designed bylearning the drawbacks that other systems have. These features are also considered asan advantage it has over other systems.

14

Page 23: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

The points below are some of the main features of NiFi [8][26].

• Flow specific Quality of Service (QoS) : It comprises Guaranteed delivery VsLoss Tolerance, Latency Vs Throughput. The QoS achieved when considering theflow relates to how the flow is configured to give high throughput with low latencyand being loss tolerant. NiFi is designed to be loss tolerant in a way that data lossis unacceptable. Guaranteed delivery is achieved by using both content repositoryand persistent Write-Ahead Logging (WAL). By this it keeps track of changes madeto the flow file’s attributes and also the connection the flow file belongs [27]. Thenit writes these changes to the log before they are written to the actual disk. Andfinally, the contents will be written to the disk. This is important for recovery andprevents data loss.Latency is the time required to process the flow file from beginning to end. AndThroughput is the result that is get after processing of a flow file in a given time.It also describes how much flow file is processed in a given time at once i.e themicro batching in a specified time. Every time the processor finishes processingspecific flow file, before sending it to the next component, there need to have anupdate of the repository which is expensive and takes much more time. Since thisprocess is expensive it is good that more work is done i.e more flow files are microbatched for processing at once in a given time. The problem with this is the nextcomponent or processor cannot start until the repository is updated and it has towait until these flow files are processed, hence producing latency. So this has to besolved in order to provide better processing speed and as well better throughput.NiFi enables to provide lower latency with higher throughput while configuring theprocessor in the settings tab hence the average and suitable point can be chosen toget the best result according to the need.

• Friendly User Interface (command and control): NiFi provides a friendlyUser Interface (UI) running on a browser that is designed using HTML5, Dragand Drop mechanisms with JavaScript technologies. Having the UI is useful espe-cially when flow files become complex and managing them will be very tough fromthe console. NiFi achieves this by having easy command and control mechanismthat enables making the changes to a specific flow file or processor and controllingonly those affected parts and the effect is seen in real time and other flow files orprocessors will not be affected at all.

• Security: One of the concerning issues in other flow management systems is se-curity. NiFi provides security in two forms which are: system to system and userto system security mechanisms. In the first, it enables encryption and decryptionof each of the flows involved. And also when communicating with other NiFi in-stances or other systems, it enables to use encryption protocols like 2-way SSL.For the second, i.e user to systems, it provides 2-way SSL authentication and alsocontrol user’s access levels by having privilege levels as Read Only, Data FlowManager (DFM), Provenance and Admin.

• Dynamic Prioritization: NiFi has queuing mechanism that enables it to retrieveand process the flow according to the specified queue prioritization schemes. It canbe based on the size or the time and it even allows making custom prioritization

15

Page 24: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

schemes. The need for prioritizing the queues arises because of constraints in band-width or different resources or how critical is the event. This is helpful to set thepriorities according to the required properties or needs at hand because the priori-ties set at one time may not be good enough for other times and it will affect thedecision if not set properly; hence NiFi allows dynamic priority setting for differentscenarios according to the need.

• Data Provenance: Data Provenance is one of the most important features ofNiFi that enables to manage and control the flow of data from beginning to end byautomatically recording each performed action. From the data provenance page, itenables the user/DFM to see what happen to the data, where it comes from, andwhere it goes, what is done with it and so on. This is useful when problems occurbecause it increases traceablity and help track the issue. It also enables to see thelineage or flow hierarchy of the data.

• Extensibility: Another feature that NiFi provides is Extensibility to its variouscomponents such as Processors, Reporting Tasks, Controller Services and Prioritiz-ers [8]. This is useful because it enables various users or organizations to design theirown extension points/components and embed it in NiFi to gain a better service intheir own specializations. One of the most extensible components that are widelyused is the processor. Most organizations design their own processors to ingest oregress data to use it with NiFi. For example,in IoT applications data is producedfrom different devices with different formats. And this different data formats needto be utilized to process and gain insight from them; for this NiFi’s extensibilityfeature can be used to design processors that ingests these different formats to NiFiwhere eventually its inbuilt processors process the ingested data according to theneed. This makes Extensibility as one of the key feature of NiFi.

3.1.3 NiFi UI components

NiFi provides visual command and control for creating, managing and monitoring dataflows. After the user starts the application, then writing the URL, https://<hostname>:8080/nifion a web browser brings a blank NiFi canvas for the first time. The <hostname>is thename of the server or the address that NiFi instance is running on. And 8080 is thedefault port number for NiFi. The points below show the different components of the UIas in Figure 3.3.

• Default URL address: As shown in the Figure 3.3, since the machine is runningon local, the hostname is “localhost” and with a default port number 8080 whichcan be changed in the “nifi.properties”file in the NiFi directory.

• System Toolbar: NiFi has about 4 system toolbar namely Component, Action,Search, and Management toolbar as shown in the Figure 3.3.

– Component: consists of the different components such as Processors, Inputand Output Ports, Process Groups, Remote Process Groups, Funnel, Tem-plates, and Label.

16

Page 25: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 3.3: NiFi UI canvas

– Action: consists of buttons to perform Actions on a particular component.Some of the actions are Enable, Disable, and Start if the process is not startedor stopped, Stop if the process is started, Copy to copy the particular compo-nent, Group to group different components together and so on.

– Search: consists of the search field to search components existing on thecanvas.

– Management: consists of buttons used by different users (DFMs, Admin)according to their privilege levels. It includes Bulletin boards, Summary page,Provenance, and so on.

• Status Bar: From the Figure 3.3 above, the Status Bar include the Status andComponent Info labeled on the figure. The Status shows the active threads thatexist if threads are being used; it also shows the total number of queues on the flowfile between different components. It shows the clusters existing and how manynodes are connecting and the timestamp as last refreshed time. The ComponentsInfo shows as to how many processors or other components are running, stopped,invalid, or disabled and so on.

• Navigation Pane and Bird’s Eye View: The navigation pane NiFi providesenables to navigate, zoom in, and zoom out the components in the canvas. Andthe Bird’s Eye View allows the user to view the data flow easily and quickly.

3.1.4 NiFi Elements

NiFi has different elements and some of them are further discussed in the subsectionsthat follow and Figure 3.4 shows the main components that it supports.

1. User Management: NiFi provides a mechanism for user management and control-ling privilege accesses. It supports user authentication either by client certificatesor using Username/Password mechanisms. Authenticated users use HTTPS for ac-cessing data flows in a browser. In order to use username/password mechanism,

17

Page 26: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 3.4: NiFi main components

login identity provider and “nifi.properties” file need to be configured for the con-figuration file, and the provider should be set to indicate which provider should beused. i.e

• nifi.login.identity.provider.configuration.file

• nifi.security.user.login.identity.provider

Likewise for controlling access levels, NiFi provides pluggable authorization mech-anism that enable users to have access to the system and assign different roles. Forthis, “nifi.properties” file is configured for these two properties:

• nifi.authority.provider.configuration.file - specify configuration files for autho-rization providers

• nifi.security.user.authority.provider - which provider to use from the configuredones

It also provides Roles for controlling authorization. Next shows some of the Rolesthat it provides. And users can have different Roles assigned to them.

• Administrator: configuring user account and the size of thread pools

• Data Flow Manager (DFM): manipulate the data flow as designing, in-gesting, routing, ...

• Read Only: only view the data flow but not allowed to change

• Provenance: able to query provenance repository and view lineage. Able toview and download the content of the flow file. Not able to Replay the flowfiles in case of failure or during troubleshooting.

2. Processor: Processor is the main building block of NiFi data flow. It is responsiblefor data ingesting from other systems or NiFi instances, routing, transforming and

18

Page 27: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

finally output the data in to other systems. It is also the main extension point thatcan be designed to enable organizations to input/output their flow files using NiFi.Figure 3.5 below describes its Anatomy:

Figure 3.5: NiFi Processor Anatomy

• Processor Type and Name: As the name implies the Processor Type spec-ifies the type of the processor used and it is “PutFile” processor used in thisexample. This processor is responsible to write flow files to a disk. The nameof the processor is bold; by default it takes its type name as a name and alsoallows renaming it in the settings tab of configuration page for the processor.So in this example, the name is “Save Matched Tweets” which is a Put Filetype processor that stores matched tweets to the disk.

• Status Indicator: This is the icon at the left top corner of the processor thatshows the current status of the processor. There is different status indicatorsavailable based on the validity of the processor.

– Running: this shows that the processor is running. It has a green playIcon.

– Stop: it shows that the processor is currently stopped and it has a redicon

– Invalid: shows that the processor cannot be started because there aremissing properties that need to be set. The missing properties can beseen by hovering over the icon. Its icon is a triangle with exclamationmark inside it.

– Disabled: this shows that the processor is disabled and cannot be starteduntil enabled.

• Flow Statistics: this shows the statistics happening on the data flow overthe past 5 minutes. They are In, Read/Write, Out, Tasks/Time fields thatshows the number of flow files and the total size entered/ingested in to theprocessor; the total size of the flow file content read from disk and written tothe disk; and then the number of flow files and the total size of the flow filecontent that are transferred to the next processor/component; and the numberof tasks this processor perform and the time it takes to perform it over thepast 5 minutes respectively.

19

Page 28: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

3. Input/Output ports: Input Port is one of the components of NiFi and it isused for transferring data coming from other components or systems into differentProcess Groups.Output Port is also used for transferring data from process groups to destinationsoutside of a Process Group or other components/systems such as Apache Spark.

4. Process Group and Remote Process Group: Process Group is another NiFicomponent that logically groups set of components that makes easier for mainte-nance. It prompts the user for unique name and provides some kind of abstraction.Remote Process Group (RPG) on the other hand has same idea as that of ProcessGroups but this is to connect other instance of NiFi remotely. It asks for the URLof the remote instance than a unique name so that the connection is created be-tween RPG and the NiFi instance. It uses Site-to-Site communication protocol tocommunicate with remote instances or other systems.

5. Template: Template is another component of NiFi where it enables re-usability ofthe components created inside the templates. It enables users to create Templates,export them in an XML format to be used in other NiFi instances. Then thisTemplate can be imported into other NiFi instances for usage. So it is a featurethat makes re-usability possible for NiFi data flows.

6. Funnel: Funnel is another component that is used for combining different com-ponents or processors into one and makes prioritizing easier. If there are dataflows with many processors, setting priorities at each of the processors is againstperformance and NiFi provides the possibility to set priorities and change themdynamically at a single point i.e in Funnel.

7. Provenance and Lineage: Data provenance is one of the key features as wellas element of NiFi that keeps a very great detail of each data that it ingests. Ithas provenance repository that stores everything that is mapping to the data frombeginning to the end such as ingesting, routing, transforming, cloning, etc...This means that everything that passes through NiFi is recorded and indexed whichmakes it easier for searching, tracking problems that occur easily and provide solu-tions and also monitors the overall data for compliance.There is a provenance icon in the Management toolbar at the top right corner in theNiFi UI and it displays everything that has happened in the data flow. It enablesto search and filter by Component Name, UUID and Component Type. When the“View Details” Icon is clicked, it displays the details of that particular event whichhas 3 tabs as in Figure 3.6: Details tab which lists the time, type of event, UUIDand so on. Attributes tab lists all the attributes that exist at the time the eventoccurred with their previous values and the Content tab enable to download orview the content. NiFi also provides a possibility to see the provenance data foreach processor by right clicking on the processor and choosing Data Provenance.Twitter data analysis is used as an example that searches all the tweets having“InternetofThings” phrase and it loads the language, location, text, username andso on according to the properties set. And Figure 3.6 below shows the provenancedata for it. In the right side of the provenance page, there is an icon for showing

20

Page 29: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 3.6: NiFi Provenance

lineage, “Show Lineage” which shows the graphical representation in details of whathappen to the data. It enables to see details, parents and expand the particularevent occurred as it is needed. It has a Slider that enables to see which event iscreated at what time and the time it takes to create by dragging the slider. It alsoenables to download the lineage graph as shown in Figure 3.7.

Figure 3.7: NiFi Lineage

21

Page 30: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

3.2 Apache Spark Streaming

Apache Spark is an open source fast and general engine for large scale data processing[28][29]. It was originally developed in AMPLab in UC Berkeley, California [29] andcurrently is a top level Apache project. Spark’s core abstraction is called Resilient Dis-tributed Datasets (RDD) which is an immutable collection of elements. Apache Spark isa main API with different components comprised of Apache Spark SQL, MLlib, GraphXand Spark Streaming.

• Spark SQL :- is one of the modules in the Spark general core API that enablesthe user to work with traditional structured data [30].

• GraphX :- is another Spark API for graphs and graph-related operations [31].

• MLlib :- is a Spark API for machine learning which consists of various kinds ofmachine learning algorithms [32].

• Spark Streaming :- is a Spark API mainly dealing with computations or analysisof live streams of data flowing each and every specified time [33].

Apache Spark Streaming is one of the components of the Spark core API that uses thestreams of data as micro batches to process them. It is also possible to use other compo-nents from Spark API such as MLlib and Spark SQL together for further processing.

3.2.1 Key Features

As one of the components in the Spark API, Spark Streaming shares main features thatSpark provides and adds others on top of it. Some of the main features are listed below.

• Spark Streaming provides high level abstraction called Descritized Streams (DStream)which are built on Resilient Distributed Datasets (RDDs) - Spark’s main abstrac-tion.

• It makes integration of streaming data with batch processing easy because it is partof the Spark API.

• It receives data from different sources such as HDFS, Flume, and Kafka; it alsoenables custom made receivers.

• It supports use of different programming languages such as Java, Scala, and Python

• Fault Tolerance - it has “exactly-once” Semantics which make sure that data is notlost and reached exactly one time avoiding duplicates which also is advantageousfor data consistency.

• Provides Stateful transformation that maintains the state even if one of the nodesfail hence good for fault tolerance.

• Speed - performs in-memory computations that have low latency and provide fasterprocessing speed than those performed on disks.

22

Page 31: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

3.2.2 Basic Concepts and Main Operations

Basic Concepts

The main programming model for Spark Streaming is its abstraction called DiscritizedStreams (DStream). DStream is a continuous stream of data which internally is repre-sented by Resilient Distributed Datasets (RDDs).DStream can be created by ingesting data streams from different sources such as Kafka,Flume, Twitter, or it can also be created by different transformations on other DStreams.RDD is Spark’s main abstraction point that consists of fault tolerant collection of elementswhich can be executed in parallel [34]. Figure 3.8 shows that DStreams as a continuous

Figure 3.8: Continuous RDDs form DStream - source [33]

stream of batches of RDDs at a specified time interval. When all these batches of RDDsare combined, they form a DStream. There are different transformations supported byDStreams similar to the RDDs on Spark API. These transformations allow the data frominput DStreams to be modified. Examples of such transformation functions include map,filter, and reduce.

Main Operations

Spark Streaming also provides various kinds of operations on DStreams. The main oper-ations are Transform, Window, Join, Output operations [33].

• Transform and Join :- The Transform operation allows RDD-to-RDD operationsover DStreams such as joining data stream with other datasets. Spark Streamingenables different DStreams to be joined with other DStreams. There is stream-stream joins which enables streams of RDDs to be joined with streams from otherRDDs. And also stream-dataset joins which enables streams to be joined withdatasets with transform operation [33].

• Window Operation :- Since the live streams of data coming from various sourcesare continuous, they cannot be computed as batch of files and traditional operationscannot be performed on them. Spark Streaming provides a solution for this by aWindow operation that enables these streams of data to be processed, transformed,and computed within a range of specified time interval over a Sliding Window. Ev-ery Window operation must specify Window Length and Sliding Interval to performits actions over a Window [34]. Window Length is the time for the total Windowwhile Sliding Interval is the rate or interval at which the operation is performed.There are many Window operations that Spark Streaming supports such as window,countByWindow, reduceByKeyAndWindow and so on.

23

Page 32: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

• Output Operation :- Spark Streaming supports many output operations thatmake sure the processed streams and data are stored in an external storage suchas HDFS, or file systems, databases, and or even displayed on live dashboards.Print, saveAsTextFiles, saveAsHadoopFiles, foreachRDD are some of the outputoperations that Spark Streaming provides.

3.2.3 Architecture

How Spark Streaming operates can be summarized as:

• Receiving the input - the sources could be from Kafka, Twitter, log data, etc... anddivides them into small batches

• Spark Engine - Processes the data received from Spark’s memory

• Output batches of processed data to storage systems

The tasks are assigned dynamically to the nodes based on the available resources whichenable fast recovery from failures and to have better load balancing between the nodes.Its ability to divide the input streams into small batches enables it to process the datain batches and reduce latency it takes to calculate if taken one by one.

Figure 3.9: Spark Cluster - source [33]

In addition to this, Spark Streaming runs on a Cluster as in Figure 3.9. The Main programin a Spark Cluster (also known as Driver program) has Spark Context that coordinatesSpark application to run on a cluster. The first step is creating a connection to a ClusterManager available which allocates resources to individual applications. Once connectionis created, Spark acquires Executors on Worker Nodes which in turn run application codein the Cluster. And then Spark sends the code to Executors that are able to run tasksand keep the data in memory or disk storage. Finally, the Spark Context sends the tasksto be run. The Clusters available include Hadoop YARN, Apache Mesos and it can alsorun on Standalone Mode.

24

Page 33: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

3.3 Apache Storm

The other tool that is going to be studied in this chapter is Apache Storm. Overview ofthe tool, what features it has, the components, and main use cases will be briefly studied.

3.3.1 Overview

Apache Storm is a distributed, resilient, real-time computation system [35]. It was de-veloped by Nathan Marz and became open source on September 2011 [36]. It works inmuch more similar ways to Hadoop except that Apache Storm is for real-time streamingdata while Hadoop is for batch processing.

3.3.2 Basic Concepts and Architecture

In this subsection, the different components and concepts of Storm will be discussed andits Architecture is also presented.

Basic Concepts

• Tuple :- is the primary data structure in Storm which is a list of values thatsupports any data type [37].

• Streams :- is a core abstraction in Storm by which these unbounded tuples form asequence or a stream. It can be formed by transformation of one stream to another.It has primitive types such as long, string, byte arrays and also supports customtypes to be defined by users but by implementing their own serializers.

• Spouts :- are the main entry point of streams for Storm. Different external sourcessuch as Kafka and Twitter API ingest their data through Spouts. Spouts can beReliable where replaying the lost tuple is possible if failures occur and Unreliablewhere replaying is not possible and the data will be lost.

• Bolts :-is where the main processing takes place. It takes in inputs from Spouts andprocesses it where finally the processed tuples will be emitted to downstream Boltsor some storage to databases. The processing part includes stream transformation,running functions, aggregating, filtering, or joining data or sending it to databases.

• Topology :- is the main abstraction point of Storm. It is a network of Spouts andBolts which are connected with stream groupings. So each node of a graph/networkrepresents either a Spout or a Bolt and the edges represent which Bolts are sub-scribed to which component i.e Spout or Bolt.

In Figure 3.10, the nodes are spouts (S1, S2) and bolts (B1, B2, B3, B4). So B1,B2, B4 are subscribed to streams coming from S1. B4 additionally is subscribed tostreams coming from S2. This shows that in a Topology, the tuples are streamedto only the components that they are subscribed to.

• Trident :- is an API which is part of Storm that is built on top of it. It supports“exactly-once” Semantics.

25

Page 34: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 3.10: Storm Topology - source [38]

• Stream Grouping :- Storm has different inbuilt stream groupings and also sup-port custom made stream grouping. The main stream groupings include ShuffleGrouping and Field Grouping. Shuffle Grouping randomly distributes the tuplesamong the tasks of the Bolt while Field Grouping groups the tuples having samefield name [38].

• Task :- refers a thread of execution.

• Worker :- executes the subset of all the Tasks existing in the Topology.

3.3.3 Architecture

Storm supports both local mode and remote modes of operation where the local modeoperation is mainly useful for developing and testing topologies and in remote mode;topologies are submitted for execution in a cluster [38]. There are two kinds of nodeson a Storm cluster i.e Master Node and Worker Node. The Storm architecture hasthree main components namely Nimbus which is a daemon that runs on Master Node,a Supervisor which also is a daemon running on Worker Node, and a Zookeeper thatmainly handles communication between Nimbus and Supervisor as shown in Figure 3.11.Their functionality is summarized in Table 3.1:

26

Page 35: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 3.11: Storm Cluster -source [38]

Nimbus Supervisor ZookeeperAssign tasks toworker nodes

Receives the work assignedto its worker

Handles the communicationbetween Nimbus and Super-visor

Monitor for fail-ures

Start and stop worker nodesas required

Keeps the state of the topol-ogy

Distribute codeamong clustercomponents

Table 3.1: Storm Architecture Components Functionality

3.3.4 Features

The features of Storm also show its advantages and why it is popular nowadays for streamdata processing. Some of the main features include:

• Reliability :- It provides guaranteed message processing by using “exactly-once”from the Trident API or “at least-once” Semantics from the core Storm. It alsomakes sure that specific messages will be replayed in case failure occur on thosespecific messages.

• Fast and Scalable :-Supports parallel addition of machines horizontally and scalesfast with increasing number of machines.

• Fault-Tolerant :- Failure in Storm occurs for example when the worker dies orwhen the node itself dies. For the first case, the supervisor handles the failureby automatically restarting the worker while for the second case; the tasks willtime-out and be assigned to other machine or node.

• Support for many Languages :- Storm uses a Thrift API that makes it possibleto support many programming language such as Scala, Java, Python and etc...

27

Page 36: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Chapter 4

Review and Comparison of the Tools

In this chapter, the tools that are studied on the previous chapter are further reviewedand then compared based on some selected parameters. The parameters are not selectedbased on any particular model but rather from the characteristics of the tools. It isimportant to answer questions like:

• Which tool is more preferable if one parameter is more wanted than the other

• What would be the complexity if we use this tool for such and such cases

• How each tool responds to the parameters specified

The selected parameters include:

(i) Ease of use

(ii) Security

(iii) Reliability

(iv) Queued data/ data buffering

(v) Extensibility

4.1 Review

4.1.1 Apache NiFi

• Ease of Use : NiFi’s Ease of use comes with its friendly drag and drop User Inter-face where it controls the activity and the flows. If we have more complex data flowswith different types, handling them from the command line is very complex andwould not provide any good detail. But NiFi solves this issue by allowing all theflows to be designed in a UI which solves complexity issues and allows fast recoveryfrom problems making maintenance easy. Another feature that makes NiFi easy touse is its flows can be changed and customized on the fly without affecting otherparts of the flow. It also accepts data from variety of sources in different formatssuch as FTP, HTTP, XML, JSON, CSV, and different File Systems. This is alsoanother feature that makes it to be used easily.

28

Page 37: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

• Security : NiFi has inbuilt security and supports different security schemes bothat the user and system level. It allows each data flow to be encrypted/decryptedby providing processors which provide this. It provides both certificate and user-name/password authentication mechanisms. It does this by 2 way SSL authentica-tion where a specific user is allowed to access if the certificate it uses is legitimateby sending acknowledgment between the client/browser and the server. It also hasaccess level authorization scheme where users are assigned different Roles. This isimportant for use cases where security is more needed such as financial, governmen-tal, and similar sectors.

• Reliability : Reliability of a system is its ability to function properly for its in-tended purpose without failure. It includes the ability for providing guaranteeddelivery of the processes at hand. NiFi is a reliable system and provides this fea-ture by using the Content Repository and the Write-Ahead-Log (WAL) mechanismwhere the content of data is stored first in the log files before they are written tothe disk. And hence if problem occur, then it is possible to get the data from thelog files without affecting the flow.

• Queued Data/Buffering : Data buffer is a physical memory area where data istemporarily stored. Queuing of data occur because the data is not processed at agiven time, or the node failed. So this queued data has to be put in some memoryas a data buffer. But it takes memory space if the data that is queued is always keptand there has to be an efficient way to handle such cases where resources are notexhausted. With this regard, NiFi provides buffering of queued data in an efficientway where all the queued data is kept in memory. It has back pressure mechanismwhere certain limit for processing data is specified and if that limit is reached, moredata will not be processed until the queued data is processed and memory spacereleased. By providing these features NiFi handles queued data in an efficient way.

• Extensibility : The extensibility feature of NiFi comes with various uses. Ithas many extension points where the user is able to design such as in processors,reporting tasks, and controller services to mention a few according to their needs.Flow files can be changed in real-time without affecting other flow files. There isno need to recompile the whole flow because if new flow files is created or the oldremoved, its effect is seen from the UI in real-time without compilation.

4.1.2 Spark Streaming

• Ease of Use : Spark Streaming Ease of Use feature comes with its core APISpark that has APIs for different programming languages. It has support for Scala,Java, and Python. It could be useful for users who are familiar to the languagesmentioned and shows that it is flexible and addresses more users as the language itsupports increases. It also has interactive shell and supports different APIs.

• Security : Spark supports authentication through Kerberos security and usingShared Secret [39]. Using the Kerberos authentication requires creating Principaland Key tab File and configuring Spark history server to use Kerberos. Only Spark

29

Page 38: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

running on YARN cluster supports Kerberos authentication and it does not allow instandalone mode. The second type of authentication is using Shared Secret wherea handshake between Spark and the other system is made to allow communicationbetween them. In order to communicate, both must have the same shared secret key.For this authentication to work, “spark.authenticate” parameter must be configuredto true.

• Reliability : Spark Streaming is a reliable fault-tolerant computation frameworkwhere data processing is guaranteed. It uses different mechanisms to address faulttolerance or guaranteed delivery of data such as Exactly-Once delivery semanticsand Write Ahead Logging (WAL). Exactly-Once semantics is one form of deliverysemantics where data is processed exactly one time. It does not allow duplicatesto be formed. Failure may occur in two forms which are node or executor failureand driver/main program failure. When the node fails, it is automatically restartedand the normal operation continues because the data blocks in the receivers arereplicated. Once the data is ingested to the node, it is guaranteed that it willbe processed. When the driver/main program dies, all nodes fail and receivedblocks failed. If DStream checkpoint is enabled, then it is possible to restart themain program from the last checkpoint. And then all executors will be restarted.DStream check point is a way to specify fault tolerant directory such as HDFS toregularly store the status. Failure may also occur when input data is being loaded.When this happens, Spark Streaming recovers some of the data but not whole.So the solution provided by Spark to recover all the files is using WAL where thedata ingested is written synchronously to fault tolerant storage such as HDFS, S3before being processed. So if the data is received correctly, acknowledgment is sentand then the data will be processed. If acknowledgment is not sent, that meansfailure occurs so Spark reads the log files and the data will be sent again for beingprocessed from there. So all these methods used make Spark a reliable and faulttolerant processing framework.

• Queued Data/Buffering : In real-time data processing, queue is created when-ever the data is not processed in a specified time interval and the processing timeis slower than that of the rate at which the data is received. So this data will bequeued in a buffer and keeps increasing if it is not processed or removed. In SparkStreaming also, the data will be queued as DStreams in memory and the queue willkeep increasing. So in order to overcome this, Spark Streaming provides a way toset configuration parameters which help to limit the rate at which data is receivedand processed. It also uses other methods such as reducing batch processing timesor having right size for the batches so that they can be processed by the rate theyare received.

• Extensibility : When new application code existed and if it is needed to replace theold application code, Spark Streaming provides two ways in which one is shuttingdown the existing system gracefully and starting the new application which startsprocessing from the point the earlier application left off. The other method is thenew application started in parallel to the existing one and later the old one is shutdown.

30

Page 39: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

4.1.3 Apache Storm

• Ease of Use : Storm’s ease of use comes with its easy to use API and abilityto support different programming languages through its Thrift API. Thrift API isan Interface Definition Language and also a communication framework that allowsdefining new data types that support different programming languages [40].

• Security : Storm for this particular release (0.9.6) does not provide inbuilt security(authentication and authorization). It does not provide encryption of the data overthe network by itself. This means that security is mainly dependent on other tech-nologies outside of Storm such as firewall settings and encryption on different partssuch as Topologies. The latest release of Storm supports Kerberos authenticationby creating key tabs and principals for the daemons.

• Reliability : Storm provides guarantee for full data processing even if any of theconnected nodes in the cluster dies or messages are lost. Full data processing isa way when all the messages in the tuple trees are fully processed in a specifiedtime interval, otherwise it fails. It guarantees this processing by providing at least-once semantics which guarantees messages are replayed when failure occurs andallows duplicates to be formed. It also uses Trident API in occasions where exactlyonce processing of data is needed. There are different points of failure such asnode failure, worker failure, or the daemons failure (Nimbus and Supervisor) andare handled differently to provide fault-tolerant system. When the worker dies,it is automatically restarted by Supervisor and nothing will be lost. If the nodedies, Nimbus will assign the task to other machines because tasks assigned to thatmachine times out. If the daemons die, worker processes are not affected and willcontinue when they restarted.

• Queued Data/Buffering : Storm provides techniques to overcome over queuingwhen data is over queued or stays in the buffer for too long without being processed.If the incoming data is not processed with in specified time, the buffer begins to fillup with messages and grows too much. This results the timeout for processing thetask to reach which causes messages to be re-emitted again at the spout. So Stormprovides back pressure by which a threshold on a number of messages to process ispredefined and other properties set. When this threshold is reached, it blocks someof the messages until those in the queue are first processed.

• Extensibility : New code that is written after the last deployment needs to berecompiled to incorporate it and use it.

31

Page 40: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

4.2 Differences and Similarities

This section summarizes selected parameters and how each tool behaves towards theparameters. The intention is to show differences, similarities and their applicability incertain use cases and it is shown in Table 4.1.

NiFi Spark Streaming StormWeb UI Has friendly UI Interactive shell, UI

for monitoringUI for monitoring

Main element Processors DStreams Spouts, Bolts, Topolo-gies

Language Has its own Expres-sion Language

Java, Scala, Python Any prog.language

Reliability/Fault-tolerance

Content Repositoryand Write Ahead Log

Exactly-Once seman-tics, Immutable RDDs

At least-once andexactly-once withTrident

Security Inbuilt security, SSL,SSH, HTTPS, contentencryption

Shared secret, Ker-beros authentication

No authentication(0.9.6), Kerberossupport for laterrelease

Applicability/Usecase

Good for Simple pro-cessing

Good for Simple +complex processing

Good for Simple +complex processing

Table 4.1: Differences and Similarity of the tools

Simple processing:- includes extracting, splitting, merging, filtering, routing, ETL trans-formations etc... of the data.Complex processing:- includes massive computations, aggregations, window and join op-erations, machine learning computations, etc...

4.3 Discussion of the parameters

In this section, discussion of the parameters for each tool is given. This discussion isdone with the intention that it can be used to guide a reader to decide which tools to useconsidering the list of parameters listed i.e combination of the parameters.

• Ease of use/Usability : Ease of use is one of the main features to consider inevery software system. It defines how easy is to use such systems effectively andefficiently. It could come in terms of the programming language the tool supportfor example by being accessible to large users in different languages. It could alsobe in terms of the way it solves the complexity of a system, i.e whether it haseasy techniques such as UI or with support of many Scripts and logs from thecommand line. Being said that, Spark Streaming supports different programminglanguages such as Scala, Java, and Python through its language integrated APIs. It

32

Page 41: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

addresses large audience/users familiar with these programming languages. ApacheStorm on the other hand even supports any programming language through its useof Thrift API. This is important because any person who is familiar with certainprogramming language can do the work without needing to learn specific languagewith little or more configuration and technical changes. When we consider ApacheNiFi, it can be designed with little or no coding and has its own expression languagewhich enables to use different functions and regular expression in different formats.It is easy but new kind of language so if learning curve is considered, since it isdifferent than common programming languages, it takes time to know and utilizeit better. Generally when we consider ease of use in terms of languages used, eventhough Storm and Spark Streaming supports many languages and reaches largenumber of users, NiFi’s easy to use feature is important to consider when makingdecisions on which tool to use.When considering ease of use in terms of solving the complexity, Spark Streamingmainly uses interactive shell for processing the data and UI for monitoring thecluster environment, memory usage, and information about executors and so on.But it does not enable to control and monitor the flow from the UI. Storm alsouses the UI for monitoring not for the flow of the data but other information suchas memory usage and so on. It basically uses running scripts or code applications.NiFi on the other hand has a friendly UI where it shows beyond monitoring andcontrolling of the information, cluster environment. It enables command and controlwhere the user is allowed to view the flow of files in real-time. It is possible to designthe flow and see the effect right away without affecting other flow files. This featureis important when there are many flow files in complex way which cannot be handledotherwise in Interactive shell or command line interface. Having such capability i.eproviding a user friendly UI that makes control of the flow easy and enable to seeproblems in real-time and make right decisions there. But using the command lineand many scripts and code will not fasten decision making when problems occurthat needs to be handled right away. Considering this feature, using NiFi provideseasy and effective data processing while controlling the flow with friendly UI.

• Reliability : Reliability is one of the main features to consider in real-time dataprocessing. All the three tools handle reliability in various ways. Spark Streaminguses DStreams which are continues RDDs and RDDs are immutable which meansfault tolerant. Spark Streaming provides reliability through exactly-once deliverysemantics which guarantees that the data is delivered exactly one time with noduplication and it also uses Write Ahead Logging. Storm provides atleast-oncedelivery semantics which guarantees the data is delivered at least one time andduplication exists. It provides exactly-once semantics if Trident API is used. NiFiprovides reliability by using its content repository and Write Ahead Logging whereevery action is written to log files before it is written to disk. So it is a designchoice as how the data should be handled to choose the tools i.e if duplicate datais possible in simple processing, then Storm with atleast-once guarantee is good touse or if the situation is banking transaction where cash withdrawal and such occur,then exactly once delivery is the one to choose in Spark or Storm with Trident.

• Security : Security is also important feature to consider in real-time processing.

33

Page 42: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Spark Streaming through its core Spark API provides shared secret authenticationwhere communication is after the handshake between systems. It also supportsKerberos authentication only on YARN cluster. Storm for 0.9.6 release version doesnot provide authentication and security is handled by external firewall. For release0.10.0 and above, it supports Kerberos authentication. NiFi comes with differentalternatives when considering security. It provides certificate authentication, andusername/password authentication. It also provides pluggable authorization whereit provides Roles for different users. It also allows encryption/decryption of the flowfiles. So if the use case needs more security even in the flow files, for example, ina bank environment where the transaction has to be highly secured, the contentsflowing may be needed to be encrypted for some users based on Roles and seen forothers (able to decrypt it). In such cases NiFi is good tool to choose.

• Extensibility : Extensibility is a feature that can be described in terms of additionof new features/components or functionality and also modifications to the existingsystems. When considering addition of new features/components, Spark Streaminguses the core Spark’s extensible API feature which supports SQL, MLlib modules.So this feature of Spark allows its Streaming API to use one or more of thesemodules which makes it extensible. Storm is designed to be extensible for usingexternal functions such as SQL features. For that it uses its Topologies and otherAPIs. NiFi is also designed to extensible where its main components are extensible.Processors and Reporting Tasks are some of the main elements of NiFi that areextensible. So it is possible to design your own Processors that are capable ofachieving your purposes and also using NiFi’s already existing Processors to modifyand transform your data. For such use cases such as where data from sensors inIoT applications needs to be transformed from one form to another for processing,choosing NiFi would be good considering this feature.Extensibility could also come in terms of addition of new functionality which is inthe flows you want to change or the application code you written. Spark Streamingand Storm follows the same ways where first the new application code has to betested and then deployed either in parallel as the existing one or first shuttingdown the existing one and starting the new one. This way would not produce goodresults when considering real-time decision making and would not allow tracingdebugs that are occurring in real-time. So it has some downtimes when somethingneeds to be changed in the application code. In NiFi, adding new functionalitycould be adding/removing Processors or other components to/from the flow. Sinceit does not have save and deploy methods, the effect is seen in real-time in the UI,what has been added or removed. And if problems occur while making that change,then it is traceable and is solved right away. So NiFi is good for such use caseswhere decision making, viewing the flow and trace debugs is vital in real-time.

34

Page 43: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

4.4 How each tool handles the use case

In this section, a use case that is going to be used as a benchmark for practical analysisis defined. And how each tool handles the use case is depicted theoretically in brief. Thistheoretical study and comparing of the tools is important to choose the tool that is goingto be used for the analysis.There are two data sources from which data is ingested to the system. One is real-time source Twitter API for receiving tweets and the other is static data which retrieveshistoric tweets from NoSQL database HBase. Then the two data sources are merged andfiltered, and then the incorrect data is sent to log files in local systems and the correctdata to HBase. Finally, some of the filtered data will be indexed in to Apache Solr.

Figure 4.1: General use case flow

(I) Apache NiFi : NiFi handles this use case in a simple and efficient way in a friendlyUI. It comes with processors that handle interactions between NiFi instances andother sources and systems such as Twitter, HBase and Solr for this use case. Itdoes this by providing inbuilt processors to get the data, process (extract, filter),and route to other processors and downstream systems. It also has processors towrite incorrect files to some logs in local file systems or HDFS. Each processorsused must be configured in the right way to function appropriately which enablesto monitor and control the flow in real-time. It is also possible to use output portswhich allow sending these flow files to external systems such as Apache Spark forfurther processing if needed.

Figure 4.1 above shows the general flow of the system for the use case. Settingproperties for each processor and also using NiFi expression language, it is possi-ble to design and query simple ETL transformation, splitting, merging, filtering,extracting needed information which can further be used as an input to other sys-tems. NiFi uses the “GetTwitter” processor to receive the tweets from the TwitterAPI. Specific search terms can be set in the “Terms to Filter On” property of the

35

Page 44: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 4.2: NiFi use case flow

processor. “GetHBase” processor is used to retrieve historic tweets from HBase.These two processors send the flow files to downstream system which extracts therequired fields from the JSON file produced from them. For this, “EvaluateJSON-Path” processor is used which allows defining custom properties which can later beused in making routing decisions. After these data is merged and required fieldsare extracted, “RouteOnAttribute” processor is used which allows defining customBoolean rules. These rules define whether the flow is correct or incorrect and isimportant in the next parts to route the flow accordingly. According to the rule,if the flow is correct, then it is sent to HBase for storage which is handled by“PutHBaseJSON” processor and some of the flow is also sent to Solr for indexingwith “PutSolrContentStream” processor. The incorrect data is sent to “PutFile”processor used as a log. This use case is shown using NiFi in Figure 4.2.

(II) Spark Streaming : Apache Spark Streaming is based on DStreams which aresmall batches of RDDs in specified time. Figure 4.1 is used as a general use casediagram. Initially, Spark Context object is created taking the configuration ob-ject. Then SQL Context object is created by taking the Spark Context object asan argument and it is responsible for getting the queries from HBase and thenstores them to temporary table, “tmpHBase”. Then Streaming context is createdby taking the Spark Context and Sliding window interval as an argument. SparkStreaming receives inputs from the Twitter API using the streaming context cre-ating a DStream, “twitterDStream”. Then Window operation is defined whichtakes the sliding interval and window length. Then a series of DStream trans-formations is done for splitting the tweets into separate words, “splitDStream”,filtering which again creates another DStreams, “filterDStreams”, and mapping ofthe data each time creating new DStreams with new transformations. After these,the last transformed DStream is stored as a temporary table, “tmpTwitter” whereit is joined/merged with the previous temporary table, “tmpHBase” using the SQLContext created before and is continuously stored using foreachRDD method. Fi-nally, “saveAsHadoopFile” or “saveAsNewAPIHadoopDataset” method is used to

36

Page 45: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

store the data. The general illustration is shown in Figure 4.3.

Figure 4.3: Spark Streaming use case flow

(III) Apache Storm : Figure 4.1 is referred as a general use case diagram and here howStorm handles this use case is described. It starts by creating Topologies where theSpouts and Bolts used are initialized. Then Spouts are created which are entrieswhere data is ingested, two Spouts are first created for receiving data from twodifferent sources because the sources are different and the way the data comes isin different forms. The data is coming from Twitter API and historic data fromHBase, hence “twitterSpout” and “hbaseSpout”. In order to load data from HBase,first HBase connection has to be created to use it in Storm. Then these data issent to the Bolt that is subscribed to these Spouts where merging of the data ishandled, “mergeBolt”. Other Bolt which is subscribed to the first Bolt are createdfor processing (filtering and checking texts, language and locations fields whetherit is empty or not) the data, “processBolt”. If one of the fields is empty, it means itis incorrect data then it is sent to the Bolt to write to log file in local file system,“incorrectBolt”. If the fields are not empty, then it is sent to storage systems,“persistBolt”. This use case is illustrated using Storm in Figure 4.4.

37

Page 46: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 4.4: Storm use case flow

4.5 Summary

To summarize, in this chapter a thorough review of tools is done based on some selectedparameters and then their differences and similarities is showed in a table as a summaryto illustrate as to how each of the tools handle those parameters. Then discussion of theparameters is done to show which tool is good for particular use case. And in the lastpart, how each tool handles the use case is briefly discussed. All this is important to seethe advantages and disadvantages of using one tool over the other given some parameterssuch as ease of use, integration with other external systems such as HBase and Solr whichare used in this case. It is important to make decisions on the tools to use in the practicalanalysis and use it as a ground after studying and comparing the tools.Considering the use case defined which involves ingestion of data, processing (merging,extraction, filtering and routing) of data and finally persisting to storage systems andfurther analysis after indexing to Solr, this use case is not so complex so it does not requiremachine learning techniques or heavy computations and aggregations. With regard tothis, all these tools are suitable to handle these use case defined. But NiFi has advantagesover others in a way it provides advanced web UI which enables to design the flow,monitor, and control in real-time as the tweets keeps flowing to the system. It also allowsthe provenance of the data i.e the origin of the data, what happen to each data in real-time. It also has advantage in integration with other systems such as HBase, Solr, Sparkand Storm. This is important if the data processed in NiFi needs to be transferred toother systems such as Spark for further advanced analysis. It uses inbuilt mechanismscalled “Site-to-Site” protocol for this purpose.So generally, NiFi is a kind of tool to use for such use cases where it makes easy, reliable,fast, and efficient processing of data which can be again transferred to other systems. Sothe tool that is going to be used for analysis is NiFi and further analysis is also made onSolr.

38

Page 47: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Chapter 5

Practical analysis/Twitter DataAnalysis

This chapter will cover the practical analysis of the project. In the previous chapters,different tools were studied that are popular in processing real-time data. In this chaptera practical analysis of Twitter data will be made using Apache NiFi - one of the toolstudied in the previous chapters.The next sections introduce Problem definition, formulation and the question it tried toanswer, the setup for the analysis and the main analysis parts.

5.1 Problem definition

The use case chosen for this purpose is analysis of Twitter data in real-time. This use caseis chosen to demonstrate how NiFi communicate with other data sources and systemssuch as Twitter, HBase and Solr through its inbuilt processors.The use case is summarized as:

1. Source A - real-time tweets from Twitter

2. Source B - static data from HBase

3. Combine sources and extract required fields

4. Filter out incorrect data to log file - define a rule to filter correct/incorrect data

5. Write all data to an HBase table

6. Write some of the transactions to a Solr index - based on the rule defined

This use case tries to answer questions such as: how NiFi can be used with other sys-tems, how is the rules formulated to extract, filter and route flow files to their respectivedownstream connections, the top languages, the locations of the tweeters, top tweetersand so on.Apache NiFi is the main tool used for processing and analysis of the twitter data fromingesting the tweets to extracting useful fields, filtering and making routing decisionsbased on some properties defined. After the analysis, the data is persisted to HBase and

39

Page 48: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

some tweets are also indexed using Apache Solr. Finally Banana which is a visualizationtool working with Solr is used to visualize the analysis in real-time.

5.2 Setup

The analysis is made both on a Windows 10 machine with 8GB memory running ApacheNiFi 0.6.0 in a local mode, Apache Solr 5.5.0 in a standard mode locally and Banana 1.6for visualizations. The project is also deployed on Amazon Web Services (AWS) clusteron Centos machine running HDP 2.4 and HDF 1.2 versions respectively. HDP 2.4 is aHortonwork Data Platform consisting of the major Hadoop components where as HDF1.2 is Hortonwork Data Flow platform powered by Apache NiFi. HBase is part of theHDP 2.4 and Apache Solr is installed in Standard mode separately.

5.3 Analysis

In this section, the steps that are followed for analysis of the tweet data are given. Asin any other data processing framework, the process starts with data ingestion in to theNiFi system. Then data processing will be done and then finally it will be persisted tostorage database. Figure 5.1 shows this graphically.

Figure 5.1: Data Analysis flow

The data is ingested from two sources i.e real-time and static sources. The real-timedata is coming from Twitter while the static source is historic tweets that were storedinitially. Once these data are ingested to the flow, then processing of the data continueswhich includes extracting the required fields since the tweets come with whole lots offields but it is interesting to consider only some fields. Then filtering and defining customproperties is done for making routing decisions. After all this steps, the data is persistedto storage systems and some of the data is also sent for indexing. The static data fromthe storage system is also sent back as an input source to Data Ingestion step.

40

Page 49: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

• Prerequisite : In order for using the needed processors, some prerequisites shouldbe properly configured. For example, for using GetHBase and PutHBaseJSONprocessors, HBase 1 1 2 ClientService has to be configured in advance. The pre-requisites are found in Controller Services.Controller Service is one of the important functions that NiFi provides. It bundlesa whole lot of services to be configured and used repeatedly. Once the services areconfigured and set, NiFi allows using them repeatedly for many clients in the sameinstance without further configurations. It has so many services such as DBCPConnection Pool where once set, can be used for many database connections. Italso has HBase 1 1 2 ClientService where configuration files are specified and onceset can be used for many clients running HBase for reading and writing data. In thisproject, HBase 1 1 2 ClientService is used and the path for “hbase-site.xml” and“core-site.xml” files is specified for it to properly work. After this the “GetHBase”and “PutHBaseJSON” processors can be used repeatedly in this instance becausethe common configuration is handled by the Controller Services.

5.3.1 Data Ingestion

The tweets are fetched from the Twitter API and then loaded to the NiFi flow throughthe “GetTwitter” Processor. This Processor has mandatory configuration properties thatneed to be set before starting the flow. The properties are shown in Table 5.1.

Property DescriptionTwitter Endpoint Specifies the Sample Endpoint and Filter End-

point. Filter Endpoint has to be specified if termsto search for are specified. Otherwise to get all thepublic tweets Sample Endpoint is used.

Consumer key andConsumer Secret

Provided by Twitter API when creating the appli-cation

Access Token andAcess Token Secret

Provided by Twitter API when creating the appli-cation

Table 5.1: Mandatory properties for “GetTwitter” Processor

It also has other properties such as “Terms to Filter On” where terms to filter canbe specified. For this project it is decided to filter based on the terms: “IoT, Inter-netofThings, and BigData”. Once those properties are properly set, then this processoris ready to start fetching data from the API.The other input data is from the NoSQL database HBase where historic data is ini-tially stored. For ingesting this static data from HBase, “GetHBase” Processor is usedwhich is an inbuilt processor used for reading historic data from HBase. It uses theHBase 1 1 2 ClientService which is set once and used many times by different HBaseclients. The other mandatory property is for specifying the table name. In this project,the table name is “Twitter”.

41

Page 50: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

5.3.2 Data Processing

This step has different parts starting from extracting required fields to filtering, separatingthe correct data from the incorrect based on rules defined and finally routing the databased on the rules set.

• Extraction of Required Fields : As the twitter data is unstructured consistingof variety of types and has many fields, it is not interesting to have all these fields foranalysis. So this step is important to extract only the required fields to make furtheranalysis on them. The Twitter data comes in JSON format and NiFi providesinbuilt processor called “EvaluateJSONPath” which is used to extract requiredfields from the JSON format data by allowing to define custom properties. Andthis property names defined are used when making routing decisions or in otherprocessors. The fields that are interesting and used for extracting the tweets areshown in the following table, Table 5.2.

Property Name Twitter JSON fieldtwitter.id $.idtwitter.user $.user.nametwitter.handle $.user.screen nametwitter.createdAt $.created attwitter.text $.texttwitter.timestamp $.timestamp mstwitter.hashtags $.entities.hashtags[0].text - gets only the first hash-

tagtwitter.mentions $.entities.user mentions[0].name - gets the first

mentionstwitter.lang $.langtwitter.location $.user.location

Table 5.2: Custom properties for extracting tweets

• Filtering and Routing of data : All the tweets that are coming in at this stepare filtered by the terms that were set in the previous steps “IoT, InternetofThings,BigData”. Even though they are coming with the right terms, it is good to notethat there are whole lots of fields that are empty which make no sense to consider.So further filtering is needed to remove those empty fields for making better routingdecisions. Filtering is done with the context Correct/Incorrect data in “RouteOnAt-tribute” processor provided by NiFi. This processor allows defining custom rulesbased on which routing decisions are made.The custom rules that decide a data is either Correct or Incorrect are:

(I) The Text, Hash Tags, Mentions, Language and Location fields extracted mustnot be Empty

(II) The tweets are routed to different downstream systems based on the rule set;English and Non English Tweets.

42

Page 51: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

The rules set use the extracted fields from the previous steps in “EvaluateJSON-Path” processor and are given names which are used as Connections for routingthe data to different downstream systems. There is also another inbuilt rule whichroutes data if the custom rules are not satisfied.The rules are: to route Non Empty tweets that are English and Non English. TheEmpty Tweets that do not satisfy the custom rules are sent to the “Unmatched”Relationship.The above rule in NiFi Expression Language is:Rule 1:$twitter.text:isEmpty():not():and($twitter.location:isEmpty():not()):and($twitter.hashtags:isEmpty():not()):and($twitter.lang:isEmpty():not()):and($twitter.mentions:isEmpty():not())

Rule 2:English - $twitter.lang:equals(“en”)Non English - $twitter.lang:equals(“en”):not()

When combining the above rules, it gives:English: Rule 1 + Rule 2 =$twitter.text:isEmpty():not():and(twitter.location : isEmpty() : not()) : and(twitter.lang:equals(“en”))

NonEnglish: Rule 1 + Rule 2 =$twitter.text:isEmpty():not():and(twitter.location : isEmpty() : not()) : and(twitter.lang:equals(“en”):not())

“Unmatched” - one or more of the fields is empty.

(A) Correct Data Correct Data in this context is the data that satisfies the rulesset above that is either English or Non English Tweets that are not empty.Here the correct data is routed based on its type to different sources. If thetweet is Non English, then it is routed to Solr for further analysis. It is to showthe different languages that the tweet is made, the locations, the top tweetersand so on. And all tweets that are correct (i.e both Non Empty English andNon English) are persisted to HBase.

(B) Incorrect Data Incorrect Data is with one or more of its fields empty. Soit is sent to Log files through its “Unmatched” Relationship. “LogAttribute”processor is used here to simply log those flow files that do not satisfy thecustom rules.

5.3.3 Data Storage

Twitter produces unstructured data consisting of various formats such as videos, images,normal text, and media files and it is also produced in a fast way in large volumes. Oncethe data is processed, it has to be persisted into storage systems. But Traditional Rela-tional Database Management Systems (RDBMS) could not handle such data because ofthe volumes and variety of data produced from such social Medias. So NoSQL databases

43

Page 52: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

have to be used which is capable of handling large volumes of unstructured data in anefficient way. In this regard, NoSQL database HBase is chosen for this project.After the data is processed in the previous steps, the correct and incorrect data aredifferentiated according to the rule set and are routed accordingly.

• Correct data : All processed correct data is persisted in Apache HBase. In or-der to store these data to Apache HBase, NiFi provides an inbuilt processor called“PutHBaseJSON” which writes data to the HBase database in JSON format. Ithas mandatory fields to be set before being ready for use. The first property thatneeds to be set is the HBase 1 1 2 ClientService property which is discussed in theabove section.For this project, the table name and column family are “Twitter” and “tweets”respectively which are created with a command:

hbase(main):002:0>create ‘Twitter’, ‘tweets’0 row(s) in 1.3130 seconds=>Hbase::Table - Twitter

After the table is created, the other mandatory property for the specific HBaseclient are specified in NiFi; “Twitter” as the table name, “tweets” as column fam-ily, and also “Id” as Row Identifier Field Name which is from the JSON tweet Idfield. After these properties are specified in the processor, then it is ready to start.

5.3.4 Data Indexing & visualization

In this step, some of the processed data is sent to Apache Solr for indexing and searching.The rules/properties that were set are used here to send the data to Solr. In this regard,all the Non English tweets that were processed are sent to Solr for indexing. There isno demanding rule to choose only these tweets, but it is selected as an example to showhow NiFi can be integrated with Apache Solr and also the different Language, Locationdistributions, Top tweets and so on in Solr. For achieving this, “PutSolrContentStream”processor is used. This processor has mandatory properties that need to be set such asSolr Type and Solr Location. The Solr Type specifies either Standard mode or Cloudmode.

For this project a Standard Solr mode is used. The Solr Location specifies the loca-tion where the Solr server is installed which is “http://52.30.209.198:8983/solr/twitter”.“twitter” is the Core where all the tweets are stored and indexed. It also allows definingcustom properties which transform the JSON document to a Solr document type whichlater is used in Solr as an attribute for further analysis. The properties defined are shownin Table 5.3.

44

Page 53: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

PropertyName

Solr field

f.1 id:/idf.2 twitter text t:/textf.3 twitter username s:/user/namef.4 twitter created at s:/created atf.5 twitter timestamp ms tl:/timestamp msf.6 twitter screenname s:/user/screen namef.7 twitter location s:/user/locationf.8 twitter lang t:/langf.9 twitter tag ss:/entities/hashtags/textf.10 twitter mentions ss:/entities/user mentions/namef.11 twitter source s:/source

Table 5.3: Custom properties for indexing tweets

The data that is processed and indexed this way has to be presented to the user invisualization to help make better decisions. It is also important to see which parametersare of most important to watch for so that the user will be aware of the things around. Inthis project, the indexed data is further put to a visualizations tool called “Banana” toshow case the different properties of the tweet in a dashboard in real-time. In the searchfield, terms such as IoT are searched and hits for specific search terms, top tweeters, thelanguages and locations are visualized using bar, histogram and different components.After all the processors are properly configured and are ready to start, the NiFi UI lookslike the Figure 5.2.

Figure 5.2: Over all NiFi Twitter Data Flow

45

Page 54: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

5.3.5 Data Result & Discussion

So after the flow is allowed to run, stream of tweets start flowing in real-time. Figure 5.3ashows an example of Non English tweets that are extracted with all the text, language,and location fields being not empty according to the rule and were routed accordingly totheir respective downstream connections.

(a) Non English Tweets from Provenance data

(b) English Tweets from Provenance data

Figure 5.3: Both English and Non English Tweets from Provenance data

The Attribute names are the fields defined to extract the tweets from the Twitter APIin the “EvaluateJSONPath” processor. The Figure shows the Date the tweet was cre-ated, the languages, locations, the text, and also the username and screen names whichare shaded for privacy purposes. The “RouteOnAttribute.Route” field shows that it hasa rule “tweetsNonEnglish” which is defined to route tweets that are not in English todownstream connections.Figure 5.3b also shows the same information of the tweet with “RouteOnAttribute.Route”field showing it has a rule “tweetsEnglish” which routes English tweets to next processor.

• Statistics : NiFi also allows viewing the statistics of each and every processor flowin different parameters. The parameters include Average Task Duration, Bytes

46

Page 55: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Read in last 5 minutes, Bytes written in last 5 minutes and Flow Files out in last5 minutes and so on. Figure 5.4a below shows the Average Task Duration for the“PutSolrContentStream” processor.

(a) Average Task Duration status for Indexing

(b) Status for flow files out from the processor

Figure 5.4: Statistics data

The left side of Figure 5.4a shows the name and type of processor as well as the startand end time which shows the average task between these times. The last pointshows the Min/Max/Mean time it takes to process the flows or send the flows toApache Solr. From the graph, the most peak average task duration [00:00:00.076]is between the times 21:10 and 21:15. The other is a little less than [00:00:00.040]which is around 22:30.Figure 5.4b also shows the statistics for the “GetTwitter” processor which is re-sponsible for getting the tweets from the Twitter API. So it receives the tweetsand sends them out to the next processor or downstream connection for furtherprocessing. This statistics shows the Flow Files that are transferred out in the last5 minutes. The left side shows the start and end time and also the Min/Max/Mean

47

Page 56: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

of the number of files transferred. So it is shown that the Max flow files transferredis 12 and the peak time is around 11:25.

5.3.6 Data Analysis in Solr

Further analysis is also made in Apache Solr and its results are visualized using Banana.The analysis is made on search for specific terms in the tweets and returns the hits foreach term, the languages, and the location of the tweeters.

• Specific terms in a tweet : This analysis includes defining the terms to searchfor in Solr and the number of hits that it returns for the terms such as “iot, bigdata,internetofthings” only in non English tweets which were used to filter in NiFi in theprevious sections.The total indexed tweets are 789 and from these the search for the individual termsinternetofthings, bigdata and iot returns 7, 25 and 30 respectively. And the searchfor all the terms returns 52.

The query used is:http://52.18.85.201:8983/solr/twitter/select?q=-twitter lang t:en+and+twitter text t:internetofthings+or+twitter text t:bigdata+or+twitter text t:iot&wt=json&indent=true

Figure 5.5: Filter for specific terms, “iot,bigdata,internetofthings”

Figure 5.5 also shows that the filtering is done with a query “-twitter lang t:en”which loads only non English tweets.

• The Language distribution : Here in this analysis, the different languages thetweets were made is given. When the terms iot, internetofthings, bigdata aresearched for specific languages, it returns 24,10,4,2 for French, Spanish, German,and Japanese respectively.

48

Page 57: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 5.6: Language distribution

The query used for non English languages:http://52.18.85.201:8983/solr/twitter/select?q=-twitter lang t:en+and+twitter text t:internetofthings+or+twitter text t:bigdata+or+twitter text t:iot&wt=json&indent=true

Figure 5.6 shows the language distribution in a pie chart where 52% is for theFrench, 22% for Spanish and so on. It also shows the World Map where white blueis shaded for the parts where tweets were made. Next it shows the Top Tweetersabout a specific search terms used above.

• The locations of the tweeters : This part of the analysis is related to thelocations of the tweeters to show which part of the globe is tweeting about certainterms such as iot, bigdata and interntofthings. The locations may sometimes bewith country levels such as Sweden, Canada or they may be particular location withno country levels.

Figure 5.7: Location distribution

Figure 5.7 above shows the different locations and their occurrences when the aboveterms are searched from Banana User Interface. It also shows the Hashtags in aTagCloud panel and next the top Mentions for the terms searched above is shownin a Pie chart. This visualization help the user to search for specific terms fromthe whole tweets that are indexed and draw certain conclusions from the differentproperties/characteristics displayed in the dashboard.

49

Page 58: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Chapter 6

Evaluation

This chapter discusses about the performance evaluation and how the designed data flowcan be optimized.NiFi’s performance is affected by many factors such as the type and number of processorsused which affects the system resources (CPU, RAM,...), whether the flow is allowed torun without constraints producing large sizes (back pressure mechanism is applied or not)and whether clustering is used or not.

• The type and number of processors used : The type of processors used deter-mines how much resources are allocated to a particular processor and it differs fromprocessor to processor. This is because some processors are resource intensive andthey require more by default. And the number of processors used also has impacton performance because every processor used needs a thread to be allocated for itby Flow Controller to function properly. In this regard, grouping the processorswith same functionality helps to minimize the threads that will be used and makethem available for other processes.

Figure 6.1: Same Processors used repeatedly

50

Page 59: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 6.1 shows that the same processors are used to extract and route dataonce they are ingested from the “GetTwitter” and “GetHBase” processors. Thisis against the performance of NiFi because it will take utilizing more threads forthe same processors. So grouping the same functionality processors together is agood design choice and enhances performance. So Figure 6.1 becomes condensedby grouping the same processors together so that they use only single threads fortheir execution as shown in Figure 6.2.

Figure 6.2: same processors used once for performance gain

Figure 6.2 above shows the data from “GetTwitter” and “GetHBase” processors areall sent to one processor i.e “Extract Fields - EvaluateJSONPath” and downwardsalso.

• Back Pressure mechanism used : If the flow files are allowed to run continuouslywithout any constraint, it will have big impact on the performance of NiFi andthe system as a whole. The impact could be due to not properly updating therepositories or full disk because of many data flowing in. In order to solve thisproblem, NiFi provides back pressure mechanisms to be deployed on connections todownstream systems first by setting certain threshold in number of flow files to beprocessed or the size of the files so that the data will be allowed to flow until thisthreshold is reached as shown in Figure 6.3.

51

Page 60: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Figure 6.3: Setting back pressure for the connection

And then the downstream processor only accepts the flows below or same as thethreshold and then after that if it has less than the threshold, it will process thebacklog data from the queue. This prevents NiFi and the processor itself not to beoverwhelmed in general. The back pressure can be set in every connection createdbetween the processors or by using another processor called “ControlRate” onlyonce. Then all downstream processors get the flows by the rate set in this proces-sor.In the Twitter analysis, it is possible either to set the back pressure in every Connec-tion or using “ControlRate” processor. Figure 6.4 shows using the “ControlRate”processor to set the threshold to 100KB. This means that it will process the data upto 100KB and if more flow files come, they will be queued and when the thresholdhave less size, then it will process the backlog from the queues.

Figure 6.4: using “ControlRate” processor to control the rate of flow to downstreamprocessors

52

Page 61: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

• Clustering : If more data is flowing to the system and the available systemresources cannot handle, adding resources by using clustering method can be usedfor performance gain. NiFi works in a Master/Slave architecture where the masterchecks for loads of every other node/slave to assign the work.After calculating the load balancing for each of the nodes, then it assigns the workto the respective node/slave. And nodes can be added as needed as possible withmore resources to distribute the work across the nodes which in turn results in per-formance gain. So clustering is also one way of solving the problem of performanceissues.

53

Page 62: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Chapter 7

Conclusion and Future Work

This thesis project investigates handling of streaming data. It is divided into two partswhich are the theoretical and practical. In the theoretical part, IoT is studied whichalso covers overview of tools such as Apache NiFi, Apache Spark Streaming and ApacheStorm. The project further went to define parameters such as Ease of Use, Security,Reliability, Queued data/Buffering and Extensibility to overview the behaviors of thetools that are studied. This approach is important to use the results as a guide for futureas well to choose one of the tools.From the study of the tools, it is found that Apache NiFi is a data processing toolwith features such as user friendly Web UI, inbuilt security, fault tolerance, provenanceand lineage, extensibility, clustering and many more. And it is highly suitable for IoTapplications because of its extensibility feature that allows designing custom processorsthat are capable of ingesting data to NiFi in required formats. It is also found thatSpark Streaming is a fast processing framework due to the fact that it uses in-memorycomputations and divides the incoming data in to small batches which reduce latencyand speeds up the computation. And Apache Storm processes the data without breakingit down into chunks rather computes them as they arrive. Besides these both SparkStreaming and Apache Storm can be used for both simple data processing such as ETLoperations and also for more complex computations which require MLlib algorithms,heavy computations, aggregations and such things while NiFi is used for simple dataprocessing such as ETL, routing, data mediation, and similar operations.Finally, in the practical part, Apache NiFi is used for processing Twitter data and see thedifferent tweets based on certain terms such as “iot, bigdata, internetofthings”. Furtheranalysis also made for the number of hits for these terms, the location and languagedistribution of the tweets in Apache Solr and their results visualized in Banana framework.This shows that the platform/tool chosen for practical analysis i.e Apache NiFi is suitablefor such use cases and can be used efficiently for data processing and analysis. It alsoshows that it can be easily integrated with external systems such as Apache HBase,Apache Solr and others as well.

7.1 Future Work

In the theoretical part of the thesis, the work done can be extended in various ways. Itcan be extended to include overview of more frameworks in the stream data processing

54

Page 63: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

area. The study on IoT can also be further extended to overview the challenges andsolutions in more detail. More parameters can be added to compare and contrast thetools used. And since the project mainly focuses on stream data processing and analysis,the tools studied comply with this, but it can be more extended to include overview ofstorage systems and search or indexing platforms.

The practical part,i.e Twitter data analysis, could also be extended with more features.It is only in some of the tweet fields that we were interested to extract and analyze but itcan be extended to include other more fields as well. Different rules can be set than theone used to route the tweets with more routing rules. One extension point that will beinteresting could also be, sending the processed data from NiFi to external systems suchas Apache Spark for more complex computations on the tweets. And since the Twitteranalysis is done as a benchmark to show how NiFi can be used for such cases, this workcan further be extended and used also for processing data from other type of sources suchas GeoSpatial, Sensor or other IoT data.

55

Page 64: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

List of Figures

3.1 NiFi standalone Architecture - source [8] . . . . . . . . . . . . . . . . . . 133.2 NiFi Cluster Architecture - source [8] . . . . . . . . . . . . . . . . . . . . 143.3 NiFi UI canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 NiFi main components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 NiFi Processor Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6 NiFi Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.7 NiFi Lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.8 Continuous RDDs form DStream - source [33] . . . . . . . . . . . . . . . 233.9 Spark Cluster - source [33] . . . . . . . . . . . . . . . . . . . . . . . . . . 243.10 Storm Topology - source [38] . . . . . . . . . . . . . . . . . . . . . . . . . 263.11 Storm Cluster -source [38] . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 General use case flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 NiFi use case flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Spark Streaming use case flow . . . . . . . . . . . . . . . . . . . . . . . . 374.4 Storm use case flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Data Analysis flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2 Over all NiFi Twitter Data Flow . . . . . . . . . . . . . . . . . . . . . . 455.3 Both English and Non English Tweets from Provenance data . . . . . . . 465.4 Statistics data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5 Filter for specific terms, “iot,bigdata,internetofthings” . . . . . . . . . . . 485.6 Language distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.7 Location distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Same Processors used repeatedly . . . . . . . . . . . . . . . . . . . . . . 506.2 same processors used once for performance gain . . . . . . . . . . . . . . 516.3 Setting back pressure for the connection . . . . . . . . . . . . . . . . . . 526.4 using “ControlRate” processor to control the rate of flow to downstream

processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

56

Page 65: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

List of Tables

3.1 Storm Architecture Components Functionality . . . . . . . . . . . . . . . 27

4.1 Differences and Similarity of the tools . . . . . . . . . . . . . . . . . . . . 32

5.1 Mandatory properties for “GetTwitter” Processor . . . . . . . . . . . . . 415.2 Custom properties for extracting tweets . . . . . . . . . . . . . . . . . . . 425.3 Custom properties for indexing tweets . . . . . . . . . . . . . . . . . . . 45

Acronyms & Abbreviations

The Acronyms used in this report are outlined in the table below.

57

Page 66: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Acronym Description

ASF Apache Software FoundationAPI Application Program InterfaceCPU Central Processing UnitCSV Comma Separated ValueDStreams Discretized StreamsDDoS Distributed Denial of ServiceETL Extract Transform LoadFTP File Transfer ProtocolHDFS Hadoop Distributed File SystemHVAC Heating Ventilation Air ConditioningHDF Hortonwork Data FlowHDP Hortonwork Data PlatformH2H Human-to-HumanH2T Human-to-ThingsHTML Hyper Text Markup LanguageICT Information Communication TechnologyITU International Telecommunication UnitIoT Internet of ThingsJSON JavaScript Object NotationJVM Java Virtual MachineMIT Massachusetts Institute of TechnologyNSA National Security AgencyNCM NiFi Cluster ManagerOS Operating SystemQoS Quality of ServiceRFID Radio Frequency IdentificationRPG Remote Process GroupRDD Resilient Distributed DatasetSSL Secure Socket LayerS3 Simple Storage ServiceT2T Things-to-ThingsTLP Top Level ProjectTCP/IP Transmission Control Protocol/Internet ProtocolURL Uniform Resource LocatorUI User FriendlyWSN Wireless Sensor NetworkWAL Write Ahead LoggingXML Extensible Markup Language

58

Page 67: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Bibliography

[1] “Ericsson IoT”. url: http://www.ericsson.com/thecompany/our_publications/books/internet-of-things (visited on 02/11/2016).

[2] D. Miorandi et al. “Internet of things: vision, applications and research challenges”.In: Ad Hoc Networks vol. 10,no. 7, (2012), pp. 1497–1516.

[3] Dave Evans. “The Internet of Things, How the Next Evolution of Internet is chang-ing Everything (white paper)”. Tech. rep. April 2011.

[4] James Manyika et al. “The Internet of Things: Mapping the Value Beyond theHype”. In: McKinsey Global Institute (June 2015), p. 3.

[5] David Neiwolny. “How the Internet of Things is Revolutionizing Healthcare (whitepaper)”. Tech. rep. October 2013.

[6] R. Weber. “Internet of Things: New security and privacy challenges”. In: ComputerLaw and Security Review vol. 26,no. 1 (2010), pp. 23–30.

[7] M. Zaharia et al. “Discretized Streams: An Efficient and Fault-Tolerant Model forStream Processing on Large Clusters”. In: (2012).

[8] “NiFi Overview”. url: https : / / nifi . apache . org / docs . html (visited on02/02/2016).

[9] “Series Y: Global Information Infrastructure, Internet Protocol Aspects and Next-Generation Networks, Next Generation Networks – Frameworks and functional ar-chitecture models (white paper)”. Tech. rep. 2012.

[10] K. Ashton. “That “Internet of Things” Thing”. In: RFID journal (2009).

[11] Somayya Madakam, R. Ramaswamy, and Siddharth Tripathi. “Internet of Things(IoT):A Literature Review”. In: Journal of Computer and Communications vol. 3, (2015),pp. 164–173.

[12] Jayavardhana Gubbi et al. “Internet of Things (IoT): A Vision, Architectural Ele-ments, and Future Directions”. In: ().

[13] Sean Dieter et al. “Towards Implementation of IoT for environmental conditionmonitoring in homes”. In: IEEE Sensors Journal vol. 13,no. 10, (Oct 2013).

[14] “Apple HomeKit”. url: http : / / www . apple . com / ios / homekit/ (visited on02/13/2016).

[15] Pedro Castillejo et al. “An Internet of Things Approach for Managing Smart Ser-vices Provided by Wearable Devices”. In: International Journal of Distributed Sen-sor Networks (2013).

59

Page 68: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

[16] Melanie Swan. “sensor mania! The IoT, wearable computing, objective Metrics andquantified self 2.0”. In: Journal of Sensor and Actuator Networks (2012).

[17] Andrea Zarella et al. “Internet of Things for smart cities”. In: IEEE Internet ofThings Journal vol. 1,no. 1, (2014), pp. 22–31.

[18] Ji chun Zhao et al. “The study and application of the IoT technology in Agricul-ture”. In: (2010).

[19] “IoT in Agriculture Case Study,Thingworx”. url: http://www.thingworx.com/Markets/Smart-Agriculture (visited on 02/06/2016).

[20] Debasis Bandyopadhyay and Jaydip Sen. “Internet of Things - Applications andChallenges in Technology and Standardization”. In: (2011).

[21] Krushang Soner and Hardik Upadhyay. “A survey: DDoS Attack on Internet ofThings”. In: International Journal of Engineering Research and Development vol.10,no. 11, (Nov 2014), pp. 58–63.

[22] J. H. Ziegeldorf, O. Garcia Morchon, and K. Wehrle. “Privacy in the Internet ofThings: threats and challenges”. In: Security and Communication Networks vol.7,no. 12, (2014), pp. 2728–2741.

[23] Bugra Gedik and Ling Liu. “Protecting Location Privacy with Personalized K-Anonymity: Architecture and Algorithms”. In: IEEE Transaction on Mobile Com-puting vol. 7,no. 1, (2008).

[24] “Privacy by Design in Big Data”. In: (Dec 2015).

[25] “NSA NiFi”. url: https://www.nsa.gov/public_info/press_room/2014/nifi_announcement.html (visited on 03/12/2016).

[26] “NiFi Key Features”. url: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.0/bk_Overview/content/high-level-overview-of-key-nifi-features.html (visited on 03/13/2016).

[27] “Apache NiFi wiki”. url: https://cwiki.apache.org/confluence/display/NIFI/Apache+NiFi (visited on 03/18/2016).

[28] “Spark Overview”. url: http://www.spark.apache.org (visited on 03/20/2016).

[29] “Spark AMPLab”. url: https://amplab.cs.berkeley.edu/projects/spark-lightning-fast-cluster-computing/ (visited on 03/20/2016).

[30] “Spark SQL Module”. url: http://www.spark.apache.org/sql/ (visited on03/20/2016).

[31] “Spark GraphX Module”. url: http://www.spark.apache.org/graphx/ (visitedon 03/20/2016).

[32] “Spark Machine Learning Module”. url: http://www.spark.apache.org/mllib/(visited on 03/20/2016).

[33] “Spark Streaming Module”. url: http://www.spark.apache.org/streaming/(visited on 03/20/2016).

[34] “Spark Programming Guide”. url: http://spark.apache.org/docs/1.6.0/programming-guide.html (visited on 04/02/2016).

60

Page 69: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

[35] “Apache Storm”. url: http : / / storm . apache . org / index . html (visited on04/02/2016).

[36] “Apache Storm history”. url: http://nathanmarz.com/blog/history- of-apache-storm-and-lessons-learned.html (visited on 04/05/2016).

[37] “Storm Feature”. url: http://hortonworks.com/hadoop/storm/ (visited on04/02/2016).

[38] “Storm Tutorial”. url: http://storm.apache.org/releases/0.9.6/ (visited on04/02/2016).

[39] “Spark Security”. url: http://spark.apache.org/docs/1.6.0/security.html(visited on 05/25/2016).

[40] “Storm Thrift API”. url: http://thrift.apache.org/docs/features (visitedon 05/28/2016).

61

Page 70: Handling Data Flows of Streaming Internet of Things Datauu.diva-portal.org/smash/get/diva2:956406/FULLTEXT01.pdf• Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version

Appendix - Apache License, 2.0.

The material content of this thesis project is licensed under the Apache License, 2.0.

You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

62