Proceedings of the ICICIS 2016 Conference Gaborone ... · Proceedings of the ICICIS 2016 Conference...

Proceedings of the ICICIS 2016 Conference

Gaborone, Botswana, May 18-20, 2016

Edited by

Oduronke Eyitayo

George Anderson

Department of Computer Science

University of Botswana

ICICIS 2016 Proceedings of the 1

st International Conference on The Internet, Cyber Security and Information

Systems, Grand Palm Hotel, Gaborone, Botswana

May 18-20, 2016

Jointly organised by:

Department of Computer Science, University of Botswana and Department of Applied Information Systems, University of Johannesburg, South Africa

International Programme Committee Chairs: Audrey N Masizana, PhD. University of Botswana Barnabas Gatsheni, PhD. University of Johannesburg, South Africa

General Chairs: Ezekiel Uzor Okike, PhD. University of Botswana Kennedy Njenga, PhD. University of Johannesburg, South Africa

Review Committee

O. T. Eyitayo, PhD (Chair) University Botswana G. Anderson, PhD (Co- Chair) University Botswana S.D. Asare University of Botswana S. Browne, PhD. National University of Ireland Galway, Ireland D. Garg University of Botswana G. Malema, PhD. University of Botswana T.M. Mogotlhwane, PhD. University of Botswana G. Mosweunyane, PhD. University of Botswana P. Motlogelwa University of Botswana T. Motshegwa, PhD. University of Botswana K Njenga, PhD. University of Johannesburg, South Africa T. Seipone University of Botswana Q. Sello University of Botswana E. Thuma, PhD. University of Botswana M. Van Den berg, PhD. University of Johannesburg, South Africa

Sponsors

United States of America – Botswana Embassy Ministry of Transport & Communications Botswana Innovation Hub Botswana Fibre Networks Botswana Communications Regulatory Authority Bit Brands Digital Agency

ISBN 978-99968-0-430-4

Copyright © 2016 Department of Computer Science, University of Botswana Published by: Department of Computer Science University of Botswana Private Bag UB 00704 Gaborone, Botswana

Table of Contents

Information Systems Track

Modeling and Simulation of a Hybrid Mobile Target Tracking System for Livestock

Obakeng Maphane, Oduetse Matsebe, Molaletsa Namoshe ........................................................... 1

A Distributed Computational MapReduce Algorithm for Big Data Electronic Health Records

Sreekanth Rallapalli, Radhika Kidambi, Suryakanthi Tangirala ....................................................... 11

A Collaborative Tool for MPhil/PhD Student Dissertation Workflow

Bigani Sehurutshi, Oduronke T. Eyitayo ............................................................................................... 22

Evaluating the Effect of Privacy Preserving Record Linkage on Student Exam Record Data Matching

George Anderson, Tsholofetso Taukobong, Audrey Masizana ............................................................. 35

Ontological Perspectives in Information System, Information Security and Computer Attack

Incidents (CERTS/CIRTS)

Ezekiel Uzor Okike, Tshiamo Motshegwa, Molly Nkamogelang Kgobathe ......................................... 46

Cybersecurity Track

Big Data Forensics As A Service

Oteng Tabona, Andrew Blyth ................................................................................................................ 61

Information Security Policy Violation: The Triad of Internal Threat Agent Behaviors

Maureen van den Bergh, Kennedy Njenga ........................................................................................... 69

Challenges in Password Usability - Users Perspective

Tiroyamodimo Mogotlhwane, Kagiso Ndlovu....................................................................................... 82

Enhancing the Least Significant Bit (LSB) Algorithm for Steganography

Oluwaseyi Osunade, Ganiyu Idris Adeniyi ............................................................................................. 90

A Security Model for Mitigating Multifunction Network Printers Vulnerabilities

Jean-Pierre Kabeya Lukusa.................................................................................................................. 103

Proceedings of the 1st International Conference on

the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016

Copyright © Department of Computer Science, University of Botswana, 2016 1

IC1012

Modeling and Simulation of a Hybrid Mobile Target

Tracking System for Livestock

Obakeng Maphane, Oduetse Matsebe, Molaletsa Namoshe

Department of Mechanical & Energy Engineering College of Engineering & Technology

Botswana International University of Science and Technology Private Bag 16, Palapye, Botswana

[email protected]; [email protected]; [email protected]

ABSTRACT

Wireless Sensor Networks (WSN’s) have enjoyed wide spread application in the Internet of Things (IoT) especially indoors; therefore, research has now expanded their use in outdoor applications. They are superior due to their ability to canvas large areas, using the wireless nodes attached to sensors and relaying the data collected to a sink node. Global System for Mobile (GSM) /Group Packet Radio Service (GPRS) network is one of the oldest telecommunication, and the fastest growing field in telecommunications technology. It covers wide ranges; unfortunately, the signal deteriorates with distance from Base Stations (BST’s) and it is expensive for application in tracking. This paper presents the concept and simulation of a high level solution to livestock tracking using a hybrid mobile target tracking system; where both the GSM and WSN are combined for the purpose of livestock tracking, leveraging on the scalability of WSN and coverage range of GSM. The proposed system combines strategically located static nodes and mobile nodes, embedded within a WSN and handshaking with the GSM/GPRS network though BST’s on a timely basis. The data collected is relayed to a background database in a hierarchically, from mobile node through static nodes to BST and Database. The hybrid system paves way for research in livestock behavior, virtual fencing, Foot and Mouth (FMD) monitoring, GSM to WSN connection, WSN optimization and applications of GSM technology for tracking or in the IoT’s. The simulation results presented shows that this system is viable.

Keywords: Wireless Sensor Networks, Foot and Mouth, GSM technology, Internet of Things, Solar

Power, Livestock tracking and Management.

1 INTRODUCTION

1.1 Motivation

Traditional methods of livestock management have proved costly, slow and failing in most cases;

consequently, they are a major contributor to the decline in the nation’s beef industry let alone

market restrictions. On the other hand, electronic agriculture has taken off in strides. A lot of mobile

applications and web based applications are being developed to improve the industry; they have

leveraged the speed of technology, and the coverage of the internet in promoting and reviving

agriculture. Electronic Agriculture is the fusion of electronics in managing and developing tools to

improve agricultural practices.




1.2 Background

a) Livestock management

Historically; since humans domesticated buffalos for meat and milk, they managed them by housing

them in Kraals and monitored them even during grazing periods. Over time this has been the only

way of monitoring livestock and rightly so because it was the main source of livelihood; however,

over time urbanization has led men to attempt balancing livestock management and maintaining

paid jobs especially in African countries (mainly Botswana). This became even more costly with

introduction of requirements by foreign markets to trace livestock movements, but farmers then

could only estimate them. The use of technology was introduced to help monitor livestock

management. One of which is the use of ingestible Radio Frequency Identification (RFID) bolus, for

identification and historic records. Bolus inserted in the animal’s gut contains an RFID tag, it is read

for identification by veterinary officers during routine vaccinations and assessment, before travel

documentation are awarded for transportation to abattoirs. This provided unique identification for

livestock, it also provided some degree of security against stock theft for owners (Moreki et al.,

2012) and gave a limited account of the history of the animal based on when and where it was

scanned; however, it had its challenges. The insertion process required trained officers and the

Ministry of Agriculture’s (MoA); Department of Veterinary Services (DVS) had a limited number of

them. It was costly for the Ministry to deploy and track livestock; most of all it had gaps in tracing

the movement of the animal, because the officers only knew where the animal had been scanned

not where it had been since it had been tagged (Moreki et al., 2012); (Sunday Standard Reporter,

2012). Recently applications of WSN, GPS and in some occasions GSM/GPRS network research has

taken off in agricultural applications.

b) GSM Network

GSM was developed in the early 20th century for routing calls after Alexandra Graham Bell

developed the first telephone; it is one of the fastest developing technologies, it moved from

switching network to data era in such a short period of time. Based on the architecture of the

network; it is possible to track the user to the nearest cell tower giving a rough estimate of the

user’s location, because the Base Station Towers (BST’s) are stationary and in a known location.

Overtime; after the introduction of mobile devices, this became a challenge but still possible to

some degree. GSM network provides the ability to track the devices over large ranges in the

magnitudes of kilometers (Behzad et al., 2014), but it is prone to network loss hence loss of target

device (Ficek et al., 2013). Network traffic is another challenge in using GSM for tracking, because it

introduces bottlenecks in the system. Finally the system is expensive to install and run due to size

and cost of equipment.

c) Wireless Sensor Network

WSN has taken off exponentially, and the increased application in automated home security,

sparked a huge interest in WSN’s. The silicon boom encouraged a growth in miniature microchips

and semiconductors; contributing to the growth in WSN application, currently areas that were hard

to access due to equipment size are reachable and sensor node power consumptions has reduced

significantly. WSN’s were initially used in research to track wildlife movements, deep seas animal




movements, underground mine condition and monitoring of veldt fires. They have recently been

introduced to agriculture monitoring farming areas and tracking livestock (Nagl et al., 2003; Huircan

et al., 2010; Raizman et al., 2013). WSN’s are widely scalable, and they provide short to medium

range coverage of the deployment area. The challenge with their application is limited power

reducing their life span (Nagl et al., 2003). Application of WSN’s in tracking has been deployed in

two ways; one of which is static sensors monitoring an area in a pre-defined network structure.

They relay any data identified, they are programmed to release information from sensors through

the network to the base station. Second method is called ad-hoc network, it is a mobile WSN for

target tracking tagged with sensors that constitute nodes in a network in dynamic mesh network.

1.3 Related Works Research on tracking mobile target has been around for quite some time; it took off after the

development of GPS system, initially by the US military to improve troupe location and target zones.

(Chakole et al., 2013) developed an application of the hybrid system on tracking and monitoring

vehicles in order to improve service delivery during accidents, the vehicle information is relayed to a

database and displayed via a Graphical User Interface (GUI). Also (Behzad et al., 2014) developed a

similar hybrid system for monitoring and tracking vehicles, their focus was in the security of the

vehicles and also to provide extra features than current vehicle tracking systems. The authors

developed a low cost tracker that also monitors the status of the vehicle while parked; through a

hidden button for system activation, the system warns the user of movements and turns off the

ignition if any suspicious motion is detected. It also gives the user the ability to control the ignition

system through text messages in case the vehicle is stolen.

Other applications of this hybrid network system are a smart home monitoring as outlined in (Xu et

al., 2010) and (Liu, 2014); both authors developed a system that uses a ZigBee WSN connected to

GPRS system and linked to a database. In (Xu et al., 2010) the researchers focused on improving a

Dijkstra routing algorithm and testing it on their system to find the shortest path to relay data; while

(Liu, 2014) focused on hardware design and analysis, for a ZigBee network on smart homes

technology that is low cost, low power and fast rate.

(Ficek et al., 2013) did a review of tracking in mobile networks, the authors present a short message

system (sms) based active tracking system, for obtaining target position of mobile user terminals

though the pre-existing GSM network. The system can be deployed in an academic environment

using off shelf components; also the system can be cross platform, and cross boarders for roaming

customers.

2 PROPOSED HYBRID SYSTEM DESIGN

The system is developed from off shelf components; a mobile assert tracking using wireless sensor

networks usually has a mesh network or star topology, either clustered or dynamic routing. To

expand the coverage distance GSM/ GPRS is combined with the WSN; it is composed of mainly the

following components:




2.1 Hardware Design

a) Mobile Node (ear tag)

The mobile tags are attached to the animals ears, encased in a plastic cover. These mobile nodes will

frequently receive the target’s GPS coordinates; append the RFID tag identification from the current

Livestock Identification and Tracking System (LITS), and the owner’s details into a packet. Then it is

transferred to the static nodes upon connection. The mobile node is composed of GPS receiver,

ARM Microcontroller, a GSM module, a Hybrid power system (coin battery plus thin-film solar power

system) and a memory card.

b) Static Nodes (gateway)

These parts of the WSN are strategically placed around the BST’s; they act as WSN gateways, sink or

base stations for the mobile nodes. They are stronger in coverage; they have external antennae and

much larger surface area solar panel power system, their location and distance from the BST is fixed.

The static nodes comprises of a larger Hybrid power module (battery plus solar power system),

microcontroller, memory card and a GSM module.

c) GPS module & GSM/GPRS module

For mobile nodes; Sim968 combined GPS/GSM module (SIM Tech, 2007), is used to receive GPS

triangulation coordinates and to communicate with the static nodes. For static nodes; since their

location is known and stationary, there is no need for GPS receiver only a Sim900DS GSM module is

used (SIM Tech, 2007). The two modules mentioned are portable and equipped with a powerful

single processor ARM926EJ-S core. Sim900DS has a dual Sim capability and they are both Quad band

modules, which allows more than one frequency in a network can be used to implement the system

thus improving chances of transmission and reducing connection speed; furthermore, multiple

bands can be used to relay data.

d) Electronic RFID tag

These are currently in circulation; the system will adopt the current system tags to reduce costs and

to improve ease of adoption, as well as keep to current protocols when the system comes into use.

The tags need not be read since they are encased by the tag upgrade, so all that is required is to

install the casing with the existing RFID number stored at installation. The proposed system is

designed for farmers; it will reduce costs for government, however both entities have a choice to use

what works for them without compromising the other.

e) Hybrid Powering Module

These modules are mainly for the WSN, research has shown that power management is a major

challenge for WSN’s (Nagl et al., 2003); therefore, in order not to lose focus on the main task of real

time tracking, the authors feel a hybrid power system (i.e. chemical batteries and solar power) will

ensure longer node life-span (Huircan et al., 2010). Thin-film solar cells layered on the surface of the

ear tag casing will be used to collect and convert solar radiation into power for mobile nodes and for

static nodes; small to medium solar panels will be used combined with larger batteries.




2.2 Software Design

a) Nodes Software

Each of the components in the network have a specific functionality that outlines how they combine

in forming the network; Figure 1 (appendix A) shows the individual node flow diagrams, outlining the

separate nodes and their data flow. These nodes combined, form the network (WSN) software of the

proposed system. The flow diagrams showing the function, decision tress and data flow within

system is outlined in Figure 1, mobile node, static node and BST node from left to right respectively.

The envisioned overall pseudo code of the software algorithm for the hybrid system focusing on

networking part is outlined in Algorithm 1, it is not detailed and the variables are mentioned for easy

understanding and they are subject to change in time. The pseudo code is to demonstrate the

proposed approach to data structure and transfer; from livestock nodes to the abstracted database.

The research is not focused on optimizing routing algorithms and database structure, but on the

main part of tracking and storing historic data of livestock movement in real time.

b) Pseudo code for routing algorithm

Algorithm 1: Summarized algorithm of the system

c) User Interface

The system will have multiple user interfaces; cell phone messaging, Android Application for mobile

devices and a Web application giving position of livestock overlaid onto a Google maps. These

interfaces will provide farmers with alert messages and visual tracking of their livestock. GSM




messaging is specifically for less technologically survey and older generation of users, they still find

it difficult to use smart phones. It also provides access in area where 3G+ coverage is not available;

while web applications gives a much larger viewing area and more access to users who have access

to internet in their mobile devices, e.g. laptops and tablets.

3 EXPERIMENTAL SETUP: MODELING AND SIMULATION

Modeling and Simulation of the proposed hybrid system are performed with Network Simulator 2

(NSG2) software using Tcl scripting, generated by a java applet to model the WSN scenarios. It

provides an insight into the applicability of the WSN part of the system; however, it has limitations

which are outlined below:

The simulator does not give information on connection loss.

It does not recover lost nodes or show how they communicate if they have gone beyond

coverage.

The node movement is preplanned using waypoint configuration, not random as livestock

movements would be in reality.

Configuration via the applet is clattered and can get confusing, with all the crossing lines, see

Figure 3.

The distance and direction of the nodes does not display as they move which will be a good

indicator of tracking

The simulator does not explicitly indicate/generate tracking errors

The main algorithm; which is has a BST at the center, and it is only accessed by the surrounding

static nodes. Mobile nodes only have access to static nodes simulates properly and gives hope that it

will reduce connection costs and improve coverage area. Figure 3 (appendix C) shows a snap shot of

the system scenario setup with node 0 representing BST, nodes 1-3 represent static nodes and finally

the others are mobile nodes or the livestock.

Figure 4 (appendix D) shows a snapshot view of the simulation output of the system as it runs, the

rings around the nodes depict a connection and the arrows show data transfer between the nodes.

The simulator can be executed and ran in both forward and reverse mode in case one misses

something during review; it also allows change of running speed.

4 RESULTS AND DISCUSSION

The simulations show movement and communication between the nodes as they move through the

network, indicating the path the data packets take from node to sink through rings forming around

the transmitting and receiving node. It indicates a successful transmission and reception of data

during motion; concurrently, outlining the dynamic mesh network of the system. When a node goes

out of range; it stops transmitting, even when it drifts back into focus it does not show any

communication. This indicates loss of connection and by extension the node or livestock. Figure 4

(appendix D) shows the nodes and their connection rings during a running simulation, but it does not

show the coordinates of either node as well as show the cardinal points for one to figure direction of

travel for the nodes. This may be due to limitations of the software but it is necessary for




interpretation of the viability of the system. Planned movements in the simulation do not depict well

the erratic animal movements of livestock, but they give an idea of how the system will respond to

mobile nodes.

5 CONCLUSION

This paper presents the concept and simulation of the high level solution to livestock tracking using a

hybrid mobile target tracking system. Simulation results show that the system is viable, and that

mobile node tracking using the combination of networks is achievable. The hybrid system provides

both identification conforming to current standards and real time 24/7 livestock tracking (no gaps in

animal movements). One of the limitations of the simulator is that it does not quantify tracking

errors, but fortunately tracking livestock does not require extreme precision in locating them as does

missile attack. Consequently, the close proximity location provided by the system would be much

helpful in reducing costs of tracking, and keeping records of movements for livestock. The challenges

identified; direction of node, co-ordinates of nodes and loss of nodes will be worked on in future to

improve accuracy of the system. An algorithm to optimize the location and tracking of livestock and

to improve power management will be incorporated into this research for future development. Also

in future; simulations of the individual components will be presented, as well as findings of changes

made before deploying with detailed descriptions. A technique to quantify tracking errors will also

be developed.

REFERENCES

Behzad, M., Sana, A., Khan, M. A., Walayat, Z., Qasim, U., Khan, Z. A., & Javaid, N. (2014). Design

and Development of a Low Cost Ubiquitous Tracking System. Procedia Computer

Science, 34, 220-227.

Chakole, S. S., Kapur, V. R., & Suryawanshi, Y. A. (2013, April). ARM Hardware Plaform for Vehicular

Monitoring and Tracking. In Communication Systems and Network Technologies (CSNT),

2013 International Conference on (pp. 757-761). IEEE.

Ficek, M., Pop, T., & Kencl, L. (2013). Active tracking in mobile networks: An in-depth view. Computer

Networks, 57(9), 1936-1954.

Huircán, J. I., Muñoz, C., Young, H., Von Dossow, L., Bustos, J., Vivallo, G., & Toneatti, M. (2010).

ZigBee-based wireless sensor network localization for cattle monitoring in grazing

fields. Computers and Electronics in Agriculture, 74(2), 258-264.

Liu, Z. Y. (2014). Hardware design of smart home system based on ZigBee wireless sensor

network. AASRI Procedia, 8, 75-81.

Long Distance Post. (1996). History of GSM and More. Belmout: LDpost.

Moreki, J. C., Ndubo, N. S., Ditshupo, T., & Ntesang, J. B. (2012). Cattle Identification and

Traceability in Botswana. Journal of Animal Science Advances, 2(12), 925-933.

Nagl, L., Schmitz, R., Warren, S., Hildreth, T. S., Erickson, H., & Andresen, D. (2003, September).

Wearable sensor system for wireless state-of-health determination in cattle. In Proceeding of

the 25th Annual International Conference of the IEEE EMBS, Cancun, Mexico (pp. 3012-

3015).




Raizman, E. A., Rasmussen, H. B., King, L. E., Ihwagi, F. W., & Douglas-Hamilton, I. (2013).

Feasibility study on the spatial and temporal movement of Samburu's cattle and wildlife in

Kenya using GPS radio-tracking, remote sensing and GIS. Preventive veterinary

medicine, 111(1), 76-80.

Sim com. (n.d.). GSM GPRS Modules. Sim Com .

SIM Tech. (2007). SIM968 Combo Module . Sim com.

Sunday Standard Reporter. (2012, 05 12). ELECTRONIC EAR TAGS TO REPLACE BOLUS. Retrieved 01 27, 2016, from http://www.sundaystandard.info/electronic-ear-tags-replace-bolus

Xu, M., Ma, L., Xia, F., Yuan, T., Qian, J., & Shao, M. (2010, October). Design and implementation of

a wireless sensor network for smart homes. In Ubiquitous Intelligence & Computing and 7th

International Conference on Autonomic & Trusted Computing (UIC/ATC), 2010 7th

International Conference on (pp. 239-243). IEEE.




APPENDIX

A: Node flow diagram

Figure 1: Node flow diagrams

B: pseudo code algorithm for the system

Figure 2: Summarized algorithm of the system




C: Simulation setup using NSG2

Figure 3: WSN scenario setup in NSG_2

D: Simulation output

Figure 4: Simulation Output

Proceedings of the 1st International Conference on the Internet, Cyber Security, and Information Systems (ICICIS), Gaborone, 18-20 May 2016


IC1013

A Distributed Computational MapReduce Algorithm for Big Data Electronic

Health Records

Sreekanth Rallapalli Network & Infrastructure Management, Faculty of Computing

Botho University Gaborone, Botswana

[email protected]

Radhika Kidambi Department of Computer Science

AIMS Institutions Bangalore, India

[email protected]

Suryakanthi Tangirala Department of Accounting & Finance, Faculty of Business

University of Botswana Gaborone, Botswana

[email protected]

ABSTRACT

Recent advancement in technology and architecture led to Big Data analysis. Implementation of big data for small and mid-size organizations can be offered through cloud computing. Processing of big data can be performed through MapReduce. MapReduce is a programming technique for big data processing. This requires networked attached storage and parallel processing. Designing an efficient algorithm for processing big data on cloud is a challenging task. Health care is generating huge amount of data in the form of Electronic Health Records (EHR) and these data has to be processed on cloud to minimize the processing cost. An efficient scalable algorithm in MapReduce is required to process large EHR data which are generating from various sources. Cloud computing lets you process big data without having to buy or maintain your own cluster or data center. Divide-and-conquer and branch-and-bound algorithms proposed by researchers confirms the effectiveness and scalability of MapReduce algorithms. Recent research confirms that to handle the massive computational and storage resources which are in demand by big data at reasonable power costs, we must rely on parallel and distributed computation. In this paper we study the existing algorithms to process EHR using cloud computing as platform. Then we propose ESPD-CIMAC an efficient, scalable, parallel and distributed computational MapReduce algorithm for cloud computing to process EHR using Hadoop Clusters. Key words: Big Data; Cloud Computing; EMR; Hadoop Clusters; MapReduce.

1 INTRODUCTION Large amount of medical data is being generated by hospital systems, clinical systems, and other medical devices which are popularly known as Big Data (Manyika et al., 2011). Big data is currently managed and analysed using database management systems. But the traditional database management systems (Evans & Hutley, 2010) do not have the capability to handle Big Data. The reason for unsuitability of traditional DBMS is that it is structured and whereas Big data is either unstructured or semi-structured.



As the data is vastly scalable, an efficient scalable algorithm is required to efficiently process the data. While building the scalable solutions, network bottleneck and low performance of hardware nodes has to be taken to consideration (Wang & Liu, 2011). Medical history of a patient stored in a digital format can be referred as Electronic Health Record (EHR). Healthcare needs scalable and distributable solutions on cloud. Cloud computing is becoming a promising technology which provides a shared pool of configurable computing resources and managed with minimal management effort (Mell & Grance, 2010).

There are various frameworks developed for big data computing (Ekanayake et al., 2010; Howe et al., 2010; Mihaylov et al., 2012; Low et al., 2012; Ewen et al., 2012; Zhang et al., 2012). MapReduce (The Apache Software Foundation, 2013) can provide efficient methods to build scalable and distributable solutions for health care data. The most important issue is to efficiently move the big data to the cloud. Cost minimizing data migration solutions using various online algorithms (Zhang et al., 2013) will be flexible for cloud to choose data centre for effective data processing. MapReduce frame work on a single cluster may not be suitable for distributed data and resources (Condie et al., 2010). Distributed MapReduce architectures are preferred when data is being aggregated from various data sources and from different computing nodes.

Cloud MapReduce architecture is used to process healthcare records. Rack servers connected to a top of rack switch which uplinks to all other switches of the similar bandwidth forms a cluster (Zhou et al., 2009). In this paper we propose ESPD-CIMAC efficient and scalable MapReduce algorithms for processing EHR by using Hadoop clusters. The organization of the paper is as follows. Section 2 investigates some preliminaries concerning Hadoop Clusters. Section 3 relates to EHR. Section 4 relates to MapReduce and health care issues. In Section 5 SOA based cloud computing features are studied. In Section 6 Literature study of MapReduce algorithms are studied. In Section 7 we propose an efficient scalable Iterative MapReduce algorithm for cloud computing to process Big Data EHR using Hadoop clusters. In section 8 Experimental results are presented. Section 9 provides the conclusion.

2 HADOOP CLUSTERS

Hadoop clusters are built by the rack servers which are connected to top of rack switch. Uplinks of

rack switch are connected to another set of switches which have the equal bandwidth. This forms a

cluster in a network. We can setup this cluster on cloud so that the workflow of this cluster will get

the required results from the large data sets. In this case we load the EHR data to the cluster and

search for a query.

The workflow of the cluster is as follows: Hadoop Distributed File System (HDFS) writes the loaded

data into the cluster. The data is analysed with MapReduce algorithms. HDFS writes the data and

saves into the cluster. HDFS reads the results from the cluster. From large data sets of EHR if we

need to find how many patients were diagnosed with heart disease, this can be analysed and

processed very quickly using hadoop. Hadoop divides the huge chunks of data into smaller chunks

and the process across multiple machines and thus produce the result so quickly. The typical

architecture of hadoop cluster is shown in Figure 1.



Figure 1. Hadoop Cluster Architecture

For faster parallel processing huge data has to be loaded into hadoop cluster for processing.

The client breaks the data into smaller chunks and all the chunks of data are being sent to

different machines for processing. To avoid the data loss it is ensured that each chunk of data is

running parallel on different machines. In order to prevent data loss and network performance

hadoop administrator manually defines the rack number of each slave data node in the cluster.

EHR data loaded into cluster is divided into data chunks like File A, File B, File C and so on. In

this section we present how these data chunks will be loaded into HDFS. Client will consult the

name node and then write the block of code to one data node. The data node then replicates

the required blocks as decided by the Name node. The same repeats for the next set of blocks

till all the chunks of data have been completely processed. Writing of files to HDFS is shown in

Figure 2. In the figure we can see that large data for Electronic health records can be sent as

data chunks to the client initially. Replication factor for the blocks by default is set to 3. Hadoop

writes the data efficiently to its nodes and keep all the data in the nodes to be safe. If for any

reason the node fails, the data is still available in other nodes.

Figure 2: EHR files writing to HDFS

3 ELECTRONIC HEALTH RECORDS (EHR) Complete data relevant to patient medical history, demographics, problems, medications, clinical observations, signs and symptoms, immunization reports, radiology reports, laboratory data, billing information, personal data, patient progress data. In EHR storage and retrieval of data is efficient. Patient care can be enhanced by EHR Systems (Jamoom et al., 2012). The various components of EHR are shown in Figure 3.



Figure 3. EHR Components

Clinical Document contains various information (Electronic Health Records Overview, 2006). These

documents are very critical for patient care and validate the patient care provided. An XML based

standard electronic document has been developed by HL7 (Dolin et al., 2006) international which

defines the structure of the document. The benefits of EHR include minimized errors in billing (Wang

et al., 2003), reduced costs (Johnston et al., 2004; Menachemi & Brooks, 2006), accuracy in diagnosis

(Jamoom et al., 2012), and many more. Despite the many benefits of EHR there are certain barriers

(Rahman & Reddy, 2015) for its adoption. EHR will be inevitable for patient care in hospitals and

various medical practices.

4 BIG DATA, MAPREDUCE AND HEALTHCARE Voluminous amount of healthcare data relating to patients is being generated by healthcare organizations and this data has different formats (Hu et al., 2014). Healthcare data sources need to follow the standards set by the healthcare industries. Big data helps the organizations to reduce the costs and improve quality in patient healthcare. Big data technologies helped to develop various applications which are successfully applied in various fields of health sciences. Various health information systems were proposed by researchers (Fernandez-Luque, 2009; Duan et al., 2011; Hoens at al., 2013; Wiesner & Pfeifer, 2014). Healthcare is one of the filed which needs scalable and distributable solutions for efficiently solve the problems of data processing and distributing those data in secured way. There are many problems in healthcare data which need to be addressed like finding patients with common symptoms, analysing the laboratory records, treatments given to various patients based on reports, patient responsiveness to the prescribed drugs, and so on. As 80% of the healthcare data (Miliard, 2011) is unstructured, this requires applications where this type of data is processed. MapReduce has emerged one of the solutions to solve healthcare issues (Bhatotia et al., 2011; Mazur et al., 2011; Yan et al., 2012). Big data analytics can be applied to the EHR data to predict the risk among patients on various health related issues. Analytics will also help to diagnose the patients with certain disease at early stages. Programming with data oriented tools such as SQL and various statistical languages are required for

big data analysis in healthcare. Big Data analytics in healthcare is required to transform the data into

knowledge. Statistical, contextual, quantitative, predictive, cognitive models can be developed to

derive information form huge data sets. Latest Big data technologies will be able to collect the large

data from millions of patients identify clusters, correlations and in short time analyze those data

using statistical machine learning ad modeling techniques (Baah et al., 2006). By implementing

statistical modeling or machine learning techniques it is possible to tell whether a patient is likely to

fall sick again using a valid range of data sets. The McKinsey Global institute estimated the potential

value from Big Data in healthcare could be $300 billion a year (Kayyali et al., 2013). Big data analytics

platform on healthcare is suggested as in Figure 4.



Figure 4 Big Data analytics on Healthcare

5 CLOUD COMPUTING In the healthcare ecosystem, cloud computing environment can be applied for various benefits for

all the components (Parakala & Udhas, 2011). As healthcare data available in various forms like

structured and unstructured, the database used to store the data should be capable to process

unstructured data. NoSQL database can be efficiently used to handle such data. The data integration

can be based on Hadoop MapReduce (Apache Hadoop, 2012). The basic requirements of Cloud

Infrastructure as a Service should be secure, scalable and Multi tenancy. It should be location

independent and on demand virtual networks. Figure 5 shows the basic requirement of cloud IaaS.

Figure 5: Cloud infrastructure (Source: http://www.slideshare.net/bradhedlund/architecting-data-

center-networks-in-the-era-of-big-data-and-cloud-13033773

A number of read and write requests are generated while processing big data on the cloud.

Thousands of entities such as application servers are accessing data. In order to avoid any failures

and continue the service it is often needed to balance the read and write load a large number of



servers need to be kept ready to distribute. Cloud computing makes it possible to have large-scale

on demand infrastructure which can provide resources for different workloads. Big data can be

offered as a service on the cloud (Low et al., 2012). The data on the cloud for parallel analysis need

to be partitioned, distributed, configured and then load the data into memory. Hadoop can be

deployed on the cloud to perform massive data processing. The Hadoop cloud platform for data

processing can be given as such the efficient algorithm for hadoop data base is designed. Cloud

computing provides opportunities for the growth even though there are barriers for various services

for big data (Moretti et al., 2008). Table-1 illustrates the growth opportunity for cloud computing

with respect to various barriers.

We use parallel computing to process large data sets on the cloud. MapReduce is a parallel

programming model which is supported by capacity on demand clouds (Gunarathne et al., 2010).

For a large collection of data stored in the cloud MapReduce work is to compute the inverted index

in parallel. A reliable SOA infrastructure is required for SOA suite in order to integrate the Healthcare

records. The applications can share the information constantly to leverage essential business

processes for healthcare integration. Cloud computing will ensure to provide scalable infrastructure

like hardware, software and any healthcare applications. SOA make sure that it delivers the software

as a service to all healthcare providers. By implementing cloud SOA healthcare (Rallapalli & Gondkar,

2016) organizations can minimize the security risk involved in exchange of the information. Assume

that each node i in the cloud stores the EHR records ri,1,ri,2,ri,3,ri,4,ri,5,….. and that the EHR

record contains patient laboratory information pi,1,pi,2,pi,3……. In order to retrieve information of

patient with similarities of lab information we use inverted index and it lists as follows

{w1:r1,1,r1,2,r1,3,…….}

{ w2:r2,1,r2,2,r2,3,…….}

{ w2:r2,1,r2,2,r2,3,…….}

In order to analyze Terabytes and Petabytes of data in quick and limited amount of time and to

perform statistical analysis we need hundreds of servers. Data must be distributed across the servers

to analyze the data in parallel. Big Data Electronic Healthcare Records processing on the cloud as

suggested is shown in Figure 6.

Figure 6: Big Data EHR processing on cloud



Barrier Opportunity

Availability Multiple Cloud Providers

Data Lock-in Standardize API; Hybrid Cloud Computing

Data confidentiality Encryption, Firewalls, VLAN

Data transfer bottleneck Higher Bandwidth switches, Transporting Disks

Performance unpredictability Improve Virtual Machine support

Scalable storage Scalable store available

Bugs in large distributed system Debugger for VMs

Scaling quickly Auto Scalar

Software licensing Pay-for-use license

Table 1: Barrier for data and cloud computing opportunities

6 LITERATURE REVIEW OF UNCERTAIN DATA ALGORITHMS

Healthcare data is unstructured due to large number of images are stored in database, this section

focus on reviewing algorithms required for uncertain data. The main challenge involved in uncertain

data is modelling and integrating this with various applications. Uncertain data working models were

proposed in (Sarma et al., 2006; Aggarwal, 2010). The second challenge is data management and

processing applications for the management of uncertain data. In order to analyse large data sets

traditional data clustering algorithm is proposed (Bu et al., 2010). In order to know the performance

of the application the MapReduce programmer need to write an complex code. For data clustering

algorithms, classification and regression trees can be applied to improve the performance.

6 EFFICIENT, SCALABLE, PARALLEL AND DISTRIBUTED COMPUTATIONAL BASED MAPREDUCE

ALGORITHM FOR CLOUD COMPUTING

In MapReduce program we have two functions named Map() and Reduce() . The general syntax for

these is shown as

map (s1,t1) [<s2,t2>]

reduce (s2,{t2}) [<s3,t3>]

MapReduce system like Hadoop reads the input data, performs computation and writes the results

to Hadoop Distributed File System and then creates chunks of blocks which run across cluster of

machines. In a MapReduce program the system runs a process called Job Tracker on the master

node in order to monitor the job progress, and a set of processes called as Task Tracker which

processes on worker nodes to perform the exact Map and Reduce tasks.

Let us first discuss on the various iterative models proposed to improve the MapReduce processing.

Twister, HaLoop, iMapReduce proposed earlier focus on reducing job startup costs and caching

structured data. In order to provide efficient, scalable, distributed computational iterative based

algorithm for cloud computing we need to separate the structure and data in the Application

Program Interface. The iterative map model requires the algorithms like k-means and Page Rank to

produce iterative functions. The separation of structure and state data can be achieved by

enhancing the map function to the structure with key value pairs in the incremental MapReduce.



The below algorithm reads a chunk of EHR data. The structure and state key value pairs for iterative

algorithms varies with structure key, value, state key, value. There may be relations like one to one,

many to one existing between the state key and structure key. For iterative algorithm two important

data sets are required. The loop invariant structure data and loop variant state data required for

efficient and scalable computation. SOA based cloud computing does the information exchange

using loosely coupled software components. Iterative algorithms are generally used for ranking the

data in the clusters. Consider j, k as vertex numbers. In this algorithm we combine clustering

algorithm, Generalized algorithm

Algorithm 1 Parallel computing algorithm

Input: Data is in List<vertex numbers>

Output:

1. Begin with n hadoop clusters, each containing chunks of EHR data and we will number the

clusters 1 through n.

2. Input the vertex numbers as Map phase and output the vertex so that it reduces and

produce the output phase as by summing all the neighbours.

3. Compute the cluster distance using k-means for MapReduce.

4. Algorithms like Generalized iterated matrix vector in MapReduce can be implemented for

data sets.

5. Run sequence of jobs J1, J2, J3…..which will incrementally produce the results of an

iterative algorithm.

8 EXPERIMENTAL RESULTS

The experiment setup consists of four nodes which is shared by a LAN with a managed switch. One

node in this is used as a Master which supervises the data and flow of control over all other nodes in

Hadoop cluster. All the nodes run on Intel core2 Duo processor.

All nodes used for this experiment uses Ubuntu Linux Operating system with Java JDK 7 installed on

the nodes. Apache Hadoop which is available as open source in the website has been used to install

Hadoop. The installation guides has been used for the node setup. The experiment setup is shown in

Figure 7.

Figure 7 Experimental setup of 4 nodes



For experimental purpose we consider the public database for Electronic health records which

contain in total 100,000 patients, 361,760 admissions, and 107,535,387 lab observations. The total

size of the file was 1.4 GB. These experiments run on Amazon EC2. In order to process these EHR

data various algorithms like priori can be implemented for incremental one-step processing for

iterative MapReduce. MapReduce computation takes 800 seconds. Our proposed algorithm takes

only 100 seconds for the computation.

9 CONCLUSION

In this paper we have described how an iterative distributed and computational MapReduce

algorithm is more efficient when compared with various algorithms for bulk data processing. By

implementing this algorithm on SOA based cloud computing we can significantly reduce the runtime

when compared with general MapReduce algorithm.

REFERENCES

Aggarwal, C. C. (Ed.). (2010). Managing and mining uncertain data (Vol. 35). Springer Science &

Business Media.

Apache Hadoop (2012). Retrieved from: http://hadoop.apache.org.

Baah, G. K., Gray, A., & Harrold, M. J. (2006, November). On-line anomaly detection of deployed

software: a statistical machine learning approach. InProceedings of the 3rd international

workshop on Software quality assurance(pp. 70-77). ACM.

Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011, October). Incoop:

MapReduce for incremental computations. In Proceedings of the 2nd ACM Symposium on

Cloud Computing (p. 7). ACM.

Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: efficient iterative data processing on

large clusters. Proceedings of the VLDB Endowment, 3(1-2), 285-296.

Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., & Sears, R. (2010, April).

MapReduce Online. In NSDI (Vol. 10, No. 4, p. 20).

Dolin, R. H., Alschuler, L., Boyer, S., Beebe, C., Behlen, F. M., Biron, P. V., & Shabo, A. (2006). HL7

clinical document architecture, release 2. Journal of the American Medical Informatics

Association, 13(1), 30-39.

Duan, L., Street, W. N., & Xu, E. (2011). Healthcare information systems: data mining methods in the

creation of a clinical recommender system.Enterprise Information Systems, 5(2), 169-181.

Electronic Health Records Overview (2006) National Institute of Health. National Center for Research

Resources. MITRE Center for Enterprise Modernization, Mclean Virginia.

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S. H., Qiu, J., & Fox, G. (2010, June). Twister: a

runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium

on High Performance Distributed Computing (pp. 810-818). ACM.

Evans, D., & Hutley, R. (2010). The Explosion of Data. White Paper.

Ewen, S., Tzoumas, K., Kaufmann, M., & Markl, V. (2012). Spinning fast iterative data

flows. Proceedings of the VLDB Endowment, 5(11), 1268-1279.



Fernandez-Luque, L., Karlsen, R., & Vognild, L. K. (2009, August). Challenges and opportunities of

using recommender systems for personalized health education. In MIE (pp. 903-907).

Gunarathne, T., Wu, T. L., Qiu, J., & Fox, G. (2010, June). Cloud computing paradigms for pleasingly

parallel biomedical applications. In Proceedings of the 19th ACM International Symposium

on High Performance Distributed Computing (pp. 460-469). ACM.

Hoens, T. R., Blanton, M., Steele, A., & Chawla, N. V. (2013). Reliable medical recommendation

systems with patient privacy. ACM Transactions on Intelligent Systems and Technology

(TIST), 4(4), 67.

Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics: a

technology tutorial. Access, IEEE, 2, 652-687.

Jamoom, E., Beatty, P., Bercovitz, A., Woodwell, D., Palso, K., & Rechtsteiner, E. (2012). Physician

adoption of electronic health record systems: United States, 2011. NCHS data brief, (98), 1-8.

Jamoom, E., Patel, V., King, J., & Furukawa, M. (2012, August). National perceptions of EHR adoption:

Barriers, impacts, and federal policies. In National conference on health statistics.

Johnston, D., Pan, E., & Walker, J. (2004). The value of CPOE in ambulatory settings. J Healthc Inf

Manag, 18(1), 5-8.

Kayyali, B., Knott, D., & Van Kuiken, S. (2013). The big-data revolution in US health care: Accelerating

value and innovation. Mc Kinsey & Company, 1-13.

Li, B., Mazur, E., Diao, Y., McGregor, A., & Shenoy, P. (2011, June). A platform for scalable one-pass

analytics using MapReduce. In Proceedings of the 2011 ACM SIGMOD International

Conference on Management of data(pp. 985-996). ACM.

Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., & Hellerstein, J. M. (2012). Distributed

GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of

the VLDB Endowment, 5(8), 716-727.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data:

The next frontier for innovation, competition, and productivity.

Mell, P., & Grance, T. (2010). The NIST definition of cloud computing.Communications of the

ACM, 53(6), 50.

Menachemi, N., & Brooks, R. G. (2006). Reviewing the benefits and costs of electronic health records

and associated patient safety technologies. Journal of medical systems, 30(3), 159-168.

Mihaylov, S. R., Ives, Z. G., & Guha, S. (2012). REX: recursive, delta-based data-centric

computation. Proceedings of the VLDB Endowment, 5(11), 1280-1291.

Miliard, M. (2011) IBM Unveils New Watson-Based Analytics. Healthcare IT News. Retrieved from:

http://www.healthcareitnews.com/news/ibm-unveils-new-watson-based-analytics-

capabilities.

Moretti, C., Bulosan, J., Thain, D., & Flynn, P. J. (2008, April). All-pairs: An abstraction for data-

intensive cloud computing. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE

International Symposium on (pp. 1-11). IEEE.

Parakala, K., & Udhas, P. (2011). The Cloud: Changing the Business Ecosystem. KPMG India study.



Rahman, R., & Reddy, C. K. (2015). Electronic health records: a survey. Healthcare Data Analytics, 36,

21.

Rallapalli, S., & Gondkar, R. R. (2016). A Study on Cloud Based SOA Suite for Electronic Healthcare

Records Integration. In Proceedings of 3rd International Conference on Advanced Computing,

Networking and Informatics (pp. 143-150). Springer India.

Sarma, A. D., Benjelloun, O., Halevy, A., & Widom, J. (2006, April). Working models for uncertain

data. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference

on (pp. 7-7). IEEE.

The Apache Software Foundation (2013) Hadoop MapReduce Tutorial. Retrieved from:

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.

Wang, F., & Liu, J. (2011). Networked wireless sensor data collection: issues, challenges, and

approaches. Communications Surveys & Tutorials, IEEE, 13(4), 673-687.

Wang, S. J., Middleton, B., Prosser, L. A., Bardon, C. G., Spurr, C. D., Carchidi, P. J., ... & Kuperman, G.

J. (2003). A cost-benefit analysis of electronic medical records in primary care. The American

journal of medicine, 114(5), 397-403.

Wiesner, M., & Pfeifer, D. (2014). Health recommender systems: concepts, requirements, technical

basics and challenges. International journal of environmental research and public

health, 11(3), 2580-2607.

Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for

iterative computation. Journal of Grid Computing,10(1), 47-68.

Zhang, L., Wu, C., Li, Z., Guo, C., Chen, M., & Lau, F. (2013). Moving big data to the cloud: an online

cost-minimizing approach. Selected Areas in Communications, IEEE Journal on, 31(12), 2710-

2721.

Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute

similarities. Proceedings of the VLDB Endowment, 2(1), 718-729.

Yan, C., Yang, X., Yu, Z., Li, M., & Li, X. (2012, June). Incmr: Incremental data processing based on

mapreduce. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on (pp.

534-541). IEEE.



IC1014

A Collaborative Tool for MPhil/PhD Student Dissertation Workflow

Bigani Sehurutshi, Oduronke T. Eyitayo

Department of Computer Science University of Botswana Gaborone, Botswana

[email protected]; [email protected]

ABSTRACT

Over the years, experiences from University of Botswana (UB) show that workflow is a major problem. Current workflow methods do not provide adequate support for MPhil/ PhD students’ dissertation in the University of Botswana. There are a lot of bottlenecks in the current system. There is need to mitigate the current limitations and bottlenecks by improving operational efficiency and effectiveness by redesigning better and efficient workflows. In this paper, we propose a model to improve the process by: firstly analysing and assessing current processes, then identifying the opportunities for improvements and potential solutions to resolve the challenges experienced by users when doing their projects. Lastly, simulate improved workflow scenarios and evaluate the solution. Evaluation was carried out to find out if the new system was beneficial. The testing revealed that the prototype for student’s research management system was regarded as easy to use and very useful which makes the prototype a better improvement to the current manual system in place. Some of the benefits attained from the workflow improvements include; increased student’s satisfaction, speedy throughput with efficient workflows and maximised assets utilization by removing constraints related to people, processes inefficiencies.

Key words: Workflow Reengineering, Process Modelling, Usability Testing, Prototype Design,

Processmaker

1 INTRODUCTION University of Botswana (UB) has a student’s enrolment of about sixteen thousand distributed

amongst the seven faculties which are; Business, Education, Engineering and Technology,

Humanities, Science, Health Sciences and Social Sciences. Each faculty is composed of departments

and they offer programs from diploma, bachelor’s degree through to masters and doctoral degrees.

In the 2015/2016 academic session, the University currently has 1704 Masters/MPhil students and

96 PhD students.

Student dissertation in our context is referred to as research in which scholars extend their

knowledge to make some contributions to their respective areas. Research projects are made up of

hundreds of processes. Yan et. al (2012) emphasised the need for improvement of quality of thesis

supervision and instruction. Apart from the supervisory approach, one major problem that has not

received a lot of attention is the workflow of student’s dissertation. Although many academic staff

have invented their own specific paper based guidelines and pro-forma documents to ease and

control the supervision process between all involved parties based on their experience and best

practices, they have not been supported by a central online collaborative system that can help them



to easily monitor and control the whole workflow (from identifying a project idea to final

assessment) and support smooth data handover between all involved parties. The current states of

research processes have been inefficient.

One of the problems encountered over the years include students not graduating the year

they were supposed to because there were no follow ups on reports from internal and

external examiners. There is no clear way of tracking progress between the examiners and

School of Graduate studies. Sometimes, reports dispatched through a courier may not reach

the examiner. Having in place a unique system that monitors workflow can therefore help with

some of the problems identified above. The current support for dissertation is handled in a

manual way. The main objective is to study the current processes and model a better

workflow and a prototype for the whole process of student’s project management from the

point of registering topics to the final submission of the dissertation.

The specific objectives of the research were:

To assess the current state of operations for student research project management in the

University

To model a workflow to make it more efficient and

To design, develop and evaluate prototype system based on the workflow

2 LITERATURE REVIEW

2.1 Modelling a Workflow To study and understand models, one has to construct models using a particular modelling

technique. It is important to identify the use or purpose of the models. In order to choose the right

technique, the modeller must know the purpose of the model. Different techniques are more

suitable for certain purposes. There are many process modelling techniques and the mostly used

being flowcharts, data flow diagrams, petri nets and workflows.

A workflow is referred to as an automation of business process, in whole or a part, during which

documents, information and tasks are passed from one participant to the other. Participants may be

people or automated processes (WFMC Documentation, 1996). A process is said to be a number of

tasks that need to be carried out and a set of conditions that define the order of the tasks. Workflow

management involves managing flow of work such that the work is done at the right time by the

proper persons. Workflow management systems aim to help business goals to be achieved with high

efficiency by means off sequencing work activities and invoking appropriate human or information

associated with these activities (WFMC Documentation, 1996). It also ensures integration of people

and programs.

2.2 Theses Management Systems Edinburgh Napier University (Romdhani et al., 2011) proposed an integrated and collaborative online

supervision system for final year and dissertation projects. The idea was initiated in order to provide

a high quality supervisory processes and effective relationship. (Romdhani et al., 2011) Suggested

that there is a need for a supervisory process to be supported by a central electronic technology

system to record, monitor, revisit supervision process and enhance the students learning process.

He further said that from the student’s perspective having in place a unique electronic supervision



system alongside the traditional face-to-face and paper based supervision methods, can ensure

guaranteed assessment and handling smooth data and reports transfer between all parties. The

system minimized administration overheads and gave better control of project progression and

monitoring (Romdhani et al., 2011).

In a study by Yan et. al. (2012), a web based system was designed to support the masters’ degree

thesis research process and the knowledge sharing. The study identified the main steps on the

research process. It then presents an instructional model based on the analysis of practical thesis

research workflow. In a study of hundred Chinese universities analyzed audit was carried and it

found out that Universities differ in theses time management and process organization. His study

also highlighted that most studies follow the generic steps of; topic selection, thesis writing, oral

examination and evaluation of excellent theses. They came up with the processes whereby the

thesis research is a combination of problem based learning and thesis management process. (Yan et

al., 2012). A web based supporting system called THEOL was designed according to the instructional

mode for the Master’s Degree thesis. The system features three key modules: research process

supporting, research group management and knowledge sharing, which has functions to support the

whole thesis research process, multi-supervision from the teachers, and rich resource sharing during

whole process.

2.3 Methodology Section The study is composed of five phases. The first phase was intended to model the current

manual system of managing researches in faculties at the University of Botswana. The second

phase was designed to model the manual system from the user requirements in phase one

into workflows. Flow charts were used to model the steps. The third phase improved

workflows to eliminate human-dependent processes which brought some inefficiency. The

purpose of the fourth phase was to use the formulated workflows to develop a prototype for

the research management system. Process maker software was used to design the prototype.

The final phase was to evaluate the designed prototype interface design through heuristic

evaluation, perceived ease of use and perceived usefulness.

3 CURRENT STATE OF OPERATIONS FOR STUDENT RESEARCH PROJECT MANAGEMENT The major processes involved in MPhil and PhD processes are admission, proposal defence, and

submissions of title and abstract of the thesis, submission of the thesis for examination, entry into

the examination, appointment of examiners and board of examiners, the oral examination and

results.

With all the processes executed manually, there is so much inefficiency through the whole process.

The delays are caused by processes depended on the human beings not following procedures. Most

processes are controlled by the coordinator. Automating the current processes the way they are will

mean the coordinator still controls the processes making the system semi-automated. Due to lack of

monitoring and communication, scheduled tasks are also postponed as students do not show up to

see their supervisors. Some do not even submit the milestones and it is therefore difficult to know

the status of their projects. Other stakeholders may forget the deadlines or forget that the

dissertation is with them. Examiners are given a month to bring the report but in some cases reports



arrive after five months. There is no clear communication between the examiners and school of

graduate studies. Reports dispatched through a courier may not even reach the examiner, and

follow up is sometimes not done. A proposed re-engineered workflow is shown in figure 2.

3.1 Improved Workflows Processmaker contains two main components: a design environment and a run-time engine. The

designer’s environment includes tools to map processes, define business rules, create dynamic

forms and add input and output documents. Web based application was created. With the web

based application a client will use a browser to access services from the server. The method relieves

the developer of the responsibilities of installing the application in every computer of an end user.

The other benefit is that changes of the application logic and database happens in one place (on the

server) and doesn’t affect the end users machines.

For this project, the open source Processmaker was customised to meet the needs of student’s

project management system. For the interfaces to be created, the system allows for the creation of

the processes first. In Processmaker a process is a collection of tasks with inputs to create outputs

that is of a value to the students doing research and end user within the University of Botswana

institution.

As major processes exist, there are some child processes created to ease the pressure on the

processes. It is recommended breaking large processes into separate master and child processes to

reduce the complexity of the process map and give sub processes time to handle exceptional

situation and activities. Functionalities of useful process can be hooked into another process. Some

of the sub-processes include; topic submission (under topic registration process), meeting reports

(under project writing), progress report and oral examination. Sub processes are divided into

synchronous and asynchronous.

Synchronous sub-processes allow for the execution of child process, ceasing the master process to

appoint where the sub-process is complete first before the master processes resume where it

stopped. Asynchronous sub processes does not pause the master process, no dependency on each

other. All the sub-processes are synchronous.

Figures 3, 4 and 5 show a few of the processes and child processes.



Figure 1: PhD Dissertation current workflow

Figure 2: PhD re-engineered workflow



Table 1 shows a comparison of the original workflow and the improved system.

Table 1 Comparison of the original workflow and the improved system.

Features Current system Improved system

Prefill of data Not available Data filled using forms

Documents storage Files, hard copies Electronic databases

Documents sharing Not available Electronic sharing, enhanced retrieval of documents

Reminders to users Sent by coordinators Emails included in the system remind users of the pending tasks

Assignment of Tasks

Coordinator assigns users tasks Electronic assignments

Tasks deadlines Done by the administrators Processes have a validity period.

Reports No reporting Done by Management staff members

Figure 3: Progress Report sub process

Figure 4: Student Examination Workflow



Figure 5: Oral Examination workflow

4 PROTOTYPE DEVELOPMENT The various processes were designed. Figures 6, 7, 8 and 9 show some examples of forms used in

the development. The dissertation examination starts with the students making an indication that

they are ready to submit. Fill in the date and make a request to the coordinator as shown in figure 6.

The coordinator will then open the submission for students who made request. As a student

upload your work and sent as specified in the form shown in figure 7.

Other processes include:

The submission by student is routed to the supervisor. Supervisor indicate that he/she

approves the submission and add comments if any to the coordinator.

Coordinator sending to the school of graduate studies (SGS)

when satisfied with submission

Figure 6: Student Examination Request Form



Figure 7: Student Thesis submission Form

The SGS then accepts the submission and indicate the number

of hard copies received from the student.

The dissertations are then logged as they are sent to the examiners with the

dates the reports are expected.

The report can also be submitted online as shown in figure 8

When the report arrives the SGS notifies the coordinator of the report from the

examiners and indicate the duration.

The coordinator prepares a brief report to the student not going into confidential

information. Inform the student that the reports are ready for collection as shown in

figure 9.

Figure 8: Examiner’s report form



The student then receives the report and then sends back the corrected version to the

supervisor.

The supervisor indicates the approval status of the corrected version project.

Indicating that all the necessary corrections are taken into consideration. This is routed

through the internal examiner

The internal examiner indicates as to whether he/she is satisfied with the correction

and indicates the final result of the students. The SGS will then notify the coordinator

and student of the outcome and post-examination submissions will be made.

Figure 9: Detailed Examiner’s report

5 PROTOTYPE EVALUATIONS Three types of evaluation were done to test the prototype system. They are usability and heuristic

evaluation, perceived ease of use and perceived usefulness. A system can only be said to be

effective and efficient if it meets usability criteria for specific types of user’s carrying out specific

tasks (Agarwal, 2002). Usability is associated with positive effects, including errors reduction,

enhanced accuracy and positive attitude towards users (Agarwal, 2002). A system that passes

usability testing we can say it is effective and efficient. In the last phase of the study, the prototype

system was evaluated using heuristic evaluation and end user evaluation. Heuristic evaluation is

considered a practical, inexpensive method of identifying usability problems and assisting in

refinement of system design (Laurie et al., 2002). Thirteen usability evaluators from department of

Computer Science and department of Library and Information studies evaluated the prototype using

heuristic evaluation. Eight usability factors were applicable to the study (Table 2). As a result only

two usability factors; user control and freedom and consistency with standards violated heuristics by

scoring negative responses, but their overall severity score were between “no usability problems to

minor usability problem”. Some of the problems included; complex wording, no cancel buttons, no

default values in the data fields, no undo buttons and others. The overall severity rating ranged

between, no usability problems, cosmetic problem and minor usability problem, which suggested

that the prototype was very usable hence efficient and effective. Most of the comments were

problems that were resolved.



Table 2: Summary Heuristic Evaluation results

Usability factor Positive(Y) Negative(N)

Not

Applicable

(NA)

Visibility of the system status 64 1 0

Match between system and real world 31 8 0

User control and Freedom 25 38 3

Consistency and standards 23 39 5

Error Prevention 36 16 0

Recognition and Recall 54 15 9

Aesthetic and Minimal design 26 13 1

Help users recognise, diagnose and recover

from errors 21 5 0

The problems that were listed by the evaluators are listed in Table2. The evaluators also

recommended the following:

Specifying the labels and button names

Titles for pop-up messages

Training users

Six questions on perceived ease of use of the prototype were added as part of the evaluation. The

same thirteen who did the Usability and Heuristic Evaluation did the evaluation. The evaluators

responded to the six questions and the responses are summarised in table 3. The responses were

between “strongly agree” and “strongly disagree”.

Table 3: Usability problems and design solutions

Usability problem Design solution

Ambiguous words like “Area of expertise, cluster” Used combo-box to list the areas

What is the difference between userID and StudentId

userID changed to username

Date format not clear in date field Fix date format

No cancel in all forms, only submit buttons Cancel buttons created

No clear title for form of coordinator receiving submission from students

Title created

No undo and redo buttons Solution not implemented

No option of using keyboard instead of mouse Solution not implemented

No defaults values for data values Default values added

No dots used to indicate length Solution not implemented

In addition to the heuristic evaluation, evaluators were given six questions on perceived ease of use

and perceived usefulness of the prototype. On the whole, the evaluators perceived the prototype

system as easy to use and a useful tool to the management of research tasks.



Participants perceived that the prototype system as very useful. The prototype appeared to meet

participant’s need for a research projects management system.

Table 4 Percentage summary of perceived ease of use

SA* A* N* DA* SDA* Total%(N)

I find the system easy to use. 46.2 46.2 7.7 0 0 100(13)

Learning to operate the system is easy

for me. 15.4 76.9 7.7 0 0

100(13)

I find it easy for the system to do what I

want it to do. 23.1 46.2 31 0 0

100(13)

The system is flexible to interact with 23.1 53.8 23 0 0 100(13)

I can easily remember how to perform

tasks 23.1 61.5 15 0 0

100(13)

My interaction with the system is clear

and understandable 15.4 46.2 39 0 0

100(13)

SA – Strongly Agree; A – Agree, N- Neutral; DA- Disagree and SDA –Strongly Disagree

It can be seen from Table 4 that users found the system easy to use, flexible as well as easy to

remember the process of performing tasks. Responses ranged between 77% and 92%. However

lower percentages between 61% and 69% were obtained in system doing what uses want clarity.

There is high possibility that this rating is also due to the usability problems which have been

attended to.

5.1 Perceived Potential Usefulness

The perceived potential usefulness was measured using six items with a 5-point scales ranging from

strongly agree to strongly disagree. The scale was 1= strongly disagree, 2-disagree, 3- neutral, 4-

agree and 5-strongly agree. Twenty participants completed the questionnaires across different

departments. The respondents came from mostly the Faculty of Science. The participants were

walked-through the online prototype. Of the 20 participants, 85% were from Faculty of Science and

15% were other Faculties. Within Faculty of Science 71% of participants were from Department of

Computer Science. From the participants 20% were graduate students and 75% came from

undergraduate students. This was mainly based on convenience sampling - those who were willing

and available to evaluate the system. As shown in Table 5, over 80% of the evaluators responded

with either “agree” or “strongly agree” in all the items related to the perceived usefulness. Overall,

all evaluators felt the system will be useful. None of them disagreed or strongly disagreed. In terms

of reducing delays, all participants agreed. This really buttresses the fact that the system will indeed

reduce delays and improve efficiency.



Table 5 Percentage summary of perceived potential usefulness

SA A N DA SDA Total%(N)

The system would allow me to complete my tasks

more quickly 25 55 20 0 0

100(20)

Using the system would increase effectiveness of

performing tasks 40 45 15 0 0

100(20)

Using the system would give me more time over

other issues than administrative task 55 35 10 0 0

100(20)

Using the system would give me more visibility over

my tasks 45 40 15 0 0

100(20)

Using the system would reduce delays for the same

amount of effort 45 55 0 0 0

100(20)

I would find the system useful in the process for my

research work 40 45 15 0 0

100(20)

Participants perceived that the prototype system as very useful. The prototype appeared to meet

participant’s need for a research projects management system. The evaluators’ narrative comments

also support the potential usefulness of the system.

6 CONCLUSION

Student’s research is a capstone in the study process at the University of Botswana. This study

looked at the current setup and proposed a better workflow system to improve the process. The

process led to the development of a prototype for a student’s projects management system for

University of Botswana. Five phases were followed in the developing of the prototype. The end of

the first phase resulted in gathering of information about the current state of student’s research

projects. The second phase used the information gathered in the first phase to model the current

processes. The third phase improved the processes designed in the second phase by providing a new

workflow. The last two phases were the prototype development and the evaluation. The results of

these two phases proved the concept was viable. The completion of this dissertation as a whole,

demonstrated a viable concept of project research management system which met the users’

expectations. The prototype will be quite useful in the effective processes of monitoring, supervising

and managing students’ research projects; therefore, in future developments of the institution, the

issues of student research workflow should be incorporated into the system.

There are however several causes of delays such as humans and system but the research focus is

mainly those caused by administrative inefficiencies that can easily be dealt with using proper

process flows and reminders.



REFERENCES

Abdelgader, F. M.,Dawood, O. O., & Mustafa, M. M. (2013). Comparison of The

WorkflowManagement Systems. The International Arab Conference on Information

Technology. , (pp. 1-5). Khartoum: theIRED.

Abiddin, Z. N., Ismail, A., & Ismail, A. (2011). Effective Supervisory Approach in Enhancing

Postgraduate Research Studies. International Journal of Humanities and Social Sciences, 1(2),

206-217.

Agarwal.R, V. (2002). Assessing a Firm's Web Presence: A Heuristic Evaluation Procedure for the

Measurement of Usability. Information Systems Research, Vol. 13(2), 168-186.

Aguilar-Saven, R. S. (2004). Business Process modelling: Review and Framework. International

Journal Production Economics, 129-149.

Aguillar-Save, R. S. (2004). Business process modelling: Review and Framework. International Journal

of Production and Economics , 129-149.

Bakar, M. A., Jailani, N., Shukur, Z., & Yarim, N. F. (2011). Final Year Supervision Management System

as a Tool for monitoring Computer Science Projects. Procedia Social and Behavioral Sciences,

273-281.

Davis, F. D. (1998). Perceived usefulness, perceived ease of use and user acceptance of Information

Technology. MIS Quartely, 13, 319-340.

Laurie, K. R. (2002). Structured Heuristic Evaluation of Online Documentation., 2002. IEEE

Professional Communication Society.

Nielsen, J. (1994). Usability Inspection Methods. New York: John Wiley & Sons,.

Romdhani, I., Tawse, M., & Habibullah, S. (2011). Student Project Performance Management System

for effective Final year and Dissertation Projects Supervision. London: London International

conference on Education.

Sommerville, I. (2011). Software Engineering. Boston: Addison-Wesley.

Tahir, M. I., Ghani, A. N., Atek, E. S., & Manaf, Z. (2012). Effective Supervision from Research

Students’ Perspective. International Journal of Education, 4(2).

WFMC Documentation, W. M. (1996). The Workflow Reference Model. Retrieved Jan 2014, from

http://www.aiai.ed.ac.uk/project/wfmc/ARCHIVE/DOCS/glossary/glossary.html

Yan, Y., Han, X., Yang, Y., & Zhou, Q. (2012). On the design of an advanced web-based system for

supporting thesis research process and knowledge sharing. Journal of Educational

Technology Development and Exchange, 5(2), 111-124.

http://www.aiai.ed.ac.uk/project/wfmc/ARCHIVE/DOCS/glossary/glossary.html



IC1015

Evaluating the Effect of Privacy Preserving Record Linkage on Student

Exam Record Data Matching

George Anderson, Tsholofetso Taukobong, Audrey Masizana



ABSTRACT

Data matching identifies which record pairs from two different databases represent the same entities. The data matching process improves data quality, enriches data and allows analysis that would otherwise be impossible from one individual database. While there is need for data matching, issues of preserving privacy and maintaining confidentiality need to be adequately addressed. Numerous research studies have been carried out with different approaches proposed to address these issues, resulting in a relatively new research area termed Privacy Preserving Record Linkage (PPRL). The broad approach is to transform record data using some sort of one-way function into an encoded representation, which is then used for matching. In this paper, we study the impact of such privacy preserving data matching on the quality of data matching, as determined by standard metrics, when applied to a university data set comprising student exam records and student registration records. Our results demonstrate that the quality of data matching does not suffer, while benefits of privacy are maintained. Key words: Privacy Preserving Record Linkage, Data Matching, Bloom Filters, Computational University Administration.

1 INTRODUCTION A Record Linkage (also known as Data matching) involves identifying records which correspond to

the same entities from several databases (Christen, 2010). Entities could be patients, customers,

people being counted in a census, etc. The process usually involves linking different records using a

set of common fields. What makes data matching a challenging field of its own is that the records in

question in the various databases might not have usable unique identifiers (matching primary keys),

due to errors. As such, other fields have to be used, such as surnames. However, surnames might

have errors, such as typographical errors. Some fields might appear in one database, but not in the

other one. For example, a post office box number. Data matching has been used to solve problems

in a variety of domains, such as (Christen, 2010): national census, where data quality can be

improved by matching records across census carried out at different points in time, as well as

building a richer source of data by integrating census databases with other databases, such as crime;

health, where records across hospital, clinic, ambulance, and mortuary databases can be matched in

order to give a richer understanding of a patient’s health over her lifetime; national security, where,

for example, terrorists have to be identified, taking advantage of the online and financial records

they leave behind as they carry out their activities; bibliographic database, such as Google Scholar,

must match bibliographic records in different formats, so correct authors can be attributed for

research papers, and information such as citation counts are accurate.



Standard data matching techniques assume that all data required for matching is available in a

readable, unencoded/unencrypted, form. When data matching is done internally in an organization,

and all data belongs to that organization, employees involved are made aware of all regulations and

policies concerning handling of data. However, when data from two or more organizations has to be

matched, or matching is outsourced, then issues of privacy become a concern. A good example is

when health records from two hospitals are matched by a third party, and the issue of data privacy is

not well addressed, it may arise that a well-known politician has a terminal medical condition. This

information might be made public against proper legal procedure.

In order to address such issues, Privacy Preserving Record Linkage (also known as Privacy Preserving

Data Matching) works by encoding or encrypting databases, before data matching is carried out,

either by one of two participating organizations, or by a third party (Schnell et al., 2009; Christen,

2010).

This research contributes to this field by applying a Privacy Preserving Record Linkage technique to

match student registration against examination records and evaluating its impact.

Research has been carried out in the University of Botswana to evaluate the potential for data

matching in processing student exam results (Anderson et al., 2013). This involved matching student

entities between student registration lists, which do not contain errors, and student exam records

which contain errors introduced by students when completing their exam forms. Various metrics,

such as precision, recall and pairs quality, were used to evaluate the approach. Results of the study

demonstrated great potential for application of data matching in this domain.

The current study takes the next step, to evaluate the impact of privacy preserving data matching on

the same problem. Incorporating privacy preserving data matching into our data matching system

would enable the data matching exercise to be overseen by people who are not at the same level of

responsibility as lecturers, such as teaching assistants or student assistants. In this paper, we

describe our study, including the experiments conducted, and our results.

The rest of this paper is organized as follows. Section 2 gives a background on data matching. Section

3 discusses our problem environment. Section 4 describes the privacy-preserving approach we used.

Section 5 describes our experiments and discusses our results. Section 6 concludes.

2 BACKGROUND While there is need for data matching from separate databases in order to improve data quality,

enrich data and allow analysis that would otherwise be impossible from one individual database,

issues of preserving privacy and maintaining confidentiality need to be adequately addressed.

Numerous research has been done, with different approaches proposed to address these issues,

resulting in the research area termed Privacy Preserving Record Linkage (Christen, 2010).

Privacy Preserving Record Linkage (PPRL) endeavors to present a way in which two or more different

organizations can perform record linkage without revealing any other information to either party

besides the matched records. For example, two businesses identifying if they have common

customers without revealing customer identity or any other confidential knowledge coming from the

matched data to either party (Christen, 2010).

Figure 1 illustrates the record linkage process under privacy preserving context. With data

preprocessing done independently by the two database owners, it is very important that they agree



about the approaches to use as well as what the common attributes to be used for linkage are.

Figure 1. Record linkage process under privacy preserving context.

Adapted from (Vatsalan et al., 2013a).

Vatsalan et al. (2013b) describe taxonomy for PPRL techniques given in Figure 2, which categorize

them into the five main areas. These are described in detail below.

Privacy aspects

i. Number of parties: This can be a two party protocol involving just two database owners

or a three party protocol also known as linkage unit.

ii. Adversary model: Two adversary models generally used in cryptography are employed

here, that is the Honest-But-Curious Behaviour (HBC) where the two database owners

follow protocol but also want to learn about the other database owner’s data. The

second model is Malicious Behaviour whereby the database owners may behave

arbitrarily.

Privacy technique: There are a number of privacy techniques being used in PPRL such as SMC

(Secure Multi-Party Computation), phonetic encoding, bloom filters and others.

Linkage techniques: The techniques that a linkage unit uses during the different steps of the PPRL

process will determine computation requirements and the quality of matched data results.

i. Indexing: There is need to apply indexing or blocking algorithms to reduce

computational complexity during comparisons (Vatsalan et al, 2013a, Anderson et al.,

2013).



ii. Comparisons and Matching: Matching can be Exact, which takes into consideration exact

matching values, that is 1 for and exact match and 0 for non-match. Alternatively

matching can be Approximate which takes into consideration partial similarities with

approximate values between 0 and 1. Various approximate string comparison functions

have been used e.g. edit – distance, use of q-grams (common sub strings). Durham et al.

(2011) detail a comparison of the different PPRL string comparison techniques.

iii. Classification: Many different techniques used for classification including threshold

based, rule based and machine learning based

Figure 2. Taxonomy of PPRL techniques adapted from (Vatsalan et al., 2013b).

Theoretical analysis: This looks at estimate measures for aspects such as scalability to large

databases, quality of linkage results (accuracy, precision, recall, f-measure), and privacy

vulnerabilities of the various PPRL methods employed (Vatsalan et al., 2013a; Vatsalan et al., 2013b).

i. Scalability to large databases: This is in terms of computational efforts and

communication costs of the overall PPRL process, normally ‘big O’ notation measure is

used.

ii. Quality of linkage: defined in terms of data errors and discrepancies fault tolerance of

the linkage technique used, whether the matching is field based or record based as well

as the data types involved.

iii. Privacy vulnerabilities: A measure of vulnerability of different PPRL techniques is

achieved by looking at the privacy attacks it is susceptible to. For instance Bloom filter

based techniques have been seen to be at risk of the cyptanalysis attack whereby an

intruder can map the individual encoded values back to their original values

(Niedermeyer, 2014). Other privacy attacks include dictionary attacks, frequency attacks,

and composition attacks, discussed in detail in (Vatsalan et al., 2015).

Evaluation: Evaluation is based on the same three aspects of scalability, linkage quality, and privacy.

Scalability measures based on platform and infrastructure include runtime, memory space and

communication size; while those based on number of generated record pairs are reduction ration,

pair completeness and pair quality (Vatsalan et al., 2013a). These measures are useful in evaluating

the performance of the linkage algorithm in terms of efficiency and effectiveness. Quality of linkage



is generally evaluated in terms of accuracy measures such as precision, recall, F-measure as record

classification is a highly imbalanced classification problem.

Practical aspects: This looks at three aspects being the implementation techniques used to

prototype and implement the solution, datasets and application area. With privacy issues making it

difficult to acquire real data with personal information, usually synthetically produced datasets

usually are used (Randall et al., 2014). Application area looks at whether there are targeted

application areas for certain techniques or the techniques are developed generally without any

target areas.

Our work focuses on privacy techniques (q-gram and hash function encoding), linkage techniques

(comparison and matching), theoretical analysis (we consider the quality of linkage using precision

and recall), and evaluation.

Various studies (Vatsalan et al., 2014; Randall et al., 2014; Vatsalan & Christen, 2014) have evaluated

the scalability, linkage quality and privacy issues of Bloom filters in PPRL by applying the technique to

large real-world datasets. Randall et al. (2014) used dataset coming from hospital admissions data

from two Australian hospitals with about 7 million records from one hospital linked with 20 million

from another. The researchers applied unencrypted linkage using bigrams on personal identifiers

against encrypted linkage using trigrams on Bloom filters. The results showed high linkage quality

for both with almost no difference in quality between the two linkages (encrypted and unencrypted

using Bloom filters) demonstrating that it is possible to get high linkage quality even under privacy

preserving context.

Evaluating any PPRL technique normally entails assessing its performance in addressing the three

main challenges of record linkage, that is scalability, linkage quality and privacy issues. Of the three,

privacy is the most difficult to evaluate. Privacy evaluation entails calculating the probability of an

attack; in other words, the risk of an adversary correctly identifying original human data using a

publicly available dataset such as a telephone directory (Vatsalan et al., 2013b; Vatsalan et al., 2014).

Other studies (Niedermeyer, 2014; Vatsalan et al., 2014) evaluated the privacy issues associated

with Bloom filter based linkage by simulating a cryptanalysis, which is the privacy attack the Bloom

filter based approach is most susceptible to (Kuzu et al., 2011). Through these privacy evaluations

and other performance measures, the method can be better understood and its security enhanced,

as it has been compared with other PPRL techniques and seen to give better results despite its

limitations (Vatsalan et al., 2014; Randall et al., 2014; Schnell et al., 2013; Vatsalan & Christen, 2014;

Karakasidis & Verykios, 2011).

Our work is different from these, because to the best of our knowledge, no one has evaluated PPRL

using Bloom filters on our data set.

3 DESCRIPTION OF ENVIRONMENT The research data comes in the form of real world datasets. The datasets are from the University of

Botswana Computer Science ICT121 course offering (Computing Skills Fundamentals I) exam results

for the Faculty of Education group from 2007 to 2011. The exams are administered with special

answer forms that are filled by shading using HB pencils (Anderson et al., 2013). Students use the



answer forms to provide their answers as well as their student details such as student ID, surname,

initials, program of study etc. The answer sheets are then graded by scanning them through an OMR

(Optical Mark Recognition) scanner which, together with the scanner software, creates CSV (Comma

Separated Value) data files which may be used as they are or converted to spreadsheet files. These

data files normally contain errors as some students shade their details incorrectly. Common

mistakes such as swapping of student ID digits or leaving spaces and how the scanner reacts to them

are discussed in (Anderson et al., 2013). The exam data records are matched with student

registration records from University of Botswana Academic Student Administration System (ASAS), in

order to ensure every student gets the right mark.

These datasets contain confidential student details and is considered highly sensitive, hence the

need for application of privacy preserving record linkage technique for their linkage. 4116 student

records were used.

4 PRIVACY PRESERVING RECORD LINKAGE APPROACH The privacy preserving approach we adopted for our experiments is to use Bloom filters to encode

our database records. We chose Bloom filters because they are known to work well for a wide

variety of data set types (Christen, 2010) and are easy to implement, therefore giving us a good

point of reference.

Bloom filters, conceived by Burton Howard in 1970, is a data structure for efficiently checking set

memberships (Niedermeyer et al., 2014; Bloom, 1970). It is a single array of l bits, where l is an array

length. Initially all bits are set to zero. In order to store a specified set S = {s1, ... , sn } of elements in a

bloom filter k independent hash functions h1, ..., hn are defined such that each hash function maps

on the domain between 0 and l-1 and all bits having indices hj(si) for 1<=j<=k are set to 1. If a bit had

been set to 1 before, it retains the 1. Set membership of element x to set S is checked by mapping

element x with the same hash functions such that if the bit indices h1(x), ..., hk(x) in the Bloom filter

are all 1 then x is believed to be a member of S. However there is a chance that the mapping results

are a false positive which comes about when the indices with ones h1(x), ..., hk(x) are resulting from

different si. On the other hand, if at least one of the bits turns out to be 0, the x will definitely not be

a member of the set S. Bloom filters can also be used to determine an approximate match between

two sets (Schnell et al., 2009; Niedermeyer et al., 2014).

An approach for privacy preserving record linkage using Bloom filters were proposed by researchers

Schnell et al. (2009). Their approach was to split identifier strings into q-grams, that is sub-strings of

length q, use hash functions (for example MD5 or SHA-1) that only the two database owners know

to map the q-grams into the Bloom filters, and then send the linkage unit to Bloom filters. Dice co-

efficients are used to generate a similarity score between the two records. These make use of the

number of 1-bits in the Bloom filters for comparison and matching (Schnell et al., 2009). With q=3

the trigram set of the name “morapedi” is “mor”, “ora”, “rap”, “ape”, “ped”, “edi”.

A simple example is illustrated as follows. Two names (paula, paul) using bigrams are mapped to

two Bloom filters (F1, F2) using two hash functions (k = 2) with 14 bits each, a is given as the number

of 1-bits in F1 and b is given as the number of 1-bits in F2, while h is the number of 1-bits in each

Bloom filter. Table 1 shows the hashing values for the two strings.



Table 1: Bloom filter hashing example

String 2-gram F1 hash bit number F2 hash bit number

paula pa 0 6

au 2 7

ul 6 11

la 8 13

paul pa 0 6

au 2 7

ul 6 11

Table 2: Resulting Bloom filters for the example in Table 1

paula 1 0 1 0 0 0 1 1 1 0 0 1 0 1

paul 1 0 1 0 0 0 1 1 0 0 0 1 0 0

In string “paula” h= 7, in “paul” h=5. The two strings have 5 common 1-bits between them. Hence

the Dice-Coefficient F1, F2 =

=

giving 0.83 as the approximate similarity of the two strings.

Research by Schnell et al. (2013) suggests using longer Bloom filter of about 500 to 1000 bits and

much more than just two hash functions for effective comparisons and more efficient secure

encodings. Variations of Bloom filter based PPRL have been used in a number of applications such as

legal applications, health applications, computer networking applications (Schnell et al., 2009;

Schnell, 2013; Niedermeyer et al., 2014 ).

5 EXPERIMENTS AND RESULTS Experiments were conducted using a 4116 record data set, each record representing a student’s

exam record. Another data set represented student registration records. For a scenario where

privacy preservation was not used, each record pair, one from each database, was compared using a

Levenshtein edit distance similarity score. We implemented our system using Python and the Python

Levenshtein library (Haapala, 2015) was used for the similarity score (the ratio function in the library

was used). The similarity score ranges from 0 to 1 (a real number). Two fields were used: ID number

and surname. The scores from two corresponding fields were added. A threshold was used to check,

for each threshold, what is the precision and recall? The threshold varied from 0.0 to 1.98 in steps of

0.02, therefore 100 threshold values were used. The range for threshold uses arises because the

combined similarity score for the two fields ranges from 0.0 to 2.0.

For Privacy preservation, Bloom filters were used to encode the records in both databases. Bloom

filters had a structure of 1000 bits, since such a long length serves to reduce the number of false

positives, was shown to work well in the literature, and was used in experiments evaluating Bloom

filters by Schnell et al. (2009). To hash the strings into the Bloom filters, the mmh3 hash function in

the Python Murmur Hash 3 library (Appleby, 2016) was used. The number of hash functions used

were 10, 30, and 60, making for 3 experiments, in order to evaluate the performance for varying

number of hash functions, which together with long Bloom filters, serves to reduce the number of

false positives. Each student record comprised two string, each of which were broken up into 2-

grams, and each 2-gram hashed into a 1000-bit Bloom filter using k MurmurHash3 has functions. A

false positive (FP) is a record pairing which is identified as a match, but is actually a non-match. A



true positive (TP) is a record pairing which is identified as a match and is actually a match. A true

negative (TN) is a record pairing identified as a non-match and is actually a non-match. A false

negative (FN) is a record pairing identified as a non-match but is actually a match. Precision is the

fraction of record pairs identified as matches that are true positives i.e. TP/(TP+FP). Recall is the

fraction of actual matches identified as true positives i.e. TP(TP+FN) (Manning et al., 2008). Precision

and recall are calculated for each threshold.

Figures 3 to 5 show results of experiments for the various configurations. The red curve is the

precision-recall curve for the non-privacy preserving experiment and the blue curves are for the

privacy-preserving configuration using Bloom filters. We employ the use of visual inspection in order

to compare the two curves. This approach was used in the literature to evaluate Bloom filters

(Schnell et al., 2009). A visual inspection shows that the Bloom filter configuration performs almost

the same as the non-privacy preserving configuration. In the figures, m represents the Bloom filter

length (number of bits) while k represents the number of hash functions used for each q-gram.

Therefore there is negligible negative impact on the quality of record linkage.

6 EXPERIMENTS AND RESULTS There are laws and regulations guiding the use of data that contains sensitive information such as

student records. Only with the assurance of use of privacy preserving record linkage can institutions

freely allow the use of their student data for data matching. Privacy preserving record linkage on

student data can be used for research such as determining student preparedness for tertiary

education in general or for specific tertiary programs for secondary leaving students by linking their

first year tertiary results with their secondary results. Alternatively linkages can be used to

determine the degree to which a student can successfully go through a specific program course, for

example programming courses, by linking results from other courses done. These can help in

understanding failure rates in certain courses or programs without exposure of details of the failing

students. Privacy preservation also helps mitigate bias as matched records can be evaluated without

knowledge of who the actual individual students are.

We have demonstrated that Privacy Preserving Data Matching has negligible effect on performance

of data matching using our data set by conducting extensive experiments and using a visual analysis.

Future work will involve a numerical analysis using a metric such as Area Under Precision Recall

Curve. Future work will also involve development of a protocol for privacy-preserving record linkage

in our context. This will detail how many data-handling entities are required, which ones do the

hashing, and which ones do the matching.



Figure 3: Precision-Recall Curves for Non-Privacy Preservation (Red) and Privacy Preservation

(Blue) With 1000 bit Bloom Filters and 30 hash functions.





ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their useful comments.



REFERENCES Anderson, G., Masizana, A.N., & Mpoeleng, D. (2013). An Exact and Inexact Approach for Saving

Time and Preventing Errors in Processing of Student Exam Results at the University of

Botswana. International Journal on Information Technology (IREIT), 1(3), 179-185.

Appleby, A. (2016). MurmurHash3. Retrieved from:

https://github.com/aappleby/smhasher/wiki/MurmurHash3.

Bloom, B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the

ACM 13(7), 422–426.

Christen, P. (2012). Data Matching, Berlin: Springer-Verlag.

Durham, E., Xue, Y., Kantarcioglu, M., & Malin, B. (2011). Quantifying the correctness, computational

complexity, and security of privacy-preserving string comparators for record linkage.

Information Fusion, 13(4), 245-259.

Haapala, A. (2015). Python Levenshtein. Retrieved from: http://github.com/ztane/python-

Levenshtein.

Karakasidis, A., & Verykios, V.S. (2011) Secure blocking + secure matching = secure record linkage.

Journal of Computing Science and Engineering, 5(3), 223–35.

Kuzu, M., Kantarcioglu, M., Durham, E., & Malin, B. (2011, July). A constraint satisfaction

cryptanalysis of bloom filters in private record linkage. In Privacy Enhancing Technologies

(pp. 226-245). Springer Berlin Heidelberg.

Manning, C.D., Raghavan, P., & Schutze, H. (2008). An Introduction to Information Retrieval,

Cambridge University Press.

Niedermeyer, F., Steinmetzer, S., Kroll, M., & Schnell, R. (2014). Cryptanalysis of basic bloom filters

used for privacy preserving record linkage. Journal of Privacy and Confidentiality, 6(2), 3.

Randall, S. M., Ferrante, A. M., Boyd, J. H., Bauer, J. K., & Semmens, J. B. (2014). Privacy-preserving

record linkage on large real world datasets. Journal of Biomedical Informatics, 50, 205-212.

Schnell, R., Bachteler, T., Reiher, J. (2009) Privacy preserving record linkage using Bloom filters.

BioMed Central Medical Informatics and Decision Making, 9(1), pp. 1-11.

Schnell, R., Bachteler, T., & Reiher, J. (2009). Privacy-preserving record linkage using Bloom filters.

BMC Medical Informatics and Decision making, 9(1), 41.

Vatsalan, D., Christen, P., & Verykios, V. (2013) Tutorial on Techniques for Scalable Privacy

Preserving Record Linkage, Presented at the 22nd ACM International Conference on

Information and Knowledge Management (CIKM 2013), San Francisco, October 2013.

Retrieved from: https://cs.anu.edu.au/people/Peter.Christen/cikm2013pprl-tutorial/cikm-

2013-pprl-tutorial-slides.pdf

Vatsalan, D., Christen, P., & Verykios, V. S. (2013). A taxonomy of privacy-preserving record linkage

techniques. Information Systems, 38(6), 946-969.

Vatsalan, D., Christen, P., O'Keefe, C. M., & Verykios, V. S. (2014). An evaluation framework for

privacy-preserving record linkage. Journal of Privacy and Confidentiality, 6(1), 35-75.



Vatsalan, D., & Christen, P. (2014). Scalable privacy-preserving record linkage for multiple databases.

In Proceedings of the 23rd ACM International Conference on Conference on Information and

Knowledge Management (pp. 1795-1798). ACM.



IC1016

Ontological Perspectives in Information System, Information Security and

Computer Attack Incidents (CERTS/CIRTS)

Ezekiel Uzor Okike, Tshiamo Motshegwa, Molly Nkamogelang Kgobathe

Department of Computer Science Faculty of Science

University of Botswana Gaborone, Botswana


ABSTRACT

Ontological methodologies are used in almost every field of study including philosophy, medicine, science and engineering based disciplines. This paper is motivated by the need to address pertinent issues and adopt necessary and useful ontology based research approaches in Information systems. The paper aims to discuss ontology by defining its uses, types, methodologies and applications especially in Information Systems and Information security. The paper discusses techniques, applications and use of ontologies in computer science, multiagent systems and particularly in information systems from three perspectives; namely Information systems (IS) research methods, Formal Specification and Information Security. The paper concludes with the view that these three perspectives are still needed in Information Systems (IS) research and education agenda. Hence IS research methods do accommodate surveys, case studies, and experiments as in other disciplines. However the researcher must demonstrate appropriately the need and usefulness of the choice and application of any research method in IS research. On the Information security side, the paper is of the view that application of ontological approaches and models could assist in the development of information system security tools for sharing cyber-attacks incident data and information. Keywords: Ontologies, Information Systems, Information Security, Formal Specification, Multiagent

Systems and CERTs/CIRTs

1 INTRODUCTION Knowledge-Based Systems (KBS) are computer programs that reason using a knowledge base to

solve complex problems ( Hayes-Roth, Waterman, & Lenat, 1983). The development of effective

Knowledge Management System (KMS) has become a critical issue in applied domains (WU,

Jiangning, 2005). As a result, the need to adopt ontological approaches in Artificial Intelligence,

Computer science and Information systems has been widely discussed (Chandrasekaran, 1999);

(Pereira & Santos, 2009); (Raskin, 2001). As explained by (Chau, 2007), ontology is concerned with

the detailed description of the architecture, the development and the implementation of the

systems prototype using both forward chaining and backward chaining during the inference process.

In order to realize the objective of semantic match for knowledge search, ontology may be divided

into information ontology and domain ontology. Enterprises with ontology based knowledge

management applications, focus on Knowledge Processes and Knowledge Meta Processes (Steffen

Staab, 2003).



This paper is concerned with research methods in computing science, and especially in Information

systems. The paper also proposes an ontological semantic approach for sharing information among

Computer Emergency Response Teams (West-Brown, Stikvoort, Kossakowski, Killcrece, & Ruefle,

1998) by developing what we coin CERTS Ontology or Computer Incidents Response Teams (CIRTS

Ontology).

1.1 The Statement of the Problem The main challenges in Computing and Information Systems research are in defining a research problem and in the selection of appropriate research methodology. Upcoming researchers (especially graduate students) and younger researchers in the domains of computing and Information systems often are not able to apply ontological research methods, and formal models due to their lack of grasp on ontology as the basis of research and as formalism.

1.2 Study Objective The objective of this paper is to examine ontological research methods and ontology applications in

computing, Information systems, and Information security and to propose an ontological semantic

approach for sharing information among Computer Emergency Response Teams (CERTS), also

referred to as Computer Security Incident Response Teams (CIRTS). This is done in order to

demonstrate the usefulness of ontological research methods, and formal ontological models in

Information systems and Information security.

1.3 Methodology The approach adopted in this paper is of evaluation of the literature review relating to ontology,

Information systems and Information security in order to obtain the general perceptions of

researchers with respect to ontological methods and perspectives in computing and Information

systems. Using this background we then propose an ontology for CERTS.

The rest of this paper is organised as follows. Section 2 presents a formal definition of ontology, its

uses, types, methods and techniques. Section 3 deals with ontological applications in regards to the

choice of research methods, Information systems and information security. Section 4 examines

ontological applications in multi agent systems. Section 5 concludes the discussion with the view

that ontological research methods enable the definition of research problems and the selection of

appropriate research methods in Information systems including the use of surveys, case studies and

experiments and proposes use ontology for CERTS domain.

2 ONTOLOGIES

2.1 What is Ontology? Ontology is a formal, explicit specification of a shared conceptualization used to encourage

standardization of the terms for representing knowledge about a domain (Kang Ye etal, 2009).

Ontology describes the logical structure of a domain, its concepts and the relations between them

(Silvonen, 2002). Ontology has also been referred to as a frank technology to represent knowledge

(Sunitha Abburu and G Suresh Babu, 2013).

2.2 Uses of Ontology (Dieter Fensel et al, 2001) States that the rapid growth of online information on intranets and the



Web has led to information overload. There needs to be some automatic meaning-directed or

semantic information processing of online documents. As a solution, an Ontology Knowledge Base

provides an innovative tools for semantic information processing and thus for much more selective,

faster, and meaningful user access.

Ontology has been adopted by early Artificial Intelligence (AI) researchers, who acknowledged its

work application from mathematical logic and debated that AI researchers could create new

ontologies as computational models that allow certain kinds of automated reasoning (Gruber T. ,

Ontology, 2009).

Furthermore, (Gruber T. , 1992) offers the following definition of Ontology - "A specification of a

representational vocabulary for a shared domain of discourse — definitions of classes, relations,

functions, and other objects” Ontologies are also seen as defining a common vocabulary in which

shared knowledge is represented. They are widely used to support sharing and reuse of formally

represented knowledge in AI systems (Gruber T. , 1992).

(Kabilan, 2007) and (Gruber T. , 2009) also list the following as frequent uses of ontologies;

1. To share common understanding of the structure of information among people or software

agents.

2. To enable reuse of domain knowledge.

3. To make domain assumptions explicit.

4. To separate domain knowledge from the operational knowledge.

5. To analyse domain knowledge.

2.3 Types of Ontologies Ontologies could either be classified based on its scope or domain granularity, taxonomy

construction direction or the type of data sources as shown in figure 1 (Catherine Roussey et al,

2011).

1. Domain Granularity

Figure 1: Ontology Categories (Antonio Zilli et al, 2009)

(Antonio Zilli et al, 2009) noted that the top-level ontology has some concepts that have some

general agreements or stable standards; domain ontology has concepts that define the main focus

of interest on the domain; the task ontology deals with sub-concepts that are needed to solve

problems on the main domain and the application ontology deals with concepts that exercise the



fastest rate of exchanging data.

2. Taxonomy construction direction

Several approaches could be followed to build the concept taxonomy. One could either use the

bottom–up approach, top-down approach or the middle out approach (Catherine Roussey et al,

2011).

Bottom-Up approach: defines first the most general concepts then goes towards the most specific.

Top-Down approach: defines first the most specific concepts then goes towards the most specific in

order to build the ontology.

Middle-Out approach: defines the concepts from the central area towards the general and / or

specific concepts to build ontology.

3. Type of sources

Ontologies could also be described according sources used to get the knowledge (Catherine Roussey

et al, 2011). The knowledge could either be based on:

Text: Unstructured data given to a computer system for processing.

Thesaurus: forming concepts from words or linguistic relations to build ontology.

Relational Database: structured and accurate software storages used to build ontologies from.

UML Diagrams: using formal described UML classes to define concepts to build ontologies

2.4 Ontology Engineering/Methodologies Apparently by 1995 there were no standard methodologies for building ontologies nor were there

much research publications in this area (Mike Uschold and Micheal Gruninger, 1996). So the same

authors came up with a methodology named the Enterprise Ontology Modelling Process which has

the following phases.

1. Identify Purpose and Scope: which deals with main reason why the ontology is being built

2. Building the ontology: segmented into three steps

i) Ontology capture: deals with identifying the key concepts and relationships in

the domain of interest.

ii) Ontology coding: deals with representation of the knowledge using a formal

language for the ontology.

iii) Integrating existing ontologies: incorporates the both coding and capturing

process with logic of how to use the ontology.

3. Evaluation: gives a technical judgment on the ontology

4. Documentation: Stating the guidelines for each purpose

As years went by the it came to be known that ontology development process could be done in so

many ways following the IEEE standard for developing Software Life Cycle Process (Mohammad

Nazir Ahmad et al, 2012). Two approaches used for building an ontology are,

1. Building ontology from scratch, or

2. Building ontologies from existing ontologies or from different data sources (Giannopoulou,

2008).

Many researcers came up with the following methodologies adressing the ontology development.



Therefore when designing an ontology knowledge base one could choose to follow either one of the

following methodologies as shown in Figure 2.

Figure 2: Phases of methodologies for building ontologies (Mohammad Nazir Ahmad et al,

2012)

2.5 Ontology Techniques The focus of modern information systems is moving from “data processing” towards “concept

processing”, meaning that the basic unit of processing is less and less an atomic piece of data and is

becoming more a meaningful concept which carries an explanation and exists in a context with other

concepts (Janez Brank et al, 2005).

The first key characteristic of the standardization of ontologies is the development of the ontology

mark-up formats and associated standards over the time, which also shows the evolving demands

for semantic mark-up (Nordmann, 2009). Nordman (2009) further explains that the largest

distributed pile of data currently is the internet, which is processed by a lot of different involved

applications. This involves static data, as well as web services interacting with each other and

different data sources to build new services. Figure 3 below shows the history of the languages of

technologies that were used to define ontologies over the years.



Figure 3: History of ontology related technologies (Nordmann, 2009)

2.6 Ontology Matching (Euzenat J and Shvaiko P, 2007) define ontology matching as one which provides a common

conceptual basis for organizing classifications that can be used to compare (logically) different

existing ontology matching systems as well as for designing new ones, taking advantages of formal

art solutions.

2.7 Ontology Mapping (Godugula, 2008) explains ontology mapping as a technique that has become quite useful for

matching semantics between ontologies or schemas that were designed independently of each

other. Ontology mapping is done by analysing various properties of ontologies, such as syntax,

semantics and structure, in order to deduce alternate semantics that may apply to other ontologies,

and therefore create a mapping (Godugula, 2008).

3 ONTOLOGY APPLICATIONS According to (Viinikkala, 2004) and (Catherine Roussey et al, 2011) the term Ontology has become

popular especially in information systems domains such as knowledge engineering, natural language

processing, cooperative information systems, intelligent information integration, web technologies,

database design and knowledge management. Viinikkala (2004) further noted that one strongly

pursued goal in the information systems ontology domain is that of establishing methods for

automatically generating ontologies and suggests that automation requires a higher degree of

accuracy in the description of its procedures, of which ontology is a mechanism for helping to

achieve this. Therefore information system ontology needs to be designed for at least one specific or

practical application. (Kabilan, Ontology for Information Systems (O4IS) Design Methodology:



Conceptualizing, Designing and Representing Domain Ontologies, 2007) justifies ontology as a

software artefact or formal language designed with a specific set of uses and computational

environments in mind.

3.1 Ontology in Information Systems We propose the use of ontologies in Information systems from three perspectives namely:

(i) Information systems research

(ii) Information systems development (Information systems engineering)

(iii) Information systems security

3.1.1 Ontologies in the Information systems domain research. In this regard we consider the choice of research methods and the use of formal specifications as

crucial in IS research

(i) Choice of research methods

Consider the ontological research method shown in figure 4 below. When you consider research

philosophy or methodology, ontological perspectives come into bearing since ontology is also “the

science or study of being” and deals with the nature of reality (Blaikie, 1993). In simple language and

from philosophical point of view, ontology is a system of belief that reflects an individual’s

interpretation about what constitutes a fact and what does not. In this consideration, a researcher

should decide if the entities being studied are objective or subjective. Objectivism and subjectivism

have been identified as two important aspects of ontology. According to (Saunders, Lewis, &

Thornhill, 2009) Objectivism “portrays the position that social entities exist in reality external to

social actors concerned with their existence” (Bryman, 2003) further adds that objectivism

“is an ontological position that asserts that social phenomenon and their meanings

have an existence that is independent of their social actors”.

With regards to subjective research, subjectivism (constructive interpretation of research result )

perceives that social phenomena are created from perceptions and consequent actions of the social

actors. (Bryman, 2003) formally define constructive interpretation (constructionism) as “an

ontological position which asserts that social phenomena and their meanings are continually being

accomplished by social actors”

Our argument at this point is that Information systems research should also follow ontological

research methods with careful application. The impact of ontology on the choice of research

methods is shown in figure 4 below. From the bottom, the researcher has to think of appropriate

research method and decide if it needs a quantitative or qualitative approach or both; then decide

on the research strategy (experimental, case study, deduction, induction); then decide on the

approach depending on the strategy (empirical, interpretivist), and finally follow an ontological

research approach. In clear statements, information systems research permits the use of surveys,

case studies, experiments depending on the research design and approach (Garble, 1994; Choudrie

& Dwivedi,2005; Walsham & Sahey, 2006; Glasow, 2005 .)



Figure 4. Impact of ontology in the choice of research methods

(ii) The use of formal specifications

Ontologies are explicit formal specifications of terms in the domain and relations among them

(Gruber, 1993). Using formal specifications in IS research guarantees the ability to prove the

reliability and workability or our theories. The theoretical frameworks of an IS research may be

established as a provable system during specification. Indeed the concepts of domains and relations

are mathematical models whose applications play vital roles in IS research. In the relational model

the basic concepts are relations, Cartesian products, attributes, key, domain, N-Uplets .

Mathematically, Let and be two sets.

. Define the Cartesian Product of sets and as the set of all ordered

pairs such that and .

1

Any subset of defines a relation on

. Define a relation as the set of ordered pairs such that

. Define the Domain of a Relation as the set :

2

. Define the range of a relation as the set

y (x,y) R, for some x 3

Information systems include: Relations, Domains, Attribute, Key and N-Uplet. Consider for example

the STUDENT Table shown in Figure 5 below. In this Table, Domains represent columns or Attributes,

while N-Uplet represents rows or records.



FirstName LastName StudentID Level of study Department

Abel Richard 20130123 200 Information Systems

Luke Anderson 20130145 200 Computer Science

Mary Paledi 20130167 200 Information Systems

Figure 5. STUDENT table

The application of these definitions in IS development can be seen in the development of logical

schemas and tables, functional dependency principles and the normalization of tables in database

base systems as well as in the development of Data models for specific applications . The need for

proper modelling of data object, entities and their relations cannot be over emphasized in IS project

as the success of the design depends largely on the data models. Improperly modelled systems

compromise system functionality. Therefore the foundations of IS models are highly based in

ontology. Moreover, the foundations of computing paradigms such as Object Oriented Analysis and

Design (OOAD) and their tools are highly rooted in ontology. More aspect of formal ontology in

Information systems are discussed in (Guarino, 1998)

3.1.2 Ontology in Information system development (Information Systems Engineering)

Ontology as a theory of domains present a highly structured system of concepts covering processes,

objects and attributes with all of their complex relations.

In Information Systems development, software development processes involve at least 4 steps

including –

1. Systems analysis (Feasibility study, Requirements Engineering, Data Modeling),

2. Systems design (interface designs, input/output, files and database design –

tables/schemas/relations/normalization),

3. System implementation (programming, testing, deployment- installation/system

conversion/documentation),

4. System maintenance, and evaluation.

Ontology finds useful applications in software development processes especially in system analysis

and data modelling. For example (DiLeo, Jacobs, & DeLoach, 2002) situates building an ontology as

part of the systems analysis phase when incorporating ontologies as part of extending the MaSe

Methodology (DeLoach & Kumar, 2005)

As presented in (Raskin, 2001), an ontology may divide the root concept ALL into EVENTS, OBJECTS

AND PROPERTIES; EVENTS into MENTAL –EVENT, PHYSICAL-EVENT, SOCIAL-EVENT; OBJECTS into

INTANGIBLE- OBJECTS, MENTAL-OBJECT, PHYSICAL-OBJECT, SOCIAL-OBJECT; PROPERTY into

ATTRIBUTE, ONTOLOGY-SLOT, RELATION as shown in figure 7 below. This approach is utilized in the

Object Oriented Systems Analysis and Design (OOAD) (Booch, 1993) with all its modelling tools such

as the Universal Modelling Language (UML) and facilities including Use Cases, Activity diagrams, Use

Case Descriptions, Functional Models, Structural Models (Classes, Class diagrams, Object diagrams,

patterns, Associations, Attributes, operations (Targeden, Dennis, & and Wixon, 2013).



Figure 6: Ontology building as part of system analysis (DiLeo, Jacobs, & DeLoach, 2002)

Figure 7. ALL Tree hierarchy illustrating design principle of Ontology (Classification)

Adapted from (Raskin, 2001)

ALL

EVENT

OBJECT

PROPERTY

EVENT

MENTAL-EVENT

PHYSICAL- EVENT

SOCIAL-EVENT

OBJECT

INTANGIBLE-OBJECT

MENTAL-OBJECT

PHYSICAL-OBJECT

SOCIAL-OBJECT

PROPERTY

ATTRIBUTE

ONTOLOGY-SLOT

RELATION



3.1.3 Ontology in Information system security

(Raskin, 2001) discusses ontological needs in information security from 2 perspectives namely:

(i) Inclusion of Natural Language data sources as an integral part of the overall data sources

in Information security applications. NLP is extensively used today in writing system

admin logs, information hiding, scanning of documents to detect possible intellectual

property breaches etc.

(ii) Formal specification of the information security community know-how for the support of

routine and time-efficient measures to prevent and counteract computer systems

attacks. Sophisticated algorithms are also used as encryption tools in information

security. Security in Information systems can be enforced at different levels. In fact the

basic concepts of the relational model (namely Relations, Domain, Attributes,Key, N-

Uplet) allows for appropriate security check at design level such as

domain integrity, referential integrity, and N-uplet checks, all of which are derived from

formal specifications in design. (Note that in a Relational table, Domains represent

columns or attributes, while N-Uplets represent rows or records)

Furthermore, (Pereira & Santos, 2009) suggested that ontologies contribute to unify terminologies

involved in classification and storage of security data, promote exchange of security information,

support browsing and search of semantic contents, promote interoperability for facilitation of

knowledge management and configuration, and provide support for construction of models or

theories of specific domains.

As security is critical in information systems, approaches that guarantee provable and reliable tools

such as found in ontology should emphasized. Hence the need for adequate dissemination of

essential ontological knowledge among researchers, Information system engineers and industry

practitioners as a sine-qua non to providing security for vital data and information.

3.1.4 Ontologies in Multiagent Systems From an Artificial Intelligence perspective, agents are communicative, intelligent, rational and possibly intentional entities. From the computing perspective, they are autonomous, asynchronous, communicative, distributed and possibly mobile processes (Pitt & Mamdani, 1991). Multiagent systems (Wooldridge M. , 2002) are modular distributed systems with decentralized data. Agents in a Multiagent system have incomplete information or capabilities and have to interact using Agent communication languages (Labrou, Finin, & Peng, 1999) and interaction protocols (Huget & Koning, 2003) to further their goals. There are numerous Agent-Oriented software engineering (AOSE) methodologies. The most common are MaSE (DeLoach & Kumar, 2005) GAIA (Wooldridge & Jennings, 2005) (Wooldridge, Jennings, Zambonelli, 2005), PROMETHEUS (Padgham & Winikoff, 2005) and TROPOS (Bresciani, Giorgini, Giunchiglia, Mylopoulos, & Perini, 2004).

Ontologies have been since integrated successfully in some of the methodologies and used extensively for the development of Multiagent Systems. For example (Tran & Low, 2008) introduce MOBMAS as a methodology for ontology-based multi-agent systems development. It is claimed that MOBMAS was the first methodology that explicitly identified and implemented the various ways in which ontologies can be used in the MAS development process and integrated into the MAS model definitions.



There have also been extensions of Multiagent Systems methodologies like MaSe (DiLeo, Jacobs, & DeLoach, 2002) to use Ontologies for information Domain specification as part of the system analysis phase. Agent development frameworks and systems for developing Multiagent system like JADE (Bellifemine & Giovanni, 2007) also offer tools for developing these Ontologies.

4. ONTOLOGY IN CERTS – a PROPOSAL FOR ONTOLOGIES IN PROTECTION STRUCTURES - COMPUTER EMERGENCY RESPONSE TEAMS APPLICATIONS

This paper proposes development of an ontology for representing and sharing incidents between

CERTS. CERTS are protection structures for critical infrastructure and services (West-Brown,

Stikvoort, Kossakowski, Killcrece, & Ruefle, 1998).

There is a proliferation of critical systems in utility services, e.g. power distribution,

telecommunications, water and others. Critical systems also are used to drive most of the economic

activity - for example, banks and financial institutions, stock markets, tax systems and others.

Furthermore it is widely known and accepted that business in general is now increasingly done

online through e-commerce and so are other services through eServices.

Increasingly also, most governments have developed eGovernment strategies that highlight strategic

aims like improving service and content delivery through the development of a core set critical

infrastructures. Governments are investing heavily in the undersea cables, development of national

telecommunications infrastructure, backbone networks and increased bandwidth and connectivity.

These developments coupled with the diffusion of devices, deregulation and falling bandwidth

prices, has led to an increase in the roll out and consumption of government public service offerings

and information. These developments are also employed to use technology as leveller for social

equality by homogenising quality of service across societies in areas like health, education and

agriculture, through for example eHealth and telemedicine, Tele-education and distance learning.

Governments are also engaged in open government initiatives and are opening up datasets and

statistics to facilitate business and innovation.

As a result, most countries as a matter of strategic security imperative and as part of frameworks for

critical information infrastructure protection in relation to this proliferation of critical systems,

critical infrastructure and services, are now developing or have developed Computer emergency

response teams (CERTs) to handle cybersecurity incidents

With this backdrop, there is need to develop information systems to allow common representation

of knowledge and sharing of information between CERTS and between countries. Ontologies can be

used for these purposes.

Such systems if based on Multiagent systems could also provide mechanisms for sharing ontologies

and ontology negotiation. This paper is part of our ongoing research in this direction.

5. CONCLUSION In concluding this paper we suggest that ontological methods are useful in Information systems

research, Information systems development and in Information systems security. The Information



Systems field (IS) makes extensive use of ontological models in terms of formalism (formal

specifications), theories, software development process, tools, metrics and several applications.

Ontological research methods approaches provide good guidance in defining research problems as

well as guidance in the selection of appropriate research methods. Therefore, the need to explore

the potentials of ontology in Information systems cannot be over emphasized. Surveys, case studies,

as well as experiments are allowed in IS research. However, the researcher must demonstrate

appropriately the need and usefulness of the choice and application of any research method in IS

research. On the Information security side, we suggest that application of ontological approaches

and models could assist in the development of reliable Information security tools to withstand

cyber-attacks on data and information. We propose in this paper an application of ontology in

CERTS/CIRTS

REFERENCES

Antonio Zilli et al. (2009). Semantic Knowledge Management: An Ontology-Based Framework: An

Ontology. New York: Information Science Reference.

Bellifemine, F. L., & Giovanni, C. G. (2007). Developing Multi-Agent Systems with JADE. Wiley.

Blaikie, N. (1993). Approaches to Social Enquiry. Cambridge: Polity Press.

Booch, G. (1993). Object-Oriented Analysis and Design with Applications. Addison-Wesley

Professional.

Bresciani, P., Giorgini, P., Giunchiglia, F., Mylopoulos, J., & Perini, A. (2004). TROPOS:an agent

oriented software development methodology. Journal of Autonomous Agents and Multi-

Agent Systems, 8(3), 203–236.

Bryman, A. a. (2003). Business Research Methods. Oxford: Oxford University Press.

Catherine Roussey et al. (2011). An Introduction to Ontologies and Ontology Engineering. In F. G. al,

Ontologies In Urban Development Projects (p. 241). London: Springer-Verlag.

Chandrasekaran, B. a. (1999). What are ontogies and why we do need them. IEEE Intelligent systems,

20-26.

Chau, K. (2007). An ontology-based knowledge management system for flow and water quality

modeling. Advances in Engineering Software 38, 172–181.

DeLoach, S., & Kumar, M. (2005). Multi-agent systems engineering: an overview and case study. In P.

G. B. Henderson-Sellers (Ed.), Agent-Oriented Methodologies (pp. 236–276). IDEA Group

Publishing.

Dieter Fensel et al. (2001). On-To-Knowledge: Ontology-Based-Tools for Knowledge Management.

Free University Amsterdam VUA, Mathematics and Informatics , De Boelelaan 1081a, NL-

1081 HV Amsterdam, The Netherlands . Retrieved September 09, 2015, from

http://www.cs.vu.nl: http://www.cs.vu.nl/~frankh/postscript/eBeW00.pdf

DiLeo, J., Jacobs, T., & DeLoach, S. A. (2002). Integrating Ontologies into Multiagent Systems

Engineering. AOIS '02, Agent-Oriented Information Systems, Proceedings of the Fourth

International Bi-Conference Workshop on Agent-Oriented Information (AOIS-2002 at

AAMAS02). Bologna, Italy: Springer. Retrieved from http://SunSITE.Informatik.RWTH-

Aachen.DE/Publications/CEUR-WS/Vol-59/1DiLeo.pdf

Euzenat J and Shvaiko P. (2007). Ontology Matching. USA: Springer Verlag.



Giannopoulou, E. G. (2008). Building Ontology form Knowledge base Systems. In F. K. El-Ghalayini,

Data Mining in Medical and Biological Research (p. 320). China: In Tech.

Godugula, S. (2008, June 15). Survey of Ontology Mapping Techniques. Software Quality and

Assurance, p. 14.

Gruber, T. (1992). A Translation Approach to Portable Ontology Specifications. Knowledge

acquisition, 5(2), 199-220.

Gruber, T. (1992). A Translation Approach to Portable Ontology Specifications. Knowledge

acquisition, 5(2), 199-220.

Gruber, T. (2009). Ontology. In L. L. Özsu (Ed.), Encyclopedia of Database Systems. Springer-Verlag.

Gruber, T. (2009). Ontology. Encyclopedia of Database Systems,.

Guarino, N. (1998). Formal Ontology and Information Systems. Proceedings of FOIS '98, (pp. 3-15).

Trento, Italy 6-8 June .

Hayes-Roth, F., Waterman, D., & Lenat, D. (1983, August 29). Knowledge-based systems. In Building

Expert Systems. Addison-Wesley. Retrieved from wikipedia.

Huget, M.-P., & Koning, J.-L. (2003). Interaction protocol engineering. In M.-P. Huget (Ed.),

Communication in Multiagent Systems (Vol. 2650, pp. 179-195). Springer.

Igor Jurisica et al. (1999). Using Ontologies for Knowledge Management:An Information Systems

Perspective. Annual Conference ofthe American Societyfor Information Science (p. 15).

Washington DC: University of Toronto, Toronto, Ontario, Canada.

Janez Brank et al. (2005). A SURVEY OF ONTOLOGY EVALUATION TECHNIQUES. Int. multi-conf.

Information Society, (pp. 166-169). Retrieved October 28, 2015, from http://ai.ia.agh.edu.pl:

http://ai.ia.agh.edu.pl/wiki/_media/pl:miw:2009:brankevaluationsikdd2005.pdf

Kabilan, V. (2007). Ontology for Information Systems (O4IS) Design Methodology: Conceptualizing,

Designing and Representing Domain Ontologies. The Royal Institute of Technology: KTH

information and Communication Technology .

Kabilan, V. (2007). Ontology for Information Systems (O4IS) Design Methodology: Conceptualizing,

Designing and Representing Domain Ontologies. The Royal Institute of Technology,

Department of Computer and Systems Sciences. KTH information and Communication

Technology.

Kang Ye etal. (2009). Ontologies for crisis contagion management in financial institutions. Journal of

Information Science, 35 (5), 548–562.

Labrou, Y., Finin, T., & Peng, Y. (1999). Agent communication languages:The current landscape. IEEE

Intelligent Systems, 14(2), 45–52.

Mike Uschold and Micheal Gruninger. (1996). Ontologies: Principles, Mthods and Applications.

Knowledge Engineering Review Volume 11 No 2 , 1-69.

Mohammad Nazir Ahmad et al. (2012). Ontology-Based Applications for Enterprise Systems and

Knowledge Management. United States of America: IGI Global.

Nordmann, K. (2009, May 13). Standardization of Ontologies. Retrieved from http://kore-

nordmann.de: http://kore-

nordmann.de/talks/09_04_standardization_of_ontologies_paper.pdf



Padgham, L., & Winikoff, M. (2005). Prometheus: a practical agent-oriented methodology. In P. G. B.

Henderson-Sellers (Ed.), Agent- Oriented Methodologies (pp. 107-135). IDEA Group

Publishing.

Pereira, T., & Santos, H. (2009). An Ontology based approach to Information Security. In M. S. F.

Sartori (Ed.), Metadata and Semantic Research (pp. 183-192). Springer Berlin Heidelberg:

Springer.

Pitt, J., & Mamdani, A. (1991). A protocol-based semantics for an agent communication. Proceedings

16th International Joint Conference on Artificial (pp. 486-491). Stockholm, Sweden: Morgan-

Kaufmann Publishers.

Raskin, V. a. (2001). Ontology in Information Security: A Useful Theoretical Foundation and

Methodolological Tool. Proceedings of the 2001 Workshop on New Security Paradigms (pp.

53-59). Cloudcroft, New Mexico: ACM. doi:10.1145/505168.505183

S.A. DeLoach, M. K. (2005). Multi-agent systems engineering: an overview and case study. In P. G. B.

Henderson-Sellers (Ed.), Agent-Oriented Methodologies (pp. 236–276). IDEA Group

Publishing.

Saunders, M., Lewis, P., & Thornhill. (2009). Research Methods for Business students (5th ed.).

Prentice Hall.

Silvonen, P. (2002, October 21). Ontologies and Knowledge Base. Retrieved October 22, 2015, from

http://www.ling.helsinki.fi:

http://www.ling.helsinki.fi/~stviitan/documents/Ontologies_and_KB/ontology.html

Steffen Staab, R. S. (2003). Knowledge Processes and Meta Processes in Ontology-Based Knowledge

Management. In C. W. Holsapple (Ed.), Handbook on Knowledge Management (Series

International Handbooks on Information Systems ed., Vol. 2, pp. 47-67). Berlin Heidelberg

GmbH, Karlsruhe, Germany: Springer. doi: 10.1007/978-3-540-24748-7

Sunitha Abburu and G Suresh Babu. (2013). A Framework for Ontology Based Knowledge

Management. International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-

2307, Volume-3, Issue-3, 21-25.

Targeden, D., Dennis, A., & and Wixon, H. B. (2013). Systems Analysis and Design with UML.

Singapore: Wiley.

Tran, Q.-N. N., & Low, G. (2008). MOBMAS: A methodology for ontology-based multi-agent systems

development. Information and Software Technology, 50(7-8), 697 - 722.

Viinikkala, M. (2004, March 22). Ontology in Information Systems. 8109103 Ohjelmistotuotannon

teoria, p. 17.

West-Brown, M. J., Stikvoort, D., Kossakowski, K.-P., Killcrece, G., & Ruefle, R. (1998). Handbook for

Computer Security Incident Response Teams (CSIRTs). Pittsburgh: Carnegie Mellon Software

Engineering Institute.

Wooldridge, M. (2002). An Introduction to MultiAgent Systems. John Wiley & Sons.

Wooldridge, M., & Jennings, N. (2005). Multi-agent systemsas computational organizations: the Gaia

methodology. In P. G. B. Henderson-Sellers (Ed.), Methodologies, Agent-Oriented (pp. 136-

171). IDEA Group Publishing.

WU, Jiangning. (2005). A Framework for Ontology-Based Knowledge Management System. Institute

of Systems Engineering (p. 9). Dalian, China: Dalian University of Technology.



ICI002

Big Data Forensics As A Service

Oteng Tabona, Andrew Blyth

Information Security Research Group University of South Wales

Pontypridd, United Kingdom [email protected]; [email protected]

ABSTRACT

The endless human reliance on computers, the proliferation of relatively inexpensive computing devices, and continuous advancement of technology has given rise to Big Data. Today, digital forensic investigations do not only involve a PC but numerous digital devices, which together holds huge amount of data. These developments have affected digital forensic. This is because current forensic tools fall short in many ways to deal with the Big Data. In this paper we discuss Big Data challenges, outline some requirements for a Big Data forensic tool and then design a Big Data forensic framework. A few experiments carried out on the platform shows that the framework can be used as a forensic tool. Key words: Big Data Forensics, Digital Forensics as a Service, Forensic cloud, Digital Forensics, Hadoop

1 INTRODUCTION The cost of digital devices has dropped drastically over the years making them affordable to most

people. Digital devices are becoming people’s lifelines. Many people find it hard to spend a day

without a digital device. They depend on them for various purposes such as communication,

entertainment, access to information and health applications. Each digital device is designed for a

specific purpose therefore consumers normally own a number of them for different cases. For

example, a person might own a smartphone, tablet, laptop, desktop, game console and TV all for

different purposes.

Digital devices have become very important part of crime investigations. However, the proliferation

in their usage has increased the amount of data that is acquired for digital forensic investigations.

The records by the Regional Computer Forensic Laboratory (RCFL, 2013) indicates that the size of

evidence has increased by over 500% for a period of seven years from 2006. The collected evidence

data is huge in size, heterogenous and need new efficient algorithms to analyse it. We classify this

data as Big Data. Big Data was described by Laney (2012) as high-volume, high-velocity and high-

variety data. Big Data has affected digital forensic in a number of ways that are discussed in the

following paragraphs.

Traditional forensic tools are typically based on a single workstation. The massive amount of

evidence collected for investigations nowadays cannot be examined on a single workstation because



of storage and processing power demand. In addition, the number of devices that can be seized per

an investigation has increased. This is a concern because, current forensic tools are not designed to

investigate multiple devices at the same time. The opportunity to carry out a cross-drive analysis and

find some correlations between these devices is missing.

Furthermore, various data sources generate data in different formats. Unstructured data constitutes

80% of the data produced today (IBM, 2010). However, existing tools cannot efficiently analyse it

because of the need for more processing power that is beyond the capability of a single workstation.

The lack of unstructured data support means that these data is often examined manually. Manual

examination can be costly with regard to time and resources and increases the likelihood of errors.

The existing forensic tools also lack the capability to share case output. There is a need to share case

output to help in identifying organised crimes that happen across multiple law enforcement regions.

The lack of such capability means that criminal organisations that appear in different regions will not

be discovered.

1.1 Contribution This paper presents a digital forensic platform to address the Big Data challenges in forensics.

Hadoop framework is used to store and process the evidence collection. We also adopt ‘as a service’

model for our platform to work on the cloud. A cloud environment provides further scalability to

what Hadoop offers and thus will help in investigations involving huge amount of forensic evidence.

The framework also encompases an inteligence sharing framework, which offers the capability to

search previous case evidence from other LEA.

The reset of this paper is structured as follows: the first section outlined the Big Data challenges in

digital forensics. Section 2 discusses the requirements of a Big Data forensic tool. The following

section (section 3) provides some research related to this work. Section 4 shows the architecture of

the proposed framework. A few experiments carried out on the framework are shown in section 5.

The paper is concluded in section 0.

2 REQUIREMENTS

In this section we detail the requirements of our Big Data forensic framework and explain how these

requirements will address the Big Data challenges discussed above in section 0.

2.1 Big Data technology Big Data requires a new generation of technologies and architectures, designed to efficiently extract

value from large volume of heterogenous data, by allowing high-velocity discovery and/or analysis

(IDC, 2011). Traditional software and techniques cannot efficiently analyse huge volumes of data.

Recently there has been a rise in a number of Big Data solutions. Big Data solutions offer features

such as scalability, reliability, and availability. One solution that will be considered for this study is

Hadoop. Hadoop is an open source, distributed, batch processing and fault tolerant system that is

capable of storing and analysing massive amount of data (White, 2011).

Hadoop provides a high level of scalability that is needed for Big Data processing. It allows the

addition of computing nodes whenever the demand increases. Data in Hadoop can be processed



with MapReduce. MapReduce processes data in parallel thereby achieving scalability in data

intensive analysis (Dean & Ghemawat, 2008). Another fundamental feature of MapReduce is that,

code is sent to the nodes that hold the data instead of transferring the data to the computing nodes.

This approach reduces the chances of network bottleneck because the code is relatively small in size

than the data.

Another benefit of Hadoop framework is that it is fault-tolerant; it ensures that all data nodes are in

operation all the times. Data in Hadoop is normally replicated to multiple data nodes (by default 3)

for backup purposes.

2.2 Cross-Drive analysis Cross-drive analysis gives an investigator the opportunity to search across multiple evidence sources

to find some interesting patterns. This analysis makes it possible to build an exhaust timeline. An

exhaustive timeline gives a detailed picture of past events.

2.3 Collaboration Digital forensic investigators are experiencing complex and wide range of problems that needs

people with different expertise to work together and share know-hows. Collaboration gives

investigators the opportunity to learn from each other − leading to improved strategies in problem

solving. This feature will as well reduce the effect of duplicating processes that would affect the

efficiency of the investigation.

Collaboration addresses the current issue whereby each investigator is assigned an evidence source

and after completing the investigation that is when the results sets are grouped to find any

connections (Baar, Beek & Eijk, 2014). The problem with the current procedure is that results can

easily be missed. However with the proposed approach results will be automatically aggregated

together.

The collaboration platform also encourages specialisation. Examiners who specialises on image

analysis can concentrate on that area alone while those who specialise in textual analysis can focus

on text examination only. The collaboration feature will also allow trusted investigators in any

location to be given access to the platform and assist with examinations.

2.4 Intelligence sharing

The current forensic set-up makes it hard to find some correlation between cases. The intelligence

sharing feature allows investigators to search old cases for any links, in doing so examiners can

discover criminal gangs.

2.5 Knowledge sharing

The technological landscape is evolving at a very high rate, for example, every year new

smartphones are released with new features. These features often need to be studied to find

suitable ways to discover evidence. The knowledge sharing framework is designed to allow

individuals to share investigative techniques and strategies. When an investigator is faced with a

new challenge they can access the framework to see if there is any shared information pertaining

the issue in hand.



3 LITERATURE REVIEW

The idea of using a cloud platform to address a Big Data challenge in the field of Digital Forensics is

relatively new. There are a few research papers, which are concentrating on providing digital

forensics from a cloud environment to tackle this challenge. Some of the research papers of interest

are discussed below.

Roussev et al. (2009) presented an MPI MapReduce (MMR) model to deal with large forensic

collection. MMR provides linear scaling for both CPU-intensive processing and indexing. Miller et al.

(2014) designed a forensicloud, an architecture for cloud-based digital forensic analysis. The

forensicloud framework encompasses existing forensic tools which run on a cloud environment

providing faster processing capabilities and also collaborative functions (Miller et al., 2014). In this

paper, authors (Roussev & Richard, 2004) proposed a light weight distributed framework to improve

the investigation turnaround time. The authors (Roussev & Richard, 2004), emphasised that single

workstations cannot cope with the performance demands. They also suggested that it is not possible

to upgrade the performance level of these tools as they have already reached the limit.

This paper (Baar, Beek & Eijk, 2014), presented a Digital Forensics as a Service model (DFaaS), which

is used by Netherlands Forensics Institutes (NFI). The authors (Baar, Beek & Eijk, 2014), declared that

the DFaaS model has significantly reduced case backlogs and improved the efficiency of

investigations by improving the traditional investigation process. Some of the key changes include:

freeing up the investigators time by assigning administrative roles to administrators, sharing of

information between the investigator, analyst and detective, and archiving acquired knowledge for

future use.

Federici (2013) designed a conceptual digital forensic framework called AlmaNebula, which

leverages the power and storage capacity of private or community clouds to process digital

evidence. The author (Federici, 2013) foresees cloud computing platforms as a solution to existing

tools, which cannot scale to meet both storage and processing power demands. The Sleuth Kit

Hadoop framework (Carrier, 2012) is a prototype project that incorporates The Sleuth Kit into a

Hadoop cluster to speed up digital forensic investigation time. However, the development of this

platform seem to have stop and more work need to be done to complete it. Raghavan et al.

(Raghavan, Clark, & Mohay, 2009), made a case for merging of different evidence sources. The

authors' (Raghavan, Clark, & Mohay, 2009), highlighted that by integrating multiple sources

investigators will be able to reconstruct a precise image of the past. Raghavan et al. (Raghavan,

Clark, & Mohay, 2009), presented an architecture for integrating evidence information from

different sources irrespective of the logical type of its contents.

4 ARCHITECTURE

An architecture of the proposed Big Data forensic framework is shown in Figure . The architecture is

divided into three block which are the storage, processing and applications. The storage dimension is

used to store the evidence while the processing feature analyses the evidence. The application

aspect holds a number of applications that are applied to the evidence. The framework is

implemented in a cloud environment to increase the overall scalability of the system.



4.1 Storage Digital forensic images will be acquired using existing imaging tools such as FTK imager. The acquired

images are inserted into the framework and stored in HDFS (Hadoop Distributed File System) by an

Ingester application. After the insertion, the ingester will initialise a File System Parser (FSP)

program. FSP reads the image and populate an HBase table of the case being investigated.

The main requirement of the storage system is to maintain the integrity of the evidence. Hadoop

was created with the notion of write once and read many times. The assumption is that once a file is

written it will not be modified. This requirement is monitored throughout the investigation of the

evidence.

Figure 1 Big Data forensic framework

4.2 Processing The MapReduce framework is used to process evidence in HDFS and HBase. Client applications are

written to implement the Map and Reduce code. MapReduce operate by sending code to the nodes

that holds the data. The Map function processes the data in parallel. The Map output is then

combined by the Reduce function. The resulting Reduce output can then be processed by client

applications (such as timeline visualisation) to discover suspicious acts.

4.3 Applications The application block features client applications. The early development of the framework will

include a few applications, additional programs can be added later to improve the functionality of

the framework. For example, the initial development of this framework only incorporates widely

used file systems. The remaining file systems parsers can be written and applied to the framework

later.

Several examiners are allowed to investigate the same case together. Their role is to run client

applications such as timeline and network analysis. The output from these analyses can be further

interpreted using various techniques such as visualisation.

The Intelligence Sharing application is a specialised feature that identifies objects from a case. These

objects can include; credit card numbers, email addresses, phone numbers, people’s name and so

on. The identified objects are represented in a format such as XML and stored in an external



database. This external database is accessible to other investigators using the same framework. The

Intelligence Sharing feature is also used to identify links between the case being investigated and

previous cases. Links between cases are missed using the current techniques because of the lack of

such framework.

4.4 The framework as a Big Data forensic tool In this section we review how this framework can addresses the Big Data challenges in digital

forensic.

The cloud environment together with Hadoop offer high scalability to accommodate large forensic

collection. This feature makes it possible to store evidence from multiple sources together. With the

evidence stored in a common area it is practical to perform cross-drive analysis to find any existing

connection.

In addition to the large storage, the framework also provides more processing power that is beyond

a single workstation. Vast computing power allows us to automate most processes. Current tools are

still doing some crucial analyses manually and thus ineffective.

Furthermore, both the storage and processing dimension of the proposed framework can process

unstructured data. Existing forensic tools are struggling to deal with unstructured data, despite it

being the most common data type.

Additional features such as collaboration and Intelligence sharing are incorporated to facilitate

investigations. Collaboration allows a pool of investigators to come together and share know-hows

on a case. Intelligence sharing gives more insights about the case being investigated. A link between

it and other cases can trigger leads that are currently impossible to discovery.

5 EXPERIMENTS

In this section, we evaluate if Hadoop can be used for digital forensic. In our experiment we

gathered enron dataset (Cohen, 2015) and insert it into an empty 8 GB FAT32 USB drive. The USB

device was then imaged using FTK imager. The resulting image was then ingested into HDFS and the

md5sum hash noted. A FAT32 file system parser was the initiated to recover the files and insert

them into HBase. We then carried out a few analysis (network analysis and word count), which are

discussed below. The total number of files that were inserted in HBase table was 88278.

5.1 Network analysis Network analysis was used to identify communication relationships and social circles. To achieve this

we wrote a MapReduce code. The Map code identified the sender of an email and all the recipients

of that email. The Reduce code mapped a sender to all the recipients that have received emails from

them (sender). The Reduce output was then processed to produce an output that can be visualised

using vis.js library (Almende B.V., 2016).

The visualised output was analysed and a number of relationships can be recognised (see Figure ).

For example, [email protected], [email protected] and [email protected] were

responsible for sending email to many individuals. On the other hand, [email protected],



[email protected] and [email protected] received emails from one or more

individuals who were responsible for sending many mails ([email protected],

[email protected] and [email protected]). In an investigation the identified group of

individuals will be examined further for more information.

Figure 2 Network analysis

5.2 Word count A word count analysis was performed on the data to determine the word frequencies. We wrote a

MapReduce program to compute the word frequencies. The following 20 words (see Table 1)

appeared in most emails than others.

Pat, Bill, Baughman, Don, Jr., Mr., Mrs., Andrea, Call, Friend, Janice, Laddie, Lalena, Marc, Mary, Matlock, NEIGHBOUR, Patsy, Randy, Reagan

Table 1 Most frequent word

In an investigation, word count can be used to generate keywords. The keywords will then be used

to search for files that contain them.

In this paper we carried out network analysis to identify social structures from enron data. We also

did a word count to find word frequencies from the emails. A few more analysis can be implemeted

and run of the evidence. However the main objective of this experiment was to evaluate if Hadoop

can be used for digital forensic. To prove this we calculated the hash value of the final image (after

analysis) and compared it with the original hash value. The output of the comparison shows that the

two hash values match. This result proves that Hadoop can be used for digital forensic. The

experiment does not present a Big Data case. It was very important to evaluate the feasibility of

using Hadoop for forensic before experimenting with a Big Data case. The following experiments will

present a Big Data case.

6 CONCLUSION

Big Data explosion has significantly affect digital forensic, as current tools cannot cope with the

demand to analyse these data. In this paper we have identified the current Big Data challenges in

digital forensic. We further identified essential Big Data forensic tool requirements. After that we

identified Hadoop as a suitable tool. We carried out a few experiment to first show that Hadoop can

be used for digital forensic. The output from the experiments shows that the integrity of the



evidence is preserved in Hadoop, which makes it suitable for forensic.

The implementation of platform is still on going. The next stage is to implement more file system

parser so that the platform can hold evidence from a variety of data sources. When this stage is

complete a Big Data case will be designed and further experiments carried on.

ACKNOWLEDGMENTS The authors would like to thank the Botswana International University of Science and Technology (BIUST) for their support.

REFERENCES Almende B.V. (2016). Vis.js. Retrieved from http://visjs.org/#

Baar, R., Beek, H., & Eijk, E. (2014). Digital forensics as a service: A game changer. Digital Investigation, 11, S54 – S62. doi: http://www.sciencedirect.com/science/article/pii/S1742287614000127

Carrier, B. (2012). Sleuth Kit Hadoop Framework. Retrieved from http://www.sleuthkit.org/tsk_hadoop/

Cohen, W., W. (2015). Enron Email Dataset. Retrieved from http://www.cs.cmu.edu/~enron/

Dean, J. & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51, 107–113. doi: http://doi.acm.org/10.1145/1327452.1327492

Federici, C. (2013). Almanebula: A computer forensics framework for the cloud. Procedia Computer Science, 19, 139 – 146. doi: http://www.sciencedirect.com/science/article/pii/S1877050913006315

IBM. (2010). The enterprise answer for managing unstructured data. Retrieved from https://www-304.ibm.com/events/idr/idrevents/detail.action?meid=6320

IDC. (2011). Extracting value from chaos. Retrieved from https://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf

Laney, D. (2012). The importance of ’big data’: A definition,”

Miller, C., Glendowne, D., Dampier, D., & Blaylock, K. (2014). Forensicloud: An architecture for digital forensic analysis in the cloud. Journal of Cyber Security, 3, 231–262.

Raghavan, S., Clark, A. J., & Mohay, G. M. (2009). FIA: an open forensic integration architecture for composing digital evidence. Second International Conference of e-Forensics : Forensics in Telecommunications, Information and Multimedia,, 83–94. doi: http://eprints.qut.edu.au/28073/

Regional Computer Forensics Laboratory (2013). The RCFL Program's annual report for Fiscal Year 2012. Retrived from https://www.rcfl.gov/downloads/documents/2012-rcfl-national-report/view

Roussev, V., & Richard III, G. G. (2004). Breaking the performance wall: The case for distributed digital forensics. Proceedings of the 2004 Digital Forensics Research Workshop (DFRWS 2004), , 1-16.

Roussev, V., Wang, L., Richard, G., & Marziale, L. (2009). A cloud computing platform for large-scale forensic computing. Advances in Digital Forensics V: Fifth IFIP WG 11.9 International Conference on Digital Forensics, 306, 201-214.

White, T. (2011). Hadoop:The Definitive Guide. O’Reilly Media, Inc.



IC1007

Information Security Policy Violation:

The Triad of Internal Threat Agent Behaviors

Maureen van den Bergh, Kennedy Njenga

Department of Applied Information Systems

University of Johannesburg Johannesburg, South Africa


ABSTRACT

Behavioral information security studies that primarily pursue the classification of information security policy violation behaviors have not been given much attention in literature. An emergent number of researchers and security reports have advocated research into this area. This paper endeavours to address the limitation of information in literature by conceptualizing the triad of internal threat agent behaviors, via a thematic analysis of literature. The triad of internal threat agent behaviors represents three classes of security behaviors, namely misbehavior, non-malicious deviant behavior, and malicious deviant behavior. This distinction could potentially improve upon the effectiveness of corrective actions in mitigating the risks associated with information security policy violations.

Key words: information security policy, misbehavior, non-malicious deviant behavior, malicious

deviant behavior, internal threat agent

1 INTRODUCTION Information Systems (IS) security remains a top priority and a challenge for information security

(InfoSec) managers (D'Arcy, Hovav, & Galletta, 2009; K.H Guo, Yuan, Archer, & Connelly, 2011;

Johnston, Warkentin, & M, 2015; Loch, Carr, & Warkentin, 1992). With the financial loss of security

breaches ever increasing and threats constantly evolving (The Global State of Information Security®

Survey, 2015), IS security violations by employees continually represent a problem that creates

tremendous risks and costs for organizations (Vance, Lowry, & Egget, 2015). Information security

managers and organizational executives struggle with enforcing policies designed to protect

information and information assets from intentional or unintentional security violations (D'Arcy et

al., 2009; Johnston et al., 2015).

Because employees are the cause of many IS security incidents (Ernst & Young's Global Information

Security Survey, 2014; The Global State of Information Security® Survey, 2015), they are a major risk

to IS security and therefore an organization’s first priority when it comes to IS security (Ernst &

Young's Global Information Security Survey, 2014). Internal organizational employees, referred to as

"Internal Threat Agents (ITAs)", are employees who are either the cause of, or contribute to security

incidents (Verizon Data BREACH Investigations Report, 2012). The impact of these security incidents

could be significant, because insiders are more likely to steal sensitive data of a non-financial nature



or intellectual property (The Global State of Information Security® Survey, 2015; US State of

Cybercrime Survey, 2014; Verizon Data BREACH Investigations Report, 2012). Also because of the

privileged insider position of the ITAs, it is possible that they could avoid discovery due to their

intimate knowledge of organizational security efforts (Verizon Data BREACH Investigations Report,

2012; Whitman & Mattord, 2012).

In the research stream of behavioral InfoSec studies (Crossler et al., 2013; S Furnell & Clarke, 2012),

generally the purpose of these studies are to improve our understanding of ITA security behaviors

(D'Arcy et al., 2009; Herath & Rao, 2009; Workman, Bommer, & Straub, 2008), mitigate the risks

associated with ITA security behaviors (Colwill, 2009; Safa et al., 2015; Safa, Von Solms, & Furnell,

2016), and to change ITA security behaviors (Beautement & Sasse, 2009; Steven Furnell, Papadaki, &

Thomson, 2009; Johnston et al., 2015).

Although these behavioral InfoSec studies provide multiple insights into ITAs behavior, recently the

concept emerged of differentiating ITA security behaviors, and the importance of applying the right

kind of corrective actions towards different classes of security behaviors (Crossler et al., 2013;

Verizon Data BREACH Investigations Report, 2012).

Although the classification of ITA security behaviors are not the primary focus of the following

studies, they do refer to ITAs and their behavioral intent. Intent is described with terminology such

as accidental (Loch et al., 1992; Vroom & von Solms, 2004), passive (Willison & Warkentin, 2013),

unintentional (Crossler et al., 2013; Verizon Data BREACH Investigations Report, 2012), with the

opposite as intentional (Crossler et al., 2013; Loch et al., 1992; Vroom & von Solms, 2004; Willison &

Warkentin, 2013), deliberate (Verizon Data BREACH Investigations Report, 2012), and knowingly (K.H

Guo et al., 2011).

A study by Stanton, Stam, Mastrangelo, and Jolton (2005), proposed and tested a taxonomy that

includes six types of security behaviors. These behaviors are differentiated according to

intentionality and technical expertise. Intentionality in turn is described as malicious, neutral and

beneficial.

In literature, not much attention have been given to behavioral InfoSec studies that primarily pursue

the classification of ITA security behaviors. This paper endeavours to address the limitation of

information in literature by conceptualizing the triad of ITA behaviors, via a thematic analysis of

literature. We propose three classes of behaviors, namely misbehavior (MB), non-malicious deviant

behavior (NDB) and malicious deviant behavior (MDB). MB is defined as unintentional, non-malicious

information security policy violations, NDB as intentional, non-malicious information security policy

violations, and MDB as intentional, malicious information security policy violations.

Applying the right kind of corrective actions towards a specific behavior is important (Crossler et al.,

2013). For example, trying to address malicious deviant behavior with awareness and training is

inappropriate, because this type of behavior is intentional and with harmful motivation. Whereas

using awareness and training to address misbehavior, is appropriate.

In pursuance of the above goal, this paper is divided into five sections. The first section introduce the



triad of ITA behaviors. The second section discusses an emergent perspective on ITA behavior. The

third section reflects on who ITAs are and the risks they pose to IS security. The fourth section

theorizes MB, NDB, MDB, and presents the triad of ITA behaviors. The fifth section discusses the

application of the right kind of corrective actions towards MB, NDB and MDB. The paper closes with

a discussion and conclusion. 2 AN EMERGENT PERSPECTIVE ON ITA BEHAVIOR An article titled “Future directions for behavioral information security research” by Crossler et al.

(2013) and the Verizon Data BREACH Investigations Report (2012), both propose future research to

separate insider misbehavior from deviant behavior. In the Computer Security Institute (2011)

survey, they asked their respondents in their latest survey to differentiate between non-malicious

insider actions and malicious insider actions (they did not do so in previous surveys). Separating

behavior may improve upon the success and sometimes applicability of corrective actions towards

misbehavior and deviant behavior (Crossler et al., 2013).

While the classification of ITA security behaviors are not the primary aim of InfoSec studies, some

refer to ITAs and their behavioral intent in sections of their studies. For example, in their pursuit to

determine the threats to IS, Loch et al. (1992) included the human perpetrator’s accidental and

intentional intent as part of their threat taxonomy. In the organizational context, Willison and

Warkentin (2013) focused on a holistic approach to insider computer abuse, and considered the

thought processes of human perpetrators preceding deterrence. As part of their investigation they

extended Loch et al.’s (1992) threat taxonomy, focussing on the human perpetrator. They agreed

with Loch et al.’s taxonomy of behavior as intentional, but differed on the term “accidental” by

replacing it with the term passive. They then proceeded to expand the taxonomy to passive non-

volitional noncompliance, volitional but not malicious noncompliance, and intentional malicious

computer abuse.

Vroom and von Solms (2004) explored the role of auditing in organizational InfoSec. They singled out

the human factor involved in organizational asset security, and how difficult it would be to audit the

behavior of these employees. They concur with Loch et al. (1992) and also refer to the ITA’s intent as

accidental and intentional. Vroom and von Solms (2004) adds to the granularity of intent by referring

to the accidental non-malicious, and intentional malicious employee.

The Verizon Data BREACH Investigations Report (2012) clearly state their classification of ITA

behaviors as unintentional, inappropriate but not malicious, and deliberately malicious. They use a

similar classification, of unintentional, as Crossler et al. (2013) does, but the similarity ends there.

While The Verizon Data BREACH Investigations Report (2012) uses the classification of “deliberate”,

Crossler et al. (2013) rather agrees with Loch et al. (1992), Vroom and von Solms (2004), and

(Willison & Warkentin, 2013) on a classification of “intentional”.

One study, K.H Guo et al. (2011), theorised and tested a model of intentional violation, but only

focused on non-malicious security violations (NMSV), and a study by Stanton et al. (2005), produced

a taxonomy that included six types of security behaviors. Behaviors were differentiated according to

intentionality and technical expertise. In their two factor taxonomy, Stanton et al. (2005) classified

three intentions: malicious, neutral and beneficial.



The studies mentioned above use a variety of terminologies to describe ITA security behaviors. But

despite this, the studies cited above also seems to inadvertently suggest a classification and

differentiation of ITA security behaviors, via their agreement on, and disagreement of behavior

terminologies.

Table 1 shows the results from the thematic analysis of literature, as described above.

Table 1 Thematic Analysis of Literature

Author/s Year Source Perpetrator Intent Behavior

Loch et al. 1992 Internal Human Accidental

Intentional

Willison and Warkentin

2013 Internal Human

Passive Non-volitional Non-compliance

Volitional But not malicious

Non-compliance

Intentional Malicious Computer Abuse

Verizon Data Breach Investigations Report

2012 Internal Human

Unintentionally

Inappropriately But not malicious

Deliberately Maliciously

Guo et al. 2011 Internal End-users Knowingly Violate Non-malicious Security violations

Crossler et al. 2013 Insider

Intentional Deviant

Unintentional Misbehavior

Stanton et al. 2005 Information technology users

Intentionally Malicious

Intentionally Beneficial

Neutral

Vroom & Von Solms

2004 Employees Accidental Non-malicious

Intentional Malicious

The security behavior of ITAs are key to an organization’s efforts to decrease policy violations. It is

central for an organization’s security efforts that ITAs behave and act responsibly with regards to the

information security policy (ISP). The next section explores the above.

3 INTERNAL THREAT AGENTS The threats to IS security are from sources internal and external, and from perpetrators both human and non-human (Loch et al., 1992). Of these, the internal human threat, also referred to as the “Internal Threat Agent (ITA)”, is the cause of many IS security incidents (Ernst & Young's Global

Information Security Survey, 2014; The Global State of Information Security® Survey, 2015).



ITAs are the internal organizational employees who are either the source of a security incident or

contribute to such an incident. ITAs include principal management, employees, contractors and

interns (Verizon Data BREACH Investigations Report, 2012). Principal management are defined as

persons with the primary duty of managing a department, unit, and/or sub-division, employees are

persons who do not manage other employees, and contractors are persons or firms that undertake a

contract to provide materials or labour to perform a service or do a job. Lastly interns are students

or trainees who work in organizations to either obtain work experience or to fulfil some requirement

for their studies.

ITAs are a significant risk to IS security (Ernst & Young's Global Information Security Survey, 2014),

with insider theft as one of the top causes of data breaches (Ernst & Young's Global Information

Security Survey, 2014). “Careless or unaware employees” are a vulnerability that increase

organizational risk exposure and therefore an organization’s first priority when it comes to IS

security (Ernst & Young's Global Information Security Survey, 2014).

Surveys such as the CSI/FBI computer crime and security survey (2010), The Global State of

Information Security® Survey (2015), The US State of Cybercrime Survey (2014), and The Vormetric

Data Security Report (2015), state that insider agents contribute a much smaller number towards

the overall percentage of incidents, compared to external agents, but this number is not indicative of

the demise of insider misconduct. External agents may launch a single sting attack against hundreds

of victims (industrialised attacks), whereas internal agents have a much smaller number of potential

targets (Verizon Data BREACH Investigations Report, 2012). Also because of the privileged insider

position of the internal threat, it is possible that they could avoid discovery due to their intimate

knowledge of organizational security efforts (Verizon Data BREACH Investigations Report, 2012;

Whitman & Mattord, 2012).

Nonetheless, compared to external agents, the possible impact of insider incidents on an

organization is significant. The reason for this is the higher probability for insiders to steal sensitive

data of a non-financial nature or to acquire intellectual property (The Global State of Information

Security® Survey, 2015; US State of Cybercrime Survey, 2014; Verizon Data BREACH Investigations

Report, 2012). Insider crimes could have a greater financial impact than that of outsider crimes.

Often security incidents result in increased data loss, as a result of compromised employee and

customer records, or unintentional release of sensitive information, especially via the Internet.

Other threats are caused by peer-to-peer file sharing, or abuse of system access privileges by others,

as a result of carelessly written down passwords, or easy-to-guess passwords (K.H Guo et al., 2011;

The Global State of Information Security® Survey, 2015).

IS security remains a high priority for InfoSec managers (Loch et al., 1992), and a major challenge

(K.H Guo et al., 2011) because the financial loss of security breaches are ever increasing, amid

constantly evolving threats (The Global State of Information Security® Survey, 2015). A study by

Garg, Curtis, and Halper (2003), on the economic impact of security breaches concluded that the

financial loss to an organization is much higher than initially reported by other self-reporting

organizational surveys: more in the range of 0.5% to 1.0% of annual sales. 8.7% of respondents from

The Global State of Information Security® Survey (2015), reports that compared to their 2014 survey,

financial losses are up by 15% in 2015, and 32% of respondents from this survey indicate that



compared to outsider crimes, insider crimes are more costly.

Data breaches are also a major challenge, with insiders causing damage by unintentionally exposing

sensitive information, such as compromising confidential records, customer records and employee

records (US State of Cybercrime Survey, 2014). The Global State of Information Security® Survey

(2015), also reports that security incidents result in increased data loss, especially caused by

compromised employee and customer records. “In our heavily networked world, organizations

across the globe are under attack 24/7/365” (Computer Security Institute, 2011).

Surveys such as the CSI/FBI computer crime and security survey (2010), The Global State of Information Security® Survey (2015), The US State of Cybercrime Survey (2014), and The Vormetric Data Security Report (2015), state increased insider security incidents, while 62% of respondents from The Insider Threat Spotlight Report (2015) indicate an increase in the frequency of insider threats during the last 12 months. Next, follows a discussion of the classification of ITA behavior, in terms of MB, NDB, and MDB.

4. THEORIZING INTERNAL THREAT AGENT BEHAVIOR

4.1 Classification of Behaviors The terminologies from the thematic analysis of literature, as presented in table 1 under section

number 2, are homogenized across the numerous similarities to classify ISP violation behaviors. This

study then classifies ISP violation behaviors with the source as internal, the agents as human, their

intent as either unintentional or intentional, and the three categories as misbehavior, non-malicious

deviant behavior, and malicious deviant behavior. The human threat agent is differentiated as

management, employees, contractors and interns (Verizon Data BREACH Investigations Report,

2012). Figure 1 illustrates the above.

Figure 1 Classification of ITA Behaviors



4.2 Misbehavior We define MB as unintentional, non-malicious information security policy violations. This means that

while misbehaving insiders participate in the behavior, the violation of policy is unintentional and

without knowledge of the violation. However, despite the unintentional violation by the

misbehaving ITA, ISP violation has still taken place, and therefore, just like NDB and MDB, MB

requires a response. Even though misbehaving ITAs do not purposefully choose to violate policies,

the result may well put organizational information and information assets at risk or still cause

possible damage to IS security. For example, indirect consequences of MB is creating weaknesses

that could allow hackers to infect internal systems with viruses or spyware, or allow them to bypass

the firewall and access confidential data (Crossler et al., 2013).

MB typically include human error, ignorance, uninformed violations, accidental data entry, forgetful

oversights, inadvertent data breach, and unintentional actions (Ken H. Guo, 2013; Siponen & Vance,

2010).

4.3 Non-malicious Deviant Behavior We define NDB as intentional, non-malicious information security policy violations. NDB is behavior

engaged in by ITAs who knowingly violate ISPs. The behavior is without hateful intent. While non-

malicious deviant ITAs do not really want to cause loss of operations or security breaches, the result

may exactly be that which the insider did not intend doing in the first place, causing possible damage

anyway.

NDB include such behavior as failure to perform backups or delaying backups, accessing websites

unrelated to work requirements using corporate computers, clicking phishing email links, opening

possible unsecure attachments, choosing passwords that are not up to standard, not changing

passwords regularly, password sharing, failing to log off when leaving the computer, not shredding

sensitive information, and mistakenly uploading confidential data onto unsecure servers or the web

(Aytes & Connolly, 2004; Crossler et al., 2013; D'Arcy et al., 2009; Warkentin & Willison, 2009).

4.4 Malicious Deviant Behavior We define MDB as intentional, malicious information security policy violations. MDB is behavior

engaged in by ITAs who also knowingly violate ISPs, whereas NDB does not contain hateful intent,

MDB does contain it. Malicious deviant ITAs intend damage and/or security breaches. Their intent is

malevolent, with a possible goal of putting the organization at risk, causing degraded services to

customers, disrupted services to customers, and corporate failure. For example deviant behavior

may directly be responsible for loss of profits, credibility or competitive advantage (Crossler et al.,

2013).

MDB include such behavior as espionage, sabotage, embezzlement, identity theft, IP theft, stealing

sensitive information, malicious data breach, data corruption, data theft, data destruction,

intellectual property theft and fraud (Aytes & Connolly, 2004; Crossler et al., 2013; K.H Guo et al.,

2011; Vance et al., 2015; Verizon Data BREACH Investigations Report, 2012; Warkentin & Willison,

2009; Willison & Warkentin, 2013).



4.5 Triad of ITA Behaviors The triad of ITA behaviors represents the interrelated and independent nature of MB, NDB and

MDB. MB, NDB and MDB all three violate ISPs, no matter their intent. NDB are interrelated with

both MB and MDB. NDB and MB are non-malicious in their behavioral intent. NDB and MDB in turn

both intentionally violate policies. Each class of ITA security behavior also represents an independent

unit in the overall phenomenon of ITA security behaviors.

Figure 2 Triad of ITA Behaviors

5 CORRECTIVE ACTIONS

Despite the fact that organisations are implementing technical controls (Aytes & Connolly, 2004;

Ayuso, Gasca, & Lefevre, 2012; Choo, 2011; Hansen, Lowry, Meservy, & McDonald, 2007; Zafar &

Clark, 2009), ISPs (Pfleeger & Caputo, 2012; Siponen, Mahmood, & Pahnila, 2009; Siponen,

Mahmood, & Pahnila, 2014; Whitman & Mattord, 2012), compliance approaches (Siponen et al.,

2014; Whitman & Mattord, 2012), and countermeasures (D'Arcy et al., 2009; Herath & Rao, 2009;

Straub, 1990) such as awareness, education, training, software programmes, penalties and

pressures, employees seldom comply with policies (Siponen et al., 2014).

Some IS security studies also attempted to understand the security behavior of individuals via

concepts such as (Crossler et al., 2013): neutralization (Siponen & Vance, 2010), disgruntlement

(Willison & Warkentin, 2013), shame and moral beliefs (Siponen, Vance, & Willison, 2012), self-

control (Hu, Xu, Dinev, & Ling, 2011), accountability (Siponen et al., 2012; Vance et al., 2015), fear

(Johnston et al., 2015), organizational culture (Hu, Dinev, Hart, & Cooke, 2012), and rational choice

(Aytes & Connolly, 2004; Bulgurcu, Cavusoglu, & Benbasat, 2010).

A number of IS security studies have investigated the effectiveness of deterrence on deviant

behavior (D'Arcy et al., 2009; Herath & Rao, 2009; Hu et al., 2011; Straub, 1990; Straub & Welke,

Misbehavior

Non-malicious Deviant

Behavior

ISP Violations

Malicious Deviant

Behavior

Intentional



1998), and for example found that countermeasures that include deterrent administrative

procedures and pre-emptive security software resulted in lowered computer abuse (Straub, 1990),

and when users are aware of security countermeasures, it influence the perceived certainty and

severity of organizational IS misuse sanctions, which lead to reduced IS misuse intention (D'Arcy et

al., 2009).

The common denominator with the current approaches to mitigate ISP violations, is that the studies

were conducted using survey samples that did not differentiate security behavior of ITAs. It may not

be sufficient anymore to have a “one-size-fits-all” approach to address security behavior within

organizations, as different security behavior require different corrective actions.

To reduce the number of incidents caused by ITAs, organizations ought to address MB, NDB and

MDB by applying the correct type of corrective action or actions towards each class of behavior. For

example: attempting to address MDB with compliance approaches (Siponen et al., 2014; Whitman &

Mattord, 2012), and countermeasures (D'Arcy et al., 2009; Herath & Rao, 2009; Straub, 1990) such

as awareness, education and training is inappropriate, because this type of behavior intends damage

and/or security breaches and threat agents know well enough that they are violating the ISP policy.

While awareness campaigns, educational programs, and training sessions are appropriate for MB,

because it is an appropriate way to address ignorance of the ISP and the content thereof. Although

the NDB agent knowingly violate ISPs, the behavior is without hateful intent, awareness campaigns,

educational programs, and training sessions aimed at explaining the risks associated and

consequences of ISPs, is appropriate to address this class of security behavior.

Organizational culture positively influences employees’ attitudes towards compliance with ISPs (Hu

et al., 2012), and increasing accountability can reduce access policy violation intensions (Vance et al.,

2015). Accountability (Siponen et al., 2012; Vance et al., 2015) and organizational culture (Hu et al.,

2012) are thus well suited for addressing MB and NDB. Although misbehaving ITAs are not

intentional in their violation of policy, having a positive attitude towards compliance may influence

them to seek information and attend training programs. While the non-malicious deviant ITA, may

rethink their deviant behavior in a culture that advocates secure practices and accountability.

The importance of applying corrective actions aimed at a specific class of behavior, could improve

upon the effectiveness of corrective actions in mitigating the risks associated with ISP violations.

6 DISCUSSION

IS security remains a high priority for information security executives (Loch et al., 1992). It is key for

an organization’s security efforts that ITAs behave and act responsibly with regards to the

information security policy (ISP). Organizations need to be able to clearly differentiate the security

behaviors of their ITAs, and then apply appropriate corrective actions towards the different

behaviors.

To this end, we classified ITA security behaviors as misbehavior, non-malicious deviant behavior, and

malicious deviant behavior. The implications of this differentiation is key to improve the

effectiveness of applied corrective actions. Organizations need to understand that it may no longer

be sufficient to have a “one-size-fits-all” approach to address ITA security behaviors, and to mitigate



the risks associated with ISP violations. Research by Crossler et al. (2013) and reports such as the

Verizon Data BREACH Investigations Report (2012) and Computer Security Institute (2011) support

this viewpoint.

While technical controls (Aytes & Connolly, 2004; Ayuso et al., 2012; Choo, 2011; Hansen et al.,

2007; Zafar & Clark, 2009), ISPs (Pfleeger & Caputo, 2012; Siponen et al., 2009; Siponen et al., 2014;

Whitman & Mattord, 2012), compliance approaches (Siponen et al., 2014; Whitman & Mattord,

2012), and countermeasures (D'Arcy et al., 2009; Herath & Rao, 2009; Straub, 1990) have been

suggested to reduce ISP violations, applying the right kind of corrective actions towards a specific

behavior is important (Crossler et al., 2013).

Organizations should accept that security incidents caused by ITA can’t be eliminated, but the

number of incidents can be reduced by addressing the related ITA behaviors. Organizations should

consider not addressing ISP violation behaviors as a collective, but rather apply corrective action/s

towards MB, NDB and MDB individually. Organizations that understand their employees and their

security behaviors, could address ISP violations more effectively.

MB, as defined by this study, includes the unintentional, non-malicious ISP violating ITA. The

consequences to their ignorance or uninformed violations (Ken H. Guo, 2013; Siponen & Vance,

2010) are to indirectly create weaknesses that could allow hackers to infect internal systems with

viruses or spyware, or allow them to bypass the firewall and access confidential data (Crossler et al.,

2013). Countermeasures such as awareness, education, and training programs (D'Arcy et al., 2009;

Herath & Rao, 2009; Straub, 1990), are appropriate for MB, because these countermeasures address

ignorance of the ISP and the content thereof. NDB, as defined by this study, includes the intentional,

non-malicious ISP violating ITA. Although the NDB agent knowingly violate ISPs, the behavior is

without hateful intent, therefore awareness campaigns, educational programs, and training sessions

aimed at explaining the risks associated and consequences of ISPs, are also appropriate to address

this class of security behavior.

Both NDB and MDB knowingly violate ISPs, and while deterrence was found to be effective in

reducing deviant behavior (D'Arcy et al., 2009; Straub, 1990), some IS security studies that

attempted to understand the security behavior of individuals, deduced that employees would use

techniques of neutralization (Siponen & Vance, 2010) to moderate their violation. NDB does not

contain hateful intent and while shame and moral beliefs (Siponen et al., 2012), accountability

(Siponen et al., 2012; Vance et al., 2015), fear (Johnston et al., 2015), and organizational culture (Hu

et al., 2012), could address NDB, because of disgruntlement (Willison & Warkentin, 2013), MDB

which contian hateful intent could continue with their malevolent behavior and possibly may

directly be responsible for loss of profits, credibility or competitive advantage (Crossler et al., 2013).

7 CONCLUSION

We established the potential to more effectively apply countermeasures and corrective actions

towards ITA security behaviors, and therefore to decrease the number of ISP violation incidents.

While behavioural InfoSec studies aim to improve our understanding of, mitigate the risks associated

with, and change ITA security behaviors, it should consider applying countermeasures and corrective

actions, not to survey samples as a collective, but to categorized survey samples. Thus, firstly



categorize participants according to their behavioral class of MB, NDB and MDB. But, as mentioned

by Crossler et al. (2013), such a strategy might proof difficult. This would be the subject of future

research. Empirical research could support the conceptualization of the triad of ITA behaviors, and

test the applicability and effectiveness of countermeasures and corrective actions.

REFERENCES

Aytes, K., & Connolly, T. (2004). Computer Security and Risky Computing Practices: A Rational Choice Perspective. Journal of Organizational and End User Computing, 16(3), 22-40.

Ayuso, P. N., Gasca, R. M., & Lefevre, L. (2012). FT-FW: A cluster-based fault-tolerant architecture for stateful firewalls. Computers & Security, 31(4), 524-539.

Beautement, A., & Sasse, A. (2009). The economics of user effort in information security. Computer Fraud & Security, 2009(10), 8-12. doi: http://dx.doi.org/10.1016/S1361-3723(09)70127-7

Bulgurcu, B., Cavusoglu, H., & Benbasat, I. (2010). Information Security Policy Compliance: An empirical study of Rationality-based beliefs and information security awareness. MIS Quarterly, 34(3), 523-548.

Choo, K. K. R. (2011). The cyber threat landscape: Challenges and future research directions. Computers and Security, 30(8), 719-731. doi: 10.1016/j.cose.2011.08.004

Colwill, C. (2009). Human factors in information security: The insider threat – Who can you trust these days? Information Security Technical Report, 14(4), 186-196. doi: http://dx.doi.org/10.1016/j.istr.2010.04.004

Computer Security Institute. (2010). 2010/2011 Computer Crime and Security Survey. from http://gatton.uky.edu/FACULTY/PAYNE/ACC324/CSISurvey2010.pdf

Computer Security Institute. (2011). 2010/2011 Computer Crime and Security Survey. from http://gatton.uky.edu/FACULTY/PAYNE/ACC324/CSISurvey2010.pdf

Crossler, R. E., Johnston, A. C., Lowry, P. B., Hu, Q., Warkentin, M., & Baskerville, R. (2013). Future directions for behavioral information security research. Computers & Security, 32(1), 90-101.

D'Arcy, J., Hovav, A., & Galletta, D. (2009). User Awareness of Security Countermeasures and Its Impact on Information Systems Misuse: A Deterrence Approach. Information Systems Research, 20(1), 79-98.

Ernst & Young's Global Information Security Survey. (2014). EY’s Global Information Security Survey 2014. from http://www.ey.com/Publication/vwLUAssets/EY-global-information-security-survey-2014/$FILE/EY-global-information-security-survey-2014.pdf

Furnell, S., & Clarke, N. (2012). Power to the people? The evolving recognition of human aspects of security. Computers & Security, 31(8), 983-988.

Furnell, S., Papadaki, M., & Thomson, K.-L. (2009). Scare tactics – A viable weapon in the security war? Computer Fraud & Security, 2009(12), 6-10. doi: http://dx.doi.org/10.1016/S1361-3723(09)70151-4

Garg, A., Curtis, J., & Halper, H. (2003). Quantifying the financial impact of IT security breaches. Information Management & Computer Security, 11(2), 74-83.

Guo, K. H. (2013). Security-related behavior in using information systems in the workplace: A review and synthesis. Computers & Security, 32, 242-251. doi: http://dx.doi.org/10.1016/j.cose.2012.10.003



Guo, K. H., Yuan, Y., Archer, N. P., & Connelly, C. E. (2011). Understanding Nonmalicious Security Violations in the Workplace: A Composite Behavior Model. Journal of Management Information Systems, 28(2), 203-236.

Hansen, J. V., Lowry, P. B., Meservy, R. D., & McDonald, D. M. (2007). Genetic programming for prevention of cyberterrorism through dynamic and evolving intrusion detection. Decision Support Systems, 43(4), 1362-1374. doi: 10.1016/j.dss.2006.04.004

Herath, T., & Rao, H. R. (2009). Encouraging information security behaviors in organizations: Role of penalties, pressures and perceived effectiveness. Decision Support Systems, 47, 154-165. doi: 10.1016/j.dss.2009.02.005

Hu, Q., Dinev, T., Hart, P., & Cooke, D. (2012). Managing Employee Compliance with Information Security Policies: The Critical Role of Top Management and Organizational Culture. Decision Sciences, 43(4), 615-659.

Hu, Q., Xu, Z., Dinev, T., & Ling, H. (2011). Does Deterrence Work in Reducing Information Security Policy Abuse by Employees? Communications of the acm, 54(6), 54-60.

Insider Threat Spotlight Report. (2015).

Johnston, A. C., Warkentin, M., & M, S. (2015). An Enhanced Fear Appeal Rhetorical Framework: Leveraging Threats to the Human Asset Through Sanctioned Rhetoric. MIS Quarterly, 30(1), 113-134.

Loch, K. D., Carr, H. H., & Warkentin, M. E. (1992). Threats to Information Systems: Today's Reality, Yesterday's Understanding. MIS Quarterly, June, 173-186.

Pfleeger, S. L., & Caputo, D. D. (2012). Leveraging behavioral science to mitigate cyber security risk. Computers & Security, 31(4), 597-611.

Safa, N. S., Sookhak, M., Von Solms, R., Furnell, S., Ghani, N. A., & Herawan, T. (2015). Information security conscious care behaviour formation in organizations. Computers & Security, 53, 65-78. doi: http://dx.doi.org/10.1016/j.cose.2015.05.012

Safa, N. S., Von Solms, R., & Furnell, S. (2016). Information security policy compliance model in organizations. Computers & Security, 56, 70-82. doi: http://dx.doi.org/10.1016/j.cose.2015.10.006

Siponen, M., Mahmood, M. A., & Pahnila, S. (2009). Are Employees Putting Your Company At Risk By Not Following Information Security Policies? Communications of the acm, 52(12), 145-147. doi: 10.1145/1610252.1610289

Siponen, M., Mahmood, M. A., & Pahnila, S. (2014). Employees’ adherence to information security policies: An exploratory field study. Information & Management, 51, 217-224.

Siponen, M., & Vance, A. (2010). Neutralization: New insights into the problem of employee information systems security policy violations. MIS Quarterly, 34(3), 487-502.

Siponen, M., Vance, A., & Willison, R. (2012). New insights into the problem of software piracy: The effects of neutralization, shame, and moral beliefs. Information & Management, 49(7/8), 334-341. doi: 10.1016/j.im.2012.06.004

Stanton, J. M., Stam, K. R., Mastrangelo, P., & Jolton, J. (2005). Analysis of end user security behaviors. Journal of Computers and Security, 24, 124-133.

Straub, D. W. (1990). Effective IS Security: An Empirical Study. Information Systems Research, 1(3), 255-276. doi: http://dx.doi.org/10.1287/isre.1.3.255

Straub, D. W., & Welke, R. J. (1998). Coping with Systems Risk: Security Planning Models for Management Decision-Making. MIS Quarterly, 22(4), 441-469.



The Global State of Information Security® Survey. (2015). The Global State of Information Security® Survey. from http://www.pwc.com/gx/en/consulting-services/information-security-survey/download.jhtml and http://www.pwc.com/gx/en/consulting-services/information-security-survey/key-findings.jhtml

US State of Cybercrime Survey. (2014). 2014 US State of Cybercrime Survey. from http://resources.sei.cmu.edu/library/asset-view.cfm?assetID=298318

Vance, A., Lowry, P. B., & Egget, D. (2015). Increasing Accountability through User-Interface Design Artifacts: A New Approach to Addressing the Problem of Access-Policy Violations. MIS Quarterly, 39(2), 345.

Verizon Data BREACH Investigations Report. (2012). from http://www.verizonenterprise.com/resources/reports/rp_data-breach-investigations-report-2012-ebk_en_xg.pdf

Vormetric Data Security Report. (2015). 2015 Vormetric Insider Threat Report: Trends and Future Directions in Data Security GLOBAL EDITION.

Vroom, C., & von Solms, B. (2004). Towards information security behavioural compliance. Computers & Security, 23(3), 191-198.

Warkentin, M., & Willison, R. (2009). Behavioral and policy issues in information systems security: the insider threat. European Journal of Information Systems, 18, 101-105. doi: 10.1057/ejis.2009.12

Whitman, M. E., & Mattord, H. J. (2012). Principles of Information Security, International Edition (Fourth ed.): Course Technology CENGAGE Learning.

Willison, R., & Warkentin, M. (2013). BEYOND DETERRENCE: AN EXPANDED VIEW OF EMPLOYEE COMPUTER ABUSE. MIS Quarterly, 37(1), 1-20.

Workman, M., Bommer, W. H., & Straub, D. W. (2008). Security lapses and the omission of information security measures: A threat control model and empirical test. Computers in Human Behavior, 24(6), 2799-2816.

Zafar, H., & Clark, J. G. (2009). Current State of Information Security Research In IS. Communications of the Association for Information Systems, 24, 571-596.



IC1009

Challenges in Password Usability - Users Perspective

Tiroyamodimo Mogotlhwane, Kagiso Ndlovu



ABSTRACT

One of the most popular authentication methods in computing is the use of passwords. Password is an additional user’s credential that is used to verify that a user entering a computer system is hopefully the one authorized to do so. As the number of applications that an individual need access to increases, so is the number of passwords that they are expected to have. Increasingly, users have to have passwords for a lot more applications such as ATM, online banking, mobile banking, and different web applications. These bring in challenges because it becomes very difficult if not impossible to memorize so many passwords, coupled with the fact that ideally, a password must not be written down anywhere. Some of the methods that people use to remember their many passwords are, writing them down, using the same password for many applications, storing them in mobile devices etc. There are now web based applications that are developed to help with password management. This is more so because ideally each application requires its own password. The challenges that users have in maintaining passwords and possible solutions according to literature are discussed. At the moment, password principle is based mainly on an individual remembering it. Latest research has shown that a human being’s ability to remember is influenced by how frequently an individual use that which he remembers. Hence the frequently used passwords are the ones that are likely to be remembered easily. A theoretical model is proposed to help address problems associated with password maintenance. The model calls for an inner application that can be embedded before supplied password authentication takes place. The increase in online application systems requires high security to maintain the quality of data. The increase in these applications is bringing in new challenges on the use of password as a means of protecting personal information.

Key words: date-time-stamp; authentication; password; challenges; management

1 INTRODUCTION

The use of computer based information system has significantly increased. Many organizations have

several information systems that are driving their businesses. Such systems are at times linked with

external stakeholders such as customers, clients, suppliers etc. These systems store organizational

data. There is a need to protect this data by controlling access to such systems. In the past data was

stored in paper folders or steel cabinets and these were kept in secured physical location often

called strong rooms where only certain individuals had access to such storage facilities. These strong

rooms were locked and not everyone had access to their keys. One of the commonly used methods



of controlling access to digital data devices and systems is the use of passwords. There is an increase

in applications that require a user to remember multiple passwords ( Sharma, Sharma, & Dave,

2015). However, with the increase in the number of applications that requires its own password, this

is a challenge to a user to remember all the passwords.

2 AUTHENTICATION AND SECURITY

2.1 Authentication

Authentication in computing is a process by which a user of a system provides some information

pertaining to them which is then matched with those stored by an operating system or server

(Rause, 2015). A user is assigned a login name and password as the credentials to use in a human

computer interaction environment. The login name is displayed as a user enters the authentication

details while the password is masked from being displayed. A user is expected to memorize the

password so that no one else will know it. So even if another user can know the login name, the

password will only be known by the rightful owner.

Computer based information system require that the information and data stored in them must be

secure and protected from those who are not supposed to access it. Security is required to deny

unauthorized person or agent to access the resource and that only authorized people can access the

resource. There are other methods of authenticating users to computer system but password is the

most common one as it is not expensive to implement (Wenstrom, 2002), (TANESKI, BRUMEN, &

HERICKO, 2014). Although popular, authentication using login names and passwords is not an

effective security measure. In most cases, the reliability of the password mechanism is dependent

upon an individual using it although human errors are common with the use of information systems.

In 2012 IEEE unintentionally left about 100 000 passwords publicly exposed. It was discovered that

even IEEE members used passwords that are easy to guess as most common passwords were

“123456,”“ieee2012,” and “12345678” ( Mills, 2012), (Symantec, 2013)].

2.2 Password Management

Due to increase in security threat on information stored in computerized information systems,

organizations have developed policies relating to use of passwords. The aim of a password policy is

to make it difficult for those who might want to guess the password to enable them to gain access.

The biggest problem with authentication using login name and password is that if anyone knows

those credentials they can log in the system. The computer systems know the credentials and not

the person whom the credentials are given to. Cyber criminals are always on the lookout to steal

these login details along with other personal information such as credit card numbers. Identity theft

is a global problem that is on the increase. Cyber criminals write applications which they can use to

perform identity theft. It is estimated that there are about 130million application systems used to

perform identity theft today. The figure was about 1million in 2007 ( Anderson, 2013). The demand

for such applications is high because theft of personal information can be used to imitate the real

person and perform illegal acts ( Saunders & Zucker, 1999). Personal information like credit card

details, email, and address are targeted by cyber criminals for the purpose of committing crime

pretending to be the original person. Selling credit card numbers is a big business as recently after an

exposure of a website that was selling such details stolen from many people across the world (Dean,

2016). Like other cyber related crimes it is very difficult to investigate crimes like online identity theft

as it requires interstate cooperation.



Password policies are designed to make it difficult for cyber criminals to easily access and get hold of

them. Many organizations have developed password policies to protect customer information and

other business data kept within their systems. Some of the things that a password policy try to

enforce for example Microsoft Developer Network as follows (MSDN, 2015):

Password complexity

Password expiration

Password complexity may specify what can or cannot be a password, minimum and maximum

characters of a password, special characters that each password must have. Password complexity

can be enforced by making it as long and as complex as possible.

Password expiration defines the lifespan of a password, time lock off when a user is logged on but is

inactive or lockout duration.

If a password is not carefully selected, it can be the weakest point that criminals can easily use to

access the system. A password that is not easy to get is said to be strong. A password policy must

enforce the following to make passwords strong (MSDN, 2015):

Have more than 8 characters

Include, numbers, characters and symbols

It must not be a name

Regularly changed

It is not written anywhere

It is memorized

2.3 Password Challenges

The increase number of different applications that a single user need to use has significantly

increased. Each system requires its own password and may even have different password policy.

One system may require 8 character passwords while another system may require 6. For example an

average user needs to have a password for their bank accounts, credit card, official email account,

personal email account etc. Each of these should have its own password. To a user it is very difficult

to remember all of them. A common solution is to have a similar password for all the applications. It

has been shown that in some instances 75% of social network username and password were similar

to users email accounts (MSDN, 2015). Another research has shown that over 50% of Americans use

the same password for various online applications (SecurityWeek, 2010). The main problem with

using the same password for different systems is that once a hacker has managed to get that

password, they will be able to access several systems where that password is used. It is like having

the same key for your house, car, office etc.

Another common approach that people use to deal with challenges of multiple passwords across

different devices is to write each system/device password down. Under this approach a strong

password is established for each application. The security of this approach lies on the security of

where the password is written on. For example if it is in a diary, diaries are often left in desks or if

stored in a mobile phone, mobile phones get lost or are stolen. Writing a password down on paper is

the next most common method that people use to manage their passwords, the most popular being

remembering ( Wlasuk , 2012).



A password is not supposed to be shared. In the banking industry any entry into your banking details

where a user has correct credentials is a valid transaction. A customer who shares a password with a

friend cannot hold the bank accountable if the friend performs unauthorized transactions in the

account. One of the conditions that Standard Bank recommends to its customers on how to protect

their password is not to disclose it to anyone (Cobb, 2012). Sharing password still remains a big

problem for many organizations today (Imprivata, 2014).

It is not only information systems that require authenticaion. There are mobile devices, cloud based

systems that also require authentication. Increasingly employees are bringing their own devices to

the workplace and use them to access corporate data. This increase the complexity of the problem

of managing and controlling access to corporate data and systems.

3 PASSWORD AUTHENCTICATION PROCESS

There are numerous password authentication protocols (PAP) that are currently in use. These range

from simple to more complex ones. The simplest protocol is where a user is assigned a login name

and an initial unique (individual) password. The login name is publicly available but the password is

not public. A user is often expected to change the initial password to something that only the user

knows and can remember. The login name and password are then stored for future reference during

log in sessions. The process of authentication is performed when a user wants to use the system or

device. A user is asked to provide their credentials. These are then compared with what is stored. If

they match, access is allowed. If they are not the same, access is denied.

Figure 1. Password Authentication Protocol (The TCP/IP Guide, 2005)



4 PROPOSED SOLUTION

Password still remains the cheapest method of authentication. There is a need to continue to

explore innovative methods through which this method can be made much more secure to meet the

current challenge of increase in cyber-crime such as password fishing. This paper proposes

improvement of PPP by incorporating date time stamp in the validation process.

The date time stamp (DTS) is the numerical combination of the current year, month, day and time

for example if the numerical values of these are extracted and combine to produce a number, this

number will become unique especially when time is set may be to smallest units. For example

Year 2015

Month 04

Day 17

Time (hour: Minute: Second: 12:53:55

The above can be converted to a date stamp to be 20150417125355. If this is set to be date time

stamp, then this numeric value can never be repeated in the future. The strength of the value can be

improved by improving accuracy of measurement of the second’s value.

The proposed system perform authentication by combining a user password with the data time

stamp value. Upon log in the DTS is captured and combined with a user’s password. An operation

preferably a mathematical operation can be performed to do the combination to produce a unique

value which is also sent to the responder. DTS is also send along with the password for validation at

the responder.

At the responder, a similar operation is performed using the stored password and the sent DTS. The

value obtained at the responder is then compared with the one that has been received to perform

authentication. If the values are the same then a user is validated or else access is denied.

Since an operation is being performed on the password combing it with a unique value of DTS, the

password does not have to be very strong; it can be something that is easy to remember. The

strength of this method will depend on the operation that is performed using a password and DTS.

Also the method rely on the mathematical operation been performed being very difficult to guess.



Figure 2. Additional functionality to PAP

5 CONCLUSIONS

The increase in cyber-crime is a global problem. It is even made more difficult by the proliferation of

web based applications. The internet does not recognize territorial boundaries. Increasingly human

beings are challenged by the need to protect their data and identity when online. Though there are

advance methods of providing online security, authentication using passwords still remain to be the

most popular method. Use of passwords for authentication has numerous challenges. There is a

need to strengthen password methodologies to provide a secure platform for using modern devices

and other information systems. Mobile devices and remote access of information systems need to

be provided with secure environments

This proposed solution is still at theoretical concept that needs to be developed. For example the

nature of the operation that needs to be performed need to be developed as well as the tools that

will be required to capture the required data and perform the operation. Global time difference also

has to be considered to decide which time zone the security authentication will use as the concept is

explored further. In this method it is assumed that the complexity or strength of the password is

enhanced by making the operation complex. The benefit of this approach is that it will reduce the

need for people to remember multiple passwords or change their passwords regularly. More work

still needs to be done to validate this theoretical concept.

Initiator

Responder

Capture password and DTS

Perform operation

Result

Receive password and DTS

Perform similar operation like

initiator

Compare Results



REFERENCES Anderson, C. J. (2013, April 14). Identity theft growing, costly to victims. Retrieved April 10, 2015,

from USA Today: http://www.usatoday.com/story/money/personalfinance/2013/04/14/identity-

theft-growing/2082179/

Cobb, S. (2012, December 4). Password handling: challenges, costs, and current behavior (now with

infographic). Retrieved April 15, 2015, from welivesecurity:

http://www.welivesecurity.com/2012/12/04/password-handling-challenges-costs-current-

behavior-infographic/

Dean, J. (2016, February 19). Website selling stolen credit cards is shut down. Retrieved April 25,

2016, from The Times (London):

http://www.lexisnexis.com/lnacui2api/delivery/rwBibiographicDelegate.do

Imprivata. (2014). Eliminate password sharing with single sign on technology. Retrieved April 14,

2015, from imprivata: http://www.imprivata.com/password_sharing

Mills, E. (2012, September 25). Researcher says 100,000 passwords exposed on IEEE site. Retrieved

March 17, 2016, from CNET: http://www.cnet.com/news/researcher-says-100000-

passwords-exposed-on-ieee-site/

MSDN. (2015). Password policy. Retrieved April 12, 2015, from Microsoft Developer Network:

https://msdn.microsoft.com/en-us/library/ms161959.aspx

Profis, S. (2014, November 25). The guide to password security (and why you should care). Retrieved

April 17, 2015, from CNET: http://www.cnet.com/how-to/the-guide-to-password-security-

and-why-you-should-care/

Rause, M. (2015). Authentication. Retrieved April 5, 2015, from TechTarget:

http://searchsecurity.techtarget.com/definition/authentication

Saunders , K. M., & Zucker, B. (1999). Counteracting Identity Fraud in the Information Age: The

Identity Theft and Assumption Deterrence Act. International Review of Law, Computers &

Technology, 13(2), 183-192.

SecurityWeek. (2010, August 16). Study Reveals 75 Percent of Individuals Use Same Password for

Social Networking and Email. Retrieved April 13, 2015, from Security Week:

http://www.securityweek.com/study-reveals-75-percent-individuals-use-same-password-

social-networking-and-email

Sharma, A., Sharma, S., & Dave, M. (2015). Identity and access management- a comprehensive study.

(pp. 1481 - 1485). Noida: IEEE Explore.

Standard Chartered Bank. (2014). Take control of your online security . Retrieved April 14, 2015, from

Standard Chartered: https://www.sc.com/en/online-banking/security/how-to-protect-

yourself.html

Symantec. (2013). Reaping the Benefits of Strong, Smarter User Authentication. Retrieved March 18,

2016, from Symantec:

http://www.symantec.com/content/en/us/enterprise/white_papers/b-reaping-the-benefits-

en-us.pdf

Taneski, V., Brumen, B., & Hericko, M. (2014). The Effect of Educating Users on Passwords: a

Preliminary Study. ACM Transactions on Applied Perception, 2, 1-8.



The TCP/IP Guide. (2005, September 20). PPP Authentication Protocols: Password Authentication

Protocol (PAP) and Challenge Handshake Authentication Protocol (CHAP) . Retrieved April 15,

2015, from The TCP/IP Guide:

http://www.tcpipguide.com/free/t_PPPAuthenticationProtocolsPasswordAuthenticationPr-

2.htm#Figure_29

UCSC. (2015, March). UCSC Password Strength and Security Standards. Retrieved April 16, 2015, from

University of Carlifornia: http://its.ucsc.edu/policies/password.html

USA TODAY. (2013, April 14). Identity theft growing, costly to victims. Retrieved April 2016, from USA

TODAY: 25

Wenstrom, M. (2002, February 22). Examining Cisco AAA Security Technology. Retrieved April 12,

2015, from CISCO: http://www.ciscopress.com/articles/article.asp?p=25471&seqNum=3

Wlasuk , A. (2012, February 10). Password Purgatory - Are we Ever Going to Get Passwords Right?

Retrieved April 14, 2015, from Security Week: http://www.securityweek.com/password-

purgatory-are-we-ever-going-get-passwords-right



IC1010

Enhancing the Least Significant Bit (LSB) Algorithm for Steganography

Oluwaseyi Osunade, Ganiyu Idris Adeniyi

Department of Computer Science University of Ibadan

Ibadan, Nigeria

[email protected]

ABSTRACT

Various Steganography algorithms have been proposed and implemented for hiding the existence of

data in a cover object starting from the algorithms that work in transform domain to the ones that

work in spatial domain, such as Least Significant Bit (LSB), which uses the three colours (RED, GREEN

and BLUE) present in an image. Three colours are present in the pixel of an image, therefore, this

project proposed a new algorithm that chooses only the two colours (GREEN and BLUE) out of the

three colours (RED, GREEN and BLUE) that made up of a pixel present in an image to hide data. This

proposed algorithm successfully hides the data with the two colours (GREEN and BLUE) present in an

image with no significant changes in the resulting colours of the image. The result of this experiment

has shown the effectiveness of the proposed algorithm. This experimental result has shown that the

algorithm strikes a balance between the security and the quality of the image. It should be noted

that this research work only considers image as the cover object, other forms of cover object are not

considered here. It should also be noted that the algorithm only hides data from 8 bytes to 1024

bytes using two different images of different size, which shows no effect on the effectiveness of the

algorithm.

Keywords: Steganography, least significant bit, colour, data, algorithm

1 INTRODUCTION

Due to the continuous changing of global Technology trends, data is continuously moving from one

host system to another system on the network or on the internet and thus the security of this data is

highly important.

It is generally accepted that the security of the data can be achieved by using encryption and

Steganography method. In Cryptography, the encrypted data is transmitted after the data is

transformed to another form in order to hide the content of the data from unauthorized users.

Steganography on the other hand, deals with hiding the existence of the data in a cover object such

as texts, image, audio/video and protocol rather than transforming the data itself thereby making

people unaware that communication is taking place.



The application of Steganography will continue to play a vital role in protecting data across several

hosts due to its unsuspicious methodology. Various Steganography algorithms have been proposed

and implemented but most of the algorithms do not hide the data effectively. Usually a slight

distortion in the image used to hide the data gives it away. Therefore, there is a need to get an

algorithm that gives only the slightest distortion. This is what led to the newly proposed algorithm.

1.1 An Overview of Ancient Steganography

Steganography can be traced back to ancient times. Early attempts at steganography made use of

chemicals and even human bodies to convey information. In practice, modern steganography has

gone beyond the use of physical bodies and chemicals but in principle, it is still the same as the

ancient steganography. Some of the records are outlined below:

Herodotus (484 BC – 425BC) is one of the earliest Greek historian. His great Work, The

Histories, is the story of the war between the huge Persian Empire and the war between the

huge Persian Empire and the much smaller Greek city–states. Herodotus recounts the story

of Histaiaeus, who wanted to encourage Aristagoras of Miletus to revolt against the Persian

King in other to secure convey his plan, Histaiateus shaved the head of his messenger, wrote

the message on his scalp, and then waited for the hair to regrow. The messenger, apparently

carrying nothing contentious, could travel freely. Arriving at his destination, he shaved his

head and pointed it at the recipient.

Pliny the Elder (23 AD – 79 AD) explained how the Milk of the thithymallus plant dried to

transparency when applied to paper but darkened to brown when subsequently heated,

thus recording one of the earliest recipes for invisible ink. The Ancient Chinese wrote notes

on small pieces of silk that they then wadded into little balls and coated in wax, to be

swallowed by a messenger and retrieved at the messenger’s gastrointestinal convenience.

Giovanni Batista Porta (1535 - 1615) described how to conceal a message within a

hardboiled egg by writing on the shell with an ounce of alum and a pint of vinegar. The

solution penetrates the porous shell, leaving no visible trace, but the message is stained on

the surface of the hardened egg albumen, so it can be read when the shell is removed.

2 RELATED WORKS

Steganography is an art and science of hiding messages in such a way that no one apart from the

intended recipient knows the existence of the message (Divya & Ram, 2012). The term ‘hiding’ refer

to the process of making the information imperceptible or keeping the existence of the information

secret.

Steganography is derived from two Greek words ‘steganos’ which literally means ‘covered’ and

‘graphy’ means ‘writing’ i.e. covered writing. Steganography refers to the science of ‘invisible’

communication for hiding secret information in various file formats, there exist a large variety of

Steganographic techniques. Some are more complex than others but all of them have respective

strong and weak points (Lokeswara et al., 2011). Different applications have different requirement

of the steganography techniques to be used.

Hiding data is the process of embedding information into digital content without causing perceptual



degradation. In data hiding three famous techniques can be used. They are watermarking,

steganography and cryptography. Steganography is defined as cover writing in Greek. It involves any

process that deals with data or information within other. (Rosziati & Teoh, 2011).

The main advantage of using Steganography over the remaining famous techniques is due to its

simple security mechanism because steganographic message is integrated invisibly and covered

inside other harmless sources.

The Steganography can be considered as a branch of Cryptography that tries to hide messages

within others, avoiding the perception that there is some kind of message. To apply steganographic

techniques, cover files of any kind can be used, although archives of image, sound or video files are

the most used today. Similarly, information to hide can be texts, image, video, sound e.tc. There are

two trends at the time to implement steganography algorithms: the method that work in the spatial

domain (altering the desired characteristics on the file itself) and the methods that work in the

transform domain (performing a series of changes to the cover image before hiding information)

(Juan & Jesus, 2009).

Different research carried out has proved the fact that the methods that work in the spatial domain

are simpler and faster to implement than the ones that work in the transform domain which is more

robust in term of resistance to attacks.

In Spatial Domain, message or data to be transferred is embedded directly into images to be used as

cover object whereas, in transform domain as its name implies, images are first transformed before

the data or message to be transferred is embedded into it.

Image steganography can be implemented using Transfer domain and Spatial domain which

implements any of these three methods:

Non-Filtering: This method deals with embedding the data into the cover object by starting

from the first pixel of the images to be used as cover object.

Randomized: In this method both the sender and receiver of the image use password

denominated stego-key that is employed as the seed for pseudo-random number generator,

which then creates sequence that is used as index to have access to the image pixel.

Filtering: In this method, the algorithm filters the cover image by using a default filter and

hides information in the areas that get a better rate (Roque, Juan & Jesus, 2009).

2.1 Steganography Algorithms

Most of the algorithm that works in Spatial Domain use Least Significant Bits Algorithm (LSB) method

or any of its derivatives as the algorithm for information hiding i.e., hiding one bit of information in

the least significant bit of each colour of a pixel. However, this method cannot stand some types of

statistical analysis (such as RS or Sample Pairs). The problem stems from the fact that modifying the

three colours of a pixel produces a major distortion in the resulting colour. This distortion is not

visible to the human eye, but detectable by statistical analysis. (Juan & Jesus, 2009).

Research carried out has proved the fact that the methods that work in the spatial domain is simpler

and faster to implement than the one that work in the transform domain which is more robust in



term of resistant to attacks. Therefore, this project focuses on Least Significant Bits Algorithm (LSB)

method and its derivative Selected Least Significant Bit Algorithm (SLSB).

2.1.1 Least Significant Bit Algorithm

In Least Significant Bit Algorithm, both the data and the image to be used as cover object are

converted from their pixel format to binary. And the Least Significant Bit of the image is substituted

with the bit of the data to be transferred so as to reflect the message that needs to be hidden. The

bits of the data replace each of the colours of the Least Significant Bit of the Image. (Lokeswara et

al., 2011).

For instance, suppose the data ‘AID’ with the following property is to be stored in the first 8 pixels of

200 by 400 Pixels with 24 bits in a pixel that made up the image.

Table 1: Showing 3 letters with ASCII values and corresponding BINARY values

LETTER ASCII VALUES BINARY VALUES

A 065 01000001

I 105 01100100

D 100 01101001

To hide ‘AID’ with the Binary Code (01000001 01100100 01101001) using Least Significant Bit

Algorithm, each bit with the least significant bit of each colour that made up the Pixel is flipped.

The affected Bits is half of the bits of the images, since there are 256 possible intensities of each

primary colour, changing the LSB of a pixel results in small changes in the intensity of the colours.

These changes cannot be perceived by the human eye - thus this makes the data to be successfully

hidden.

Figure 1: Least Significant Bit method adapted from (Roque, Juan & Jesus, 2009)

Cover Image

Colour Selection

Pixel Filtering

Steganographic Image

Data to hide

LSB matching

Bit Replacement

File Compression



2.1.2 Selected Least Significant Bit Algorithm

In Selected Least Significant Bit Algorithm, both the data and the image used as cover object are

converted from pixel format to binary. The Least Significant Bit of one colour (BLUE) that made up a

Pixel is substituted with the bit of the data to be transferred. This will reflect the message that needs

to be hidden. Only the Least Significant Bit of one colour in a Pixel is flipped by the bits of the data to

hide. (Juan & Jesus, 2009).

Only one-third (1/3) of the bits of the image is used. Hiding Data using Selected Least Significant Bit

takes more pixels of images compared to the Least Significant Bits method of hidden data, since only

the last colour of the Least Significant Bit is going to be replaced. As a result, the human eye cannot

perceive the changes - thus this makes the Data to be successfully hidden and inconspicuous to the

human eye.

3 PROPOSED ALGORITHM FOR NEW SELECTED LEAST SIGNIFICANT BIT

In this technique, a new steganography algorithm that is based on selecting the Least Significant Bit

of the two colours (Green and Blue) in each pixel is proposed, since images in a computer system are

represented as arrays of values. These values represent the intensities of the three colours R (Red),

G (Green) and B (Blue), where the value for each of the three colours describes a pixel. Each pixel is

combination of three components (Red, Green and Blue).

In this scheme, the bits of last two components (Green and Blue) of Pixels of image have been

replaced with Data Bits. The blue colour is selected because of a research conducted by Hecht

(Hecht, 2006), which reveals that the visual perception of intensely BLUE objects is less distinct than

the perception of objects of Red and Green. Green is chosen in combination with Blue because it

gives more room for the length of the data to be embedded.

3.1 Proposed Procedure for Embedding Phase

To embed data into images the following procedure is performed.

Step 1: Extract the entire pixel in the image and store it in the array called Pixel-array

Step 2: Extract all the characters in the given text file and store it in the array called Character-

array.

Step 3: Extract all the characters from the Stego-key and store it in the array called Key- array.

Step 4: Choose first pixel and pick characters from Key- array and place it in first and second

component of pixel. If there are more characters in Key-array, then place rest in the first

component of next pixels.

Step 5: Place some terminating symbol to indicate end of the key.



Step 6: Place characters of Character- Array in each first and second components (Blue and

Green channel) of next pixels by replacing it.

Step 7: Repeat step 6 till all the characters has been embedded.

Step 8: Again place some terminating symbol to indicate end of data.

Step 9: Obtained image will hide all the characters that input.

Figure 2: Proposed Selected Least Significant Bit method

3.2 Proposed Procedure for Extraction Phase

To extract data from Stego- image the following procedure should be performed.

Step 1: Consider three arrays, Character-Array, Key-array and Pixel- array.

Step 2: Extract all the pixels in the given image and store it in the array called Pixel-array.

Step 3: start scanning pixels from first pixel and extract key characters from first and

second (blue and green) components of the pixels and place it in Key-array. Follow

Step 3 up to terminating symbol, otherwise follow step 4.

Step 4: If this extracted key matches with the key entered by the receiver, then follow;

otherwise, terminate the program by displaying message “Key is not correct”

Cover Image

Colour Selection

Pixel Filtering

Steganographic

Image

Data to hide

SLSB matching

Bit Replacement

File Compression



Step 5: If the key is valid, then again start scanning next pixels and extract secret Message

characters from first (Blue and Green) component of next pixels and place it in Character

array. Follow Step 5 till up to terminating symbol, otherwise follow step 6.

Step 6: Extract secret message from Character-array

Figure 3: Steganography Mechanism Receiver

3.3 Interface Design

The user interface is generally the means of communication between the user and the system

i.e. to enable the user to access the system. It is important that this communication is as

meaningful and friendly as possible.

Based on the proposed algorithm, we develop a simple interface using: Java Graphical User

Interface i.e., Java Net Beans and Eclipse, since the system is implemented using Java

Programming Language. It is a very simple interface to use with the following buttons:

ENCODE: This Button when click will open a text box where user is asked to input the data to be

hidden in the cover object.

ENCODE NOW: This Button when clicked will open a dialog box for the user to browse for the

preferred Cover Object (Image).

DECODE: This Button when click will open a dialog Box for the user to browse for the Stego Image

that has the data embedded in a cover object (Image).

Stego Image

File

Pixel to Binary

Conversion

SLSB Decoder

Binary to ASCII

Conversion

Binary to Pixel

Conversion

Text

Cover Image

File



DECODE NOW: This Button when click will decode the Stego image.

EXIT Button: This button is use to terminate the application programmed

RESULTS

Histograms are a very useful tools used to analyze and compare significant changes in the frequency

of appearance of the colors of the cover image with steganographic images so as to be able to get a

quick summary of the tonal range present in any given image.

It plots a graph of the tones in the image from black (on the left) to white (on the right). A histogram

with lots of dark pixels will be skewed to the left and one with lots of lighter tones will be skewed to

the right.

For efficient analysis and comparison, two different images are used and detailed analysis of the four

component of any image: Brightness, Red, Green and Blue colors have been carried out.

Figure 4: Original Image

Figure 5: Original Image (Luminosity) Figure 6: Original Image (Blue)



Figure 7: Original Image (Green) Figure 8: Original Image (Red)

Stego Image using LSB Stego Image using SLSB Stego Image using NEW SLSB

Figure 9: Stego Images

LSB Image SLSB Image NEW SLSB Image

Figure 10: Luminosity channel of Stego Images



LSB Image SLSB Image NEW SLSB image

Figure 11: Red channel of Stego Images

LSB Image SLSB Image NEW SLSB image

Figure 12: Green channel of Stego Images



LSB Image SLSB Image NEW SLSB Image

Figure 13: Blue channel of Stego Images

As it can be seen from the Experimental result (Histogram analysis) above, the algorithms are tested

using an image. The result has shown that all the algorithms successfully hides the image with no

difference in the resulting frequency in the color of the images and sizes.

Table 2: showing the image size before and after encoding using different algorithms

No Original Image

(Size)

Hidden Data

File (Size)

LSB Algorithm

(Size)

SLS Algorithm

(Size)

NEW SLSB

Algorithm (Size)

1 11.97kb 32 bytes 111kb 111kb 111kb

2 5.97kb 32 bytes 77.4kb 77.4kb 77.4kb

4 CONCLUSION

The result of the experiment perform has shown the effectiveness of the proposed algorithm. The

experimental result has shown that the algorithm strikes a balance between Least Significant Bits

Algorithm (LSB) and Selected Least Significant Bit (SLB) algorithm in such a way that achieved

balance between the security and the quality of the image. There is no loss of the data hidden

whatsoever and this new method retains the quality of the image.

This research work, only consider images as the cover object. Other forms of cover object are not



considered here. The algorithm only hides data between 8 bytes and 1024 bytes. Future work will be

how to use the algorithm with other forms of cover object i.e., Text, Video and also to hide data of

bigger size.

REFERENCES

Arvind K. and Kim P. (2010). “Steganography- A Data Hiding Technique” International Journal of

Computer Applications ISSN 0975 – 8887, Volume 9– No.7, November 2010.

Chen P. and Wu W. (2009). A modified side match scheme for image steganography, International

Journal of Applied Science & Engineering 7 (2009) 53-60.

Divya S.S and Ram M. (2012). Hiding text in audio using multiple lsb steganography and provide

security using cryptography. International journal of scientific & technology research volume

1, issue 6, July 2012.

El-Emam N. (2007) Hiding a large amount of data with high security using steganography algorithm,

Journal of Computer Science 3 (2007) 223-232.

Fridrich J, Du R and Meng L. (2000) “Steganalysis of LSB Encoding in Color Images,” Proc. IEEE Int’l

Conf. Multimedia and Expo, CD-ROM, IEEE Press, Piscataway, N.J., 2000.

Gandharba S. and Saroj K.L. (2012). A Technique for Secret Communication Using a New Block

Cipherwith Dynamic Steganograph. International Journal of Security and Its Applications

Vol. 6, No. 2, April, 2012

Hecht, E. 2006. Optics. Delhi, India: Pearson Education.

John M. and Manimurugan S. (2012).A Survey on Various Encryption Techniques. International

Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, 2(1), March 2012.

Juan J. and Jesus M. (2009). SLSB: Improving the Steganographic Algorithm LSB. Universidad

Nacional de Educación a Distancia (Spain).

Lokeswara V., Subramanyam A. and Chenna P. (2011). Implementation of LSB Steganography and its

Evaluation for Various File Formats. Int. Journal Advanced Networking and Application, 2(5),

Pages: 868-872

Lou D., Liu J. and Tso H. (2008)Evolution of information – hiding technology, in H. Nemati (Ed.),

Premier Reference Source–Information Security and Ethics: Concepts, Methodologies, Tools

and Applications, New York: Information Science Reference, 2008, pp. 438-450.

Mauro B., Franco B., Vito .C and Alessandro P. (1999). A DCT-domain system for robust image

watermarking. Dipartimento di Ingegneria Elettronica, Universita di Firenze,via di S. Marta,

3, 50139 Firenze, Italy

Morkel T., Eloff J. and Olivier M. (2005). An overview of image steganography.Information and

Computer Security Architecture (ICSA) Research Group.Department of Computer Science

University of Pretoria, 0002, Pretoria, South Africa.

Rosziati I. and Teoh S. (2011). Steganography Algorithm to Hide Secret Message inside an Image.

Computer Technology and Application 2 (2011) 102-108.

Roque, Juan J., and Jesús M. M., (2009) "SLSB: Improving the Steganographic Algorithm LSB." WOSIS.



Roziati I. and Teoh (2011). Steganography Algorithm to Hide Secret Message inside an Image. Faculty

of Computer Science and Information Technology, University Tun Hussein Onn Malaysia

(UTHM), Batu Pahat 86400, Johor, Malaysia

Stefan K. and Fabien A. (2000) “Information HidingTechniques for Steganography and Digital

Watermarking”. Boston,Artech House, pp. 43 – 82. 2000.

Thomas A. (2005). Implementing Steganographic Algorithms: An Analysis and Comparison of Data

Saturation

Vijay k. and vishal S. (2005). A steganography algorithm for hiding image in image by improved lsb

substitution by minimize detection .Journal of Theoretical and Applied Information

Technology

Wu P and Tsai W. (2003). A steganographic method for images by pixel-value differencing, Pattern

Recognition Letters 24 (2003) 1613-1626.



IC1011

A Security Model for Mitigating Multifunction Network Printers

Vulnerabilities

Jean-Pierre Kabeya Lukusa

Department of Network and Infrastructure Management Botho University

Gaborone, Botswana [email protected]

ABSTRACT

With the ability of incorporating a wide range of functions, Network Printers have not only become one of the most essential tools in today’s businesses but also one of the most neglected component in network security defenses. An efficient network security architecture design therefore necessitates the integration of key security implementations by means of formal security models conceived with security policies that take into consideration multifunctional network printers (MNP) security liabilities. This paper, presents a novel approach aimed at enforcing policy constrained security mechanisms using a multilevel printer security architecture. The proposed security model ensures discretionary access control (DAC), and a secure flow of information to and from entities connected to the network, to provide a trusted computing base (TCB). Access to the printer by subjects is controlled by means of security clearance matrices that can then be applied to security classes under which network resources can be grouped. Lastly, a validation of the model is presented using simple set theoretic concepts to assess the resilience of the implemented security defense model.

Keywords: Printer security, access control matrices, security architecture design, information flow

control, trusted computing base (TCB)

1. INTRODUCTION

1.1. Background

The modern printer has over the years evolved into an embedded device that is capable of

incorporating a wide range of functionalities that go beyond what an otherwise conventional printer

would be thought capable of doing. This unique ability of incorporating multiple functions into a

single unit, has earned it the acronym Multi-function Printer (MFP). For generalization sake, the term

Multifunction Network Printer (MNP) has been adopted in the text to better represent it as a network

integratable embedded1 device. MNP have, in spite of the commendable efforts made by the “go-

green” (Bansal, 2000; Di Giuli, 2014) corporations2, managed to become one of the most essential

tools in today’s businesses and homes (Infotrends, 2011). Due to this increasing market demand for a

multipurpose printer and the need for incorporating a wide range of functionalities into a compact

1 An object containing a special purpose computing system.

2 Encourage conservation of paper by advocating printing only when absolutely necessary (a.k.a. Green

Computing).



unit, most manufacturers have opted to integrate disk drives in their printer designs to record and

store latent and/or residual data thus effectively turning, even the most secured printers, into a

dormant security liability.

Nowadays, a typical MNP is capable of printing, scanning, copying or faxing documents from both

electronic and hard sources. These documents would more often than not contain potentially

sensitive information that if not properly secured may fall into the wrong hands (Forbes, 2013).

Therefore, identified security flaws in mechanisms preventing unauthorized access to files residing

within these printers and the illegal flows of information heighten the level of vulnerability to inter-

process communications and thus potentially compromising the privacy and integrity across the

network (Gonsalves, 2013; Vail, 2003).

1.2. Problem of Interest

In highlighting the problem of interest, this paper takes cognizance of the ISO/IEC 154083 standard

(Chen, 2015) and it would thus be of consequence to clarify that the focus of the paper is not on the

internal or architectural security design flaws (Cui, 2013; Forbes, 2013) in MNPs, that are otherwise

conventionally addressed by the aforementioned standard, but rather on potential security loopholes

born from complexities inherent to MNPs. These loopholes can be roughly grouped into risks linked

with (i) control security, (ii) data security, and (iii) network security. Risk, in this sense, can then be

viewed as an component that is increased by the magnitude of the resulting causation threats and

vulnerabilities subject4 to assets (Bishop, 2012; Pfleeger, 2011) as presented by the following formula:

(1)

1.3. Focus of this Paper

In order to best describe the proposed security architecture, a formal mathematical model

(Landwehr, 1981) is presented to demonstrate its potential implementation. This paper primarily

focuses on: (i) devising a multilevel printer security5 mechanism for controlling access by subjects with

different security clearances; (ii) safeguarding privacy and integrity of data stored on the printers; (iii)

providing audit trails for all transient inter-process communications; and (iv) and providing protection

against printer denial of service. The goal is to attain optimum security without compromising the

balance between protection and usability (Vail, 2003).

2. IDENTIFICATION OF TRUSTED COMPUTING BASE6 FUNCTIONS IN MNPs

2.1. Definition of Terms Used

A Subject – is an active network resource capable of exchanging data or control information with an MNP.

A Network User – is a person authorised to use a given network. A User Identifier – is a unique character string used to identify a given network user.

3 The Common Criteria for Information Technology Security Evaluation.

4 An active entity (i.e. user/user process) that interacts with an MNP.

5 Multilevel security deals with the protection of information to which different security level clearance

classifications have been ascribed. 6 A set of hardware, software, or firmware factory implemented protection mechanisms within a given MNP

that are responsible for enforcing security policies (Bishop, 2003).



A Security Class – is a security attribute that can be assigned to all network resources to which a sensitivity level can be ascribed (e.g. ADMIN, POWER-USER, DOMAIN USER, etc…). It provides a basis for determining access from subject(s)-to-MNP(s). This allows us to define the set of security class as a bounded7 lattice of sensitivity levels where such that . This is important as it defines the set of permissible information flow/transactions between subject(s) and MNP.

A Classification – is a designation attached to an MNP used for a given security class that reflects its relative value and vulnerability levels as a network asset.

An I/O Interface – is a point of transit for data/control located on an MNP. Each I/O interface belongs to a given classification.

An Operation – is a unit function that can be assigned to a given MNP and performed by an authenticated subject. These include, but are not limited to, the following: a. Print – reproducing text and/or image from digital to hard-copy. b. Scan – capturing images from hard-copy onto a digital format. c. Fax – transmitting or receiving an electronic copy of a document. d. Email – an electronic transmission or reception of a document.

A Reference Monitor (RM) – is used to mediate all information flows/transactions to a given MNP by subject(s).

A Reference Validation Mechanism (RVM) – is used to represent an implementation of the RM concept.

2.2. Security Requirements for Connecting Trusted and Untrusted Subject(s) to MNP(s)

This section presents some scenarios aimed at highlighting specific security requirements for

connecting trusted (t) or untrusted (u) subject(s) to MNP(s). Before proceeding further, it is important

to point out that creating a perfectly secured network is an ultimate, albeit unachievable goal as there

would in some way always be an element of risk albeit minute (Suh-Lee, 2015). For instance, a trusted

subject or MNP may be defined as one that is deemed to have provided sufficient credible evidence

that it meets a finite set of security requirements (Bishop, 2003) and thus also making it in a way

trustworthy. Consequently, a subject or MNP deemed as ‘trusted’ would remain in that state

provided that the link between the ‘credible evidence’ and the finite set of ‘security requirements’ is

maintained. For this reason, ‘trust’ within the context of security should never be thought of as a

given.

Let X represent the set of security states for a given MNP such that and Y represent

those of its connecting subject(s) such that where represent trusted security

states and untrusted security states. Figure 1, illustrates four partial ordered sets that can

result from the interaction between given elements of X and those of Y using a Hasse diagram. From

the diagram, the following four possible cases can be deduced, with the assumption that all involved

MNP have a resident TCB:

i. Threat T1: – A trusted subject connecting to a trusted MNP. a. Requirement T1-A: the subject is permitted to begin a session only if a unique user identifier8

has been supplied to the MNP and if it has been successfully validated and authenticated by the MNP.

b. Requirement T1-B: an active subject must belong to a classification that allows or prohibits access to both operations and/or I/O interfaces provided by the MNP.

7 An ordered set with both join (i.e. least upper bound) and meet (i.e. greatest lower bound) semi-lattices

8 May be presented as a user id and password, biometric inputs such as finger vein scan, smart card, etc…



c. Requirement T1-C: an inactive session albeit authenticated must have a finite expiration period.

d. Requirement T1-D: a valid subject identifier must have a finite print/copy quota in any given session.

e. Requirement T1-E: Mandatory security policies and flow control functions must be implemented on both MNP and its subjects.

{xt, xu, yt, yu}

{xt, yt} {xt, yu} {xu, yt} {xu,yu}

{xt} {xu} {yt} {yu}

Ф

Figure 1: Hasse diagram of partial order derived from X and Y

The security analysis mapping for T1 can be defined as: T1 {T1-A, T1-B, T1-C, T1-D, T1-E}. The set

{T1-A, T1-B, T1-C, T1-D, T1-E} is known as a security target reference mapping to threat T1.

ii. Threat T2: – An untrusted subject connecting to a trusted MNP. a. Requirement T2-A: a valid justification of the security analysis mapping for T1 must be

presented in accordance to ITSEC (Commission of the European Communities, 1991). b. Requirement T2-B: the classification of each untrusted subject must fall within the range of

sensitivity levels that the MNP is trusted. c. Requirement T2-C: the MNP must provide an audit trail (i.e. maintaining an event log) for all

past actions performed on behalf of subject(s). d. Requirement T2-D: an appropriate user classification with matching sensitivity level must be

defined for all subjects connecting to an MNP during non-business hours. e. Requirements T2-E: appropriate network level security must be enforced to ensure

discretionary access control9 as well as port and protocol access control is observed at all network layers.

The security analysis mapping for T2 can be defined as: T2 {T1, T2-A, T2-B, T2-C, T2-D}.

iii. Threat T3: – An untrusted subject connecting to an untrusted MNP. a. Assumption T3-A: due to lack of discretionary access control trust association with subject(s)

cannot be reciprocated by the MNP and vice-versa. b. Assumption T3-B: as a result of T3-A, no security policy may be effected. This also means

that the presence of a TCB on the MNP is symbolic (i.e. not functional). 9 Also known as an identity-based access control (IBAC) – Granting restricted access to subjects on the basis of their

identity and/or groups to which they belong.



The security analysis mapping for T3 can be defined as: T3 { T3-A, T3-B}.

iv. Threat T4: – A trusted subject connecting to an untrusted MNP. a. Requirement T4-A: a valid justification of the security analysis for T2 must be validated in

accordance to ITSEC (Commission of the European Communities, 1991). b. Requirement T4-B: upon initiation of an active session, subject need to secure their

transaction using either data protection mechanisms10 or printer job locking11.

2.3. MNP Security Mis-configurations: A Review of Possible Problem Areas

In order to best address possible problem areas inherent to MNP, it is necessary to look at security

control management in terms of its access (Dohi, 2012), information flow (Denning, 1975; Stoughton,

1981), and cryptographic (Kahate, 2013) control. These are briefly discussed in the following sections

as potential security sore spots leading to network threats and/or vulnerability in MNP.

i. Devising Generic MNP Based Security Mechanisms for Controlling Access by Subjects When inspecting access control vulnerabilities areas, one needs to describe them in terms of

configurable authentication12, authorization13, and accountability14 features. For instance on a generic

MNP these can be controlled through enabling, amongst others, features such as discretionary

copy/print/scan/fax account tracking, subjects authentication for both remote and local access, auto

log off on idle processes, function restrictions, event log historic, printer driver user data encryption,

non-business hours user account tracking, etc…

ii. Safeguarding Data Privacy and Integrity While it is equally important to ensure confidentiality of data by taking simple measures such as not

leaving personal documents lying in the MNP’s output tray; the emphasis in this section is on the

implementation of appropriate data security policies to safeguard latent and/or residual data stored

on MNPs’ resident drives. Amongst others, these can be controlled by enabling features such as disk-

drive password protection, hard-disk data encryption, hard-disk data overwriting, temporary data

deletion, timed data auto deletion, etc…

iii. Providing Audit Trails for Transient Inter-process Communications Provision of audit trails for transient inter-process communications on MNP is often realized through

the integration of reference monitor functions on the MNP. Control is achieved here by enabling

features such as IP address filtering, port and protocol access control, SSL15/TLS16 encryption, IPSec

support for secured session tunneling, IEEE 802.1x support, NDS17 authentication, etc…

iv. Protection Against Printer Denial of Service Most MNPs denial of service attacks discussed in literature (Ormazabal, 2014a; Ormazabal, 2014b;

Ormazabal, 2015) can be grouped into two broad categories. The first is often achieved by gaining

unlawful access to the printer via unsecured ports (such as HTTP, or Telnet) with the intent of

10

Includes – hard-disk password protection, disk data encryption, hard-disk data overwrite, latent data auto deletion, etc… 11

Ensuring that jobs from an authenticated subject is put on hold until a matching identifier is provided physically at the machine. 12

confirming subjects’ identity 13

determining what the subject can do 14

associating subjects to its action 15

Secure Socket Layer 16

Transport Layer Security 17

Novel Directory Services



damaging or unlawfully restricting access to services or functionalities that were otherwise

provisioned for authenticated subjects. The second type is achieved by flooding known printer

interfaces/ports (such as port 9100) with random data with the intent of exhausting its resources and

thus effectively preventing it from provisioning any further service. These are often circumvented by

simply observing basic access control measures as discussed earlier and frequently applying printer

firmware patches.

3. MEASURING SECURITY AS A RESULT OF THE INTERACTION BETWEEN SUBJECTS AND MNP

3.1. Preamble

This section presents an adaptation of the method for quantifying security risk presented by Suh-Lee

and Jo (2015). Before proceeding to the calculations, there is need to define the following terms:

i. A Danger Zone – is a network segment where trusted members belonging to the set attached to the segment have frequent interactions with an untrusted host belonging to the

set . The set of nodes belonging to a given Danger Zone is defined as:

{ }

(2)

ii. A Zone Proximity Value – is an integer value that indicates the proximity of a trusted members of to the untrusted member of . The smaller the

value of , the closer the member is to the untrusted node, and therefore, the higher the risk in the zone (Suh-Lee, 2015).

iii. The Proximity value of a node H from - is defined as:

(3)

iv. The Proximity-adjusted Vulnerability Score of a Host H (Suh-Lee, 2015) – is defined as:

∑ (4)

Where is a given vulnerability found in the host and

⁄ (5)18

v. The Relative Cumulative Risk (RCR) of the vulnerability – is defined as:

(6)

3.2. Evaluating the Relative Cumulative Risk of the Interaction between Members of

the set

To demonstrate the effect of the interactions between subjects and MNPs located in various sections

of the network and estimate their relative risk, an emulation of a typical deployment environment is

18

The Common Vulnerability Scoring System v3 (CVSS): is a measurable vulnerability severity score ranging from 0 to 10, with 7.0 to 10.0 being the highest, often used to prioritize responses and resources according to threat.



used as represented in Figure 2.

Under the assumption that the internal network is adequately policed and fulfills security

requirements as stated in the previous section; the three Multifunction Network Printers (MNP) have

been placed in three different segments of the network. From figure 2, the following two danger

zones (i.e. ) can be identified.

The first zone is directly positioned behind inbound connections originating from the internet via the

ISP supplied gateway router inbound through FW1 and onto the DMZ segment containing the

employees e-mail server (EMS), the public hall printer (MNP2) and the webserver (WS) that also has a

backend connection to the database server(DS).

DMZ

Subnet 1 Workstations

Internet

UTM/FW

External (FW1)

FW

Internal (FW2)

Proxy/Email Filter

(PRX)

Subnet 2 Servers

Gateway

Router (GW)

DB Server

(DS)

AD Server

(AD)File Server

(FS)

Management Printer

(MNP1)Management Server

(MS)

E-Mail Server

(EMS)

Web Server

(WS)Main Hall Printer

(MNP2)

User Station

(US1)User Station

(US2)

System Admin

Workstation (AUS3)

Sales Printer

(MNP3)

Figure 2: The Test Network Diagram

Excluding all none printing resources our zone definition and proximity value assignment would be:

: .

The second zone comprises of subnet 1 and 2. The first subnet has an outbound connectivity to the

Internet through the proxy server via FW1. Similar to the above representation of zone 1, our zone

definition and proximity assignment in this case is:



Determining proximity values for the printers relative to the danger zones, using (3) and (5)

generates the following proximity map.

Figure 3: Proximity Map for Test Network

The proximity adjusted vulnerability scores for MNP1, MNP2, and MNP3 can be calculated as

follows:

For MNP1:

⁄

⁄

For MNP2:

⁄

⁄

For MNP3:

⁄

⁄

From the above calculation, we can conclude that MNP2 has the highest risk rank (i.e. RCR=45.49) of

the three printers; followed by MNP3 (i.e. RCR=28.73). MNP1 is found to be the least at risk resource

(i.e. RCR=2.33).

This conclusively demonstrates how exposure to different vulnerability levels can elevate the

relative risk rankings of resources that may otherwise be assumed to have been properly secured.

The presented method (Suh-Lee, 2015) therefore establishes that while accurate configuration of

printers is important, it is equally important to remedy existing network vulnerabilities to ensure

that risks are kept at their lowest.

4. CONCLUSION The paper presented a multi-level network printer security architecture that relied on robust policy

constrained security mechanisms for discretionary control of both trusted and untrusted entities by

means of the TCB. The developed model further demonstrates the need for a secured and

trustworthy network environment since the nature of such an environment tends to measurably

DZ1

Internet: 0

GW: 0

FW1: 1

MNP2: 1

FW2: 1

DZ2

MNP1: 5

FW2: 3

MNP3: 4

Internet: 0

GW: 0

FW1: 1



reduce the risk ranking ascribed to a given MNP regardless of how well it is thought to have enforced

access and information flow control mechanisms. It is hoped that by focusing on accurate

configuration, good use of discretionary security policy implementations, and strategic placement of

MNPs; one can greatly improve both the security and trustworthiness of MNPs while still

maintaining the balance between protection and usability.

ACKNOWLEDGEMENTS I would like to thank Botho University for sponsoring this paper’s presentation costs. I would also

like to thank colleagues, and friends who have taken time to review and provide much needed

feedback.

REFERENCES

Bansal, P., & Roth, K. (2000). "Why companies go green: A model of ecological responsiveness."

Academy of Management Journal 43 (4): 717-736.

Bishop, M. (2012). "An Overview of Computer Security." In Computer Security: Art and Science, 3-24.

Capetown: Addison-Wesley.

Bishop, M. (2003). "Assurance." In Computer Security: Art and Science, 475-544. Capetown: Addison-

Wesley.

Chen, H., Bao, D., Goto, Y., & Cheng, J. (2015). "A Supporting Environment for IT System Security

Evaluation Based on ISO/IEC 15408 and ISO/IEC 18045." Computer Science and its

Applications 1359-1366.

Commission of the European Communities. (1991). Information Technology Security Evaluation

Criteria. Brussels: Commission of the European Communities.

Cui, A., Costello, M., & Stolfo, S. J. (2013). "When Firmware Modifications Attack: A Case Study of

Embedded Exploitation." NDSS.

Denning, D. E. R. 1975. "Secure Information Flow in Computer Systems." Ph. D. Dissertation. Purdue

Univ.

Di Giuli, A., & Kostovetsky, L. (2014). "Are red or blue companies more likely to go green? Politics

and corporate social responsibility." Journal of Financial Economics 111 (1): 158-180.

Dohi, M. 2012. Printing system, information processing apparatus, printing apparatus, print

management method, and storage medium. Washington, DC: U.S. Patent 8,161,297. April

17.

Forbes. (2013). "The Hidden IT Security Threat Multifunction Printers." February 7. Accessed

December 26, 2015. http://www.forbes.com/sites/ciocentral/2013/02/07/the-hidden-it-

security-treat-multifunction-printers/?sf9393024=1.

Gonsalves, A.,. (2013). "Printers Join Fray in Network Vulnerability Landscape." CSO Online. January

29. Accessed December 26, 2015. http://www.csoonline.com/article/2132861/access-

control/printers-join-fray-in-network-vulnerability-landscape.html.

http://www.forbes.com/sites/ciocentral/2013/02/07/the-hidden-it-security-treat-multifunction-printers/?sf9393024=1

http://www.forbes.com/sites/ciocentral/2013/02/07/the-hidden-it-security-treat-multifunction-printers/?sf9393024=1

http://www.csoonline.com/article/2132861/access-control/printers-join-fray-in-network-vulnerability-landscape.html

http://www.csoonline.com/article/2132861/access-control/printers-join-fray-in-network-vulnerability-landscape.html



Grubb, B.,. (2013). "Security Fears over Exposure of Web-accessible Printers." January 29. Accessed

December 26, 2015. http://www.theage.com.au/it-pro/security-it/security-fears-over-

exposure-of-webaccessible-printers-20130129-2dhxo.html.

Infotrends. (2011). "Placements of Printers & MFP Devices Grew In U.S. and Western Europe Despite

Challenging Economy." May 24. Accessed December 26, 2015.

http://www.infotrends.com/public/content/press/2011/05.24.2011c.html.

Kahate, A. (2013). Cryptography and network security. New Delhi: Tata McGraw-Hill Education.

Landwehr, C. E. (1981). "Formal models for computer security." Computer Surveys 13 (3): 247-275.

Ormazabal, G. S., & Schulzrinne, H. G. (2014b). Denial of service detection and prevention using

dialog level filtering. DC: USA Patent 8,719,926. May 6.

Ormazabal, G. S., & Schulzrinne, H. G. (2014a). Maliciouis user agent detection and denial of service

(DOS) detection and prevention using fingerprinting. DC: U.S Patent 8,689,328. Apr 1.

Ormazabal, G. S., Schulzrinne, H. G., Yardeni, E., & Patnaik, S. B. (2015). Prevention of denial of

service (DoS) attacks on session initiation protocol (SIP)-based systems using return

routability check filtering. DC: U.S Patent 8,966,619. Feb 24.

Pfleeger, C. P., & Pfleeger, S. L. (2011). "Administering Security." In Security in Computing, 524-545.

Boston: Prentice Hall Professional Technical Reference.

Savage, C., Petro, C., & Goldsmith, S. (2015). System for Providing Session-based Network Privacy,

Private, Persistent Storage, and Discretionary Access Control for Sharing Private Data.

Washington, DC: U.S. Patent 20,150,333,917. November 19.

Stoughton, A. (1981). "Access Flow: Protection model which integrates access control and

information flow." IEEE Symp. Security and Privacy. 9-9.

Suh-Lee, C., & Jo, J. (2015). "Quantifying security risk by measuring network risk conditions." 2015

IEEE/ACIS 14th International Conference. Las Vegas. 9-14.

Vail, V. T. (2003). "Printer Insecurity: Is it Really an Issue?" SANS Institute InfoSec Reading Room, May

28: 1-12.

http://www.theage.com.au/it-pro/security-it/security-fears-over-exposure-of-webaccessible-printers-20130129-2dhxo.html

http://www.theage.com.au/it-pro/security-it/security-fears-over-exposure-of-webaccessible-printers-20130129-2dhxo.html

http://www.infotrends.com/public/content/press/2011/05.24.2011c.html

Proceedings of the ICICIS 2016 Conference Gaborone ... · Proceedings of the ICICIS 2016 Conference...

Documents

Transcript of Proceedings of the ICICIS 2016 Conference Gaborone ... · Proceedings of the ICICIS 2016 Conference...