A general-purpose distributed computing Java middleware › haroldo › papers ›...

Received: 24 December 2016 Revised: 26 July 2018 Accepted: 3 August 2018

DOI: 10.1002/cpe.4967

R E S E A R C H A R T I C L E

A general-purpose distributed computing Java middleware

André Luís Barroso Almeida1,2 Leonardo de Souza Cimino1

José Estevão Eugênio de Resende1 Lucas Henrique Moreira Silva1

Samuel Queiroz Souza Rocha1 Guilherme Aparecido Gregorio1

Gustavo Silva Paiva1 Saul Delabrida1 Haroldo Gambini Santos1

Marco Antonio Moreira de Carvalho1 Andre Luiz Lins Aquino3 Joubert de Castro Lima1

1DECOM, Universidade Federal de Ouro Preto,

Ouro Preto, Brazil2Control and Automation Department

(CODAAUT), Instituto Federal de Minas Gerais,

Ouro Preto, Brazil3IC, Universidade Federal de Alagoas, Maceió,

Brazil

Correspondence

André Luís Barroso Almeida, Control and

Automation Department, Instituto Federal de

Minas Gerais, 35400-000 Ouro Preto MG,

Brazil.

Email: [email protected]

Summary

The middleware solutions for General-Purpose Distributed Computing (GPDC) have distinct

requirements, such as task scheduling, processing/storage fault tolerance, code portability for

parallel or distributed environments, simple deployment (including over grid or multi-cluster envi-

ronments), collaborative development, low code refactoring, native support for distributed data

structures, asynchronous task execution, and support for distributed global variables. These solu-

tions do not integrate these requirements into a single deployment with a unique API exposing

most of these requirements to users. The consequence is the utilization of several solutions with

their particularities, thus requiring different user skills. Besides that, the users have to solve the

integration and all heterogeneity issues. To reduce this integration gap, in this paper, we present

Java Cá&Lá (JCL), a distributed-shared-memory and task-oriented lightweight middleware for

the Java community that separates business logic from distribution issues during the develop-

ment process and incorporates several requirements that were presented separately in the GPDC

middleware literature over the last few decades. JCL allows building distributed or parallel appli-

cations with only a few portable API calls, thus reducing the integration problems. Finally, it also

runs on different platforms, including small single-board computers. This work compares and

contrasts JCL with other Java middleware systems and reports experimental evaluations of JCL

applications in several distinct scenarios.

KEYWORDS

distributed computing, distributed shared memory, Java, middleware, parallel computing,

task-oriented

1 INTRODUCTION

We live in a world where large amounts of data are stored and processed every day.1 According to the last International Data Corporation (IDC)

report, the amount of data stored reached 4.5 trillion gigabytes in 2013. This number is expected to grow by a factor of 10, exceeding 40 trillion

gigabytes by 2020.2 Despite significant increases in the performances of today's computers, problems that are intractable via sequential comput-

ing approaches still exist.3 Big data,4 Internet of Things (IoT),5 and elastic cloud services6 are promising technologies for this new, decentralized,

dynamic, and communication-intensive society.

General-Purpose Distributed Computing (GPDC) follows on the principle of concurrence. This approach can achieve considerable speed improve-

ments, but the development process becomes considerably more complicated when we introduce concurrence. Therefore, the development of

GPDC applications without middleware or frameworks that function as intermediate software layers is practically impossible.7 Middleware is every-

where and most likely will remain everywhere for a long time because it helps reduce the complexity of application development.7 The challenging

issue is to provide sufficient generalized high-level mechanisms using middleware to support a general-purpose computing and a rapid development

Concurrency Computat Pract Exper. 2019;31:e4967. wileyonlinelibrary.com/journal/cpe © 2018 John Wiley & Sons, Ltd. 1 of 28https://doi.org/10.1002/cpe.4967

https://doi.org/10.1002/cpe.4967

http://orcid.org/0000-0002-9722-0426

http://orcid.org/0000-0002-5728-9373

2 of 28 ALMEIDA ET AL.

TABLE 1 Fundamental requirements for GPDC middleware solutions

GPDC requirements

1. Code Refactoring 5. General-Purpose Computing 9. Multi-cluster/Grid Support

2. Simple Deployment 6. Performance 10. Task cost

3. Collaborative Development 7. Distributed Storage 11. Processing/Storage Fault Tolerance

4. Parallel/Distribute Portability 8. Task Scheduling 12. Scalability

of distributed and parallel applications. Among various programming languages for middleware systems, the interest in Java for GPDC is enormous.8

This interest is because of numerous requirements, including built-in networking, multi-threading support, platform independence, reflection, type

safety, security, and a vast developer community.9

Despite significant improvements in Java middleware solutions for GPDC over the past decades, few studies in the literature focus on implement-

ing many of the requirements considered fundamental for GPDC development. The consequence is the utilization of several solutions with their

particularities, thus requiring different user skills. Besides that, the users have to solve the integration and all heterogeneity issues. Therefore, to fill

this gap and integrate most of these GPDC requirements, we introduce a middleware called Java Cá&Lá, which herein is JCL,* which fulfills most of

the these requirements.

The main contributions of this paper, compared with our previous work,10 are the following.

1. A new JCL version that implements new requirements;

2. Three JCL optimization applications that can be used to evaluate its scheduling technique;

3. A detailed performance evaluation of JCL compared to Java Remote Method Invocation (RMI) (both synchronous and asynchronous) and the

Apache Ignite middleware solution, demonstrating the performance and scalability of JCL;

4. An evaluation of the JCL-Super-Peer component, in which we use multiple clusters; and

5. A discussion of the state-of-the-art in Java GPDC middleware, which can be useful for future technical and research investigations.

The rest of this paper is organized as follows. Section 2 discusses the important requirements for the design and development of a modern GPDC

middleware solution. Section 3 presents works that influenced the design and development of JCL, highlighting their benefits and limitations with

respect to the GPDC requirements discussed in Section 2. Section 4 details how our JCL middleware implements most of the requirements pre-

sented in Section 2. Section 5 presents our experimental evaluation and discusses the results. Section 6 concludes our work and suggests future

improvements to JCL.

2 GPDC MIDDLEWARE REQUIREMENTS

Table 1 lists the fundamental requirements for GPDC development. The results of a comparative study, described in Section 3, show that JCL meets

more of these requirements than any other related work. Therefore, it is highly useful for GPDC development. Comparative tests against standards

and market leaders, presented in Section 5, reinforce our view of JCL as a promising GPDC middleware alternative for the Java community.

The design and development processes of a middleware system include implementation, architectural and conceptual requirements. This section

details these requirements, completing the list presented in the work of Almeida et al,10 in which they explain the requirements of code refactoring,

simple deployment, collaborative development, and parallel/distributed portability. We detail the remaining of requirements as follows.

General-Purpose Computing: Middleware systems support programming models such as Distributed Shared Memory (DSM), message-passing, and

task-oriented or event-based models.11 The DSM programming model12 uses a global address space to store variables for an entire cluster but

does not consider method execution over the same cluster. The task-oriented programming model13 is an alternative for that purpose, as a task can

encapsulate one or more asynchronous executions of any class member, including its methods. Event-based and message-passing programming

models support other key programming abstractions, such as messages and events. These programming models are supported by middleware

solutions to enable the development of general-purpose computing.

Middleware systems such as Hazelcast,14 JBoss,15 Gridgain,16 Apache Ignite,17 and JCL10 can be not only adopted for general-purpose computing

but also designed for a specific purpose, eg, gaming, mobile computing, or real-time computing.18-20

Performance: The ability of a solution to provide processing, storage, and communication services can be evaluated using benchmarks or via com-

parative experiments. Both approaches can provide results that reveal the solution performance. Thus, the best design practices in developing

GPDC applications are primordial. Therefore, improvements in storage services, such as caching and pre-fetching, in processing, such as pipelines

and schedulers, or in communication, such as buffering and compression, are typically implemented in middlewares.

*A version of Java Cá&Lá is available for download from http://www.javacaela.org.

ALMEIDA ET AL. 3 of 28

Many works have investigated the performance of cloud computing systems such as Amazon EC2 in the work of Mehrotra et al21 and Amazon

web service cloud in the work of Jackson et al.22 Some High-Performance Computing (HPC) Java platforms were investigated and compar-

atively evaluated in the work of Taboada et al.8 Massive parallel computing architectures such as Field Programmable Gate Array (FPGA)

and Graphical Processing Unit (GPU) cards represent promising alternatives concerning performance. Thus, many solutions have attempted

to simplify the development efforts required for such approaches. For instance, Karantasis and Polychronopoulos,23 presented an exten-

sion of the Pleiad middleware,24 used by Java developers to work with a local GPU abstraction over several nodes with between one and

four GPU boards each.

Distributed Storage: Many middleware systems implement user-typed object storage, but few of them implement distributed data structures as

part of a unified Application Programming Interface (API).14-16 Developers typically implement distributed storage using a specific framework or

middleware such as HBase,25 Cassandra,26 Apache Pig,27 ScyllaDB,28 or MongoDB.29 Often, third-party distributed storage solutions are focused on

transactional aspects, ie, database atomicity, consistency, isolation, and durability (ACID) demands, designed for applications with specific needs.

In contrast, our focus is on global variables, including data structures, adopted in every codebase. JCL and a few others extend standard Java

collection APIs such as Map, Set, and List, which have been present in Java since its beginning. Thus, small code refactoring is necessary when we

replace sequential global variables with distributed variables.

Task Scheduling: In most cases, we model GPDC applications as single instruction multiple data (SIMD) solutions; consequently, the workload

depends on data partitioning.30 We can model other problems as pipeline solutions, in which each pipe step executes a different set of instructions

or a method. Therefore, pipeline steps typically have different workloads (Multiple Instruction Single Data (MISD)). Unfortunately, both modeling

alternatives introduce a load balancing problem, as highlighted in the work of Boneti et al.30

To reduce the load balancing problems, we adopt scheduling algorithms,31 the goal of which is to reduce the workload difference by moving part

of the load from overloaded machines or cores to those that are underutilized.32 Some middleware systems, such as the Java Parallel Processing

Framework (JPPF)33 and Gridgain,16 implement various scheduling techniques; others, such as Java RMI34 and Message Passing Interface (MPI),35

delegate scheduling issues to developers. To achieve better performance, implementation of a dynamic scheduling strategy may be necessary.

In such cases, middleware can typically implement more complex algorithms that can reconfigure the scheduling strategy at runtime36 using

predictive models based on processor, memory, and network historical usage.37

Multi-cluster/Grid Support: The concept of a super-peer is a well-established option for enabling a multi-cluster/grid environment. A super-peer is a

node in a peer-to-peer network that operates both as a server for a set of clients and as an equal in a network of super-peers.38 Such architectures

capitalize on heterogeneous capabilities (eg, bandwidth and processing power) across peers, but they also enable sub-networks with IP addresses

invalid outside of the sub-network to interconnect them in a grid.

The super-peer concept also introduces the possibility of creating multiple logical clusters, where we organize each cluster according to the devel-

oper needs. Therefore, super-peers can extend network infrastructure advantages. For instance, using three super-peers in a smart building, a

developer can build one cluster for the garden, another for the swimming pool, and a third for the garage, all of which share the same network

infrastructure. We also can arrange the super-peers in a hierarchical topology. Thus, nested clusters are feasible where, eg, a garden cluster might

contain a swimming pool cluster. Surveys such as the works of Lua et al39 and SalemAlzboon et al40 report the benefits of super-peers, but the

Java GPDC middleware systems found in the literature do not consider this concept.

Task Cost: Middleware systems such as JPPF33 and Hazelcast14 monitor the health of every cluster member concerning RAM, disk, and CPU usage.

Dashboards have been implemented to visualize cluster health, but for capacity planning, the collection of each task's storage and processing

requirements in each cluster member is fundamental.

These task costs are essential for building scheduling algorithms or supervisory systems and have been adopted to delineate capacity planning

strategies for decentralized systems. A high standard deviation in a cluster's queue time can indicate that the cluster has insufficient cores and

that new members need to be connected. Conversely, a low standard deviation can guarantee energy savings. Unfortunately, no related work has

implemented this detailed task cost model.

Processing/Storage Fault Tolerance: Fault tolerance in distributed computing is an important requirement for preventing data loss and corruption as

well as resistance to malfunctioning applications. Middleware systems, such as Hazelcast,14 Gridgain,16 and Oracle Coherence,41 implement fault

tolerance for storage services by maintaining multiple copies of distributed global variables. To process fault tolerance, JPPF33 resubmits tasks

when timeouts occur. A common way to detect a fault is to send a message periodically to the node. Basanta-Val and García-Valls42 used this

strategy. However, none of the existing Java GPDC middleware systems consider Byzantine faults,43 in which a process not only fails to respond

but also produces incorrect results due to a variety of reasons.

Scalability: Scalability is crucial in the design of parallel and distributed systems44 because not all operations or applications require the same

middleware improvements for GPDC. In their book The Art of Scalability, Abbott and Fisher45 defined the three types of scalability as a scale cube.

All GPDC middleware solutions attempt to simplify one of the following scaling axes for their developers: (i) X-axis scaling is essentially traditional

horizontal scaling, which distributes the total load across a given number of nodes; (ii) Y-axis scaling refers to extracting and distributing services,

ie, functional decomposition, and is a design approach reflected in service-oriented and microservice architectures; and (iii) Z-axis scaling refers

to data partitioning and involves distributing data among many nodes or blocks of data to improve performance.

HPC middleware solutions have constrained and rigorous demands concerning scalability and performance requirements; this way, they try to

reduce, for instance, communication latency. Java has native support for Sockets Direct Protocol (SDP), but sometimes it is not sufficient, precisely


while running network-bound applications. Furthermore, SDP is not supported by OpenFabrics Alliance (OFA) anymore. For these reasons, HPC

middleware solutions always implement their network support on top of low-level libraries for low-latency communications, eg, Verbs/MXM for

InfiniBand, GNI for Cray Aries and usNIC for Cisco. Currently, JCL does not integrate with such libraries. Thus, in this paper, we avoid mention JCL

as an HPC middleware solution. Instead, we adopt the acronym GPDC to describe JCL.

3 RELATED WORK

The goal of this section is to evaluate the main Java GPDC middleware solutions based on the requirements previously defined in Table 1. Solu-

tions classified as cloud middleware and frameworks are not considered here, as their goal is different from that of JCL and its counterparts. More

precisely, cloud middleware and frameworks offer virtual machines and transparently execute virtual services on top of them (eg, network, file

system, operating system, relational database, and NoSQL database). Solutions classified as middleware for big data, time-critical big data,46 and

stream processing47 are also not considered here because they focus on data analytics and non-volatile storage rather than coding general-purpose

applications.

This section considers both academic and commercial Java GPDC middleware solutions and highlights their limitations/improvements. Middle-

ware systems that are very similar to JCL are described in detail, while others are only briefly mentioned in Tables 2 and 3.

Jessica54 improves the original Java virtual machine (JVM) by enabling a distributed shared space for ordinary Java objects and threads. Therefore,

Jessica enables thread migration. Java developers who are familiar with Java thread programming can quickly develop applications using Jessica,

and legacy Java thread applications can use Jessica transparently. JCL and the remaining related works are over a JVM for a single machine; thus,

we mention Jessica due to its high level of transparency (eg, it requires no new instructions to develop an asynchronous distributed thread-based

application). Although Jessica supports distributed shared objects, it does not implement distributed data structures in the same way that JCL and

many of its counterparts do.

RAFDA49 is a reflective middleware that “permits arbitrary objects in an application to be dynamically exposed for remote access, allowing appli-

cations write without concern with distribution.”49 RAFDA objects are exposed as Web services to provide distributed access to ordinary Java

classes. Applications access RAFDA functionalities using infrastructure objects called RAFDA runtime (RRT). Each RRT provides two interfaces to

application programmers, ie, one for local RRT access and the other for remote RRT access. With this approach, RAFDA introduces dependencies,

and consequently, requires code refactoring. A RRT supports peer-to-peer communication; therefore, it is possible to execute a task in a specific

TABLE 2 Requirements of JCL and its counterparts - Part 1

Feature

Tool Fault Refactoring Simple Collaborative Portable

Tolerant required Deploy Code

JCL10 No No Yes Yes Yes

Infinispan48 Yes Low No Yes Yes

JPPF33 Yes No Yes No No

Hazelcast14 Yes Low No Yes No

Oracle Coherence41 Yes Medium Yes Yes NF1

RAFDA49 No No Yes Yes No

PJ50 No Yes Yes No Yes

FlexRMI51 No Medium No No No

RMI34 No Medium No No No

Gridgain16 Yes Low No Yes No

ICE52 Yes High No No No

MPJ Express53 No Medium No No Yes

Jessica54 NF1 No Yes No Yes

ProActive55 Yes Medium NF1 No No

FastMPJ56 No Medium No No Yes

P2P-MPI57 Yes High No No No

KaRMI58 No NF1 No No NF1

RMIX59 No NF1 No No No

Open MPI60 Yes High No No No

MPJava61 No Medium No No No

Apache Ignite17 Yes Low Yes Yes Yes

1 - NF: Not found


TABLE 3 Requirements of JCL and its counterparts - Part 2

Feature

Tool Task cost Super-Peer Distributed Scheduler Support

Data Structures Available

JCL10 Yes Yes Yes Yes Yes

Infinispan48 NF1 NF1 Yes Yes NF1

JPPF33 No Yes No Yes Yes

Hazelcast14 NF1 NF1 Yes Yes Yes

Oracle Coherence41 NF1 NF1 Yes Yes NF1

RAFDA49 No Yes No No No

PJ50 No No No Yes Yes

FlexRMI51 No No No No No

RMI34 No No No Yes No

Gridgain16 No Yes Yes Yes Yes

ICE52 No NF1 No Yes NF1

MPJ Express53 No No No Yes Yes

Jessica54 No No No Yes NF1

ProActive55 NF1 Yes No Yes Yes

FastMPJ56 No No No Yes Yes

P2P-MPI57 No No No Yes Yes

KaRMI58 No No No No NF1

RMIX59 No No No No No

Open MPI60 No No No Yes Yes

MPJava61 No No No No No

Apache Ignite17 NF1 No Yes Yes Yes

1 - NF: Not found

cluster node. However, when developers need to submit several tasks to more than one remote RRT, they must implement a scheduler from scratch.

RAFDA has no portable parallel/distributed versions.

At the beginning of the 2000s, the FlexRMI51 middleware was developed to implement asynchronous remote method invocation using the

standard Java RMI API. As stated in Taveira,51 “FlexRMI is a hybrid model that allows both asynchronous or synchronous remote method

invocations.” Basically, FlexRMI alters the Java RMI stub and skeleton compilers to achieve high transparency. Similar to the standard RMI,

FlexRMI does not include a multi-core parallel version to achieve GPDC portability. Furthermore, it requires at least “java.rmi.Remote” and

“java.rmi.server.UnicastRemoteObject” extensions to produce an RMI application. Because FlexRMI does not implement a dynamic class

loading feature, all classes and interfaces must be stored in nodes before an RMI (and also a FlexRMI) application can be initiated, making deployment

a time-consuming process.

JPPF is an open source grid computing framework62 that simplifies the process of parallelizing applications that demand high processing power,

allowing developers to focus on their core software development.33 It implements a dynamic class loading feature for a cluster node but does not

support collaborative development, ie, different JPPF applications do not share methods and variables among them. JPPF includes four predefined

but customizable scheduler algorithms. Other requirements, such as fault tolerance and the possibility of it interconnects different networks via

super-peers, make JPPF one of the complete solutions in the literature.

ProActive55 “is an RMI-based middleware for parallel, multi-threaded, and distributed computing focused on grid applications.”9 In general, ProAc-

tive's use of RMI as its default transport layer adds significant overhead, but it is fault tolerant and includes transparent mobility and security

implementations.

The Parallel Java (PJ)50 solution implements several high-level programming abstractions, including ParallelRegion (in which code are exe-

cuted in parallel),ParallelTeam (in which a group of threads executes aParallelRegion), andParallelForLoop (in which work paralleliza-

tion is conducted among threads), which support an easy task-oriented programming model. Moreover, PJ is designed for hybrid shared/distributed

memory systems such as multi-core clusters. It adopts a message-passing programming model for these clusters. Consequently, it eliminates

the transparency of multi-core shared-memory access in multi-computer environments due to communication particularities. The middleware

implements the concept of a tuple space18 but not in a distributed manner. It includes an API for GPU devices that offers Cuda63 transparent services.

Infinispan, developed by JBoss/RedHat,48 is a popular open source distributed in-memory <key, value> pair data store solution64 that enables

accessing a cluster in two ways, ie, (i) via an API available in a Java library and (ii) via several protocols, such as HotRod, REST, Memcached, and

WebSockets,65 making Infinispan a language-independent solution. In addition to storage services, the middleware can execute tasks remotely and


asynchronously; however, developers must implement Runnable or Callable Java interfaces. Furthermore, it must register these tasks in the JVM

classpath of each cluster node, which can delay the deployment process.

Hazelcast14 is a well-established middleware in industry. It offers the concepts of functions, locks, and semaphores. According to Veentjer et al,14

“Hazelcast provides a distributed lock implementation and makes it possible to create a critical section within a cluster of JVMs; so only a single

thread from one of the JVMs in the cluster is allowed to acquire that lock.” In addition to an API for asynchronous remote tasks, Hazelcast includes a

simple API for storing objects in a computer grid. Unfortunately, it does not separate business logic from distribution issues; therefore, code refac-

torings are mandatory. It has a manual scheduling alternative for processing and storage services; thus, developers can select the cluster node on

which to store data or run an algorithm. The middleware does not implement a dynamic class loading feature; therefore, it is necessary to manu-

ally deploy each developer class in each member of a cluster, making deployment a time-consuming activity. Hazelcast's List, Set, and Queue data

structures are fault tolerant but not distributed, ie, only the Map data structure is both distributed and fault tolerant.

Oracle Coherence is an in-memory data grid commercial middleware that offers database caching, HTTP session management, grid agent invo-

cation, and distributed queries.41 It provides an API for all services and includes an agent deployment mechanism. Thus, it also has a dynamic

class loading feature, but the individual agents must implement the EntryProcessor interface. Consequently, code refactoring is mandatory.

Single-board computers with Linux support, such as Raspberry Pi, Omega2, and Cubieboard, can be adopted for general-purpose computing, but

Oracle typically does not design GPDC products for small platforms.

Apache Ignite is an open source in-memory data fabric written in Java that provides native support for other programming languages, such as

dotNet, C++, and PHP.17 Supported by the Apache Software Foundation,66Apache Ignite is integrated with Spark and Hadoop.67 Similar to JCL,

Apache Ignite implements both synchronous and asynchronous remote tasks, utilizes a distributed <key, value> pair storage, and is simple to deploy.

It supports SQL queries as an alternative approach for retrieving data.

Taboada et al8 presented a Java HPC survey that cataloged middleware systems and libraries classified as shared memory, socket-based, RMI, and

message-passing solutions. The middleware systems in the study were tested in two shared-memory environments and on two InfiniBand multi-core

clusters using the NAS Parallel Benchmarks (NPB)68 benchmark. The results demonstrated that the Java language reached performance levels

similar to those of natively compiled languages.

Programming GPU clusters via a DSM abstraction offered by a middleware layer is a promising solution for some specific problems, eg, SIMD

problems. An extension of the Pleiad middleware24 was implemented in the work of Karantasis and Polychronopoulos,23 enabling Java developers

to work with a local GPU abstraction over several nodes equipped with between one and four GPU boards each.

4 JCL ARCHITECTURE

This section details the architecture of the proposed JCL middleware and how it implements most of the requirements listed in Table 1. There are two

versions of JCL, ie, a multi-computer or cluster version and a multi-core version. The multi-computer version stores objects and launches tasks to

invoke methods over a cluster or multiple clusters and all communications take place over Ethernet protocols (TCP and UDP, precisely) or Infiniband,

as Java is portable across both protocols.69 Unfortunately, Java and Infiniband native support occur via SDP technology. However, we must take into

account that SDP is still much slower than using other low-level native libraries. This way, Java has no support for HPC natively, requiring third-party

libraries such as Direct Storage and Networking Interface (DiSNI).70

The multi-computer version has a hybrid distributed architecture, ie, it adopts a client-server behavior to provide location and registration ser-

vices, but it also adopts a peer-to-peer (P2P) architectural style to provide processing and storage services. In contrast, the multi-core version,

also present in the multi-computer version, turns the JCL-User component into a local JCL-Host component without the overhead of network

communications. All objects and tasks are stored and executed locally in a user's machine. All JCL applications are portable across both versions.

The architecture of the JCL distributed version includes four main components, ie,JCL-User,JCL-Server,JCL-Super-Peer, andJCL-Host,

while the parallel version includes only two, ie, JCL-User and JCL-Host. JCL implements the Java Map interface to run over clusters

(JCL-HashMap) and includes safe <key, value> pair locking for concurrent accesses.

The JCL-User component is designed to expose the middleware services via a unique API, and it provides an important phase of the schedul-

ing strategy and automatic version selection based on user configurations. The JCL-Server component is designed to manage the cluster

and is responsible for receiving the information from each JCL-Host and distributing it to all registered JCL-User components, enabling

them to make a P2P communication with each JCL-Host. The JCL-Host component stores objects and invokes registered methods. It also

stores JCL-HashMap<key, value> pairs. Finally, it solves the second phase of the JCL scheduling solution in the JCL-Host component. The

JCL-Super-Peer component is responsible for both managing a cluster under its control, operating like that of a JCL-Server, and for cre-

ating tunnels in conjunction with the JCL-Server through which data and commands passed to JCL-Hosts in networks with invalid IPs.

The JCL-Server component works as a coordinator; thus, it must to be visible to other components such as JCL-Hosts, JCL-Users, and

JCL-Super-Peers.

In the following sections, we present how JCL addresses solutions to most of the requirements presented in Table 1, specifically deployment,

refactoring, scheduling, distributed storage, collaboration, portability, multi-cluster/grid support, and task cost tracing.


FIGURE 1 JCL multi-computer deployment view


4.1 Simple deployment

Deployment is a time-consuming process in most middleware systems. In some cases, the system must be rebooted to deploy a new application. JCL

adopts a simple deployment process based on both the reflection capabilities of Java and the adoption of discovery services.

The JCL simple deployment process is illustrated in Figures 1 to 3. Only oneJCL-Serverexists for each JCL deployment, and it must be deployed

first, as it registers and manages the remaining components (Figure 1A). Furthermore, at least one JCL-Host component must be deployed after

the JCL-Server to guarantee that other JCL components will be registered correctly; it deploys this component in the same network as that of

the JCL-Server. JCL supports one or many JCL-Hosts per cluster, as shown in Figure 1B.

Steps three and four of Figure 2 are optional, ie, we require them when interconnecting different data networks or when creating logical clus-

ters according to specific needs, such as a group of JCL-Hosts to support machine learning services or to collect sensing data from a smart

building's garden. JCL-Super-Peer component deployment takes place in a network gateway (Figure 2) or the same network as that of the



JCL-Server to create logical groups of JCL-Hosts. Several JCL-Hosts can be deployed after a JCL-Super-Peer deployment, as shown in

Figure 2. SeveralJCL-Super-Peers are feasible in a JCL multi-cluster or grid environment. Besides, nestedJCL-Super-Peers can be deployed

to produce a hierarchical network topology. The hierarchical topology in Figure 2 shows aJCL-Super-Peer inside a network managed by another

JCL-Super-Peer. This network topology can be useful in many scenarios; for instance, a house cluster may contain a garden cluster, and the gar-

den cluster may contain a swimming pool cluster. For this situation, three JCL-Super-Peers could be interconnected to form a hierarchical or

tree topology.

In Figure 3, severalJCL-User components are deployed on different types of machines (desktops and laptops). We assume that each machine is

running a different User Application (UA).

Following the deployment steps, multipleJCL-Users can run their applications, sharing JCL abstractions (registered modules, maps, and global

variables) without cluster reboots. When updating a previously registered module, JCL requires only a new registration API call to perform all new

registrations in the cluster. Thus, the execution of the middleware does not need to be stopped, even in live update scenarios. A selective registration

approach exists, in which only JCL-Hosts that will execute a UA must register it before its first execution. This approach avoids the to register

modules in the entire cluster each time.

The JCL discovery services implementation provides another useful advantage, as aJCL-Host can become aJCL-Server if it is deployed first,

ie, if aJCL-Host tries to find aJCL-Serverand fails, it becomes both aJCL-Serverand aJCL-Host. When the deployment of aJCL-Server is

late, it assumes control of the cluster, removing that responsibility from theJCL-Host that was previously functioning as a server. Furthermore, JCL

components find each other in a network; thus, the manual configuration of each component, which is a time-consuming activity that is impractical

in dynamic cluster deployments, can be avoided.

4.2 Code refactoring

One goal of JCL is to separate business logic from distribution issues. Typically, existing middleware solutions force their users to implement several

interfaces to guarantee distributed storage or asynchronous distributed tasks. In JCL, the developers do not need to implement such interface

because JCL adopts Java reflection to avoid code refactorings of existing and well-tested methods, variables, components, or algorithms.

To explain this feature, we use the ubiquitous “Hello World” application. Figure 4 depicts a class with a sequential method named print that

represents part of the business logic of a UA. In this example, the method prints the sentence “Hello World!”; however, the concept works the same

for any other demand.

Note that JCL does not automatically partition the user business logic into several distributed tasks. Therefore, theprintmethod is never auto-

matically partitioned by JCL; print is only allocated and executed by JCL in a deployment. Figure 5 illustrates how JCL achieves distribution for an

existing sequential code. At line four, the user obtains an instance of JCL, and at line five, the class “HelloWorld” is registered; subsequently, it is vis-

ible to the entire JCL cluster. At line six, JCL starts one task perJCL-Host in the cluster to enable execution of theprintmethod of the registered

class. The JCL executeAll method requires the class nickname (“Hello”), the method to be executed (“print”), and the arguments to the method

(or null if no arguments are required).

FIGURE 4 Business logic: Hello World


FIGURE 5 Distribution logic: Hello World

FIGURE 6 Distribution logic: complex Hello World

Another way to work with JCL is to execute distributed tasks remotely. For example, we can use JCL to start other JCL tasks, as shown in Figure 6.

The main application sends 10 parallel tasks that represent JCL distributed tasks, as shown in Figure 5. In Figure 6, the UA requests a JCL instance

at line five, and it registers a Jar file containing the “JCLHelloWorld” (Figure 5) and “HelloWorld” (Figure 4) classes at line seven. Finally, at line ten,

the JCL instance launches 10 tasks by calling its executeAPI method.

TheexecuteAllandexecutemethods are asynchronous; therefore, the UA can execute other code while waiting for the results. When we call

these methods, the JCL kernel assigns a thread, named “worker,” to handle the task's execution in a specificJCL-Host. TheJCL-Host initially starts

as many “worker” threads as the number of cores available; it adds any remaining tasks to a “worker” queue to wait for execution. This strategy can

generate deadlocks because a running task can be waiting for a task that is not running (ie, in the queue). To avoid this problem, theJCL-Host starts

a new task whenever all the running tasks have been in the wait state for a given time. This strategy causes more thread yields, and consequently,

more context switches, but it avoids deadlocks because the JVM always guarantees that a CPU is available for all started threads. This strategy also

reduces the number of tasks in the “worker” queue, reducing the possibility of moving tasks from overloaded JCL-Hosts to less-loaded ones.

4.3 Scheduling

JCL adopts different strategies for scheduling processing and storage services in a cluster. More precisely, the scheduling of tasks, global variables,

the JCL-HashMap, and their <key, value> pairs occurs in two ways.

JCL adopts a two-phase distributed solution to schedule a task that invokes a method or a group of methods. In the first phase, the JCL-User

component quickly dispatches a task to a JCL-Host using a circular list of JCL-Hosts. The list of JCL-Hosts is obtained from the JCL-Server,

which is responsible for notifying the cluster members when changes occur. After selecting a JCL-Host, the JCL-User determines the number

of tasks per JCL-Host using its information regarding the number of available cores provided by the JCL-Server. The JCL-User can group the

tasks into chunks before submitting them if the UAs explicitly call thejcl.executeAllAPI service or if the users configure a property file for that

purpose.

The number of chunked tasks needed to invoke methods remotely and asynchronously is always proportional to the number of cores available

in each JCL-Host; thus, JCL-Host is designed to work with different nodes in a cluster. JCL implements a watchdog to address UAs when the

number of method invocation calls is not proportional to the chunk size. A watchdog is a thread that wakes up every 100 milliseconds. At each run,

it flushes the chunk regardless of the number of processing calls on it.

In the second scheduling phase, theJCL-Host components collaborate with each other to better balance the JCL cluster workload. After execut-

ing its last task, ie, when its last “worker” thread finishes execution, eachJCL-Host attempts to obtain and execute a new task from other threads in

the JCL cluster. Each time, it obtains only one task from a “worker” queue to avoid new redistributions in the cluster. This collaborative behavior mit-

igates problems caused by the circular list scheduler, implemented in theJCL-User. Therefore, even non-deterministic heuristics can be scheduled

efficiently in JCL, requiring few task replacements and dramatically reducing the runtimes, as demonstrated by the experiments.


When using a circular list scheduling technique, a scenario in which one JCL-Host receives most of the CPU-bound tasks can occur. In JCL, the

second phase redistributes these tasks with all the otherJCL-Host “worker” threads. AJCL-Host that addresses a task from anotherJCL-Host

must notify the JCL-User component to update its control data, as it contains the task metadata for each task, including the JCL-Host that

addresses it. However, to avoid architectural bottlenecks, the JCL-Server component is not notified after JCL-Host scheduler decisions.

Patel et al71 classified this load balancing technique as a neighbor-based approach because it is a dynamic load balancing technique in which

nodes transfer tasks among their neighbors. Consequently, after some iterations, the whole system is balanced without having to introduce a global

coordinator. As mentioned earlier, if it detects deadlocks in aJCL-Host, then by using the state of a task in wait mode for a long time, the number of

tasks in the “worker” queue will be reduced because the JCL-Hostwill initiate new queued tasks to eliminate the existing deadlock; consequently,

it reduces the task replacement.

To schedule global variables, maps and map <key, value> pairs, the JCL-User component calculates

F = Remainder(|hash(vn)|

nh

), (1)

where hash(vn) is the hashcode of the global variable name, nh is the number of JCL-Hosts, and F is the remainder of the division corresponding

to the node position. Equation 1 is used to determine the JCL-Host in which they will be stored and to perform a fair distribution. JCL adopts the

default Java hash code for strings and primitive types, but user-typed objects require a hashcode implementation.

Experiments with incremental global variable names such as “p_ij” or “p_i,” where i and j are incremented for each variable and p is any prefix,

showed that F achieves an almost uniform distribution over a cluster in several scenarios with different variable name combinations. However, F

does not guarantee a uniform distribution in all scenarios. For this reason, the JCL-User component introduces a delta (d) property that normally

ranges from 0 − −10% of nh. The delta property relaxes the result of function F, enabling two or more alternative JCL-Hosts to store a global

variable.

One drawback introduced by d is that JCL must check (2 ∗ d) + 1 nodes to search for a stored object, ie, if d is equal to 2, JCL must check five

nodes (two before and two after the JCL-Host identified by function F in the logical ring). Therefore, JCL performs parallel checks to reduce this

overhead. Experiments demonstrated that the extra communications introduced by the parallel checks are compensated for when compared with

only two sequential checks, which is possible when a check of the first node does not yield the desired object.

Equation 1 also introduces a problem for the scheduler when a new JCL-Host enters or exits the cluster. In this scenario, the location of the

previously stored global variable changes. To avoid storage replacements, which can become a time-consuming activity, the JCL-Server and the

JCL-User components maintain all previous cluster sizes after the first object instantiation in JCL (a global variable or a JCL-HashMap instantia-

tion, to be precise). Thus, F can be applied to each cluster size, increasing the number of parallel checks in the network and reducing their benefits

but avoiding the need to replace large objects very often. Typically, this compensatory strategy is sufficient. However, users can opt to replace all

objects after all cluster changes by modifying a property file.

4.4 Distributed hash map

JCL has a distributed implementation of the Java Map interface, which allows users to adopt a data structure familiar to the Java community and

requires minimal refactorings of existing Java code. In general, users replace a Map sub-type object (eg, tree-map or hash map) with aJCL-HashMap.

This replacement, previously performed locally, causes a distributed storage over a cluster of multi-core nodes.

Internally, when the UA stores or requests a <key, value> pair of a map object usingput(key, value) orget(key)methods, respectively, the

object key hash is calculated, and the location of the object is acquired using the function F described in Section 4.3, which returns the JCL-Host

in which the value is stored.

Each JCL-HashMap object has a single identifier that was provided by the UA at its creation. Therefore, any JCL UA can gain access to

a JCL-HashMap previously created in the cluster. Multiple JCL-HashMaps can have identical keys in the same cluster; however, different

JCL-HashMaps should have distinct identifiers to avoid overlaps. To efficiently implement some of the Map interface methods, such as clear(),

containsKey(Object key) and containsValue(Object value), a list of all the key hashcodes is cached in a single JCL-Host.

To traverse the map items, the JCL-HashMap provides a new iterator implementation that initially identifies and gathers all the keys of a map

belonging to a single JCL-Host in bins to optimize data transfer. Then, it sends the first bin to the JCL-User component, and after reading 40% of

the<key, value>pairs already obtained by theJCL-User, the next bin is submitted until there are no more bins. The value of 40% was chosen empir-

ically after conducting numerous experiments with various types of objects for the<key, value>pairs. Bin pre-fetching is of fundamental importance

to guarantee that large maps can be traversed without stopping due to communication between the JCL-User and the JCL-Host components.

This iterator strategy is very efficient, but it does not guarantee sorted traversals according to a key order. In summary, the JCL-HashMap is an

unsorted distributed map implementation similar to many market leaders, such as Hazelcast, Gridgain, Infinispan and Apache Ignite.

Distributed mutual exclusion is also implemented at the level of individual keys, ie, the UAs can call thegetLock(key)method, which guarantees

safe and exclusive access to the value that represents the key. While one UA thread is manipulating the value, another thread cannot write to the

object. It calls theputUnlock(key, value)method to unlock the object and allow access to other JCL cluster threads. Theput(key, value)

method of the JCL-HashMap is always thread-safe; however, the get(key)method returns a value without blocking.


FIGURE 7 User application one in machine one

FIGURE 8 User application two in machine two

4.5 Collaborative development

JCL UAs can share compiled modules, global variables, and maps without explicit references. Thus, a UA A1 in a node can access an object instantiated

by another application, A2, using only its nickname. By introducing this requirement, UAs worldwide can share algorithms, data structures, and the

computational power of multiple clusters.

To exemplify the collaborative behavior of JCL, consider one UA starting a JCL-HashMap named “Test” in line one of Figure 7 and storing two

<key, value> pairs in lines two and three, respectively. UA two can recover the JCL-HashMap named “Test” in line one and print the values of keys

“1” and “2” in lines two and three of Figure 8. It can also input other values, as illustrated in line four. The UA can also lock an entry of an existing

map and update its value, as demonstrated in lines five, six, and seven of Figure 8. The execution of registered methods as tasks and instantiation of

global variables in JCL clusters follow the same collaborative idea.

4.6 Parallel/distributed portability

JCL was built in Java and is portable to any JVM that meets Oracle specifications. Thus, JCL can run over not only massive multi-core nodes but also

clusters composed of heterogeneous nodes. A JCL cluster can also be composed of single-board computers compatible with the Oracle JVM. In this

way, similar to Congosto et al72 solution, JCL can run on Raspberry Pi, Galileo, Cubieboard, and many other small devices.

In addition to JVM portability, JCL introduces the concept of GPDC portability, ie, any JCL UA can adopt either the multi-core or multi-computer

version without changes. The options for instantiating aJCL-HashMap, as well as those for invoking methods and instantiating/storing objects, are

fully compatible with both versions. To achieve this requirement, a single component with a single API access to both JCL versions is mandatory, for

which the JCL-User component is responsible. Users select which JCL version to launch via a property file or an API by calling static methods to

obtain parallel or distributed versions. In the literature, JCL is the only option that has such GPDC portability.

4.7 Multi-cluster/grid support

JCL adopts the super-peer concept by introducing the JCL-Super-Peer component to create the network interface and addd the capacity

to partition a cluster into logical groups. It has two internal components, ie, the first behaves as a JCL-Server component (referred to as a

JCL-Super-Peer-Server) for a given network, while the second behaves as aJCL-Hostcomponent (referred to as aJCL-Super-Peer-Host)

for a network in which a JCL-Server or other JCL-Hosts and JCL-Super-Peers are deployed.

The JCL-Super-Peer-Server component receives requests from JCL-Hosts or from other JCL-Super-Peers. The

JCL-Super-Peer-Server for a particular JCL cluster stores all the information pertinent to its domain. When a JCL-Super-Peer-Host

receives a storage request, it redirects to the JCL-Super-Peer-Server component, which calculates function F by considering only the nodes

under its control. We locate an object in a multi-cluster/grid environment via only two F calculations, ie, the first executes in the JCL-User to

determine which JCL-Host to select; then, if the selected JCL-Host is a JCL-Super-Peer, a second calculation of F is performed to determine

where it stores the object. The same idea is adopted by JCL to store <key, value> map pairs over multiple clusters.

When a JCL-Super-Peer-Host receives a request to launch a task for which invokes a method or a group of methods, it selects a JCL-Host

from its domain to perform the launch. Thus, the JCL-Super-Peer adopts the same two-phase scheduling mechanism to find a JCL-Host. The

second phase of the JCL task scheduling technique does not migrate tasks between clusters administered by differentJCL-Super-Peers; instead,

the collaboration occurs only among JCL-Hosts in the same cluster.

One of the significant challenges of the JCL-Super-Peer is to provide interconnections between different networks transparently, i.e.,

without any additional configuration. To accomplish this, the JCL-Super-Peer establishes a set (defined by the user) of connections with the

JCL-Server or other JCL-Super-Peers. Such connections form tunnels for transmitting data and commands between networks. Thus, JCL


supports sub-networks with invalid IPs, which, as discussed in the work of Perera et al,73 are quite common in IoT, without requiring any extra

JCL-Super-Peer configuration.

4.8 Task cost

During task handling, JCL collects all information regarding the time spent to invoke class methods. In this case, the UAs need only the Java Future

object obtained from a submitted task, as illustrated in Figure 5. After retrieving the task result, users can obtain the collected time data using the

getTaskTimes(ticket) API method. As mentioned earlier, a Future object represents the result of an asynchronous computation in Java. JCL

is the only option from the Java middleware literature that provides the task cost as a service through its API.

JCL returns a list containing either six- or eight-time values. The six-time values compose the timeline of a task that has not changed itsJCL-Host

due to phase two of the scheduler, in which the JCL-Hosts cooperate. It records each time value at one of the following steps during a task

execution:

1. Immediately before the JCL-User component sends the task to the destination JCL-Host;

2. When the task reaches the JCL-Host;

3. When task execution begins;

4. When task execution ends;

5. When the result leaves the JCL-Host; and

6. When the result arrives at the JCL-User component.

By analyzing the timeline of these six-time values, a variety of elapsed time values can be calculated, such as network time, queue time, or time

spent in a JCL-Host. All six-time values are available via the JCL API, and we calculate them as follows:

Total time = timeline(6) − timeline(1)

Queue time = timeline(3) − timeline(2)

Execution time = timeline(4) − timeline(3)

Result retrieval time = timeline(5) − timeline(4)

Time during which a result remains in the JCL-Host = timeline(5) − timeline(2)

Network time = ((timeline(6) − timeline(1)) − (timeline(5) − timeline(2))).

The list of time values returned by JCL can also contain eight values. The two additional time values for a task timeline are to the collaborative

scheduler. In this case, a newJCL-Host receives a task moved to execution. Under task replacement, we calculate the eight-time values (all available

through the API) as follows:

1. Immediately before the JCL-User component sends the task to the destination JCL-Host;

2. When the task reaches the first JCL-Host;

3. When the task leaves the first JCL-Host;

4. When the task reaches the second JCL-Host;

5. When task execution begins;

6. When task execution ends;

7. When the result leaves the second JCL-Host; and

8. When the result arrives at the JCL-User component.

The analysis of the timeline when a secondJCL-Hostexists is slightly different, as the elapsed time in the firstJCL-Host is added. The equations

used to determine the different time values are as follows:

Total time = timeline(8) − timeline(1)

Queue time = (timeline(3) − timeline(2)) + (timeline(5) − timeline(4))

Execution time = timeline(6) − timeline(5)

Result retrieval time = timeline(7) − timeline(6)

Time during which the task remains in the first JCL-Host = timeline(3) − timeline(2)

Time during which the task remains in the second JCL-Host = timeline(7) − timeline(4)

Network time = ((timeline(8) − timeline(1)) − (timeline(3) − timeline(2)) − (timeline(7) − timeline(4))).

The overhead introduced to acquire all these time values is approximately 320 microseconds; therefore, it is advantageous to use this service

in most GPDC scenarios. Specific API methods for collecting the service time, queue time, network time and all other times exist, thus avoiding


the need to calculate them from scratch. Specifically, these methods aregetTaskTotalTime,getTaskQueueTime,getTaskExecutionTime,

getTaskResultTime, getTaskTimeHost1, getTaskTimeHost2, and getTaskNetworkTime in the JCL API. The memory consumption of

each task can also be obtained from JCL via calls to getTaskMemory.

5 EXPERIMENTAL EVALUATIONS

The objective of this section is to evaluate the JCL middleware in several distinct scenarios. Both performance and scalability requirements are

addressed in this section. The experiments did not involve an GPDC benchmark; instead, we adopted similar ideas presented in related works, such

as RAFDA,49 ProActive,55 Hazelcast,14 Gridgain,16 and Infinispan.48 By analyzing all the related works, which are illustrated in Tables 2 and 3, we can

conclude that 25% adopted a benchmark to evaluate their solutions and that 75% did not; moreover, all of them produced remarkable and conclusive

results. JCL implementations for benchmarks such as NPB68 are planned for a future work, as detailed in the conclusions section. We adopt number

of Hosts in many graphics during this section, what indicates increase or decrease the number of JCL-Hosts. There is only one JCL-Host per

machine in all experiments.

In the first scenario, the JCL task execution and global variable storage were evaluated against the RMI synchronous and asynchronous versions

and Apache Ignite solution. The evaluation considered two metrics, the speedup when they call different task API methods, and the throughput when

they call global variable API methods. The speedup considered a sequential version as the baseline and not the single-thread parallel version to avoid

multi-core overheads. We divide the first scenario into six tests, ie, (i) the execution of four types of tasks (void, sorting, CPU-bound, and a task with

a user-typed argument), (ii) the instantiation of global variables, (iii) the instantiation of map variables, (iv) the iterator cost over a distributed map,

(v) the storage of items in a map, and (vi) the map values retrieval cost. We adopt the JCL multi-computer version for the first experimental scenario,

and all the tested middleware solutions used Ethernet protocols.

In the second scenario, we compare the JCL multi-core speedup with a version using Java thread implementation provided by Oracle. In the third

scenario, we evaluateJCL-Super-Peer component overhead, and the results discussed. In the fourth scenario, we tested JCL by executing an opti-

mization solver, with the goal of finding promising input data for specific optimal results. In the fifth scenario, experiments with a non-deterministic

solver for a real-world combinatorial problem were conducted to evaluate how efficiently JCL schedules non-deterministic tasks.

In the fourth and fifth scenarios, the COIN-OR branch-and-cut (CBC)74 solver was used. CBC is an open-source C++ tool for solving com-

binatorial optimization problems modeled as mixed-integer linear programming solutions. The CBC combines a cutting-plane method with a

branch-and-bound algorithm75; thus, it is suitable for solving a large number of integer programming problems.75

The sixth scenario evaluates JCL while scheduling tasks to invoke a solver method proposed by Paiva and Carvalho76 for the Minimization of Tool

Switches Problem (MTSP) problem, which is an -Hard problem.77 Currently, this method is the state-of-the-art for solving the MTSP problem,

and it works as follows. First, it generates an initial solution using a new constructive heuristic based on a graph search. Then, after we obtain an

initial solution, it is improved by an implementation of the traditional Iterated Local Search78 metaheuristic, “which consists of repeatedly apply-

ing local search methods and randomly modifying a solution until it reaches a stopping condition.”76 The termination condition adopted here was

200 iterations of the metaheuristic method.

These scenarios emphasize the applicability of JCL. Specific scenarios, as HPC integrated with IoT applications79 or optimization in the industrial

infrastructure,80 are feasible to implement through Java Cá&Lá. However, some modifications of application specificity are required in JCL engine.

5.1 Speedup and throughput experiments

This set of experiments was conducted using the JCL multi-computer version, Java RMI from Oracle java.rmi package, and Apache Ignite version

1.9.0 with the default property. The experiment was evaluated using a desktop cluster composed of 15 nodes. All machines were equipped with

Intel(R) Core(TM) i5-2500 3.3 GHz processors (4 physical cores) and 4 GB of DDR3 1333 MHz RAM. The operating system was Ubuntu 16.04

LTS 64-bit kernel 4.4, and all experiments fit into the RAM. We repeat each experiment five times. We calculate the average times and standard

deviations. Each task had a unique method invocation. The goal of these experiments was to demonstrate JCL scalability and its comparisons results

against a standard and a market leader. Furthermore, we also evaluate how uniform the function F (presented in Equation (1)) can be when both

incremented global variable names and random names are adopted.

In the first round of experiments, we measured the speedup of JCL asynchronous and RMI asynchronous and synchronous remote method invo-

cations (Figure 9) when the number of nodes increased (precisely, clusters with 5, 10, and 15 nodes). For each test, we fixed the number of remote

method invocations to 103 executions. The experiments comprised four different methods, ie, (i) a method without argument (Figure 9A); (ii) a

method with an integer argument, where the task is the generation and sorting of one million integers (Figure 9B); (iii) a method that takes an array

of strings and two integer values as arguments and executes the Levenshtein distance algorithm, Fibonacci series, and prime number algorithms

(Figure 9C); and (iv) a void method that uses a book as an argument, where a book is a user-typed object comprising authors, editors, editions, pages,

and year attributes. The book constructor forms a list of objects to produce the references, simulating a crawler activity that obtains references

from the Web (Figure 9D). As mentioned before, each task encapsulates a single method call, so we also measured task speedups. The speedup is


(A) (B)

(C) (D)

FIGURE 9 Task execution experiments

calculated as s = par∕seq, where the speedup s is the difference from the multi-computer or cluster version par from the sequential version seq. We

desire linear speedups as the number of cluster nodes increased, but the expected one is normally logarithmic curves.

The results demonstrated that for non-CPU-bound tasks (Figure 9A), RMI is far more efficient than JCL. This result occurred because RMI

required less data transmission through the network, while JCL only buffered the messages and partitioned them into bins. JCL transmits iden-

tical messages, but RMI optimizes the messages. However, even the RMI decreased its speedup as the number of nodes increased, as Figure 9A

illustrates, and the reason is the network overhead. The JCL and RMI asynchronous speedup increased when both processing and the cluster size

increased, and their results were quite similar (Figure 9B and 9C), as expected. Finally, JCL had better speedup results when complex arguments

were considered (Figure 9D). This result occurred because the serialization method used by JCL is faster than that of RMI.

The behaviors of the curves in Figure 9A and 9D indicate that network time dominated the processing time. Therefore, two options exist, ie,

(i) submit the tasks to the cluster in coarser groups to increase task processing and (ii) change the Ethernet protocols to a HPC protocol, such as the

alternatives of the InfiniBand computer-networking communications standard.

In the second round of experiments (Figure 10), we fixed the number of instantiated global variables to 103. The JCL method

instantiateGlobalVar was compared with the more closed representation of the JCL global variable in Apache Ignite, ie, the Atomic type,

specificallyIgniteAtomicReference. First, we evaluated the instantiation of the abovementioned book class in two distinct ways, ie, (i) we cre-

ate the object instance in a JCL-User and sent to a JCL-Host node or an Apache Ignite node synchronously; (ii) we create the instance directly

in the JCL-Host synchronously. The results, illustrated in Figure 10A, demonstrate that JCL achieved better throughput than did Apache Ignite.

Moreover, its local instantiation mechanism was more efficient than its remote instantiation mechanism. The remote alternative involves both the

arguments of the class constructor, and it transmits its dependencies through the network, which reduces the throughput.

Figure 10B shows the JCL throughput achieved while instantiating global variables asynchronously. When we compare the asynchronous results

with the synchronous ones, the behavior reversed itself, ie, the greater throughput occurred in the variable instantiations in the JCL-Host. This

behavior is warranted because the cost of creating the book object is not the responsibility of theJCL-Userbut rather falls to theJCL-Host, which

adopts a parallel solution. The Apache Ignite Atomic type is, by definition, synchronous, ie, the object is created locally and then sent to the cluster.

The most similar behavior is the JCL synchronous global variable creation in a JCL-User, which outperformed Apache Ignite by 1000%.

In Figure 10C, the getValue(Object key) method was used to recover a previously instantiated book variable. JCL was evaluated with dif-

ferent delta values (discussed in Section 4.3); we adopted IgniteAtomicReference for the Apache Ignite solution. Although additional requests

introduced by a non-zero delta are parallel, this process introduces overhead. Consequently, when the delta value duplicates, the throughput is

reduced by 50%. When we compare the results to those of Apache Ignite, JCL was 1270% faster when delta was equal to zero, 500% faster when

delta was equal to one, and 280% faster when delta was equal to two. This enormous difference between JCL and Apache Ignite is probably related

to IgniteAtomicReference, which, unlike JCL, adds atomic operations to the variable.


(A) (B)

(C) (D)

FIGURE 10 Global variable experiments

In Figure 10D, a Crawler object was used to simulate a scan of 1000 Web pages. The Crawler object stored the visited pages, the pages to visit

and the pages where it founds the keyword. This scenario exemplifies a situation in which the cost of remote variable instantiation is compensated

for, as the cost of sending an object via the network is higher than the cost of sending its constructor parameters and its dependencies. Individually,

JCL outperformed Apache Ignite by 900% and 2400% when the Crawler object was created locally and remotely, respectively.

In the third round of experiments, we tested the JCL and Apache Ignite distributed map implementation by inserting 103 book instances into it.

The Put, PutAll (Figure 11A and 11B), Get, and iterator (Figure 11C and 11D) methods were evaluated. The Put and PutAll methods have a

huge throughput difference, explained in part by the optimization in which <key, value> pairs with identical f + d results are buffered to reduce

network communication. JCL was faster than Apache Ignite for both methods. Specifically, it was 30% faster for the Put method and 3100% faster

for the PutAllmethod.

Another huge throughput difference occurred when we retrieved a value individually versus when we obtained it via an iterator. The iterator

method optimizes<key, value> retrievals by requesting chunks of data stored in aJCL-Host. In contrast, theGetmethod requests one<key, value>

pair each time. Therefore, it is more efficient to adopt the iterator method instead of the Get method to traverse a distributed map. In general,

Apache Ignite was 333% faster than JCL when retrieving data from a cluster using the iterator method and 220% faster when using the Getmethod

of the distributed Map. In summary, inserting data into the map was carried out faster when using JCL, while when retrieving data, Apache Ignite was

faster, probably because Ignite adopts cache, data compression, and optimized algorithms not included in the design of JCL.

Finally, in the fourth round of experiments, we evaluated the uniformity of the function F (presented in Equation (1)) for different delta values, d.

The experiment instantiated 103 variable names with different prefix variable names and auto-incremented suffixes, ie, variable names such as “p_i”

and “p_ij,” where p is a prefix and i and j are auto-incremented values. We also tested the distribution of F + d for an arbitrary bag of words from a

Christian bible to verify the performance of the JCL data partition. The results are illustrated in Figures 12 and 13, where Δ is the delta size of the

equation, ie, F + d, used to schedule the global variables. In general, JCL achieved an almost uniform distribution when using a delta value between

zero and two.

In Figure 12, where the suffixes are auto-incremented, the delta size did not interfere with the uniformity of the distribution. This result occurred

because, very often, the hash values follow the auto-incremented names. Thus, the function F selectedJCL-Hosts in a uniformly distributed man-

ner. When the keys were arbitrary, the distribution became more uniform as delta increased (see Figure 13). Therefore, even if a developer decided

to adopt arbitrary names in the code, JCL can achieve an almost fair data partition. It reduces data partition uniformity as delta approaches zero,

as illustrated in Figure 13. Several project management decisions can be made to reduce the uniformity of the variables in each JCL-Host, which

increases the access throughput. These results are very similar to the results presented in the work of Almeida et al10 because the distribution of


(A) (B)

(C) (D)

FIGURE 11 Distributed map experiments

FIGURE 12 Variable names with auto-increment

function F changes only when the key-hash changes or if the number ofJCL-Hosts changes. In both cases, the number of JCL keys and the number

of JCL-Hostswere the same; therefore, identical results were expected even when running over different clusters.

5.2 Multi-core speedup experiments

We evaluated the JCL multi-core version against a Java thread implementation provided by Oracle using the package

concurrent.ExecutorService. We use in the experiment an Intel I7-4790 3.6 GHz processor with 8 cores that included hyper-threading

technology and 16 GB of RAM. We repeat each experiment five times and calculate the average speedup and standard deviations. In the first test,


FIGURE 13 Bag of words

(A) (B)


we implemented a CPU-bound task composed of existing Java sequential algorithms for calculating the Levenshtein distance, Fibonacci series,

and prime numbers. We calculated the JCL and Java thread speedups for 103 executions; the results demonstrated similar speedups. JCL achieved

a speedup of 1.98, 3.81, 5.22, and 6.52, while Oracle Java threads achieved a speedup of 1.95, 3.82, 5.22, and 6.54 (Figure 14A). These results

demonstrate that the overhead added by JCL is almost zero for CPU-bound tasks. In the second test, we evaluate a void method with no argument

and no processing in the task, simulating an extreme scenario which highlights the parallelism transparency overhead. We fixed the number of

remote method invocations to 106 executions. Figure 14B illustrates the void method with no business logic in the task manged by both JCL and

Oracle solutions. As expected, the overhead added by JCL was significant when compared to Java thread, precisely 37.4 times slower. This number

is fundamental to be used as a threshold between parallel alternatives, ie, when users can use JCL multi-core version and its facilities, including

parallel/distribute portability, detailed in Section 4.6. The main reason is that JCL uses Java threads and consequently many services ofconcurrent

package internally. In both cases the speedup is less than one, indicating that the sequential version is the faster alternative when the task is not

CPU-bound.

5.3 JCL-Super-Peer component overhead

We evaluated the overhead of the JCL-Super-Peer component using two different applications, ie, (i) a communication-intensive sorting appli-

cation and (ii) a heuristic optimization application. We test both applications in the same environment with 20 nodes, all of which equipped with

Intel(R) Core(TM) i5-2500, 3.3 GHz processors (4 physical cores) and 4 GB of DDR3 1333 MHz RAM. The operating system was Ubuntu 14.04.1 LTS

64-bit with a 3.13.0-39-generic kernel, and all experiments fit into the RAM. For the heuristic experiments, the CBC solver was adopted; 77 tasks

were randomly selected among the 4221 non-deterministic tasks presented in Section 5.4. For the sorting experiments, we submit 50000 tasks. The

sorting tasks are described in Section 5.1.


TABLE 4 Number of JCL-Hosts

Round

Network 1 2 3 4 5

Super-peer Network 0 5 10 15 20

192.168.0.0/24

Sever Network 20 15 10 5 0

10.10.10.0/24

(A) (B)

FIGURE 15 JCL-Super-Peer overhead in managing JCL-Hosts

We tested both applications in two scenarios. First, we configured the 20 nodes in two different networks, one with a JCL-Server and another

with both JCL-Server and JCL-Super-Peer components. The number of JCL-Hosts in the JCL-Super-Peer network was increased by

5 for each round of tests, varying from 0 to 20. In contrast, the number of JCL-Hosts in the JCL-Server network was decreased by 5 per round

of tests, varying from 20 to 0. Table 4 illustrates both networks in terms of the number of JCL-Hosts.

We repeat each round five times, and we calculate the average and standard deviations. The results illustrate that, when the tasks are CPU bound,

no significant overhead exists for various numbers of JCL-Hosts in each network (Figure 15B). When the tasks are communication intensive and

not CPU bound, the JCL-Super-Peer introduces some overhead to manage the different numbers of JCL-Hosts, as shown in Figure 15A. If we

consider the total time of the first round as a baseline, where theJCL-Servermanaged all 20JCL-Hosts, the second round was 74% slower than

the first, the third was 66% slower than the second, the fourth was 54% slower than the third, and finally, the fifth was 63% slower than the fourth.

In general, there was an increase of 10% when moving five JCL-Hosts from the JCL-Server network to the JCL-Super-Peer network. The

most substantial overhead occurred in the second round of the sorting application because a new JCL-Super-Peer was created in the second

round. In the case of the sorting application, the overhead was unacceptable. Therefore, one alternative is to group the 50000 tasks into, eg, smaller

groups of 5000 tasks each. Such an alternative would involve less communication, and consequently, less overhead. Besides, HPC protocols could

be used instead of Ethernet protocols for sorting applications. CBC and the experimental sorting results represent two extreme scenarios, rein-

forcing that JCL can introduce significant overhead when tasks are not CPU bound, and the cluster operates on ordinary commodity PCs with NIC

Ethernet cards.

In the second scenario, we evaluated both applications in the case where new JCL-Super-Peers were introduced, varying from 0 to

4 JCL-Super-Peers. We execute five rounds, yielding five network configurations with different numbers of JCL-Super-Peers (zero,

one, two, three, and four JCL-Super-Peers). In the fifth round, for instance, five networks and four JCL-Super-Peers existed, and each

JCL-Super-Peer component managed five JCL-Hosts. The JCL-Server network had no JCL-Hosts in this round, as Table 5 illustrates.

We repeat each round five times, similar to the first scenario. The results illustrate, as in the first scenario, that the overhead increases as the

communication costs exceed the task-processing costs (Figure 16). To calculate the overhead of the JCL-Super-Peer, we consider round 0 as a

baseline (ie, we measured the remaining rounds and compared them with round zero). We obtain the following results: round 2 was 74% slower,

round 3 was 154% slower, round 4 was 236% slower, and round 5 was 294% slower. As expected, the impact of adding newJCL-Super-Peerswas

higher than that of adding new JCL-Hosts. For the sorting application, the same recommendations as those discussed previously apply.

TheJCL-Super-Peerscreate tunnels with theJCL-Server; therefore, each communication betweenJCL-UserandJCL-Hostcomponents

must be intercepted by the JCL-Server and redirected by a JCL-Super-Peer component. The JCL-Super-Peer enables JCL to communi-

cate with sub-networks with invalid IPs, a common scenario in multi-cluster topologies in academic institutions, businesses, and people's houses.

However, the overhead can be enormous or unrealistic if we do not consider the granularities of the tasks and the supported HPC communication

technologies.


TABLE 5 JCL-Super-Peer overhead underdifferent multi-cluster topologies

Round

Round 1 2 3 4 5

Sever Network 20 15 10 5 0

10.10.10.0/24

Super-peer Network 1 0 5 5 5 5

192.168.0.0/24


192.168.1.0/24


192.168.2.0/24


192.168.3.0/24

(A) (B)

FIGURE 16 JCL-Super-Peer overhead topology 2

5.4 CBC solver experiments

In this section, the goal is to evaluate JCL scheduling for both deterministic and non-deterministic optimization problems. The CBC solver is exe-

cuted several times to calibrate the input parameters for specific optimal results; thus, the goal is to find the best parameters that achieve optimal

solutions in shorter runtimes or by opening fewer branches. Note that these CBC executions carried out to calibrate parameters are themselves a

new combinatorial optimization problem.

We divide the experiments into two rounds. In both rounds, the developed application was the same, ie, to evaluate 21 sets of parameters with

each one of the 201 instances, generating a total of 4221 tasks (21 × 201). Some tasks can require more time than an acceptable solution for a

specific cluster node. Thus, a maximum execution limit was added to each task to solve this limitation. This limit was calculated using the Benchmark

ITC3-Linux-x86-64 † in each JCL-Host. The benchmark collects some node configurations to stipulate a limit. A time limit of one hour, normalized

by the standard CPU used in the benchmark, was stipulated.

The CBC executed by JCL is illustrated in Figure 17, where all the 4221 tasks were submitted (lines 11–30), then their results are retrieved from

the cluster using a synchronization barrier (line 36), and finally, JCL mounts the output for all the CBC executions (lines 37–40). The CBC is a C++

code, so the Java class “CBCLoad” and its method “ExecCBC” are responsible for creating a process in the JVM from the CBC executable file. In

summary, it represents a simple way to use JCL but widely required by users to solve many non-trivial problems like combinatorial ones since their

non-deterministic behaviors impose hard scheduling challenges. In CBC experiments, the Benchmark ITC3-Linux-x86-64 is executed by JCL in each

JCL-Host.

The test environment was composed of 20 nodes, in which all nodes are equipped with Intel(R) Core(TM) i5-2500, 3.3 GHz processors (4 physical

cores), and 4 GB of DDR3 1333 MHz RAM. The operating system was Ubuntu 14.04.1 LTS 64-bit with the 3.13.0-39-generic kernel, and all exper-

iments could fit into the RAM. We calculate the average time from five executions, ie, we submitted the 4221 tasks five times and calculated the

average and standard deviation. We evaluate the middleware concerning seconds of runtime.

In the first round of experiments, the 4221 tasks were submitted to the cluster five times with phase two of the JCL scheduler enabled and five

times with phase two disabled. The different time values for each task were collected using the getTaskTimes(ticket) method discussed in

† Benchmark_ITC3-Linux-x86-64 is available for download at https://www.utwente.nl/ctit/hstt/itc2011/benchmarking/.


FIGURE 17 Executing CBC multiple times in a JCL cluster

Section 4.8. This deterministic execution of the CBC was repeated five times, and Figure 18A to 18D, respectively, illustrate the total time, queue

time, execution time, and network time results. The total time (Figure 18A) decreased by 50% when the collaborative scheduler behavior was turned

on. Figure 18B illustrates that the queue time was slightly greater in tests 1, 2, 3, and 5, and significantly greater in test 4 when the scheduler was

disabled. This result occurred because with scheduler phase two disabled, the tasks mandatorily obtained more queue time while waiting for the

CPU. However, the runtimes measured in the five rounds were similar, as the tasks were deterministic (Figure 18C). The network time (Figure 18D)

demonstrates that the overhead introduced by the data exchange between JCL-Hosts was minimal; therefore, the UA was not communication

intensive.

In Figure 19, the total time was reduced by a factor of approximately 2 to 3.5 when phase two of the scheduler was enabled. The behavior when

phase two was disabled was more aleatory and was higher in the fifth round of Figure 19A. The queue time was also aleatory. Consequently, when

the queue time increased, phase two of the scheduler had to work harder to balance the system workload. The number of tasks moved to other

JCL-Hosts varied from 250 to 350. Thus, we can argue that a few scheduling interventions (4 to 8% of the tasks) accelerated the application by

approximately 350%. Because few tasks exist with substantial runtime differences, their average executions were similar (Figure 19C). The network

time was also aleatory and depended on the number of scheduler interventions needed to move tasks.

The sequential time needed to execute the 4221 tasks was calculated by summing up the total times, resulting in 82 days. The average execution

time of the same set of tasks under JCL was 25 hours when the phase two scheduler was enabled. Therefore, we can characterize the reduction in

computational resources due to adopting JCL at approximately 78×, which is compatible with the 80 cores of the cluster used in the experiments.

5.5 Experiments with a solution for MTSP

In the previous section, the workload difference was small concerning the percentage of the total tasks that were scheduled by JCL. It moves less

than 10% of the CBC tasks within the cluster. To increase the number of task replacements, we evaluated an optimization algorithm to solve a

real-world combinatorial problem with an application in the industrial production context. The chosen problem was the MTSP, where Paiva and

Carvalho76 recently proposed the method selected to solve this problem.

We conduct the tests for two purposes, ie, (i) to accelerate the iterated local search executions for multiple MTSP instances, similar to the exper-

iments conducted on the CBC and (ii) to parallelize the local search algorithm executions to improve the accuracy of the MTSP solutions. However,

in this work, the focus is not on discussing accuracy improvements but on the quality of the JCL scheduler.

In the experiments, 240 instances were used, ie, 160 proposed by Crama et al77 and 80 proposed by Catanzaro et al.81 The JCL cluster version was

used to distribute the sequential algorithm, written in C++, over 20 nodes and 80 cores, identical to the CBC environment. The iterated local search

has a parallel local search phase, ie, it explores simultaneously multiple different ranges of 30% of the search space. Thus, we use the multi-core


(A) (B)

(C) (D)


(A) (B)

(C) (D)

FIGURE 19 Non-deterministic task execution experiments

version of JCL; it was responsible for managing 4 to 16 local search algorithm tasks concurrently. The parallelism of the local search algorithm

presents a scenario in which a C++ application executes via JCL, ie, the opposite of JCL scheduling the iterated local search over the cluster. Thus, the

JCL iterated local search application confirms the simplicity of interoperating JCL with C++ in two complementary ways. These MTSP experiments

also highlighted the alternative of using both JCL versions (multi-core and multi-computer) together to avoid unnecessary network communication.

Figure 20 illustrates the first step of the MTSP JCL application, very similar with CBC version presented in Figure 17. Basically, at line 9, the

application is iterating over 240 instances of MTSP problem per time to guarantee that few tasks per JCL-Host are running initially since these

tasks will adopt JCL multi-core version to perform iterated local searches locally and in parallel, avoiding costly distributed communications. At line

16, the JCL is called to create a task that encapsulates the Java class “toolsS,” calling its “execTS” method. This method, illustrated in Figure 21, creates


FIGURE 20 Executing MTSP multiple times in a JCL cluster

FIGURE 21 The ExecTS method to start the C++ MTSP code

FIGURE 22 The C++ MTSP sequential code initializing a Java object to communicate with JCL multi-core version

a process on each JCL-Host to handle the MTSP C++ code. Next, the C++ code, illustrated in Figure 22, calls a Java class named “JCLInterface”

(Line 9), precisely its method “CreateTasks,” to start a JCL multi-core version to perform iterated local searches in parallel and also done by a C++

compiled module. This MTSP JCL application has two integrations of Java/C++, one to call the MTSP and the other to call the iterated local search,

and one between C++/Java when the MTSP calls JCL API of the multi-core version. This way, it can be considered the most complex JCL application

built until now.

Figure 23 illustrates the final steps of the MTSP application, where the method “CreateTasks” registers a wrapper Java class named “IteratedLo-

calSearch,” responsible for starting a process to run the C++ code for a iterated local search (Line 10). Note that, the method “CreateTasks” starts

from 4 to 16 local tasks (line 9 to 16) using the multi-core JCL version (line 2) and each of these tasks encapsulates a “IteratedLocalSearch” object,

thus there are from 4 to 16 concurrent C++ iterated local searches per MTSP distributed Java/C++ task. Due to similarity reasons with “toolsS” class

(Figure 21), the “IteratedLocalSearch” Java wrapper class and the C++ “IteratedLocalSearch” version are omitted.

The results of the MTSP application demonstrated that approximately 20% of the 240 tasks submitted to the JCL cluster were replaced, which

implies an increase of more than 100% compared to the CBC experiments. Each value in Figure 24 represents the average times of an execution

round, ie, the execution of all 240 instances and their parallel local search algorithm executions, representing more than 240 × 4 tasks for each

execution round. Regarding total time, when the phase two scheduler was active, the runtime reduction was approximately 20% to 30% on average


FIGURE 23 The JCLInterface Java class responsible to start parallel Java/C++ iterated local searches

(A) (B)

(C) (D)

FIGURE 24 Iterated local search applied to the 240-instance MTSP experiment

compared to that when the JCL phase two scheduler was disabled. Compared to the CBC, the application of the iterated local search to the MTSP

problem achieved fewer benefits from parallelization, as its tasks are less CPU bound. Therefore, the cluster operated with Ethernet protocols tends

to reduce the parallelization benefits.

As shown in Figure 24B and 24D, the runtimes were similar. Thus, the more than 20% task replacement sometimes affected neither the net-

work nor the queue times. The reason is that the iterated local search tasks are small and their initial arguments are also small; consequently, it

minimizes the drawbacks of network transfers. Figure 24C illustrates the non-determinism, where the runtime with the scheduler enabled was

sometimes worse than that with the scheduler disabled. The tiny variation caused by the task replacements resulted in an improvement of between

20% and 30%.


(A) (B)

(C) (D)

FIGURE 25 Iterated local search applied to the MTSP run rounds (1 - 64 tasks)

(A) (B)

(C) (D)

FIGURE 26 Iterated local search applied to the MTSP execution rounds (96 to 160 tasks)


A second alternative for parallelizing the iterated local search applied to the MTSP was considered. For this purpose, we select one of the largest

instances of the 240 available instances, and the local search algorithm method was parallelized via the JCL multi-computer version, running over

80 cores. The local search algorithm method operated concurrently; therefore, we launched between 1 and 160 concurrent tasks in each iteration.

The scheduler was always enabled in these experiments, and each value in Figures 25 and 26 represents the average time needed to complete

between 1 and 160 tasks in each execution round. Figure 25 shows JCL being tested with 1, 32, and 64 tasks performing the local search algorithm

method in parallel; Figure 26 shows JCL being tested while managing 96, 128, and 160 parallel tasks. More tasks than the 80 cores available in the

cluster were executed.

As shown in Figure 25A, the parallelization introduced nearly 80% of the overhead. We observe this overhead in the runtime difference between

one task, and for instance, 32 tasks. However, if we consider that the parallel executions performed 32 or 64 more local search algorithm method

calls, then the improvement is evident. Figures 25C and 26C reinforced the non-deterministic behavior of the iterated local search when applied

to the MTSP. In Figure 26A, the total times do not increase, which can be explained by the timely scheduler interventions. On average, 40% of the

overhead was introduced into the total times when comparing rounds with 32 or 64 tasks to rounds with 96 or 128 tasks and 80% when compar-

ing rounds with 32 or 64 tasks to rounds with 160 tasks. Nonetheless, the JCL scheduler was useful, considering that the number of local search

algorithm executions increased by more than 300%. Figure 26B illustrates a queue time outlier that occurred when there were 160 tasks to be sched-

uled; this demonstrates a bottleneck, as the results obtained with less than 128 tasks in parallel did not cause retention in the cluster. The outlier

value indicates that even when JCL replaced nearly 20% of the 160 tasks, reduction of the queue times is sometimes insufficient, as they represent

hardware saturation when addressing so many concurrent tasks.

6 CONCLUSIONS

In this work, we have presented a novel middleware solution using two programming models for GPDC development, ie, a DSM model and a

task-oriented model. This work represents an extension of our previous JCL paper.10 Specifically, it includes the following: (i) a new JCL version that

implements new requirements; (ii) three JCL optimization applications, used to evaluate its scheduling technique; (iii) a detailed performance eval-

uation of JCL against RMI (synchronous and asynchronous) and Apache Ignite middleware solutions to demonstrate the performance and scalability

of JCL; (iv) an evaluation of the JCL-Super-Peer component using multiple clusters; and (v) a discussion on the state-of-the-art of Java GPDC

middleware, which can be useful for future technical and research investigations.

The experiments demonstrated that JCL is a promising solution for GPDC. A detailed experimental evaluation was conducted using the open

source CBC mixed-integer programming solver. The CBC results demonstrated that, while the JCL scheduler introduced few task replacements,

it reduced the runtime by 50% to 350%. Compared to a sequential execution of CBC, JCL reduced the execution of 4221 tasks from 82 days to

25 hours. The speedup experiments demonstrated that JCL scales well as the cluster increases and reaffirmed that either processing must compen-

sate for the introduced data network delays or that HPC communication options must be used. The last set of experiments evaluated a solution for

the MTSP problem. The results demonstrated that JCL scales well even when it must replace many non-deterministic tasks. The comparison of JCL

to the RMI and Apache Ignite alternatives highlighted the strengths and drawbacks of JCL.

JCL still needs many improvements. It should be fault tolerant concerning both storage and processing services. The architecture of JCL has

to rebuild to eliminate the JCL-Server component or at least duplicate it. In the current architecture, JCL-Server introduces a Single Point

Of Failure (SPOF) in the cluster and a bottleneck when it starts a significant amount of JCL-Users simultaneously. New configurations for

the JCL-Super-Peer must be added to JCL, as the JCL-Super-Peer does not always involve only invalid IPs. Many cases exist where a

JCL-Super-Peermight be installed in a node with two Network Interface Cards, making theJCL-Super-Peer accessible to bothJCL-Server

and JCL-User components, ie, it both has Internet access and manages JCL-Hosts from a second data network composed of nodes with invalid

IPs. Thus, we have to implement a new strategy for interconnecting theJCL-Super-Peercomponent with different networks. The current method

considers on two network routers, ie, one for the JCL-Server network and another for JCL-Super-Peer networks. This concept causes signif-

icant network overhead when the application is communication intensive, but it transparently enables the connection of two networks. Therefore,

JCL integrates networks with invalid IPs. The JCL-Super-Peer must avoid message decoding and re-transmit all messages directly to their des-

tinations without opening them. We can introduce optimizations to avoid submitting the same API calls multiple times, eg, several JCL execute API

calls with identical method names and arguments should be replaced by a single call perJCL-Host. JCL groups them to optimize the transmissions

through a data network.

GPU execution abstractions, where location and copies are transparent to developers, would be very useful in JCL. New options for the first phase

of the JCL scheduler should be incorporated to improve the initial task allocation. For example, JCL using Yarn82 for resource management and task

scheduling should be investigated and compared against the native JCL alternatives. A new storage strategy, which does not allocate objects based

on names but instead partitions all objects into byte-arrays of, for instance, 256 KBytes each, thereby allowing many JCL-Hosts to store a unique

variable or a unique<key, value>pair of a map, should be developed. IoT requirements such as interoperability, context awareness, privacy, and secu-

rity must be part of JCL. A cross-platformJCL-Host, including platforms without JVMs, platforms with JVMs that are not compatible with JSR 901

(the Java Language Specification), or platforms without operating systems, is necessary to apply JCL to IoT. We can implement a supervisor appli-

cation for visually managing JCL cluster resources (maps, global variables, tasks, sensors, devices, and more). Low-level communication libraries,


eg, Verbs/MXM for InfiniBand, GNI for Cray Aries, and usNIC for Cisco should be used by JCL internally to support HPC communication-intensive

applications. Users should select they communication protocol preference from many HPC alternatives and not only TCP, the unique option of cur-

rent JCL and not designed for low-latency requirement. We can conduct experiments using consolidated benchmarks such as the NPB.68 Finally,

we plan to support real-time and stream processing applications.

ORCID

André Luís Barroso Almeida http://orcid.org/0000-0002-9722-0426

Gustavo Silva Paiva http://orcid.org/0000-0002-5728-9373

REFERENCES

1. Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. Third ed. Burlington, MA: Morgan Kaufmann Publishers Inc; 2011.

2. Turner V, Gantz JF, Reinsel D, Minton S. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. Framingham,MA: IDC Analyze the Future. https://www.emc.com/leadership/digital-universe/2014iview/index.htm. Accessed June 13, 2018.

3. Kaminsky A. Big CPU, Big Data: Solving the World's Toughest Computational Problems with Parallel Computing. First ed. Scotts Valley, CA: CreateSpaceIndependent Publishing Platform; 2016.

4. McAfee A, Brynjolfsson E. Big data: the management revolution. Harv Bus Rev. 2012;90(10):60-66.

5. Perera C, Liu CH, Jayawardena S, Chen M. A survey on Internet of Things from industrial market perspective. IEEE Access. 2014;2:1660-1679.

6. Zhang Q, Cheng L, Boutaba R. Cloud computing: state-of-the-art and research challenges. J Internet Serv Appl. 2010;1(1):7-18.

7. Al-Jaroodi J, Mohamed N. Middleware is STILL everywhere!!! Concurrency Computat Pract Exper. 2012;24(16):1919-1926.

8. Taboada GL, Ramos S, Expósito RR, Touriño J, Doallo R. Java in the high performance computing arena: research, practice and experience. Sci ComputProgram. 2013;78(5):425-444.

9. Taboada GL, Touriño J, Doallo R. Java for high performance computing: assessment of current research and practice. In: Proceedings of the7th International Conference on Principles and Practice of Programming in Java; 2009; Calgary, Canada.

10. Almeida ALB, Silva SED, Nazaré Jr AC, De Castro Lima J. JCL: a high performance computing Java middleware. In: Proceedings of the 18th InternationalConference on Enterprise Information Systems; 2016; Rome, Italy.

11. Ghosh S. Distributed Systems: An Algorithmic Approach. Boca Raton, FL: CRC Press; 2014.

12. Protic J, Tomasevic M, Milutinovic V. Distributed shared memory: concepts and systems. IEEE Parallel Distributed Technol Syst Appl. 1996;4(2):63-71.

13. Shahrivari S, Sharifi M. Task-oriented programming: a suitable programming model for multicore and distributed systems. Paper presented at:10th International Symposium on Parallel and Distributed Computing; 2011; Cluj Napoca, Romania.

14. Veentjer P. Mastering Hazelcast. Palo Alto, CA: Hazelcast; 2013.

15. Watson RT, Wynn D, Boudreau M-C. JBOSS: the evolution of professional open source software. MIS Q Exec. 2005;4(3):329-341.

16. A GridGain Systems In-Memory Computing White Paper. Foster City, CA: GridGain. http://go.gridgain.com/rs/491-TWR-806/images/Apache_Ignite_White_Paper.pdf?aliId=402. Accessed December 15, 2015.

17. Bhuiyan SA, Zheludkov M, Isachenko T. High Performance In-Memory Computing With Apache Ignite: Building Low Latency, Near Real-Time Application.Victoria, Canada: Leanpub; 2017.

18. Murphy AL, Picco GP, Roman G-C. LIME: a coordination model and middleware supporting mobility of hosts and agents. ACM Trans Softw Eng Methodol.2006;15(3):279-328.

19. Gokhale A, Balasubramanian K, Krishna AS, et al. Model driven middleware: a new paradigm for developing distributed real-time and embedded systems.Sci Comput Program. 2008;73(1):39-58.

20. Tariq MA, Koldehofe B, Bhowmik S, Rothermel K. PLEROMA: a SDN-based high performance publish/subscribe middleware. In: Proceedings of the15th International Middleware Conference; 2014; Bordeaux, France.

21. Mehrotra P, Djomehri J, Heistand S, et al. Performance evaluation of Amazon EC2 for NASA HPC applications. In: Proceedings of the 3rd Workshop onScientific Cloud Computing; 2012; Delft, The Netherlands.

22. Jackson KR, Ramakrishnan L, Muriki K, et al. Performance analysis of high performance computing applications on the Amazon web services cloud. Paperpresented at: IEEE Second International Conference on Cloud Computing Technology and Science; 2010; Indianapolis, IN.

23. Karantasis KI, Polychronopoulos ED. Programming GPU clusters with shared memory abstraction in software. Paper presented at: 19th InternationalEuromicro Conference on Parallel, Distributed and Network-Based Processing; 2011; Ayia Napa, Cyprus.

24. Karantasis KI, Polychronopoulos ED. Pleiad: a cross-environment middleware providing efficient multithreading on clusters. In: Proceedings of the6th ACM Conference on Computing Frontiers; 2009; Ischia, Italy.

25. George L. HBase: The Definitive Guide. Sebastopol, CA: O'Reilly Media Inc; 2011.

26. Apache Cassandra. http://cassandra.apache.org/. Accessed December 20, 2015.

27. Gates A, Dai D. Programming Pig. Sebastopol, CA: O'Reilly Media Inc; 2016.

28. Scylla Is Next Generation NoSQL. Palo Alto, CA: ScyllaDB. http://www.scylladb.com/. Accessed December 15, 2015.

29. Chodorow K. MongoDB: The Definitive Guide. Sebastopol, CA: O'Reilly Media Inc; 2013.

30. Boneti C, Gioiosa R, Cazorla FJ, Valero M. A dynamic scheduler for balancing HPC applications. In: Proceedings of the 2008 ACM/IEEE Conference onSupercomputing; 2008; Austin, TX.

http://orcid.org/0000-0002-9722-0426

http://orcid.org/0000-0002-9722-0426

http://orcid.org/0000-0002-5728-9373

http://orcid.org/0000-0002-5728-9373

https://www.emc.com/leadership/digital-universe/2014iview/index.htm

http://go.gridgain.com/rs/491-TWR-806/images/Apache_Ignite_White_Paper.pdf?aliId=402

http://go.gridgain.com/rs/491-TWR-806/images/Apache_Ignite_White_Paper.pdf?aliId=402

http://cassandra.apache.org/

http://www.scylladb.com/


31. Murata Y, Inaba T, Takizawa H, Kobayashi H. A distributed and cooperative load balancing mechanism for large-scale P2P systems. Paper presented at:International Symposium on Applications and the Internet Workshops; 2006; Phoenix, AZ.

32. Balasangameshwara J, Raju N. A hybrid policy for fault tolerant load balancing in grid computing environments. J Netw Comput Appl. 2012;35(1):412-422.

33. Java Parallel Processing Framework. The open source grid computing solution. http://www.jppf.org/. Accessed December 15, 2015.

34. Pitt E, McNiff K. Java.Rmi: The Remote Method Invocation Guide. London, UK: Pearson Education; 2001.

35. MPI: A Message-Passing Interface Standard, Version 2.2 specification. Message Passing Interface Forum; 2009.

36. García-Valls M, Basanta-Val P. Comparative analysis of two different middleware approaches for reconfiguration of distributed real-time systems. J SystArchit. 2014;60(2):221-233.

37. Basanta-Val P, García-Valls M. Resource management policies for real-time Java remote invocations. J Parallel Distributed Comput. 2014;74(1):1930-1944.

38. Yang B, Garcia-Molina H. Designing a super-peer network. In: Proceedings of the 19th International Conference on Data Engineering; 2003; Bangalore,India.

39. Lua EK, Crowcroft J, Pias M, Sharma R, Lim S. A survey and comparison of peer-to-peer overlay network schemes. IEEE Commun Surv Tutor.2005;7(2):72-93.

40. SalemAlzboon M, Arif S, Mahmuddin M, Dakkak O. Peer to peer resource discovery mechanisms in grid computing: a critical review. Paper presented at:4th International Conference on Internet Applications, Protocols and Services; 2015; Putrajaya, Malaysia.

41. Seovic A, Falco M, Peralta P. Oracle Coherence 3.5. Birmingham, UK: Packt Publishing Ltd; 2010.

42. Basanta-Val P, García-Valls M. A simple distributed garbage collector for distributed real-time Java. J Supercomput. 2014;70(3):1588-1616.

43. Castro M, Liskov B. Practical Byzantine fault tolerance. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementa-tion; 1999; New Orleans, LA.

44. Chen Y, Sun X-H. STAS: a scalability testing and analysis system. Paper presented at: IEEE International Conference on Cluster Computing; 2006;Barcelona, Spain.

45. Abbott ML, Fisher MT. The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise. Boston, MA: PearsonEducation; 2009.

46. Basanta-Val P, Audsley NC, Wellings AJ, Gray I, Fernández-García N. Architecting time-critical big-data systems. IEEE Trans Big Data. 2016;2(4):310-324.

47. Basanta-Val P, Fernández-García N, Sánchez-Fernández L, Arias-Fisteus J. Patterns for distributed real-time stream processing. IEEE Trans ParallelDistributed Syst. 2017;28(11):3243-3257.

48. Marchioni F, Surtani M. Infinispan Data Grid Platform. Birmingham, UK: Packt Publishing Ltd; 2012.

49. Walker SM, Dearle A, Norcross SJ, Kirby GNC, McCarthy AJ. RAFDA: A Policy-Aware Middleware Supporting the Flexible Separation of Application Logic fromDistribution. Technical Report CS/06/2. Laurinburg, NC: University of St Andrews; 2003.

50. Kaminsky A. Parallel Java: a unified API for shared memory and cluster parallel programming in 100% Java. Paper presented at: IEEE International Paralleland Distributed Processing Symposium; 2007; Rome, Italy.

51. Taveira WF, de Oliveira Valente MT, da Silva Bigonha MA, da Silva Bigonha R. Asynchronous remote method invocation in java. J Univers Comput Sci.2003;9(8):761-775.

52. Henning M, Spruiell M. Distributed Programming With Ice. Revision 3.2.1 ed. Jupiter, FL: ZeroC Inc; 2007.

53. Shafi A, Carpenter B, Baker M. Nested parallelism for multi-core HPC systems using Java. J Parallel Distributed Comput. 2009;69(6):532-545.

54. Zhu W, Wang C-L, Lau FCM. JESSICA2: a distributed Java virtual machine with transparent thread migration support. In: Proceedings of the IEEEInternational Conference on Cluster Computing; 2002; Chicago, IL.

55. Baduel L, Baude F, Caromel D. Object-oriented SPMD. Paper presented at: Fifth IEEE International Symposium on Cluster Computing and the Grid;2005; Cardiff, UK.

56. Expósito RR, Ramos S, Taboada GL, Touriño J, Doallo R. FastMPJ: a scalable and efficient Java message-passing library. Clust Comput.2014;17(3):1031-1050.

57. Genaud S, Rattanapoka C. P2P-MPI: a peer-to-peer framework for robust execution of message passing parallel programs on grids. J Grid Comput.2007;5(1):27-42.

58. Philippsen M, Haumacher B, Nester C. More efficient serialization and RMI for Java. Concurr Computat Pract Exp. 1999;12(7):495-518.

59. Kurzyniec D, Wrzosek T, Sunderam V, Slominski A. RMIX: a multiprotocol RMI framework for Java. In: Proceedings of the International Parallel andDistributed Processing Symposium; 2003; Nice, France.

60. Gabriel E, Fagg GE, Bosilca G, et al. Open MPI: goals, concept, and design of a next generation MPI implementation. In: Recent Advances in Parallel Vir-tual Machine and Message Passing Interface: 11th European PVM/MPI Users' Group Meeting Budapest, Hungary, September 19-22, 2004. Proceedings. Berlin,Germany: Springer-Verlag Berlin Heidelberg; 2004.

61. Pugh W, Spacco J. MpJava: high-performance message passing in Java using Java.nio. In: Languages and Compilers for Parallel Computing: 16th InternationalWorkshop, LCPC 2003, College Station, TX, USA, October 2-4, 2003. Revised Papers. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2003.

62. Xiong J, Wang J, Xu J. Research of distributed parallel information retrieval based on JPPF. Paper presented at: International Conference of InformationScience and Management Engineering; 2010; Xi'an, China.

63. Nvidia Corporation. Compute Unified Device Architecture Programming Guide. 2008.

64. Di Sanzo P, Quaglia F, Ciciani B, et al. A flexible framework for accurate simulation of cloud in-memory data stores. Simul Model Pract Theory.2015;58:219-238.

65. Wang V, Salim F, Moskovits P. The WebSocket API. In: The Definitive Guide to HTML5 WebSocket. Berkeley, CA: Apress; 2013:13-32.

66. Apache Software Foundation. Welcome to Apache Software Foundation. http://apache.org/. Accessed April 20, 2017.

67. Apache Ignite. What is Ignite. https://apacheignite.readme.io/docs/. Accessed April 20, 2017.

68. Bailey DH, Barszcz E, Barton JT, et al. The NAS parallel benchmarks. Int J High Perform Comput Appl. 1991;5(3):63-73.

http://www.jppf.org/

http://apache.org/

https://apacheignite.readme.io/docs/


69. The Java Tutorials. Lesson: Understanding the Sockets Direct Protocol. https://docs.oracle.com/javase/tutorial/sdp/sockets/. Accessed April 28, 2017.

70. IBM Code. Direct Storage and Networking Interface (DiSNI). https://developer.ibm.com/code/open/projects/direct-storage-and-networking-interface-disni/. Accessed May 5, 2018.

71. Patel DK, Tripathy D, Tripathy CR. Survey of load balancing techniques for grid. J Netw Comput Appl. 2016;65:103-119.

72. Congosto M, Basanta-Val P, Sanchez-Fernandez L. T-Hoarder: a framework to process Twitter data streams. J Netw Comput Appl. 2017;83:28-39.

73. Perera C, Jayaraman PP, Zaslavsky A, Christen P, Georgakopoulos D. MOSDEN: an Internet of Things middleware for resource constrained mobiledevices. Paper presented at: 47th Hawaii International Conference on System Sciences; 2014; Waikoloa, HI.

74. Forrest J, Lougee-Heimer R. CBC user guide. In: Tutorials in Operations Research: Emerging Theory, Methods, and Applications. Catonsville, MD: INFORMS;2005.

75. Mitchell JE. Branch-and-cut algorithms for combinatorial optimization problems. In: Handbook of Applied Optimization. Vol 1. Oxford, UK: OxfordUniversity Press; 2002:65-77.

76. Paiva GS, Carvalho MAM. Improved heuristic algorithms for the job sequencing and tool switching problem. Comput Oper Res. 2017;88:208-219.

77. Crama Y, Kolen AWJ, Oerlemans AG, Spieksma FCR. Minimizing the number of tool switches on a flexible machine. Int J Flex Manuf Syst. 1994;6(1):33-54.

78. Lourenço HR, Martin OC, Stützle T. Iterated local search. Handbook of Metaheuristics. New York, NY: Springer Science+Business Media; 2003:320-353.

79. de Souza Cimino L, de Resende JEE, Silva LHM, et al. IoT and HPC integration: revision and perspectives. Paper presented at: VII Brazilian Symposiumon Computing Systems Engineering; 2017; Curitiba, Brazil.

80. Basanta-Val P. An efficient industrial big-data engine. IEEE Trans Ind Inform. 2018;14(4):1361-1369.

81. Catanzaro D, Gouveia L, Labbé M. Improved integer linear programming formulations for the job sequencing and tool switching problem. Eur J Oper Res.2015;244(3):766-777.

82. Vavilapalli VK, Murthy AC, Douglas C, et al. Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium onCloud Computing; 2013; Santa Clara, CA.

How to cite this article: Almeida ALB, Cimino LS, de Resende JEE, et al. A general-purpose distributed computing Java middleware.

Concurrency Computat Pract Exper. 2019;31:e4967. https://doi.org/10.1002/cpe.4967

https://docs.oracle.com/javase/tutorial/sdp/sockets/

https://developer.ibm.com/code/open/projects/direct-storage-and-networking-interface-disni/

https://developer.ibm.com/code/open/projects/direct-storage-and-networking-interface-disni/

https://doi.org/10.1002/cpe.4967

A general-purpose distributed computing Java middleware › haroldo › papers ›...

Documents

Transcript of A general-purpose distributed computing Java middleware › haroldo › papers ›...