PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended...

18
Document No: SKA-TEL-SDP-0000020 Unrestricted Revision: 1.0 Author: P. C. Broekema Release Date: 2015-02-09 Page 1 of 17 PDR.02.01.02 Compute platform: Software stack developments and considerations Document number……………………………………………………………SKA-TEL-SDP-0000020 Context…………………………………………………………………………………SKA.SDP.COMP Revision……………………………………………………………………………………………..,…1.0 Authors……………………………...D. Christie, P. Crosby, A. Ensor, N. Erdody, P. C. Broekema, B. A. do Lago, M. Mahmoud, R. O’Brien, R. Rocha, T. Stevenson, A. St John, J. Taylor, Y. Zhu, A. Mika Release Date………………………………………………………………………………...2015-02-09 Document Classification………………………………………………………………….. Unrestricted Status………………………………………………………………………………………….……. Draft

Transcript of PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended...

Page 1: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 1 of 17

PDR.02.01.02 Compute platform: Software stack

developments and considerations

Document number……………………………………………………………SKA-TEL-SDP-0000020

Context…………………………………………………………………………………SKA.SDP.COMP

Revision……………………………………………………………………………………………..,…1.0

Authors……………………………...D. Christie, P. Crosby, A. Ensor, N. Erdody, P. C. Broekema,

B. A. do Lago, M. Mahmoud, R. O’Brien, R. Rocha, T. Stevenson, A. St John, J. Taylor, Y. Zhu,

A. Mika

Release Date………………………………………………………………………………...2015-02-09

Document Classification………………………………………………………………….. Unrestricted

Status………………………………………………………………………………………….……. Draft

Page 2: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 2 of 17

Name Designation Affiliation

Chris Broekema COMP Team Lead ASTRON

Signature & Date:

Name Designation Affiliation

Paul Alexander Project Lead University of Cambridge

Signature & Date:

Version Date of Issue Prepared by Comments

1.0 2015-02-09 P. C. Broekema

ORGANISATION DETAILS

Name Science Data Processor Consortium

Signature:

Email:

Signature:

Email:

P.C. Broekema (Feb 9, 2015)P.C. Broekema

[email protected]

Paul Alexander (Feb 9, 2015)Paul Alexander

[email protected]

Page 3: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 3 of 17

Table of Contents List of Figures ............................................................................................................................ 4

Introduction ................................................................................................................................ 5

References ................................................................................................................................ 6

Applicable Documents ............................................................................................................ 6

Reference Documents ............................................................................................................ 6

Operating systems ..................................................................................................................... 7

Linux ...................................................................................................................................... 7

Exascale operating systems ................................................................................................... 8

Middleware ................................................................................................................................ 9

Messaging Middleware ........................................................................................................... 9

Message Passing Interface ................................................................................................. 9

IBM Infosphere Streams ....................................................................................................10

ZeroC Internet Communications Engine ............................................................................10

ZeroMQ .............................................................................................................................11

Cloud software ......................................................................................................................11

OpenStack .........................................................................................................................11

Containerisation ....................................................................................................................13

Application development environment and software development kit ........................................14

Schedulers ................................................................................................................................15

Schedulers used in pre-cursor telescopes .............................................................................15

Adaptive Computing Moab HPC Suite ...................................................................................15

Simple Linux Utility for Resource Management .....................................................................15

Conclusions ...........................................................................................................................16

References ...............................................................................................................................16

Page 4: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 4 of 17

List of Figures

Figure 1: The Argo Enclave concept. From [RD-03]. .................................................................. 8

Page 5: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 5 of 17

Introduction This document is intended as supporting material for the Software Stack chapter of the

Compute platform sub-element design document [AD-01]. It explores some of the current

software trends and their applicability to the SDP. Currently available software products are

discussed together with their potential development in the coming years and how the various

developments can most efficiently be exploited. Emphasis is placed on the lower levels of the

software stack, i.e. operating systems, middleware and such.

This is by no means intended as an exhaustive discussion on the software that we will use in

the SDP once it is build. Instead we consider some of the possible classes of software we may

explore on the road to CDR. It is likely that many of the mentioned software products will cease

to exist and be replaced by better and more capable products before SKA1 becomes available.

We therefore intend to gain experience with best of breed software products over the next

couple of years to be able to better specify requirements for SKA1 software products.

Page 6: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 6 of 17

References

Applicable Documents

The following documents are applicable to the extent stated herein. In the event of conflict

between the contents of the applicable documents and this document, the applicable

documents shall take precedence.

Reference Number Reference

AD-01 SKA-TEL-SDP-0000018 P. C. Broekema: Compute platform element sub-system design

Reference Documents

The following documents are referenced in this document. In the event of conflict between the

contents of the referenced documents and this document, this document shall take

precedence.

Reference Number Reference

RD-01 SKA CSP Element Software Development Plan - SKA-TEL.CSP.SE-NZA-SDP-001

RD-02 An Updated Performance Comparison of Virtual Machines and Linux Containers -- Wes Felter, et al

RD-03 http://www.mcs.anl.gov/project/argo-exascale-operating-system

RD-04 http://slurm.schedmd.com/

Page 7: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 7 of 17

Operating systems

Linux

Linux is undisputedly the best match for a large and distributed project such as the SKA.

Adoption, especially in the last ten years has been outstanding and the library and tool

availability now covers the full computing spectrum, including desktop activities, video and audio

processing, software development and data analysis. In the science community there are good

examples of its usage. One curious case is Scientific Linux (SL) [1], an effort put together by

Fermilab, CERN, and various other labs and universities around the world to have a common

install base for the various experimenters, reducing duplication of effort. Based on packages

targeted at RedHat based distributions, it adds additional science-specific libraries and tools and

validates the whole set for usage in a scientific environment. A key aspect of Scientific Linux is

the long life cycle of all releases (5+5 years) which allows services to be operated within a

stable environment for long periods of time.

While successful for many years now, this approach has the drawback of imposing a significant

amount of effort in keeping a separate distribution - infrastructure for package build and storage,

manpower for keeping up with upstream updates, end user support. An alternative approach

has been promoted recently by the Unified Middleware Distribution (UMD) of the European Grid

Infrastructure (EGI) [2]. While initially providing its own repositories, this project has increasingly

promoted the adoption of the community guidelines from the Fedora based distributions, with

the ultimate goal of getting all its packages available in the public Fedora/EPEL repositories.

Benefits are clear:

● Community reviews prior to the inclusion of any package in the distribution, resulting in

much improved package quality;

● Re-use of the existing package code repositories, review and workflow tools (in this

case, Koji and Bodhi);

● Re-use of the existing repository infrastructure, with mirrors available all around the

world.

Our recommendation is thus to take a similar approach to that of the EGI project, though aiming

at Debian/Ubuntu as the target distribution. Popularity being one of the main reasons (as the

SDP survey also pointed out [3]), there are others:

● Availability of science- and math-related packages [4]

● Massive user and package maintainer community [5]

● Resiliency, as shown by its long existence [6] and the fact that is has no single

organisation leading the effort, remaining a purely distributed effort

● Clear and frequent release cycles (every six months) and the provision of long-term

support (stable releases, five year support guaranteed)

● Multiple package states (unstable, testing, stable)

All the benefits mentioned previously regarding the Fedora use by EGI remain true for a solution

targeting Debian/Ubuntu - such as the extensive set of guidelines [7] that ensures package

Page 8: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 8 of 17

quality. Debian/Ubuntu is not commonly used for HPC systems as of today, which might

however change for OpenPOWER- and future ARM-based systems.

We realise there are cases where packages will not be accepted in any of these distributions -

such as partial use of proprietary code, dependencies on libraries not available in the

distribution, etc. These should remain the exception, and can be handled by either keeping a

custom repository or (ideally) maintaining a project-specific Personal Package Archive (PPA)

[8], which does not enforce such strict policies, but provides the rest of the build and repository

infrastructure.

Exascale operating systems

As part of the various projects towards exascale computing, several exascale operating system

designs are being considered. The motivation for these is:

● The desire to scale to extreme system sizes with minimal OS overhead;

● The need for a more predictable and deterministic performance profile;

● More control over the compute resources, without additional overhead.

One of these projects, Argo [RD-03], is led by Argonne National Labs, which has collaborated

with the radio-astronomical community before. Where current state-of-the-art HPC operating

systems are basically large numbers of node operating systems wired together, Argo introduces

a new key concept, the enclave. This hierarchical concept sits between the system and the

node, much like the SDP Compute Island does, and provides its own form of management

system.

Figure 1: The Argo Enclave concept. From [RD-03].

Page 9: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 9 of 17

While the requirements for SKA1 are different from conventional HPC applications, the

hierarchical concept of an enclave matches well with the SDP Compute Island concept.

Although our current baseline assumption is that we will run some Linux flavour on the SDP,

these exascale operating system developments are extremely interesting and may offer a way

to significantly ease the operational challenges by reducing the number of manageable

components. More research is required to estimate the suitability of such operating systems for

the SKA.

Middleware

Messaging Middleware

Message Passing Interface

The Message Passing Interface (MPI, http://www.mpi-forum.org/) is the de facto standard

providing a diverse message-passing interface and library for programming parallel machines

with distributed memory. Whilst MPI is a framework that provides API for building distributed

applications it does not provide any direct ability to deploy and manage distributed applications.

Point-to-point communication is the fundamental MPI mechanism, using send and receive

operations that can either be synchronous or asynchronous. Messages are sent in an envelope

that specifies information that can be used by the receiver for selecting a particular message. A

variety of serialised data types are offered for transmission as well as the ability to define more

complex derived serialisable types.

MPI extends simple send and receive operations to allow more complex collective messages

that involve a group or groups of processes. This type of collective messaging can also be

performed synchronously or asynchronously.

Beyond collective messaging, MPI offers further advanced message passing mechanisms such

as process groups and topologies. Groups represent an ordered collection of processes, where

each has a rank and a low-level name for inter-process communication. A topology is an

optional feature that can be assigned to a group's communication contexts and provides a

naming mechanism for the processes within a group. A topology can assist the runtime system

in mapping processes onto hardware.

To aid more sophisticated middleware mechanisms such as scheduling, fault-tolerance, load

balancing and adaptivity, MPI offers environmental management and inquiry routines as well as

process creation and management interfaces. MPI offers routines for getting and setting various

parameters that relate to the implementation and the execution environment. For example

features such as version inquiries, memory management, error handling, timing and

synchronisation as well as start-up parameters can be queried and set where applicable.

Despite MPI's primary function being concerned with communication rather than process or

resource management, some decision can be facilitated during execution via a set of interfaces

that enable the creation and management of processes.

Page 10: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 10 of 17

IBM Infosphere Streams

Stream processing applications can be developed using IBM Infosphere Streams (Streams,

http://www-03.ibm.com/software/products/en/infosphere-streams) using its Stream Processing

Language (SPL). SPL allows the composition of a stream processing application by its

respective flow graph, with vertices representing the processes and edges representing the data

streams between processes. The SPL application is then compiled resulting in a set of

processing elements.

The Streams runtime environment is usually deployed on a Linux cluster that might be

comprised of different computing and networking hardware architectures. The Streams runtime

is designed to be hardware agnostic however the operating system must be the same across

the entire cluster. Each physical node executes and manages a processing element container.

The Streams application manager and scheduler deploy and start the application processing

elements on the containers. The Streams resource manager and application manager then

monitor and manage the execution of the processing elements.

Streams provides an Eclipse plugin for the development of stream-processing applications.

Application development using SPL allows direct incorporation of Java or C/C++ code. Support

for Python is also available via a C++ layer.

In particular for Radio Astronomy data processing the potential of Streams has been shown in

applications for correlation, RFI mitigation and imaging. Streams is a commercial IBM product,

however is freely available for academic use under the IBM academic initiative policies.

ZeroC Internet Communications Engine

The Internet Communications Engine (ICE, https://www.zeroc.com/) is an open source client-

server based distributed computing environment that supports a wide range of programming

languages (e.g. C/C++, Java, Python, Objective-C and .NET) allowing systems to utilise more

than one language. ICE also supports a variety of operating systems including Linux.

ICE provides sophisticated remote procedure call (RPC) mechanisms including

synchronous/asynchronous calls, bidirectional connections, control of threads and resource

allocation. It also provides fault tolerance and load balancing features for improved performance

and scalability.

In terms of grid computing, ICE provides a service known as IceGrid. This service is an

intermediary layer that decouples clients from servers to help promote replication, load

balancing and fault tolerance. IceGrid facilitates application deployment onto resources to assist

with improving utilisation and provides administrative tools enabling the management of

deployed applications.

Page 11: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 11 of 17

ICE has been suggested for use as the main enterprise bus middleware technology for ASKAP.

Essentially its role in ASKAP computing is to coordinate the telescope's main components.

ZeroMQ

For non-enterprise or less-critical application messaging, a brokerless messaging system might

be used to reduce potential bottlenecks, latency and network communication associated with

enterprise-grade brokered messaging systems. The ZeroMQ (ØMQ, http://zeromq.org/)

lightweight messaging kernel is a C-based package that extends the familiar concept of TCP

socket interfaces with features for supporting asynchronous message queues and subscription-

based message filtering over a range of network protocols. Although ØMQ is written in C there

are API bindings available for many languages, including C++, C#, Java, and Python.

Unlike TCP sockets, ØMQ sockets can use various transport protocols (besides TCP it supports

pgm pragmatic general multicast and its encapsulated variant epgm, in-process transport and

ipc inter-process communication between threads sharing a ØMQ context), allows an arbitrary

number of incoming and outgoing connections and network connections can come and go

dynamically with ØMQ automatically reconnecting. Similarly to TCP, one side (nominally the

server) binds to a socket and the others (the clients) instead connect to the socket, although

ØMQ allows clients to connect before the server has bound the socket. All I/O for a socket is

handled asynchronously by a pool of background ØMQ threads.

ØMQ supports various messaging patterns including request-reply, publish/subscribe, pipeline,

and exclusive pair.

Cloud software

"Cloud computing" is a generic term that describes the ability to automatically and immediately provision compute capability without the need to invest capital in hardware or software. This is made possible by the development of technologies that virtualise the provisioning of servers, networks and other infrastructure. Similar virtualisation has taken place on a software level such that operating systems, middleware, databases, development tools and end-user tools can all also be automatically and immediately provisioned.

Amazon, Google, Microsoft and many smaller businesses all have their own versions of cloud computing. The market disruption they have caused is due to the ability they have to charge consumers of cloud capability on an as-you-go basis. This allows consumers to only pay for what they use which can lead to cost savings and different models of undertaking compute tasks that range from tiny to very large. Systems can be architected to scale up and down for temporal usage as opposed to being fixed to peak usage criteria.

OpenStack The free and open source community has risen to the challenge of cloud computing by largely coalescing around a suit of inter-operating projects that fall under the umbrella of the OpenStack foundation. These developments are effectively defining open standards for cloud computing. OpenStack is also creating the opportunity for many new players to enter the field of offering cloud services that were once the domain of a handful of large-scale suppliers. It also now makes commercial sense for organisations with high internal compute requirements to build

Page 12: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 12 of 17

their own “private” clouds. Past efforts have tried projects such as OpenNebula or pure KVM, but lately, OpenStack [9] is showing better traction.

The OpenStack project started as a combined NASA and Rackspace development. It gained popularity very quickly and now has many large scientific institutions (e.g, CERN) among its users [10]. The project is supported by hundreds of companies with contributions made by thousands of developers. It is the fastest growing and developing open source project ever seen.

OpenStack aims to provide all the services required to run a private cloud while remaining agnostic in terms of the actual virtualization solution being used underneath. It shares with Linux a lot of the points that have made it successful: community effort, led by a foundation (there’s no single entity driving its features); fully open source, allowing easy adoption of new features; large user base with frequent summits for discussions covering operational procedures, best practices and development priorities.

There is a considerable amount of interest in using OpenStack in the general HPC community for the following reasons (see also [RD-01]):

● As HPC “as a service” is gaining momentum in Cloud environments. ● OpenStack provides a potential common framework for the integration of HPC

middleware components such as provisioning, management and control. Basing these elements on a common framework may aid development and provide consistency.

● (as is stated) elements such as containerization will allow DevOps on production resources thus providing test facilities at the scale of production level operation.

Based on the arguments above, OpenStack appears to be suitable as the basis for all the SDP

middleware services providing computing, including:

● glance for image management, which will be typically but not necessarily limited to a

Debian/Ubuntu image with all the required software installed,

● cinder for block-based data access, the backend of all virtual machine instances.

Support for LVM, Ceph, Glusterfs and many other storage solutions, leaving options

opened to specific deployments on the technology to use,

● nova for management of computing resources, including scheduling and allocation.

This solution provides in addition:

● multi-tenant support, including quota management based on groups of users,

● all the required packaging, especially well supported when used in conjunction with the

Debian/Ubuntu distribution mentioned above,

● potential to explore new usages of the computing infrastructure, which go beyond what

is typically done in scientific domains. Offerings of tools like R or other data analysis

services directly exposing their interfaces, with no need to even deal with the details of

setting up virtual machines, are extremely promising and should be explored.

● providing a common platform for engineers to develop and test software across the globe.

Apart from these, the range of services and agility of the new players (including large research institutes such as CERN) will allow the SKA to find a community of practitioners with whom it can share developments and co-develop new technologies. The use of OpenStack-based cloud

Page 13: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 13 of 17

capability would also allow the SKA project to both insource and outsource compute tasks based on a common platform and the availability of suitable compute capability. Apart from the existing features, there are also some interesting ongoing developments relating to OpenStack including the scheduling and orchestration of compute tasks and the use of components such as Glance [7] and Glint [8], the ability to use tiny execution containers, such as ZeroVM [1], associated with storage objects and streams to pre-process data prior to it being shipped, and the use of extremely efficient operating systems such as CoreOS [2]. It is worth mentioning that in the framework of CSP and SDP, AUT, Catalyst and OpenParallel have successfully implemented and OpenStack instantiation on GPUs. Another point worth considering is an ongoing effort to providing Cloud Federation, led by

CERN and Rackspace. Following the success of its Grid infrastructure, CERN wants to achieve

the same using the new cloud-based interfaces.

Containerisation

Containerisation is an operating system level of virtualisation, where a single operating system

instance is shared by multiple isolated user space instances. In Linux isolation is maintained by

means of kernel namespaces, which add an identifier to system calls. In contrast, in hardware

virtualisation the physical hardware is abstracted from the user and a hypervisor presents a

virtual hardware interface to the software. Operating system level virtualisation is characterised

by:

● very low overhead

● direct access to the hardware (if allowed by the host operating system)

● much smaller images compared to hardware virtualized images.

While the cost model for cloud services is fundamentally based around the virtual private server (VPS), containers make it possible to still deploy multiple VPS-like instances without incurring the costs of multiple VPS yet retaining the characteristics and security around individual VPSs. A container is secure - effectively a confined user space (chroot jail). Strict guest separation is achieved by means of Linux kernel namespaces. It has the characteristics of an independent network stack, and resource control/quotas. Systems like Docker enable convenient mechanisms for patching resources from one container to another to create the necessary linkages for services that are jailed within a container to talk to each other. This includes network communication and filesystem access. These are the core characteristics that enable the likes of Juju [5] and Solum [6] to 'plug' components in containers together to create services. There are technical aspects of providing optimisations for:

● networking - VPSs will now be like mini VLANs as they run multiple hosts ● security - there may be considerations for specialist OSs (like CoreOS) that are

stripped back to be essentially boxes that hold containers with a lot of the standard OS processes removed

● compute - comparative research has shown that compute overhead of cotainerisation is minimal compared to virtual machines [RD-02]

Page 14: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 14 of 17

● storage - Container style resource sharing will be yet another abstraction layer from real storage and it gets blurred when the pen OS shares it's storage with the container, or containers share storage between themselves. There are bound to be performance considerations if the trend is towards running processes like database storage within these environments

There are also management aspects: ● There will be a drive towards having abstracted interfaces for managing containers ● Performance monitoring tools that operate at the container level, container cluster level

(within a VPS and across VPSs), and VPS level

Service design aspects: ● Services need to be designed to take advantage of cloud “as a Service” principles (back-

ending by cloud storage, message queues, RDBMSs and so on, with horizontal scaling of web processes and multi-tenant concepts built in)

● It is likely that containerisation will require design pattern changes to take advantage of the paradigm.

With containerisation, each service is a cluster of individual processes (one distinct process per container). Processes should not assume that they speak directly to the underlying OS (eg., logging, temporary file storage). Containers can make standalone software viable as a service instead of having to re-engineer for multi-tenanting.

Application development environment and software

development kit Using OpenStack to provide a development and test architecture for SKA engineers would ensure a consistency of approach that will otherwise be hard to realise. It would also allow engineers to define and provision their own tool sets and at the same time offer these to other engineers should the need arise. Some benefits that OpenStack might provide include:

● Easy ability to build images with different standard software offerings, from which participants can easily spin up virtual machines (VMs) - simplifying system builds.

● Well-defined APIs so VMs can be created quickly on demand, then removed once the required computations have been completed - reducing cost.

● Can be hosted on their own hardware, providing full control over the hardware, or instead be hosted by a 3rd party, limiting the capital expenditure for the SKA participants.

● Data storage implementation can be tailored to the SKA requirements, these requirements may vary from raw data collection to the output of science experiments.

● OpenStack Cells or Regions can be used to have OpenStack clusters dotted around the world, but with one API, and one authentication database used by all of them. This allows limiting the data transfer of large data sets to within countries for example.

● Support by major HPC solution providers like Cray or IBM.

Page 15: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 15 of 17

Schedulers

Schedulers used in pre-cursor telescopes

The SKA pre-cursors ASKAP (Australian Square Kilometre Array Pathfinder), LOFAR (Low-

Frequency Array of Netherland), and MeerKAT (a mid-frequency array of parabolic reflectors in

South Africa) have implemented various scheduling systems.

ASKAP uses a commercial scheduling software system named PBSPro. PBSPro is evaluated

as stable in functions with good control of computing hardware. PBSPro was chosen due to the

familiarity of iVEC (who maintain the ASKAP computing platform) with it. The major drawback is

the high license cost that rises rapidly with the number of computing nodes [9].

In LOFAR, another commercial scheduling software system named LST is being used. LST

provides a plethora of resource scheduling functions under the conditions of both online and

offline data processing [10]. The online pipelines perform the observations and data recording

tasks (inc., correlation and beam forming). The off-line (or post-processing) pipelines take care

of the raw data reduction and calibration/imaging. The two types of pipelines have different

scheduling constraints. In the case of the on-line pipelines, the scheduling software needs to

deal with the LOFAR hardware constraints and resource sharing (e.g., storage and LOFAR

station availability). It also needs to fulfil astronomical constraints such as target visibility and

LST scheduling. The off-line post-processing pipelines only need to be scheduled so that

science priorities are being met and the processing cluster is being used efficiently. The

drawback of this system from the viewpoint of the SKA is the lack of easy portability due to the

very different scheduling constraints for online and offline pipelines.

MeerKAT applies an open-source scheduling software named Celery [11]. There seems to be

no industrial support for this tool although it is still being developed actively by the open-source

community.

Adaptive Computing Moab HPC Suite

Moab (http://www.adaptivecomputing.com/) is a cluster management middleware that offers

workflow management facets such as schedule optimisation, resource monitoring and load

balancing. Moab is a commercial product. It is currently being utilised for job distribution on the

145 node low voltage Clover town Intel processor cluster known as "The Green Machine" at

Swinburne University's Center for Astrophysics and Supercomputing (CAS).

Simple Linux Utility for Resource Management

The Simple Linux Utility for Resource Management (SLURM, http://slurm.schedmd.com/) [RD-

04] is a modular open source resource manager designed for Linux clusters of all sizes. It is

currently the de-facto standard for cluster resource management in HPC. It allocates access to

resources and provides a framework to start, execute and monitor user jobs. The modular

nature of this scheduler may allow the scheduler behaviour to be modified to suit the SDP

Page 16: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 16 of 17

requirements. Between the configurable aspects there are multiple plugins for scheduling,

accounting, prioritisation and topology-aware scheduling, among others.

A conventional HPC scheduler tries to minimise a time to answer while optimising the use of the

system for as many users as possible. The SDP scheduler has far fewer users, possibly only a

single one: the instrument. Additionally, the SDP scheduler needs to provide Local Monitoring

and Control with a runtime estimate for a particular scheduled pipeline well in advance, based

on a priori information that may be obtained from reference runs.

The possibly heterogeneous nature of the SDP Compute Island may complicate the issue even

further. While the scheduling task itself is probably less complicated in SDP than it is in a normal

HPC facility, making an existing tool a very attractive option, the additional tasks will require

significant effort to develop.

Between the pros of this project are:

● Open source project with very active users and developers community

● Scalable to thousands of nodes and hundreds of thousands of jobs per hour

● Compatibility with different MPI implementations

● Light-weight daemons

● Good and extensive documentation

● There is the option to have commercial support through SchedMD

Cons:

● Lacks more advanced scheduling policies as SLA or multi-cluster scheduling.

Conclusions

In summary, all of the aforementioned scheduling systems are well established and integrated

into existing management platform software. However, none of these systems comply with the

unique requirements of the SDP, i.e. the ability to estimate the required resources and runtime

beforehand, based on a-priori knowledge. As such, the best way forward is probably to rely on

basic policies available in existing scheduling software to design an SDP-specific scheduler that

supports both deadline- and resource-aware scheduling.

References [1] ZeroVM - A "process specific" hypervisor. See also "Using ZeroVM and Swift to Build a Compute Enabled Storage Platform" https://www.openstack.org/summit/openstack-summit-atlanta-2014/session-videos/presentation/using-zerovm-and-swift-to-build-a-compute-enabled-storage-platform

[2] CoreOS - Very efficient Linux distro for deployment of large numbers of servers https://coreos.com/

Page 17: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

Document No: SKA-TEL-SDP-0000020 Unrestricted

Revision: 1.0 Author: P. C. Broekema

Release Date: 2015-02-09 Page 17 of 17

[3] LXC - Linux containers https://linuxcontainers.org/

[4] Docker - https://www.docker.com/ and see "Docker CEO: Why it just got easier to run multi-container apps" http://www.zdnet.com/docker-ceo-why-it-just-got-easier-to-run-multi-container-apps-7000036356/

[5] Juju - Canonical's Ubuntu based orchestration product https://juju.ubuntu.com/

[6] Solum - as per Juju but designed natively for OpenStack http://solum.io/

[7] Glance - the image service for OpenStack that allows common images to be defined, discovered and deployed automatically. [8] Glint - Image distribution in a multi-cloud environment. Leveraging Glance, this project from Victoria University, Vancouver allows images and associated jobs to be scheduled across multiple cloud platforms. http://www.internetnews.com/blog/skerner/openstack-glance-gets-new-glint-for-replicated-and-distributed-images.html

[9] B. Nitzberg, J.M. Schopf, and J.P. Jones, PBS Pro: Grid computing and scheduling

attributes, in Grid resource management. 2004, Springer. p. 183-190.

[10] M. de Vos, A.W. Gunst, and R. Nijboer, The LOFAR telescope: System architecture and

signal processing. Proceedings of the IEEE, 2009. 97(8): p. 1431-1437.

[11] Celery Team, Celery-the distributed task queue, 2011.

Page 18: PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element

PDR02-01-02Softwarestackdevelopmentsandconsiderations(1)EchoSign Document History February 09, 2015

Created: February 09, 2015

By: Verity Allan ([email protected])

Status: SIGNED

Transaction ID: XJEEY6X3B6G6Y2W

“PDR02-01-02Softwarestackdevelopmentsandconsiderations(1)” History

Document created by Verity Allan ([email protected])February 09, 2015 - 2:05 PM GMT - IP address: 131.111.185.15

Document emailed to P.C. Broekema ([email protected]) for signatureFebruary 09, 2015 - 2:05 PM GMT

Document viewed by P.C. Broekema ([email protected])February 09, 2015 - 2:33 PM GMT - IP address: 192.87.1.200

P.C. Broekema ([email protected]) verified identity with Google web identity Chris Broekema (https://www.google.com/profiles/118359904325355782244)February 09, 2015 - 2:34 PM GMT

Document e-signed by P.C. Broekema ([email protected])Signature Date: February 09, 2015 - 2:34 PM GMT - Time Source: server - IP address: 192.87.1.200

Document emailed to Paul Alexander ([email protected]) for signatureFebruary 09, 2015 - 2:34 PM GMT

Document viewed by Paul Alexander ([email protected])February 09, 2015 - 6:38 PM GMT - IP address: 131.111.185.15

Document e-signed by Paul Alexander ([email protected])Signature Date: February 09, 2015 - 6:39 PM GMT - Time Source: server - IP address: 131.111.185.15

Signed document emailed to Verity Allan ([email protected]), P.C. Broekema ([email protected]) andPaul Alexander ([email protected])February 09, 2015 - 6:39 PM GMT