VC3: Virtual Clusters for Community Computation · VC3: Virtual Clusters for Community Computation...
Transcript of VC3: Virtual Clusters for Community Computation · VC3: Virtual Clusters for Community Computation...
VC3: Virtual Clusters for Community Computation
Douglas Thain, University of Notre Dame Rob Gardner, University of Chicago
John Hover, Brookhaven National Lab
You have developed a large scale workload which runs successfully at a University cluster.
Now, you want to migrate and expand that application to national-scale infrastructure. (And allow others to easily access and run similar workloads.)
Traditional HPC Facility Distributed HTC Facility Commercial Cloud
IceCube Simulation DAG
Signal Generator
Background Generator
Photon Propagator
Photon Propagator
Photon Propagator
Photon Propagator
CPU
GPU
Detector Detector Detector Detector Detector
Filter Filter Filter Filter Filter
Cleanup
CPU
CPU
CPU
CMS Data Analysis w/Lobster
Anna Woodard, Matthias Wolf, et al., Scaling Data Intensive Physics Applications to 10k Cores on Non-Dedicated Clusters with Lobster, IEEE Conference on Cluster Computing, September, 2015.
Lobster Master Application
Work Queue Master Library
Submit Wait
Foreman
Foreman
Foreman
$$$
$$$
$$$
16-core Worker 16-core Worker
16-core Worker 16-core Worker
$$$
16-core Worker 16-core Worker
16-core Worker 16-core Worker
$$$
16-core Worker 16-core Worker
16-core Worker 16-core Worker
$$$
Local Files and Programs
A B C
The Perils of Workload Migration • Dynamic resource configuration and scaling.
– # nodes, cores/node, RAM/core, disk, GPUs • OS expectations:
– Ubuntu, Cray, Red Hat, Debian, etc…. • Software dependencies.
– Script languages, installed libraries, supporting tools… • Online service dependencies.
– Batch systems, databases, web proxies, … • Network accessibility:
– Addressibility, incoming/outgoing, port ranges, protocols… • Storage configuration:
– Local, global, temporary, permanent, home/project/tmp…
Can we make HPC more like cloud?
• User cluster specification: – 50-200 nodes of 24 cores and 64GB RAM/node – 150GB local disk per node – 100TB shared storage space – 10Gb outgoing public internet access for data – CMS software 8.1.3 and python 2.7 – Running Condor or Spark or Makeflow . . .
• Of course, we cannot unilaterally change other computing sites!
So, that means containers and VMs? Not necessarily.
VMs and containers are great, and we
will use them where needed, but:
1) Not all sites deploy them. 2) We want to use native hardware (and software) whenever possible.
Traditional HPC Facility Distributed HTC Facility Commercial Cloud
Concept: Virtual Cluster • 200 nodes of 24 cores and 64GB RAM/node • 150GB local disk per node • 100TB shared storage space • 10Gb outgoing public internet access for data • CMS software 8.1.3 and python 2.7
Virtual Cluster Service
Virtual Cluster Factory
Deploy Services Deploy Services Deploy Services
Virtual Cluster Factory
Virtual Cluster
Virtual Cluster Factory
Virtual Cluster Factory
Virtual Cluster Factory
Project Status and Structure
• Just getting started, funding began June 2016. • First milestone for PI meeting today:
– VC across three sites at UC/ND runs IceCube.
CSE: Douglas Thain Ben Tovar CMS: Kevin Lannon Michael Hildreth Kenyi Hurtado CRC: Paul Brenner
Robert Gardner Lincoln Bryant Benedikt Riedel
John Hover Jose Caballero
VC3 Architecture
User Portal VC3 Service Instance
Resource Provider
Resource Provider
Resource Provider
Batch System Batch System Batch System
VC3 Pilot Factory
Cluster Spec
Pilot Pilot
Pilot
Pilot Pilot Pilot
Middleware Scheduler
MW Nod
e
MW Nod
e MW Nod
e
MW Nod
e
MW Nod
e
MW Nod
e
End user accesses the VC head node.
Software Catalog
Site Catalog
Create a virtual cluster!
VC3 Service Instance
VC3 Service Instance
Teardown is Critical!
User Portal
Resource Provider
Resource Provider
Resource Provider
Batch System Batch System Batch System
VC3 Pilot Factory
Cluster Spec
Pilot Pilot
Pilot
Pilot Pilot Pilot
Middleware Scheduler
MW Nod
e
MW Nod
e MW Nod
e
MW Nod
e
MW Nod
e
MW Nod
e
Software Catalog
Site Catalog
Destroy my virtual cluster!
VC3 Service Instance
VC3 Service Instance
Teardown is Critical!
User Portal
Resource Provider
Resource Provider
Resource Provider
Batch System Batch System Batch System
VC3 Pilot Factory
Cluster Spec Middleware
Scheduler
Software Catalog
Site Catalog
Destroy my virtual cluster!
Inherent Challenges • Portal -> Service Instance
– Reliability, specification, collaboration, discoverability, lifecycle management.
• Cluster Factory – Configuration, impedance matching, response to outages, right-
sizing to workload, authentication, cost management. • Environment Construction
– Specification complexity and portability, detection of existing environments, environment sharing, resource consumption.
• Performance Management – Want mall easy, big possible. Matching HW capability to
middleware deployment. Environment compatible with manycore, GPU, FPGA.
• Site Management – Work with the site owners, not against them. Collect relevant
configuration data. Make VC deployment transparent to sites.
Changing Technology Landscape • Resource Management Systems
– Condor, PBS, SLURM, Cobalt, UGE, Mesos, ??? • User Interests in Middleware
– Workflows, GlideInWMS, PanDA, Hadoop, Spark, ???? • Software Deployment Technologies
– VMs -> LXC -> Docker -> Singularity -> ??? – CVMFS, Tarballs, NixOS, Spack, ???
• Access to Resources – Old way: SSH+Keys New Way: Two Factor Auth
• Our approach: – Pick a place to stand, but keep specific technologies at
arm’s length and be prepared to change.
Prototype Implementation • Portal -> Service Instance
– (under construction) • Pilot Job Factory
– AutoPyFactory (APF) from BNL – SSH/BOSCO to connect to resource providers
• Pilot Job and Environment Deployment – Local software install via tarballs + PATH. (Groundwerk) – Access CVMFS via FUSE or Parrot, whichever available.
• User Visible Middleware – Condor batch system (user level “glide-in”)
• Application – IceCube data analysis
Key Idea:
Specify requirements in abstract. Deliver requirements by
matching or creating, or both. *
* (only works if you can characterize requirements very accurately)
Tour of First Milestone Prototype:
Application (Ice Cube Simulation) Environment Creation (VC3-Pilot) Cluster Factory (AutoPyFactory)
IceCube Software and Jobs • Experiment specific software stack
– Dependencies not normal for particle physics experiment: Boost, hdf5, suite-parse, cfitsio, etc.
– Distributed mostly through CVMFS global filesystem now, tarballs still used in edge cases, containers are an issue.
– Moving to shipping C++11 compliant environment (own compiler, etc.)
• Heavily invested in GPU accelerators • Average job: 2-4 GB RAM, 10 GB Disk, 2 hour wall time • Tail-end job: 6+ GB RAM, 100 GB Disk, 10s to 100s hours • Need to record all details about a job, forever: job
configuration, where did it run, resource usage, efficiency, etc.
CVMFS Global Filesystem
www server
HEP Task
Parrot / FUSE
squid proxy squid
proxy squid proxy
CVMFS Driver metadata
data
data
data
metadata
data
data
CAS Cache
CMS Software
967 GB 31M files
Content Addressable
Storage
Build
CAS
HTTP GET HTTP GET
http://cernvm.cern.ch/portal/filesystem
CVMFS + HPC Challenges • Need disk local to node (ideal) or site (ok) for
local cache management. (Project: RAM $$$) • Need FUSE (ideal) to mount FS, otherwise use
Parrot (ok) for user level interception. • Must have a local HTTP proxy, otherwise
CVMFS becomes a denial of service attack. • Site operators dislike blocking CPU for data. • CVMFS itself has dependencies to install!
Jakob Blomer, Predrag Buncic, Rene Meusel, Gerardo Ganis, Igor Sfiligoi and Douglas Thain, The Evolution of Global Scale Filesystems for Scientific Software Distribution, IEEE/AIP Computing in Science and Engineering, 17(6), pages 61-71, December, 2015.
Delivering Dependencies with VC3-Pilot
vc3-pilot –require python 2.7.12 icecube-sim input.dat • Query the current environment. • Install missing pieces (recursively) in /home • Run the program with a modified PATH.
Resource Provider
Resource Provider
Resource Provider
Python 2.6 Python 2.7 Python 3.0
Task Pilot Task Pilot Task Pilot
Python 2.7 Python 2.7
"python":[ { "version":"v2.7.12", "versioncmd":"python --version", "versionreg":"Python ([0-9.]*).*", "sources":[ { "type":"tarball", "files":[ "Python-2.7.12.tgz" ], "recipe":[ "./configure --prefix=${VC3_PREFIX} --libdir=${VC3_PREFIX}/lib", "make", "make install", "ln -s ${VC3_PREFIX}/bin/pydoc{,2}" ] } ], "environment_variables":[ { "name":"PATH", "value":"bin" },
Recipes Define Environments
data dependencies
setup instructions
environment setup
app definition
CVMFS Deployment via VC3-Pilot
vc3-pilot –require cvmfs icecube-sim input.dat • Search for existing services. • Download dependent software. • Deploy using Parrot (user level VM) if necessary.
Resource Provider
Resource Provider
Resource Provider
FUSE
Task Pilot Task Pilot Task Pilot
FUSE /cvmfs /cvmfs
Parrot /cvmfs
CVMFS Deployment via VC3-Pilot Set the software environment required for scientific applications.
% stat /cvmfs/cms.cern.ch % stat: cannot stat '/cvmfs/cms.cern.ch': No such file or directory % ./vc3-pilot --require cvmfs -- stat /cvmfs/cms.cern.ch File: '/cvmfs/cms.cern.ch' Size: 4096 Blocks: 9 IO Block: 65336 ...
Icecube demo dependencies according to the pilot
(host already has cvmfs)
Icecube demo dependencies according to the pilot
(host does not have cvmfs)
The MAKER Genomics Pipeline http://www.yandell-lab.org/software/maker.html
vc3-pilot –require maker maker -BIO
Custom docker container in Jetstream took weeks to install pieces by hand. Converted to vc3-pilot, successfully ported to Stampede in a single automated install.
AutoPyFactory from BNL • Primary concern is intelligently, efficiently, and deterministically scaling
overlay submission to the WMS workload, based on policy. – How many pilots to submit, combining info from multiple sources?
• Chainable scheduler logic plugins allow “algorithms via config file”. • Single process, multi-threaded, no-database, object-oriented Python
daemon, resulting in high reliability/stability. • “Everything is a plugin” architecture makes new usage easy/safe. • Leverages developer effort, infrastructure, scalability, resource targets,
authorization mechanisms, and common interface (everything is a job) of the HTCondor project--which would need to be custom-coded without Condor. – Condor-G interface submits any executable, with job resource
requirements (memory, disk, corecount, waltime, etc.) if specified by the WMS queue.
– Scalability and speed allows rapid submission.
Submission policies Current APF supports ‘demand-driven policy’, logic is driven by how much idle work is
waiting. A separate APF queue handles a demand level, e.g.:
[low-demand] sched.ready.offset=0
sched.maxtorun.maximum = 1000
[medium-demand]
sched.ready.offset=1000
sched.maxtorun.maximum = 500
Sched.scale.factor = 0.10
[high-demand]
sched.ready.offset=6000
sched.maxtorun.maximum = 100
Sched.scale.factor = 0.01
Putting it All Together • VC created by AutoPyFactory on:
– UC ATLAS Tier-3 running Condor + FUSE – UC OSG Testbed running PBS w/o FUSE – ND CRC cluster running SGE w/o FUSE
• Payload: – VC3-Pilot deploys dependencies, mounts CVMFS. – Icecube data analysis task
• Truth in advertising: – GPU detection/configuration – Web proxy discovery.
Demo Time Deploy CVMFS on the fly: vc3-pilot --require cvmfs https://asciinema.org/a/40j5dnd6m67yog3y4qa4tw957 Deploy MAKER on the fly:
vc3-pilot --require maker-ecoli-example-01 https://asciinema.org/a/4qzmcrpmrzssxen6s1knkgw86 VC jobs running at UC via “glide in”
http://asciinema.org/a/7a9ku2k4z3ujtnr1v4cjo6mq3 VC jobs running at ND via “hobble in”
https://asciinema.org/a/84798
Much more to do… • User Portal and Dynamic Service Instances • Scale and Dynamic Behavior
– New problems with each order of magnitude. – Manage +10K cores on demand!
• Fitting into the Ecosystem – Work with sysadmins to synchronize user flexibility with respect
for local configuration and policy. • Deployment
– Per-site configuration, rather than per-job. – Better exploit existing packages / tools? – How to discover / deploy new services?
• Applications – Starting with LHC: CMS, ATLAS HEP: IceCube – Show generality with bio: MAKER, AWE, LifeMapper
http://virtualclusters.org