1
COSC 6397
Big Data Analytics
Grids, Clouds and Volunteer Systems
Edgar Gabriel
Spring 2014
Grid Computing
• Definition 1: Infrastructure that provides dependable,
consistent, pervasive, and inexpensive access to high-
end computational capabilities (1998)
• Definition 2: A system that coordinates resources not
subject to centralized control, using open, general-
purpose protocols to deliver nontrivial Quality of
Service (2002)
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
2
An Example: The Globus Toolkit
- Initially developed at Argonne National
Lab/University of Chicago and ISI/University
of Southern California
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
How It Started
While helping to build/integrate a diverse range of distributed applications, the same problems kept showing up over and over again.
– Too hard to keep track of authentication data (ID/password) across institutions
– Too hard to monitor system and application status across institutions
– Too many ways to submit jobs
– Too many ways to store & access files and data
– Too many ways to keep track of data
– Too easy to leave “dangling” resources lying around (robustness)
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
3
Forget Homogeneity!
• Trying to force
homogeneity on users is
futile. Everyone has their
own preferences,
sometimes even dogma.
• The Internet provides the
model…
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
Building a Grid (in Practice)
• Building a Grid system or application is currently an exercise in software integration.
– Define user requirements
– Derive system requirements or features
– Survey existing components
– Identify useful components
– Develop components to fit into the gaps
– Integrate the system
– Deploy and test the system
– Maintain the system during its operation
• This should be done iteratively, with many loops and eddys in the flow.
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
4
How it Really Happens
Web Browser
Compute Server
Data Catalog
Data Viewer Tool
Certificate authority
Chat Tool
Credential Repository
Web Portal
Compute Server
Resources implement standard access & management interfaces
Collective services aggregate &/or
virtualize resources
Users work with client applications
Application services organize VOs & enable access to other services
Database service
Database service
Database service
Simulation Tool
Camera
Camera
Telepresence Monitor
Registration Service
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
What Is the Globus Toolkit?
• The Globus Toolkit is a collection of solutions to problems that frequently come up when trying to build collaborative distributed applications.
• Not turnkey solutions, but building blocks and tools for application developers and system integrators.
– Some components (e.g., file transfer) go farther than others (e.g., remote job submission) toward end-user relevance.
• To date, the Toolkit has focused on simplifying heterogeneity for application developers.
• The goal has been to capitalize on and encourage use of existing standards (IETF, W3C, OASIS, GGF).
– The Toolkit also includes reference implementations of new/proposed standards in these organizations.
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
5
How To Use the Globus Toolkit
• By itself, the Toolkit has surprisingly limited end user value.
– There’s very little user interface material there.
– You can’t just give it to end users (scientists, engineers,
marketing specialists) and tell them to do something useful!
• The Globus Toolkit is useful to application developers and
system integrators.
– You’ll need to have a specific application or system in mind.
– You’ll need to have the right expertise.
– You’ll need to set up prerequisite hardware/software.
– You’ll need to have a plan.
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
Data Management Security Common
Runtime Execution
Management Information
Services
Web Services Components
Non-WS
Components
Pre-WS Authentication Authorization
GridFTP
Grid Resource
Allocation Mgmt (Pre-WS GRAM)
Monitoring & Discovery
System (MDS2)
C Common Libraries
G T 2
WS Authentication Authorization
Reliable File
Transfer
OGSA-DAI [Tech Preview]
Grid Resource
Allocation Mgmt (WS GRAM)
Monitoring & Discovery
System (MDS4)
Java WS Core
Community Authorization
Service G T 3
Replica Location Service
XIO
G T 3
Credential Management
G T 4
Python WS Core [contribution]
C WS Core
Community Scheduler Framework
[contribution]
Delegation Service
G T 4
Globus Toolkit Components
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
6
From Grids to Cloud Computing
• Logical steps: – Make the grids public
– Provide much simpler interfaces (and more limited control)
– Charge usage of resources
• Instead of relying on implicit incentives from science collaborations
• Ideally, a “pay-as-you-go” rate
• In reality: – Different history
• Cloud computing as utility computing (1966 paper)
• However, the promise of cloud computing finds a great user base in science grids due to: – Intense computations
– Huge amounts of storage needs
• Much of the Grid research community is now working on clouds – How much of that is only rebranding is useful to understand
Slide based on a talk by Anda Iamnitchi
http://www.csee.usf.edu/~anda/CIS6930-S11/notes/grids-and-clouds.ppt
Why volunteer computing?
● 2006: 1 billion PCs, 55% privately owned
● If 100M people participate:
– 100 PetaFLOPs, 1 Exabyte (10^18) storage
● Consumer products drive technology
– GPUs (NVIDIA, Sony Cell)
your computers
academic
business
home PCs
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
7
Volunteer computing history
95 96 97 98 99 00 01 02 03 04 05 06
GIMPS, distributed.net
SETI@home, folding@home
commercial projects
climateprediction.net
BOINC
Einstein@home
Rosetta@home
Predictor@home
LHC@home
BURP
PrimeGrid
... Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
Scientific computing paradigms
Grid computing
Supercomputers
Volunteer computing
Cluster computing
Control Bang/buck
least
least most
most
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
8
BOINC
SETI physics Climate biomedical
Joe Alice Jens
volunteers
projects
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
Participation in >1 project
• Better short-term resource utilization
– communicate/compute in parallel
– match applications to resources
• Better long-term resource utilization
– project A works while project B thinks
project
computing
needs think
work
think
work
time
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
9
Server performance
How many clients can a project support?
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
Server limits
• Single server (2X Xeon, 100 Mbps disk)
– 8.8 million tasks/day
– 4.4 PetaFLOPS (if 12 hrs on 1 GFLOPS CPU)
– CPU is bottleneck (2.5% disk utilization)
– 8.2 Mbps network (if 10K request/reply)
• Multiple servers (1 MySQL, 2 for others)
– 23.6 million tasks/day
– MySQL CPU is bottleneck
– 21.9 Mbps network
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
10
Credit
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
Credit display
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
11
Credit system goals
• Retain participants
– fair between users, across projects
– understandable
– cheat-resistant
• Maximize utility to projects
– hardware upgrades
– assignment of projects to computers
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
Credit system
• Computation credit
– benchmark-based
– application benchmarks
– application operation counting
– cheat-resistance: redundancy
• Other resources
– network, disk storage, RAM
• Other behaviors
– recruitment
– other participation
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
12
Goals of BOINC
• > 100 projects, some churn
• Handle big data better
– BitTorrent integration
– Use GPUs and other resources
– DAGs
• Participation
– 10-100 million
– multiple projects per participant
Slide based on a lecture by David Anderson:
http://www.cs.berkeley.edu/~demmel/cs267_Spr06/Lectures/Lecture11/lecture_11_VolunteerComputing.ppt
What is Condor? • Full-featured batch queue system.
• Condor is a specialized workload management system for compute-intensive jobs.
• Condor provides a job queuing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management.
• Users submit their serial or parallel jobs to Condor
• Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.
• Can be used to manage a cluster of dedicated compute nodes
• In addition, unique mechanisms enable Condor to effectively harvest wasted CPU power from otherwise idle desktop workstations.
• Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user
• Condor can be used to seamlessly combine all of an organization's computational power into one resource.
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
13
History of Condor
• Hosted at University of Wisconsin, USA.
• Condor project started in 1988.
• Directed by Professor M.Livny.
• Preliminary version of the Condor Resource Management system implemented in 1986.
• Originally focusing on the problem of Load Balancing in a distributed system,
• Shifted its attention to Distributively Owned computing environments where owners have full control over the resources they own.
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
Status
• ~17 years development.
• Condor team consists of ~30 people.
• Available on many platforms.
• Basic installation and usage very easy.
• Contracted + free support.
• Used in research environments and by industry.
• Sponsored by various major IT companies and organizations (IBM, Intel, Microsoft, NASA, …).
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
14
Architecture • Coordinated by a Central Manager Node.
• No central DBMS.
• Condor provides set of daemons defining the roles of each node in the pool.
• Daemons: – condor_master: Basic coordination on each node.
– condor_collector: Collects system information. Only on Central Manager.
– condor_negotiator: Assigns jobs to machines. Only on Central Manager.
– condor_startd: Executes jobs.
– condor_schedd: Handles job submission. Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
Condor Pool - Example
Central Manager
master
collector
negotiator
schedd
startd
= ClassAd Communication Pathway
= Process Spawned
Submit-Only
master
schedd
Execute-Only
master
startd
Regular Node
schedd
startd
master
Regular Node
schedd
startd
master
Execute-Only
master
startd
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
15
Personal Condor vs. Condor Pool
• Condor Pool:
– Collection of several nodes coordinated by one, Central Manager.
• Personal Condor:
– Condor on one workstation, no root access required, no system administrator intervention needed.
– Benefits of ‘pool’ with only one node (same as for a pool):
• Schedule large batches of jobs and have these processed in background.
• Keep an eye on jobs and get progress updates.
• Implement own scheduling policies on the execution order of jobs.
• Keep a log of the job activities.
• Add fault tolerance to the job execution.
• Implement policies for when jobs can run on a workstation.
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
Dedicated Nodes vs. Non-
Dedicated Nodes • Dedicated Node:
– Condor has all CPUs at its disposal.
• Non-Dedicated Node:
– Can't always run Condor jobs.
– If user is accessing keyboard/mouse or CPU is used by other processes, the Condor jobs are preempted.
– The policies for when Condor jobs can be started and may be preempted, are defined in the Condor Configuration.
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
16
Shared File System vs. File
Distribution • Use Shared Filesystem if available
– Administration and handling easier.
– Normally the case for a Dedicated Cluster.
• If no shared filesystem?
– Condor can transfer files.
– Can automatically send back changed files.
– Atomic transfer of multiple files.
– Data can be encrypted during transfer.
– Usually the case for pools with non-dedicated nodes
or in a GRID environment.
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
Personal Condor - Condor Pool
Condor Flocking
Personal Condor
Dedicated Pool
Common User Desktop Pool
Submission Node
Personal Condor
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
17
Condor Configuration
• Simple format (based on ClassAd).
• Possible to use environment variables.
• Global configuration and local specific to each node.
• Large set of configurable parameters.
• Example:
CONDOR_HOST = dfo09.hq.eso.org
RELEASE_DIR = /home/condor/INSTROOT/
LOCAL_DIR = $(TILDE)
LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local
REQUIRE_LOCAL_CONFIG_FILE = TRUE
CONDOR_ADMIN = [email protected]
MAIL = /bin/mail
UID_DOMAIN = hq.eso.org
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
…
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
Command Line Tools
• Many command line tools provided – some of these are: – condor_config_val: Get/set value of configuration parameters.
– condor_history: Query job history queue.
– condor_off: Stop Condor daemons.
– condor_q: Check job queue.
– condor_reconfig: Force sourcing of configuration.
– condor_rm: Remove jobs from the queue.
– condor_status: Status of Condor pool.
– condor_submit: Submit a job or a cluster of jobs.
– condor_submit_dag: Submit a set of jobs with dependencies.
– …
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
18
Submitting Clusters of Jobs
# Example condor_submit input file that defines
# a cluster of 600 jobs with different directories
Universe = vanilla
Executable = my_job
Log = my_job.log
Arguments = -arg1 –arg2
Input = my_job.stdin
Output = my_job.stdout
Error = my_job.stderr
InitialDir = run_$(Process)
Queue 600
Slide based on a talk by J.Knudstrup
http://www.eso.org/projects/dfs/dfs-shared/web/rei/rei-condor-intro.ppt
Top Related