From PC Clusters to a Global Computational Grid David Abramson Head of School Computer Science and...
-
Upload
jonathan-reeves -
Category
Documents
-
view
214 -
download
1
Transcript of From PC Clusters to a Global Computational Grid David Abramson Head of School Computer Science and...
From PC Clusters to a From PC Clusters to a Global Computational GridGlobal Computational Grid
From PC Clusters to a From PC Clusters to a Global Computational GridGlobal Computational Grid
David Abramson
Head of School
Computer Science and Software Engineering
Monash University Thanks to Jon Giddy, DSTC
Rok Sosic, Active ToolsAndrew Lewis, QPSF
Ian Foster, ANLRajkumar Buyya, Monash
Tom Peachy, Monash
22©David Abramson
Applications
Nimrod/G‘98 -
DSTC
Nimrod/ONimrod/O‘‘97 - ‘9997 - ‘99
ARC
Research ModelResearch Model
NimrodNimrod‘‘94 - ‘9894 - ‘98
ActiveSheets‘00 -
DSTC
Commercialisation (‘97 -)
33©David Abramson
Parametrised ModellingParametrised ModellingKiller App for the Grid?Killer App for the Grid?Parametrised ModellingParametrised ModellingKiller App for the Grid?Killer App for the Grid?
Study the behaviour of some of the output variables against a range of different input scenarios.
Computations are uncoupled (file transfer)
Allows real time analysis for many applications
More realistic simulations
Study the behaviour of some of the output variables against a range of different input scenarios.
Computations are uncoupled (file transfer)
Allows real time analysis for many applications
More realistic simulations
44©David Abramson
Working with Small ClustersWorking with Small ClustersNimrod (1994 - )
– DSTC Funded project– Designed for Department level clusters– Proof of concept
Clustor (www.activetools.com) (1997 - )– Commercial version of Nimrod– Re-engineered
Features– Workstation Orientation– Access to idle workstations– Random allocation policy– Password security
55©David Abramson
Execution ArchitectureExecution Architecture
Input FilesInput FilesSubstitutionSubstitution
Output FilesOutput Files
Root MachineRoot Machine Computational Computational NodesNodes
77©David Abramson
Physical Model
f
fTime to Time to crack in crack in this positionthis position (Courtesy Prof Rhys Jones, (Courtesy Prof Rhys Jones,
Dept Mechanical Engineering,Dept Mechanical Engineering,Monash University)Monash University)
Clustor by example Clustor by example
99©David Abramson
Sample Applications of ClustorSample Applications of ClustorSample Applications of ClustorSample Applications of Clustor
Bioinformatics: Bioinformatics: Protein Protein
ModellingModelling
SensitivitySensitivityexperiments experiments
on smog formationon smog formation
Combinatorial Combinatorial Optimization:Optimization:
Meta-heuristic Meta-heuristic parameter estimationparameter estimation
Ecological Modelling: Ecological Modelling: Control Strategies Control Strategies
for Cattle Tickfor Cattle Tick
Electronic CAD: Electronic CAD: Field Programmable Field Programmable
Gate ArraysGate Arrays
Computer Graphics: Computer Graphics: Ray TracingRay Tracing
High Energy High Energy Physics: Physics:
Searching for Searching for Rare EventsRare Events
Physics: Physics: Laser-Atom Laser-Atom
CollisionsCollisions
VLSI Design: VLSI Design: SPICE SimulationsSPICE Simulations
Fuzzy Logic Fuzzy Logic Parameter settingParameter setting
ATM Network DesignATM Network Design
1010©David Abramson
SMOG Sensitivity ExperimentsSMOG Sensitivity ExperimentsSMOG Sensitivity ExperimentsSMOG Sensitivity Experiments
Control ROCControl ROC
Co
ntr
ol N
Ox
Co
ntr
ol N
Ox
$$$$$$
1111©David Abramson
Physics - Laser Physics - Laser InteractionInteractionPhysics - Laser Physics - Laser InteractionInteraction
1313©David Abramson
Dr Dinelli MatherMonash University &MacFarlane Burnett
Public Health Policy
Health Standards
Lew KotlerAustralian Radiation Protection and Nuclear Safety Agency
Airframe Simulation
Dr Shane Dunn,AMRL, DSTO
Network Simulation
Dr Mahbun Hassan, Monash
Current Application DriversCurrent Application Drivers
1414©David Abramson
Evolution of the Global GridEvolution of the Global Grid
GlobalGlobalClustersClusters
DesktopDesktop DepartmentDepartmentClustersClusters
SharedSharedSupercomputerSupercomputer
Enterprise-WideEnterprise-WideClustersClusters
1515©David Abramson
The Nimrod Vision ...The Nimrod Vision ... Can we Can we make it 10% make it 10%
smaller?smaller? We need We need the answer the answer by 5 o’clockby 5 o’clock
1616©David AbramsonSource: www.globus.org & updated
Towards Grid Computing…. Towards Grid Computing…. The Gusto TestbedThe Gusto Testbed
1717©David Abramson
What does the Grid have to offer?What does the Grid have to offer?
“Dependable, consistent, pervasive access to
[high-end] resources”
Dependable: Can provide performance and functionality guarantees
Consistent: Uniform interfaces to a wide variety of resources
Pervasive: Ability to “plug in” from anywhere
Source: www.globus.org
1818©David Abramson
Challenges for the Global GridChallenges for the Global Grid
Security
Resource Allocation & Scheduling
Data locality
Network Management
System ManagementResource Location
Uniform Access
1919©David Abramson
Nimrod on Enterprise Wide Nimrod on Enterprise Wide Networks Networks and the Global Gridand the Global Grid
Manual resource location– Static file of machine names
No resource Scheduling– First come first serve
No cost Model– All machines/users cost alike
Homogeneous Access Mechanism
2020©David Abramson
RequirementsRequirements
Users & system managers want to know– Where it will run– When it will run– How much it will cost– That access is secure– Will support a range of
access mechanisms
2121©David AbramsonSource: www.globus.org
The Globus ProjectThe Globus Project
Basic research in grid-related technologies– Resource management, QoS, networking, storage, security, adaptation,
policy, etc.
Development of Globus toolkit– Core services for grid-enabled tools & applns
Construction of large grid testbed: GUSTO– Largest grid testbed in terms of sites & apps
Application experiments– Tele-immersion, distributed computing, etc.
2222©David Abramson
Layered Globus ArchitectureLayered Globus Architecture
Applications
Local Services
LSF
Condor MPI
NQEEasy
TCP
SolarisIrixAIX
UDP
High-level Services and Tools
DUROC globusrunMPI Nimrod/GMPI-IO CC++
GlobusView Testbed Status
Core ServicesMetacomputing
Directory Service
GRAMGlobus
Security Interface
Heartbeat Monitor
Nexus
Gloperf GASS
Source: www.globus.org
2424©David Abramson
Resource LocationResource Location
Need to locate suitable machines for an experiment– Speed– Number of processors– Cost– Availability– User account
Available resources will vary across experiment
Supported through Directory Server (Globus MDS)
2525©David Abramson
Resource SchedulingResource Scheduling
User view– solve problem in minimum time
System– Spread load across machines
Soft real time problem through deadlines– Complete by deadline– Unreliable resource provision
– Machine load may change at any time– Multiple machine queues
2626©David Abramson
Resource Scheduling ...Resource Scheduling ...
Need to establish rate at which a machine can consume jobs
Use deadline as metric for machine performance
Move jobs to machines that are performing well
Remove jobs from machines that are falling behind
Node 4
Node 2
Time
2727©David Abramson
Computational EconomyComputational Economy
Resource selection on based real money and market based
A large number of sellers and buyers (resources may be dedicated/shared)
Negotiation: tenders/bids and select those offers meet the requirement
Trading and Advance Resource Reservation
Schedule computations on those resources that meet all requirements
2828©David Abramson
Cost ModelCost Model
Without cost ANY shared system becomes un-managable
Charge users more for remote facilities than their own
Choose cheaper resources before more expensive ones
Cost units may be– Dollars– Shares in global facility– Stored in bank
2929©David Abramson
Cost Model ...Cost Model ...
Non-uniform costing
Encourages use of local resources first
Real accounting systemcan control machine usage
11 33
22 11User 5User 5
Mach
ine 1
Mach
ine 1
User 1User 1
Mach
ine 5
Mach
ine 5
3030©David Abramson
SecuritySecurity
Uses Globus Security Layer
Generic Security Service API using an implementation of SSL, Secure Sockets Layer.
RSA encryption algorithm employing both public and private keys.
X509 certificate consisting of – duration of the permissions, – the RSA public key, – signature of the Certificate Authority (CA).
3131©David Abramson
Uniform AccessUniform Access
Resource Allocation Module (GRAM) provides interface to range of schemes
– Fork– Queue (Easy, LoadLeveler, Condor, LSF)
Multiple pathways to same machine (if supported)
Integrated with Security scheme
3232©David Abramson
Nimrod/G ArchitectureNimrod/G Architecture
Nimrod/G Client Nimrod/G ClientNimrod/G Client
Grid Directory Services
Schedule Advisor
Resource Discovery
Grid Middleware Services
Dispatcher
GUSTO Test Bed
Parametric Engine
Persistent Info.
3333©David Abramson
Nimrod/G InteractionsNimrod/G Interactions
MDSserver
Resource location
QueuingSystem
GRAMserver
Resource allocation
(local)
Additional services used implicitly:• GSI (authentication & authorization)• Nexus (communication)
Userprocess
File accessGASSserver
Gatekeeper node
JobWrapper
Computational node
Dispatcher
Root node
Scheduler
Prmtc..Engine
3434©David Abramson
A Nimrod/G ClientA Nimrod/G Client
CostCostDeadlineDeadline
AvailableAvailableMachinesMachines
3535©David Abramson
Nimrod/G Scheduling AlgorithmNimrod/G Scheduling Algorithm
Find a set of machines (MDS search)
Distribute jobs from root to machines
Establish job consumption rate for each machine
For each machine
Can we meet deadline?
If not, then return some jobs to root
If yes, distribute more jobs to resource
If cannot meet deadline with current resource
Find additional resources
3636©David Abramson
Nimrod/G Scheduling algorithm ...Nimrod/G Scheduling algorithm ...
LocateLocate
MachinesMachines
DistributeDistribute
JobsJobs
EstablishEstablish
RatesRates
MeetMeet
Deadlines?Deadlines?
Re-distributeRe-distribute
JobsJobs
LocateLocate
moremore
MachinesMachines
3737©David Abramson
Nimrod/G Scheduling algorithm ...Nimrod/G Scheduling algorithm ...
LocateLocate
MachinesMachines
DistributeDistribute
JobsJobs
EstablishEstablish
RatesRates
MeetMeet
Deadlines?Deadlines?
Re-distributeRe-distribute
JobsJobs
LocateLocate
moremore
MachinesMachines
3838©David Abramson
Nimrod/G Scheduling algorithm ...Nimrod/G Scheduling algorithm ...
LocateLocate
MachinesMachines
DistributeDistribute
JobsJobs
EstablishEstablish
RatesRates
MeetMeet
Deadlines?Deadlines?
Re-distributeRe-distribute
JobsJobs
LocateLocate
moremore
MachinesMachines
3939©David Abramson
Nimrod/G Scheduling algorithm ...Nimrod/G Scheduling algorithm ...
LocateLocate
MachinesMachines
DistributeDistribute
JobsJobs
EstablishEstablish
RatesRates
MeetMeet
Deadlines?Deadlines?
Re-distributeRe-distribute
JobsJobs
LocateLocate
moremore
MachinesMachines
4040©David Abramson
Nimrod/G Scheduling algorithm ...Nimrod/G Scheduling algorithm ...
LocateLocate
MachinesMachines
DistributeDistribute
JobsJobs
EstablishEstablish
RatesRates
MeetMeet
Deadlines?Deadlines?
Re-distributeRe-distribute
JobsJobs
LocateLocate
moremore
MachinesMachines
4141©David Abramson
Nimrod/G Scheduling algorithm ...Nimrod/G Scheduling algorithm ...
LocateLocate
MachinesMachines
DistributeDistribute
JobsJobs
EstablishEstablish
RatesRates
MeetMeet
Deadlines?Deadlines?
Re-distributeRe-distribute
JobsJobs
LocateLocate
moremore
MachinesMachines
4242©David Abramson
Nimrod/G Scheduling algorithm ...Nimrod/G Scheduling algorithm ...
LocateLocate
MachinesMachines
DistributeDistribute
JobsJobs
EstablishEstablish
RatesRates
MeetMeet
Deadlines?Deadlines?
Re-distributeRe-distribute
JobsJobs
LocateLocate
moremore
MachinesMachines
4343©David Abramson
Nimrod/G Scheduling algorithm ...Nimrod/G Scheduling algorithm ...
LocateLocate
MachinesMachines
DistributeDistribute
JobsJobs
EstablishEstablish
RatesRates
MeetMeet
Deadlines?Deadlines?
Re-distributeRe-distribute
JobsJobs
LocateLocate
moremore
MachinesMachines
4444©David Abramson
Some results experimentsSome results experimentsGraph 2 - GUSTO Usage for Ionization Chamber Study
0
10
20
30
40
50
60
70
80
0 2.5 5 7.5 10 12.5 15 17.5 20
Time
Ave
rag
eN
o P
roce
sso
rs
20 Hour deadline15 hour deadline10 hour deadline
4545©David Abramson
Graph 3 - GUSTO Usage for 20 Hour Deadline
0
2
4
6
8
10
12
14
16
18
20
0 2.5 5 7.5 10 12.5 15 17.5 20
Time
Ave
rag
e N
o P
roce
sso
rs
5 CUs
10 CUs
15 CUs
20 CUs
50 CUs
5 Cost Units
10 Cost Units
4646©David Abramson
Graph 4 - GUSTO Usage for 15 Hour Deadline
0
2
4
6
8
10
12
14
16
18
20
0 2.5 5 7.5 10 12.5 15 17.5 20
Time
Ave
rag
e N
o P
roc
es
so
rs
5 CUs
10 CUs
15 CUs
20 CUs
50 CUs
5 Cost Units
50 Cost Units
15 Cost Units
10 Cost Units
4747©David Abramson
Graph 5 - GUSTO Usage for 10 Hour Deadline
0
5
10
15
20
25
30
35
0 2.5 5 7.5 10 12.5 15 17.5 20
Time
No
Pro
ce
ss
es 5 CUs
10 CUs
15 CUs
20 CUs
50 CUs
10 Cost Units
50 Cost Units
20 Cost Units
5 Cost Units
15 Cost Units
4848©David Abramson
Optimal Design using Optimal Design using computation - Nimrod/Ocomputation - Nimrod/O
Clustor allows exploration of design scenarios– Search by enumeration
Search for local/global minima based on objective function– How do I minimise the cost of this design?– How do I maxmimize the life of this object?
Objective function evaluated by computational model– Computationally expensive
Driven by applications
4949©David Abramson
Application DriversApplication Drivers
Complex industrial design problems– Air quality– Antenna Design– Business Simulation– Mechanical Optimisation
5050©David Abramson
Cost function minimizationCost function minimization
Continuous functions - gradient descent
Quasi-Newton BFGS algorithm– find gradient using finite difference approximation– line search using bound constrained, parallel method
5151©David Abramson
ImplementationImplementation
Master - slave parallelization
Gradient-determination & line-searching– tasks queued via IBM LoadLeveler– (adapt to number of CPUs allocated by the Resource Manager)
Interfaced to existing dispatchers– Clustor – Nimrod/G
5252©David Abramson
Meta-heuristicMeta-heuristic
SearchSearchMeta-heuristicMeta-heuristic
SearchSearch
Supercomputer orSupercomputer orCluster PoolCluster Pool
ArchitectureArchitecture
BFGSBFGS
ClustorClustorDispatcherDispatcher
FunctionFunctionEvaluationsEvaluations
JobsJobs
ClustorClustorPlanPlanFileFile
5353©David Abramson
Ongoing researchOngoing research
Increased parallelism– Multi-start for better coverage– High dimensioned problems– Addition of other search algorithms
– Simplex algorithm
Mixed integer problems– BFGS modified to support mixed integer– Mixed search/enumeration– Meta-heuristic based search
– Adaptive Simulated Annealing (ASA)