Grid Computing Grid Systems and scheduling Grid Computing April 20 th, 2009.
Grid Computing 1
description
Transcript of Grid Computing 1
CSE 160/Berman
Grid Computing 1
Grid Book, Chapters 1, 2, 3, 22
“Implementing Distributed Synthetic Forces Simulations in Metacomputing Environments”
Brunett, Davis, Gottschalk, Messina, Kesselmanhttp://www.globus.org
CSE 160/Berman
Outline
• What is Grid computing?• Grid computing applications• Grid computing history• Issues in Grid Computing• Condor, Globus, Legion• The next step
CSE 160/Berman
What is Grid Computing?
• Computational Grid is a collection of distributed, possibly heterogeneous resources which can be used as an ensemble to execute large-scale applications
• Computational Grid also called “metacomputer”
CSE 160/Berman
Computational Grids• Term computational grid comes from an
analogy with the electric power grid:– Electric power is ubiquitous– Don’t need to know the source (transformer,
generator) of the power or the power company that serves it
– Analogy falls down in the area of performance
• Ever-present search for cycles in HPC. Two foci of research– “In the box” parallel computers -- PetaFLOPS
architectures– Increasing development of infrastructure and
middleware to leverage the performance potential of distributed Computational Grids
CSE 160/Berman
Grid Applications
• Distributed Supercomputing– Distributed Supercomputing
applications couple multiple computational resources – supercomputers and/or workstations
– Examples include:• SFExpress (large-scale modeling of battle
entities with complex interactive behavior for distributed interactive simulation)
• Climate Modeling (high resolution, long time scales, complex models)
CSE 160/Berman
Distributed Supercomputing Example – SF Express
• SF Express = (Synthetic Forces Express) large scale distributed simulation of behavior and movement of entities (tanks, trucks, airplanes, etc.) for interactive battle simulation.
• Entities require information about – State of terrain– Location and state of other
entities• Info updated several times a
second• Interest management allows
entities to only look at relevant information, enabling scalability
CSE 160/Berman
SF Express• Large scale SF Express run goals
– Simulation of 50,000 entities in 8/97, 100,000 entries in 3/98
– Increase fidelity and resolution of simulation over previous runs
– Improve
• Refresh rate
• Training environment responsiveness
• Number of automatic behaviors
– Ultimately use simulation for real-time planning as well as training
• Large scale runs extremely resource-intensive
CSE 160/Berman
SF Express Programming Issues
• How should entities be mapped to computational resources?
• Entities receive information based on “interests”– Communication reduced and localized
based on “interest management”
• Consistency model for entity information must be developed– Which entities can/should be replicated?– How should updates be performed?
CSE 160/Berman
SF Express Distributed Application Architecture
• D = data server, I = interest management, R = router, S = simulation node
R
I
DS S
R
S S S I
DS S
R
S S SI
DS S
R
S S S
CSE 160/Berman
Site Hardware Processors Entities / First Run
Entities / Second Run
Caltech HP Exemplar
256 13,095 12,182
ORNL Intel Paragon 1024 16,695 15,996
NASA Ca IBM SP2 139 5464 5637
CEWES, Va IBM SP2 229 9739 9607
Maui IBM SP2 128 5056 7027
HP/Convex, Tx
HP Exemplar
128 5348 6733
Total 1904 55,397 57,182
50,000 entity SF Express Run
• 2 large-scale simulations run on August 11, 1997
CSE 160/Berman
50,000 entity SF Express Run• Simulation decomposed terrain (Saudi Arabia, Kuwait, Iraq)
contiguously among supercomputers
• Each supercomputer simulated a specific area and exchanged interest and state information with other supercomputers
• All data exchanges were flow-controlled
• Supercomputers fully interconnected, dedicated for experiment
• Success depended on “moderate to significant system administration, interventions, competent system support personnel, and numerous phone calls.”
• Subsequent Globus runs focused on improving data, control management and operational issues for wide area
CSE 160/Berman
High-Throughput Applications
• Grid used to schedule large numbers of independent or loosely coupled tasks with the goal of putting unused cycles to work
• High-throughput applications include RSA keycracking, Seti@home (detection of extra-terrestrial intelligence), MCell
CSE 160/Berman
High-Throughput Applications
• Biggest master/slave parallel program in the world with master = website, slaves = individual computers
CSE 160/Berman
High-Throughput Example - MCell
• MCell – Monte Carlo simulation of cellular microphysiology. Simulation implemented as large-scale parameter sweep.
CSE 160/Berman
MCell
• MCell architecture: simulations performed by independent processors with distinct parameter sets and shared input files
CSE 160/Berman
MCell Programming Issues
• How should we assign tasks to processors to optimize locality?
• How can we use partial results during execution to steer the computation?
• How do we mine all the resulting data from experiments for results– During execution– After execution
• How can we use all available resources?
CSE 160/Berman
Data-Intensive Applications
• Focus is on synthesizing new information from large amounts of physically distributed data
• Examples include NILE (distributed system for high energy physics experiments using data from CLEO), SAR/SRB applications (Grid version of MS Terraserver), digital library applications
Data-Intensive Example - SARA
• SARA = Synthetic Aperture Radar Atlas– application developed at JPL
and SDSC
• Goal: Assemble/process files for user’s desired image– Radar organized into tracks
– User selects track of interestand properties to be highlighted
– Raw data is filtered and converted to an image format
– Image displayed in web browser
SARA Application Architecture
• Application structure focused around optimizing the delivery and processing of distributed data
Computation servers
and data servers are
logical entities, not
necessarily different
nodes. . .
ComputeServers
DataServers
Client
SARA Programming Issues• Which data server should replicated data be accessed
from?
• Should computation be done at the data server or data moved to a compute server or something in between?
• How big are the data files and how often will they be accessed?
OGI
UTK
UCSD
AppLeS/NWS
CSE 160/Berman
TeleImmersion
• Focus is on use of immersive virtual reality systems over a network– Combines generators, data sets and
simulations remote from user’s display environment
– Often used to support collaboration
• Examples include– Interactive scientific visualization (“being
there with the data”), industrial design, art and entertainment
CSE 160/Berman
Teleimmersion Example – Combustion System Modeling
• A shared collaborative space– Link people at multiple
locations– Share and steer scientific
simulations on supercomputer
• Combustion code developed by Lori Freitag at ANL
• Boiler application used to troubleshoot and design better products
Chicago
San Diego
CSE 160/Berman
Early Experiences with Grid Computing
• Gigabit Testbeds Program
– Late 80’s, early 90’s, gigabit testbed program was developed as joint NSF, DARPA, CNRI (Corporation for Networking Research, Bob Kahn) initiative
– Goals were to
• investigate potential architecture for a gigabit/sec network testbed
• explore usefulness for end-users
CSE 160/Berman
Gigabit Testbeds –Early 90’s
• 6 testbeds formed: – CASA (southwest)– MAGIC (midwest)– BLANCA (midwest)– AURORA (northeast)– NECTAR (northeast) – VISTANET (southeast)
• Each had a unique blend of research in applications and in networking and computer science research
CSE 160/Berman
Gigabit Testbeds
Testbed Site Hardware Application Focus
Remarks
Blanca NCSA, UIUC, UCB, UWisc, AT&T
Experimental ATM switches running over experimental 622 Mb/s and 45 Mb/s circuits developed by AT&T and universities
Virtual environments, Remote visualization and steering, multimedia digital libraries
Network spanned US (UCB to AT&T). Network research included distributed virtual memory, real-time protocols, congestion control, signaling protocols etc.
Vistanet MCNC, UNC, BellSouth
ATM network at OC-12; (622 Mb/s) interconnecting HIPPI local area networks
Radiation treatment planning applications involving supercomputer, remote instrument (radiation beam) and visualization
Medical personnel planned radiation beam orientation using a supercomputer. Extended the planning process from 2 beams in 2 dimensions to multiple beams in 3 dimensions.
Nectar CMU, Bell Atlantic, Bellcore, PSC
OC-48 (2.4 Gb/s) links between PSC supercomputer facility and CMU
Coupled supercomputers running chemical reaction dynamics and CS research
Metropolitan area testbed with OC-48 links between PSC and downtown CMU campus.
CSE 160/Berman
Gigabit Testbeds
Testbed Site Hardware Application Focus
Remarks
Aurora MIT, IBM, Bellcore, Penn, MCI
OC-12 network interconnecting 4 research sites and supporting the development of ATM host interfaces, ATM switches and network protocols.
Telerobotics, distributed virtual memory and operating system research
East coast sites. Research focused mostly on network and computer science issues.
Magic Army Battle Lab, Sprint, UKansas, UMinn, LBL, Army HPC Lab
OC-12 network to interconnect ATM-attached hosts
Remote vehicle control applications and high-speed access to databases for terrain visualization and battle simulation
Funded separately by DARPA after CNRI initiative had begun.
Casa Caltech, SDSC, LANL, JPL, MCI, USWest, PacBell
HippI switches connectedby HIPPI-over-SONETat OC-12
Distributed Supercomputing
Targeted improving the performance of distributed supercomputing applications by strategically mapping application components on resources.
CSE 160/Berman
I-Way• First large-scale
“modern” Grid experiment
• Put together for SC’95 (the “Supercomputing” Conference)
• I-Way consisted of a Grid of 17 sites connected by vBNS
• Over 60 applications ran on the I-WAY during SC’95
CSE 160/Berman
I-Way “Architecture”• Each I-WAY site served by an I-POP (I-WAY
Point of Presence) used for– authentication of distributed applications– distribution of associated libraries and other
software– monitoring the connectivity of the I-WAY virtual
network
• Users could use single authentication and job submission across multiple sites or they could work directly with end-users
• Scheduling done with a “human-in-the-loop”
CSE 160/Berman
I-Soft – Software for I-Way• Kerberos based authentication
– I-POP initiated rsh to local resources
• AFS for distribution of software and state• Central scheduler
– Dedicated I-WAY nodes on resource– Interface to local scheduler
• Nexus based communication libraries– MPI, CaveComm, CC++
• In many ways, I-Way experience formed foundation of Globus
CSE 160/Berman
I-Way Application: Cloud Detection
• Cloud detection from multimodal satellite data– Want to determine if satellite
image is clear, partially cloudy or completely cloudy
• Used remote supercomputer to enhance instruments with– Real-time response– Enhanced function, accuracy (of
pixel image)
• Developed by C. Lee, Aerospace Corporation, Kesselman, Caltech et al.
SPRINT
CSE 160/Berman
PACIs
• 2 NSF Supercomputer Centers (PACIs) – SDSC/NPACI and NCSA/Alliance, both committed to Grid computing
• vBNS backbone between NCSA and SDSC running at OC-12 with connectivity to over 100 locations at speeds ranging from 45 Mb/s to 155 Mb/s or more
CSE 160/Berman
PACI Grid
CSE 160/Berman
NPACI Grid Activities
• Metasystems Thrust Area one of the NPACI technology thrust areas– Goal is to create an operational
metasystems for NPACI
• Metasystems players:– Globus (Kesselman)– Legion (Grimshaw)– AppLeS (Berman and Wolski)– Network Weather Service (Wolski)
CSE 160/Berman
Alliance Grid Activities• Grid Task Force and Distributed
Computing team are Alliance teams
• Globus supported as exclusive grid infrastructure by Alliance
• Grid concept pervasive throughout Alliance– Access Grid developed for use by distributed
collaborative groups
• Allliance grid players include Foster (Globus), Livny (Condor), Stevens (ANL), Reed (Pablo), etc.
CSE 160/Berman
Other Efforts • Centurion Cluster = Legion testbed
– Legion cluster housed at UVA– 128 533 MHz Dec Alphas– 128 Dual 400 MHz Pentium2– Fast ethernet and myrinet
• Globus testbed = GUSTO which supports Globus infrastructure and application development– 125 sites in 23 countries as of 2/2000– Testbed aggregated from partner sites
(including NPACI)
CSE 160/Berman
GUSTO (Globus) Computational Grid
CSE 160/Berman
IPG • IPG = Information
Power Grid• NASA effort in grid
computing • Globus supported as
underlying infrastructure
• Application focus include aerospace design, environmental and space applications
CSE 160/Berman
Research and Development Foci for the
Grid• Applications
– Questions revolve around design and development of “Grid-aware” applications
– Different programming models: polyalgorithms, components, mixed languages, etc.
– Program development environment and tools required for development and execution of performance-efficient applications
Resources
Infrastructure
Middleware
Applications
CSE 160/Berman
Research and Development Foci for the
Grid• Middleware
– Questions revolve around the development of tools and environments which facilitate application performance
– Software must be able to assess and utilize dynamic performance characteristics of resources to support application
– Agent-based computing and resource negotiation
Resources
Infrastructure
Middleware
Applications
CSE 160/Berman
Research and Development Foci for the
Grid• Infrastructure
– Development of infrastructure that presents a “virtual machine” view of the Grid to users
– Questions revolve around providing basic services to user: security, remote file transfer, resource management, etc., as well as exposing performance characteristics.
– Services must be supported by heterogeneous and interoperate
Resources
Infrastructure
Middleware
Applications
CSE 160/Berman
Research and Development Foci for the
Grid• Resources– Questions revolve around
heterogeneity and scale.– New challenges focus on
combining wireless and wired, static and dynamic, low-power and high-power, cheap and expensive resources
– Performance characteristics of grid resources vary dramatically, integrating them to support performance of individual and multiple applciations extremely challenging
Resources
Infrastructure
Middleware
Applications