Grid Computing 1

CSE 160/Berman

Grid Computing 1

Grid Book, Chapters 1, 2, 3, 22

“Implementing Distributed Synthetic Forces Simulations in Metacomputing Environments”

Brunett, Davis, Gottschalk, Messina, Kesselmanhttp://www.globus.org

CSE 160/Berman

Outline

• What is Grid computing?• Grid computing applications• Grid computing history• Issues in Grid Computing• Condor, Globus, Legion• The next step

CSE 160/Berman

What is Grid Computing?

• Computational Grid is a collection of distributed, possibly heterogeneous resources which can be used as an ensemble to execute large-scale applications

• Computational Grid also called “metacomputer”

CSE 160/Berman

Computational Grids• Term computational grid comes from an

analogy with the electric power grid:– Electric power is ubiquitous– Don’t need to know the source (transformer,

generator) of the power or the power company that serves it

– Analogy falls down in the area of performance

• Ever-present search for cycles in HPC. Two foci of research– “In the box” parallel computers -- PetaFLOPS

architectures– Increasing development of infrastructure and

middleware to leverage the performance potential of distributed Computational Grids

CSE 160/Berman

Grid Applications

• Distributed Supercomputing– Distributed Supercomputing

applications couple multiple computational resources – supercomputers and/or workstations

– Examples include:• SFExpress (large-scale modeling of battle

entities with complex interactive behavior for distributed interactive simulation)

• Climate Modeling (high resolution, long time scales, complex models)

CSE 160/Berman

Distributed Supercomputing Example – SF Express

• SF Express = (Synthetic Forces Express) large scale distributed simulation of behavior and movement of entities (tanks, trucks, airplanes, etc.) for interactive battle simulation.

• Entities require information about – State of terrain– Location and state of other

entities• Info updated several times a

second• Interest management allows

entities to only look at relevant information, enabling scalability

CSE 160/Berman

SF Express• Large scale SF Express run goals

– Simulation of 50,000 entities in 8/97, 100,000 entries in 3/98

– Increase fidelity and resolution of simulation over previous runs

– Improve

• Refresh rate

• Training environment responsiveness

• Number of automatic behaviors

– Ultimately use simulation for real-time planning as well as training

• Large scale runs extremely resource-intensive

CSE 160/Berman

SF Express Programming Issues

• How should entities be mapped to computational resources?

• Entities receive information based on “interests”– Communication reduced and localized

based on “interest management”

• Consistency model for entity information must be developed– Which entities can/should be replicated?– How should updates be performed?

CSE 160/Berman

SF Express Distributed Application Architecture

• D = data server, I = interest management, R = router, S = simulation node

R

I

DS S

R

S S S I

DS S

R

S S SI

DS S

R

S S S

CSE 160/Berman

Site Hardware Processors Entities / First Run

Entities / Second Run

Caltech HP Exemplar

256 13,095 12,182

ORNL Intel Paragon 1024 16,695 15,996

NASA Ca IBM SP2 139 5464 5637

CEWES, Va IBM SP2 229 9739 9607

Maui IBM SP2 128 5056 7027

HP/Convex, Tx

HP Exemplar

128 5348 6733

Total 1904 55,397 57,182

50,000 entity SF Express Run

• 2 large-scale simulations run on August 11, 1997

CSE 160/Berman

50,000 entity SF Express Run• Simulation decomposed terrain (Saudi Arabia, Kuwait, Iraq)

contiguously among supercomputers

• Each supercomputer simulated a specific area and exchanged interest and state information with other supercomputers

• All data exchanges were flow-controlled

• Supercomputers fully interconnected, dedicated for experiment

• Success depended on “moderate to significant system administration, interventions, competent system support personnel, and numerous phone calls.”

• Subsequent Globus runs focused on improving data, control management and operational issues for wide area

CSE 160/Berman

High-Throughput Applications

• Grid used to schedule large numbers of independent or loosely coupled tasks with the goal of putting unused cycles to work

• High-throughput applications include RSA keycracking, Seti@home (detection of extra-terrestrial intelligence), MCell

CSE 160/Berman

High-Throughput Applications

• Biggest master/slave parallel program in the world with master = website, slaves = individual computers

CSE 160/Berman

High-Throughput Example - MCell

• MCell – Monte Carlo simulation of cellular microphysiology. Simulation implemented as large-scale parameter sweep.

CSE 160/Berman

MCell

• MCell architecture: simulations performed by independent processors with distinct parameter sets and shared input files

CSE 160/Berman

MCell Programming Issues

• How should we assign tasks to processors to optimize locality?

• How can we use partial results during execution to steer the computation?

• How do we mine all the resulting data from experiments for results– During execution– After execution

• How can we use all available resources?

CSE 160/Berman

Data-Intensive Applications

• Focus is on synthesizing new information from large amounts of physically distributed data

• Examples include NILE (distributed system for high energy physics experiments using data from CLEO), SAR/SRB applications (Grid version of MS Terraserver), digital library applications

Data-Intensive Example - SARA

• SARA = Synthetic Aperture Radar Atlas– application developed at JPL

and SDSC

• Goal: Assemble/process files for user’s desired image– Radar organized into tracks

– User selects track of interestand properties to be highlighted

– Raw data is filtered and converted to an image format

– Image displayed in web browser

SARA Application Architecture

• Application structure focused around optimizing the delivery and processing of distributed data

Computation servers

and data servers are

logical entities, not

necessarily different

nodes. . .

ComputeServers

DataServers

Client

SARA Programming Issues• Which data server should replicated data be accessed

from?

• Should computation be done at the data server or data moved to a compute server or something in between?

• How big are the data files and how often will they be accessed?

OGI

UTK

UCSD

AppLeS/NWS

CSE 160/Berman

TeleImmersion

• Focus is on use of immersive virtual reality systems over a network– Combines generators, data sets and

simulations remote from user’s display environment

– Often used to support collaboration

• Examples include– Interactive scientific visualization (“being

there with the data”), industrial design, art and entertainment

CSE 160/Berman

Teleimmersion Example – Combustion System Modeling

• A shared collaborative space– Link people at multiple

locations– Share and steer scientific

simulations on supercomputer

• Combustion code developed by Lori Freitag at ANL

• Boiler application used to troubleshoot and design better products

Chicago

San Diego

CSE 160/Berman

Early Experiences with Grid Computing

• Gigabit Testbeds Program

– Late 80’s, early 90’s, gigabit testbed program was developed as joint NSF, DARPA, CNRI (Corporation for Networking Research, Bob Kahn) initiative

– Goals were to

• investigate potential architecture for a gigabit/sec network testbed

• explore usefulness for end-users

CSE 160/Berman

Gigabit Testbeds –Early 90’s

• 6 testbeds formed: – CASA (southwest)– MAGIC (midwest)– BLANCA (midwest)– AURORA (northeast)– NECTAR (northeast) – VISTANET (southeast)

• Each had a unique blend of research in applications and in networking and computer science research

CSE 160/Berman

Gigabit Testbeds

Testbed Site Hardware Application Focus

Remarks

Blanca NCSA, UIUC, UCB, UWisc, AT&T

Experimental ATM switches running over experimental 622 Mb/s and 45 Mb/s circuits developed by AT&T and universities

Virtual environments, Remote visualization and steering, multimedia digital libraries

Network spanned US (UCB to AT&T). Network research included distributed virtual memory, real-time protocols, congestion control, signaling protocols etc.

Vistanet MCNC, UNC, BellSouth

ATM network at OC-12; (622 Mb/s) interconnecting HIPPI local area networks

Radiation treatment planning applications involving supercomputer, remote instrument (radiation beam) and visualization

Medical personnel planned radiation beam orientation using a supercomputer. Extended the planning process from 2 beams in 2 dimensions to multiple beams in 3 dimensions.

Nectar CMU, Bell Atlantic, Bellcore, PSC

OC-48 (2.4 Gb/s) links between PSC supercomputer facility and CMU

Coupled supercomputers running chemical reaction dynamics and CS research

Metropolitan area testbed with OC-48 links between PSC and downtown CMU campus.

CSE 160/Berman

Gigabit Testbeds

Testbed Site Hardware Application Focus

Remarks

Aurora MIT, IBM, Bellcore, Penn, MCI

OC-12 network interconnecting 4 research sites and supporting the development of ATM host interfaces, ATM switches and network protocols.

Telerobotics, distributed virtual memory and operating system research

East coast sites. Research focused mostly on network and computer science issues.

Magic Army Battle Lab, Sprint, UKansas, UMinn, LBL, Army HPC Lab

OC-12 network to interconnect ATM-attached hosts

Remote vehicle control applications and high-speed access to databases for terrain visualization and battle simulation

Funded separately by DARPA after CNRI initiative had begun.

Casa Caltech, SDSC, LANL, JPL, MCI, USWest, PacBell

HippI switches connectedby HIPPI-over-SONETat OC-12

Distributed Supercomputing

Targeted improving the performance of distributed supercomputing applications by strategically mapping application components on resources.

CSE 160/Berman

I-Way• First large-scale

“modern” Grid experiment

• Put together for SC’95 (the “Supercomputing” Conference)

• I-Way consisted of a Grid of 17 sites connected by vBNS

• Over 60 applications ran on the I-WAY during SC’95

CSE 160/Berman

I-Way “Architecture”• Each I-WAY site served by an I-POP (I-WAY

Point of Presence) used for– authentication of distributed applications– distribution of associated libraries and other

software– monitoring the connectivity of the I-WAY virtual

network

• Users could use single authentication and job submission across multiple sites or they could work directly with end-users

• Scheduling done with a “human-in-the-loop”

CSE 160/Berman

I-Soft – Software for I-Way• Kerberos based authentication

– I-POP initiated rsh to local resources

• AFS for distribution of software and state• Central scheduler

– Dedicated I-WAY nodes on resource– Interface to local scheduler

• Nexus based communication libraries– MPI, CaveComm, CC++

• In many ways, I-Way experience formed foundation of Globus

CSE 160/Berman

I-Way Application: Cloud Detection

• Cloud detection from multimodal satellite data– Want to determine if satellite

image is clear, partially cloudy or completely cloudy

• Used remote supercomputer to enhance instruments with– Real-time response– Enhanced function, accuracy (of

pixel image)

• Developed by C. Lee, Aerospace Corporation, Kesselman, Caltech et al.

SPRINT

CSE 160/Berman

PACIs

• 2 NSF Supercomputer Centers (PACIs) – SDSC/NPACI and NCSA/Alliance, both committed to Grid computing

• vBNS backbone between NCSA and SDSC running at OC-12 with connectivity to over 100 locations at speeds ranging from 45 Mb/s to 155 Mb/s or more

CSE 160/Berman

PACI Grid

CSE 160/Berman

NPACI Grid Activities

• Metasystems Thrust Area one of the NPACI technology thrust areas– Goal is to create an operational

metasystems for NPACI

• Metasystems players:– Globus (Kesselman)– Legion (Grimshaw)– AppLeS (Berman and Wolski)– Network Weather Service (Wolski)

CSE 160/Berman

Alliance Grid Activities• Grid Task Force and Distributed

Computing team are Alliance teams

• Globus supported as exclusive grid infrastructure by Alliance

• Grid concept pervasive throughout Alliance– Access Grid developed for use by distributed

collaborative groups

• Allliance grid players include Foster (Globus), Livny (Condor), Stevens (ANL), Reed (Pablo), etc.

CSE 160/Berman

Other Efforts • Centurion Cluster = Legion testbed

– Legion cluster housed at UVA– 128 533 MHz Dec Alphas– 128 Dual 400 MHz Pentium2– Fast ethernet and myrinet

• Globus testbed = GUSTO which supports Globus infrastructure and application development– 125 sites in 23 countries as of 2/2000– Testbed aggregated from partner sites

(including NPACI)

CSE 160/Berman

GUSTO (Globus) Computational Grid

CSE 160/Berman

IPG • IPG = Information

Power Grid• NASA effort in grid

computing • Globus supported as

underlying infrastructure

• Application focus include aerospace design, environmental and space applications

CSE 160/Berman

Research and Development Foci for the

Grid• Applications

– Questions revolve around design and development of “Grid-aware” applications

– Different programming models: polyalgorithms, components, mixed languages, etc.

– Program development environment and tools required for development and execution of performance-efficient applications

Resources

Infrastructure

Middleware

Applications

CSE 160/Berman


Grid• Middleware

– Questions revolve around the development of tools and environments which facilitate application performance

– Software must be able to assess and utilize dynamic performance characteristics of resources to support application

– Agent-based computing and resource negotiation

Resources

Infrastructure

Middleware

Applications

CSE 160/Berman


Grid• Infrastructure

– Development of infrastructure that presents a “virtual machine” view of the Grid to users

– Questions revolve around providing basic services to user: security, remote file transfer, resource management, etc., as well as exposing performance characteristics.

– Services must be supported by heterogeneous and interoperate

Resources

Infrastructure

Middleware

Applications

CSE 160/Berman


Grid• Resources– Questions revolve around

heterogeneity and scale.– New challenges focus on

combining wireless and wired, static and dynamic, low-power and high-power, cheap and expensive resources

– Performance characteristics of grid resources vary dramatically, integrating them to support performance of individual and multiple applciations extremely challenging

Resources

Infrastructure

Middleware

Applications

Grid Computing 1

Documents

Transcript of Grid Computing 1