TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid...

26
TEXAS ADVANCED COMPUTING CENTER TEXAS ADVANCED COMPUTING CENTER Grids: Grids: TACC Case Study TACC Case Study Ashok Adiga, Ph.D. Ashok Adiga, Ph.D. Distributed & Grid Computing Group Distributed & Grid Computing Group Texas Advanced Computing Center Texas Advanced Computing Center The University of Texas at Austin The University of Texas at Austin [email protected] [email protected] (512) 471-8196 (512) 471-8196

Transcript of TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid...

Page 1: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

TEXAS ADVANCED COMPUTING CENTERTEXAS ADVANCED COMPUTING CENTER

Grids: Grids: TACC Case StudyTACC Case Study

Ashok Adiga, Ph.D.Ashok Adiga, Ph.D.Distributed & Grid Computing GroupDistributed & Grid Computing GroupTexas Advanced Computing CenterTexas Advanced Computing Center

The University of Texas at AustinThe University of Texas at [email protected]@tacc.utexas.edu

(512) 471-8196(512) 471-8196

Page 2: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

2

OutlineOutline

• Overview of TACC Grid Computing ActivitiesOverview of TACC Grid Computing Activities

• Building a Campus Grid – UT GridBuilding a Campus Grid – UT Grid

• Addressing common Use CasesAddressing common Use Cases– Scheduling & FlowScheduling & Flow

– Grid PortalsGrid Portals

• ConclusionsConclusions

Page 3: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

3

TACC Grid ProgramTACC Grid Program

• Building Grids at TACCBuilding Grids at TACC– Campus Grid (UT Grid)Campus Grid (UT Grid)– State Grid (TIGRE)State Grid (TIGRE)– National Grid (ETF)National Grid (ETF)

• Grid Hardware ResourcesGrid Hardware Resources– Wide range of hardware resources available to research community at Wide range of hardware resources available to research community at

UT and partnersUT and partners

• Grid Software ResourcesGrid Software Resources– NMI Components, NPACKageNMI Components, NPACKage– User Portals, GridPortUser Portals, GridPort– Job schedulers: LSF Multicluster, Community Scheduling FrameworkJob schedulers: LSF Multicluster, Community Scheduling Framework– United Devices (desktop grids) United Devices (desktop grids)

• SignificantlySignificantly leveraging NMI Components and experience leveraging NMI Components and experience

Page 4: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

4

TACC Resources: TACC Resources: Providing Comprehensive, Balanced CapabilitiesProviding Comprehensive, Balanced CapabilitiesHPCHPC

Cray-Dell cluster: 600 CPUs, 3.67 Cray-Dell cluster: 600 CPUs, 3.67 Tflops, 0.6 TB memory, 25 TB diskTflops, 0.6 TB memory, 25 TB disk

IBM Power4 system: 224 CPUs, IBM Power4 system: 224 CPUs, 1.16 Tflops, 0.5 TB memory, 7.1 1.16 Tflops, 0.5 TB memory, 7.1 TB diskTB disk

Data storageData storageSun SAN: 12TB across research Sun SAN: 12TB across research and main campusesand main campuses

STK PowderHorn silo: 2.8 PB STK PowderHorn silo: 2.8 PB capacity capacity

VisualizationVisualizationSGI Onyx2: 24 CPUs, 25 GB SGI Onyx2: 24 CPUs, 25 GB memory, 6 IR2 graphics pipesmemory, 6 IR2 graphics pipes

Sun V880z: 4 CPUs, 2 Zulu Sun V880z: 4 CPUs, 2 Zulu graphics pipesgraphics pipes

Dell/Windows cluster: 18 CPUs, 9 Dell/Windows cluster: 18 CPUs, 9 NVIDIA NV30 cards (soon)NVIDIA NV30 cards (soon)

LargeLarge immersive environment and immersive environment and 10 10 largelarge, tiled displays, tiled displays

NetworkingNetworkingNortel 10GigE DWDM: between Nortel 10GigE DWDM: between machine room and vislab bldg.machine room and vislab bldg.

Force10 switch-routers: 1.2Tbps, Force10 switch-routers: 1.2Tbps, in machine room and vislab bldgin machine room and vislab bldg

TeraBurst V20s: OC48 video TeraBurst V20s: OC48 video capability for remote, collaborative capability for remote, collaborative 3D visualization3D visualization

Page 5: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

5

TeraGrid (National)TeraGrid (National)• NSF Extensible Terascale Facility (ETF) projectNSF Extensible Terascale Facility (ETF) project

– build and deploy the world's largest, fastest, distributed build and deploy the world's largest, fastest, distributed computational infrastructure for general scientific researchcomputational infrastructure for general scientific research

– Current Members: Current Members: San Diego Supercomputing Center, NCSA, Argonne National San Diego Supercomputing Center, NCSA, Argonne National

Laboratory, Pittsburg Supercomputing Center, California Laboratory, Pittsburg Supercomputing Center, California Institute of TechnologyInstitute of Technology

– Currently has 40 Gbps backbone with hubs in Los Angeles & Currently has 40 Gbps backbone with hubs in Los Angeles & ChicagoChicago

– 3 New Members added in September 20033 New Members added in September 2003 The University of Texas (led by TACC)The University of Texas (led by TACC) Oakridge National LabsOakridge National Labs Indiana U/Purdue UIndiana U/Purdue U

Page 6: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

6

Teragrid (National)Teragrid (National)• UT awarded $3.2M to join NSF ETF in September 2003UT awarded $3.2M to join NSF ETF in September 2003

– Establish 10 Gbps network connection to ETF backboneEstablish 10 Gbps network connection to ETF backbone– Provide access to high-end computers capable of 6.2 teraflops, Provide access to high-end computers capable of 6.2 teraflops,

a new terascale visualization system, and a 2.8-petabyte mass a new terascale visualization system, and a 2.8-petabyte mass storage systemstorage system

– Provide access to geoscience data collections used in Provide access to geoscience data collections used in environmental, geological climate and biological research:environmental, geological climate and biological research:

high-resolution digital terrain datahigh-resolution digital terrain data worldwide hydrological dataworldwide hydrological data global gravity dataglobal gravity data high-resolution X-ray computed tomography datahigh-resolution X-ray computed tomography data

• Current software stack includes: Globus (GSI, GRAM, Current software stack includes: Globus (GSI, GRAM, GridFTP), MPICH-G2, Condor-G, GPT, MyProxy, SRBGridFTP), MPICH-G2, Condor-G, GPT, MyProxy, SRB

Page 7: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

7

TIGRE (State-wide Grid)TIGRE (State-wide Grid)

• Texas Internet Grid for Research and EducationTexas Internet Grid for Research and Education– computational grid to integrate computing & storage systems, computational grid to integrate computing & storage systems,

databases, visualization laboratories and displays, and instruments and databases, visualization laboratories and displays, and instruments and sensors across Texas.sensors across Texas.

– Current TIGRE particpants:Current TIGRE particpants: RiceRice Texas A&MTexas A&M Texas Tech UniversityTexas Tech University Univ of HoustonUniv of Houston Univ of Texas at Austin (TACC)Univ of Texas at Austin (TACC)

– Grid software for TIGRE Testbed:Grid software for TIGRE Testbed: Globus, MPICH-G2, NWS, SRBGlobus, MPICH-G2, NWS, SRB Other local packages must be integratedOther local packages must be integrated Goal: track NMI GRIDSGoal: track NMI GRIDS

Page 8: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

8

UT Grid (Campus Grid)UT Grid (Campus Grid)

• Mission: integrate and simplify the usage of the diverse Mission: integrate and simplify the usage of the diverse computational, storage, visualization, data, and computational, storage, visualization, data, and instrument resources of UT to facilitate new, powerful instrument resources of UT to facilitate new, powerful paradigms for research and education.paradigms for research and education.

• UT Austin Participants:UT Austin Participants:– Texas Advanced Computing Center (TACC) Texas Advanced Computing Center (TACC)

– Institute for Computational Engineering & Sciences (ICES) Institute for Computational Engineering & Sciences (ICES)

– Information Technology Services (ITS) Information Technology Services (ITS)

– Center for Instructional Technologies (CIT) Center for Instructional Technologies (CIT) – College of Engineering (COE)College of Engineering (COE)

Page 9: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

9

What is a Campus Grid?What is a Campus Grid?

• Important differences from enterprise gridsImportant differences from enterprise grids– Researchers Researchers generallygenerally more independent than in company with more independent than in company with

tight focus on mission, profitstight focus on mission, profits

– No central IT group governs researchers’ systemsNo central IT group governs researchers’ systems paid for out of grants, so distributed authoritypaid for out of grants, so distributed authority owners of PCs, clusters have total reconfigure and owners of PCs, clusters have total reconfigure and

participate if willingparticipate if willing

– Lots of heterogeneity; lots of low-cost, poorly-supported systemsLots of heterogeneity; lots of low-cost, poorly-supported systems

– Accounting potentially less importantAccounting potentially less important Focus on increasing research effectiveness allows tackling Focus on increasing research effectiveness allows tackling

problems early (scheduling, workflow, etc.)problems early (scheduling, workflow, etc.)

Page 10: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

10

UT Grid: ApproachUT Grid: Approach

• Unique characteristics present opportunitiesUnique characteristics present opportunities– Some campus researchers want to be on bleeding Some campus researchers want to be on bleeding

edge, unlike commercial enterprisesedge, unlike commercial enterprises

– TACC provides high-end systems that researchers TACC provides high-end systems that researchers requirerequire

– Campus users have trust relationships initially with Campus users have trust relationships initially with TACC, but not each otherTACC, but not each other

• How to build a How to build a campuscampus grid: grid:– Build a hub & spoke grid firstBuild a hub & spoke grid first

– Address both productivity and grid R&DAddress both productivity and grid R&D

Page 11: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

11

UT Grid: Logical ViewUT Grid: Logical View

1.1. IntegrateIntegrate distributed TACC distributed TACCresources first (Globus, LSF, NWS,resources first (Globus, LSF, NWS,SRB, United Devices, GridPort)SRB, United Devices, GridPort)

TACC HPC,Vis, Storage

(actually spread across two campuses)

Page 12: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

12

UT Grid: Logical ViewUT Grid: Logical View

2.2. Next Next add other UTadd other UTresources in one bldg.resources in one bldg.as spoke usingas spoke usingsame tools andsame tools andproceduresprocedures

TACC HPC,Vis, Storage

ICES Cluster

ICES Cluster

ICES Cluster

Page 13: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

13

UT Grid: Logical ViewUT Grid: Logical View

2.2. Next Next add other UTadd other UTresources in one bldg.resources in one bldg.as spoke usingas spoke usingsame tools andsame tools andproceduresprocedures

TACC HPC,Vis, Storage

ICES Cluster

ICES Cluster

ICES Cluster

GEO Cluster

GEO Cluster

Page 14: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

14

UT Grid: Logical ViewUT Grid: Logical View

TACC HPC,Vis, Storage

ICES Cluster

ICES Cluster

ICES Cluster

GEO Cluster

GEO Cluster

BIO Cluster BIO Cluster

PGE Cluster

PGE Cluster

2.2. Next Next add other UTadd other UTresources in one bldg.resources in one bldg.as spoke usingas spoke usingsame tools andsame tools andproceduresprocedures

Page 15: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

15

UT Grid: Logical ViewUT Grid: Logical View

3.3. Finally Finally negotiatenegotiateconnectionsconnectionsbetween spokesbetween spokesfor for willingwilling participants participantsto develop a P2P grid.to develop a P2P grid.

TACC HPC,Vis, Storage

ICES Cluster

ICES Cluster

ICES Cluster

GEO Cluster

GEO Cluster

BIO Cluster BIO Cluster

PGE Cluster

PGE Cluster

Page 16: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

16

UT Grid: Physical ViewUT Grid: Physical View

Research campus

Main campus

TACCPWR4

TACC Vis

NOC

NOC

Ext nets

GAATN

ACES

Switch

CMS

TACCStorage

Switch

TACCCluster

TACC Cluster

ICES Cluster

ICES Cluster

PGE Cluster

PGE Cluster

PGE Cluster

Switch

PGE

Page 17: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

17

UT Grid: FocusUT Grid: Focus

• Address users interested only in increased Address users interested only in increased productivityproductivity– Some users just want to be more productive with Some users just want to be more productive with

TACC resources and their own (and others): TACC resources and their own (and others): scheduling throughput, data collections, workflowscheduling throughput, data collections, workflow

– Install ‘lowest common denominator’ software only on Install ‘lowest common denominator’ software only on TACC production resources, user spokes for TACC production resources, user spokes for productivity: Globus 2.x, GridPort 2.x, WebSphere, productivity: Globus 2.x, GridPort 2.x, WebSphere, LSF MultiCluster, SRB, NWS, United Devices, etc.LSF MultiCluster, SRB, NWS, United Devices, etc.

Page 18: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

18

UT Grid: FocusUT Grid: Focus

• Address users interested in grid R&D issuesAddress users interested in grid R&D issues– Some users want to conduct grid-related R&D:Some users want to conduct grid-related R&D:

grid scheduling, performance modeling, meta-grid scheduling, performance modeling, meta-applications, P2P storage, etc.applications, P2P storage, etc.

– Also install bleeding-edge software to support grid Also install bleeding-edge software to support grid R&D on TACC R&D on TACC testbedtestbed and and willingwilling spoke systems: spoke systems: Globus 3.0 and other OGSA software, GridPort 3.x, Globus 3.0 and other OGSA software, GridPort 3.x, Common Scheduling Framework, etc.Common Scheduling Framework, etc.

Page 19: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

19

Scheduling & WorkflowScheduling & Workflow

• Use Case: Researcher wants to run climate Use Case: Researcher wants to run climate modeling job on a compute cluster and view modeling job on a compute cluster and view results using a specified visualization resourceresults using a specified visualization resource

• Grid middleware requirements: Grid middleware requirements: – Schedule job to “best” compute clusterSchedule job to “best” compute cluster

– Forward results to specified visualization resourceForward results to specified visualization resource

– Support advanced reservations on vis. resourceSupport advanced reservations on vis. resource

• Currently solved using LSF Multicluster & Currently solved using LSF Multicluster & Globus (GSI, GridFTP, GRAM)Globus (GSI, GridFTP, GRAM)

• Evaluating CSF meta-scheduler for future use Evaluating CSF meta-scheduler for future use

Page 20: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

20

What is CSF?What is CSF?

• CSF (Community Scheduler Framework):CSF (Community Scheduler Framework): Open source meta-scheduler framework contributed by Platform Open source meta-scheduler framework contributed by Platform

Computing to Globus for possible inclusion in the Globus ToolkitComputing to Globus for possible inclusion in the Globus Toolkit Developed with the latest version of OGSI – grid guideline being Developed with the latest version of OGSI – grid guideline being

developed with Global Grid Forum (OGSA)developed with Global Grid Forum (OGSA) Extensible framework for implementing meta-schedulersExtensible framework for implementing meta-schedulers

– Supports heterogeneous workload execution software (LSF, PBS, Supports heterogeneous workload execution software (LSF, PBS, SGE)SGE)

Negotiate advanced reservations (WS-agreement)Negotiate advanced reservations (WS-agreement) Select best resource for a given job based on specified policiesSelect best resource for a given job based on specified policies

– Provides standard API to submit and manage jobsProvides standard API to submit and manage jobs

Page 21: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

21

LSFPBS

Queuing Service

Queuing Service

GT3.0

Job Service

Job Service

Reservation Service

Reservation Service

VO A

GT3.0RMAdapterfor PBS

GT3.0RM

Adapterfor LSF

CAQueuing Service

Queuing Service

GT3.0

Job Service

Job Service

Reservation Service

Reservation Service

VO B

GT3.0RM

Adapterfor LSF

CA

Example CSF ConfigurationExample CSF Configuration

Page 22: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

22

Grid PortalsGrid Portals• Use Case: Researcher logs on using a single grid portal Use Case: Researcher logs on using a single grid portal

account which enables her toaccount which enables her to– Be authenticated across all resources on the gridBe authenticated across all resources on the grid– Submit and manage job sequences on the entire gridSubmit and manage job sequences on the entire grid– View account allocations and usageView account allocations and usage– View current status of all grid resourcesView current status of all grid resources– Transfer files between grid resourcesTransfer files between grid resources

• GridPort provides base services used to create GridPort provides base services used to create customized portals (e.g. HotPages). Technologies:customized portals (e.g. HotPages). Technologies:– Security: GSI, SSH, MyProxySecurity: GSI, SSH, MyProxy– Job Execution: GRAM GatekeeperJob Execution: GRAM Gatekeeper– Information Services: MDS, NWS, Custom information scriptsInformation Services: MDS, NWS, Custom information scripts– File Management: GridFTPFile Management: GridFTP

Page 23: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

23

Page 24: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

24

GridPort Application PortalsGridPort Application Portals

• UT/Texas Grids:UT/Texas Grids:– http://http://gridport.tacc.utexas.edugridport.tacc.utexas.edu– http://http://tigre.hipcat.nettigre.hipcat.net

• NPACI/PACI/TeraGrid HotPages (also @PACI/NCSA )NPACI/PACI/TeraGrid HotPages (also @PACI/NCSA )– https://https://hotpage.npaci.eduhotpage.npaci.edu– http://hotpage.teragrid.orghttp://hotpage.teragrid.org– https://hotpage.paci.orghttps://hotpage.paci.org

• Telescience/BIRN (Biomedical Informatics Research Network)Telescience/BIRN (Biomedical Informatics Research Network)– https://https://gridport.npaci.edu/Telesciencegridport.npaci.edu/Telescience

• DOE Fusion Grid PortalDOE Fusion Grid Portal• Will use GridPort based portal to run scheduling experiments using Will use GridPort based portal to run scheduling experiments using

portals and CSF at upcoming Supercomputing 2003portals and CSF at upcoming Supercomputing 2003• Contributing and founding member of NMI Portals Project: Contributing and founding member of NMI Portals Project:

– Open Grid Computing Environments (OGCE)Open Grid Computing Environments (OGCE)

Page 25: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

25

ConclusionsConclusions

• Grid technologies progressing & improving but still ‘raw’Grid technologies progressing & improving but still ‘raw’– Cautious outreach to campus community Cautious outreach to campus community

– UT campus grid under construction, working with beta users nowUT campus grid under construction, working with beta users now

• Computational Science problems have not changed:Computational Science problems have not changed:– Users want easier tools, familiar user environments (e.g. Users want easier tools, familiar user environments (e.g.

command line) or easy portalscommand line) or easy portals

• Workflow appears to be desirable tool: Workflow appears to be desirable tool: – GridFlow/GridSteer Project under wayGridFlow/GridSteer Project under way

– Working with advanced file mgmt and scheduling to automate Working with advanced file mgmt and scheduling to automate distributed tasksdistributed tasks

Page 26: TEXAS ADVANCED COMPUTING CENTER Grids: TACC Case Study Ashok Adiga, Ph.D. Distributed & Grid Computing Group Texas Advanced Computing Center The University.

26

TACC Grid Computing Activities ParticipantsTACC Grid Computing Activities Participants

• Participants include most of the TACC Distributed & Grid Participants include most of the TACC Distributed & Grid Computing Group:Computing Group:– Ashok AdigaAshok Adiga

– Jay BoisseauJay Boisseau

– Maytal DahanMaytal Dahan

– Eric RobertsEric Roberts

– Akhil SethAkhil Seth

– Mary ThomasMary Thomas

– Tomislav UrbanTomislav Urban

– David WallingDavid Walling

– As of Dec. 1, Edward Walker (formerly of Platform Computing)As of Dec. 1, Edward Walker (formerly of Platform Computing)