DCC/FCUP Grid Computing 1 Resource Management Systems.

21
DCC/FCUP Grid Computing 1 Resource Management Systems

Transcript of DCC/FCUP Grid Computing 1 Resource Management Systems.

Page 1: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 1

Resource Management Systems

Page 2: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 2

NQE (Network Queue Environment)

Page 3: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 3

NQE

FTA: File Transfer Agent NQS: Networking Queueing System

./prog.out

snow

Page 4: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 4

NQE user commandscevent  Posts, reads, and deletes job-dependency event information.cqdel  Deletes or signals to a specified batch request.cqstatl   Provides a line-mode display of requests and queues on a specified hostcqsub  Submits a batch request to NQE.ftua  Transfers a file interactively (this command is issued on an NQE server only).ilb  Executes a load-balanced interactive command.nqe Provides a graphical user interface (GUI) to NQE functionality.

Commands issued on an NQE server only:qalter  Alters the attributes of one or more NQS requestsqchkpnt  Checkpoints an NQS request on a UNICOS, UNICOS/mk, or IRIX system qdel  Deletes or signals NQS requests qlimit  Displays NQS batch limits for the local hostqmsg  Writes messages to stderr, stdout, or the job log file of an NQS batch request qping  Determines whether the local NQS daemon is running and responding to

requests qstat  Displays the status of NQS queues, requests, and queue complexesqsub  Submits a batch request to NQS rft  Transfers a file in a batch request

Fonte: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=0650&db=bks&fname=/SGI_Admin/NQE_AG/apa.html

Page 5: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 5

SGE (Sun Grid Engine)

Um único recurso pode desempenharMais de uma atividade

Page 6: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 6

SGE Commands similar to the ones used by NQE Example: g.job

#!/bin/csh

gaussian < testDFT.in To run:qsub –pe smp 4 –M [email protected] –m ae –r n g.job

Or...

Page 7: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 7

SGE File g.job#!/bin/csh#$ -pe smp 4 # parallel environment#$ -M [email protected]#$ -m ae # mail sent at end/abort#$ -r n # no rerungaussian < testDFT.in

To run: qsub g.job

Page 8: DCC/FCUP Grid Computing 1 Resource Management Systems.

SGE: other example

#$ -pe openmpi* 32

#$ -q short*

#$ -l dedicated=4

DCC/FCUP Grid Computing 8

Page 9: DCC/FCUP Grid Computing 1 Resource Management Systems.

SGE: another example

#$ -V #Inherit the submission environment

#$ -cwd # Start job in submission directory

#$ -N myMPI # Job Name

#$ -j y # Combine stderr and stdout

#$ -o $JOB_NAME.o$JOB_ID# Name of the output file (eg. myMPI.oJobID)

#$ -pe 12way 24# Requests 12 tasks/node, 24 cores total

#$ -q normal # Queue name normal

#$ -l h_rt=01:30:00 # Run time (hh:mm:ss) - 1.5 hours

#$ -M # Use email notification address

#$ -m be # Email at Begin and End of job

DCC/FCUP Grid Computing 9

Page 10: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 10

SGE

User can specify requirements (cpu type, disk storage, memory etc)

SGE regisers the task, requirements and control information (owner, group, dept, date/time of submission etc)

Planner to execute tasks As soon as a resource queue is available, SGE launches

the execution of one of the waiting tasks Task with greater priority or greater waiting time, according to

the configuration of the planner If there are several available queues, choose the least loaded There can exist more than one queue per cluster

Page 11: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 11

SGE

Planning policies: Based in tickets (User)

The more tickets a user has, the greater its priority Tickets are assigned statically according to the queueing

policy and priorities assigned to each user Based in urgency (tasks)

Time limit to finish the task (can be specified by the user) Waiting queue time Required resources

Customized: allows arbitrary assignment of priorities to the tasks (similar to nice)

Page 12: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 12

SGE Lyfe cycle of a task:

Submission Master stores the task and informs the planner Planner inserts the task in the appropriate queue Master sends task to host Before executing the execution daemon:

Changes to the directory of the task Initializes the environment (env variables) Initializes the set of processors Changes the task uid to the pwner’s uid Initializes resource limits to the process Collects accounting info When it finishes these steps, stores the task in its database

and wait for it to finish Once the task is finished, informs the master and deletes the

task from the database

Page 13: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 13

SGE

Some commands:qconf: cluster configurationqsub: task submissionqdel: delete tasks from the queueqacct: accounting informationqhost: inspects hosts statusqstat: inspects queues status

Page 14: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 14

SGE

GUI

Page 15: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 15SGE GUI

Page 16: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 16

Condor

It is a specialized job and resource management system. It provides:Job management mechanismSchedulingPriority schemeResource monitoringResource management

Page 17: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 17

Condor

The user submits a job to an agent. The agent is responsible for remembering jobs in a persistent

storage while finding resources willing to run them. Agents and resources advertise themselves to a

matchmaker, which is responsible for introducing potentially compatible agents and resources.

At the agent, a shadow is responsible for providing all the details necessary to execute a job.

At the resource, a sandbox is responsible for creating a safe execution environment for the job and protecting the resource from any mischief.

Page 18: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 18

Condor

User Problem Solver Agent Resource

Matchmaker

Shadow Sandbox

Job

Plan of jobs job

ClassAds

claim

Details of the job

Environment

Page 19: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 19

Condor: Gateway Flocking

- Gateway passes information about participants between pools, - M(A) sends request to M(B) through gateways, - M(B) returns a match

Page 20: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 20

CondorDirect Flocking

A also advertises to Condor Pool B

Page 21: DCC/FCUP Grid Computing 1 Resource Management Systems.

DCC/FCUP Grid Computing 21

RMS

Each has its own interface Do not provide integration Lack of interoperability Requires specific administrative skills Increase operational costs Generate over-provisioning and global

load imbalance