Copyright © 2002 Legato Systems, Inc. Legato Confidential.

35
Copyright © 2002 Legato Systems, Inc. <NetWorker Jobs Framework> <Version 1> <Pankaj Berde, Scott Quesnelle, Michal Drozd> <1783> <08/02/2005> Legato Confidential

Transcript of Copyright © 2002 Legato Systems, Inc. Legato Confidential.

Copyright © 2002 Legato Systems, Inc.

<NetWorker Jobs Framework><Version 1>

<Pankaj Berde, Scott Quesnelle, Michal Drozd>

<1783>

<08/02/2005>Legato Confidential

Legato Systems, Inc - Confidential and Proprietary 2

Introduction

Prerequisites for attending this TOI session

Overview and Benefits of the new feature

Installation considerations

How to configure/enable the feature

Using the feature

Licensing considerations

Architecture and internal Design

Debugging techniques and tips

Questions and Answers

Legato Systems, Inc - Confidential and Proprietary 3

Prerequisites

List any prerequisites to attending this presentation

• Internal documentation (proprietary) – d1783 d1682

• Non-proprietary –

• http://www.snia.org/smi/tech_activities/smi_spec_pr/spec/SMIS_v101.pdf

• http://www.dmtf.org/standards/cim/cim_schema_v28

• http://www.dmtf.org/standards/documents/CIM/DSP0107.pdf

Legato Systems, Inc - Confidential and Proprietary 4

Overview and Benefits

NetWorker’s backup control and event management was limited + monitoring and reporting was also sparse

• Scheduled backup’s failure detection and classification was difficult for NetWorker Administrator

• History of failures/events was not stored in structured DB.

• Runtime monitoring of backups was limited

• Parallelism control was not centralized

Legato Systems, Inc - Confidential and Proprietary 5

Overview and Benefits (cont.)

Savegroup reporting had inadequate error reporting

• Savegroup failures are not easy to detect

• Failure to backup some files was not treated as errors or warnings

• Failure reporting was done post completion of saveset

• Runtime monitoring of savesets was limited

• Control on individual savesets was limited

Legato Systems, Inc - Confidential and Proprietary 6

Overview and Benefits (cont.)

To solve these problem a new jobs framework is utilized

• The framework utilizes a new daemon called nsrjobd (The jobs daemon).

• The jobs daemon maintains a repository that stores information about jobs such as: status, indications and job session information. This information is gathered at run time to allow monitoring of active jobs.

• Jobs are managed and controlled from a central point which provides the ability to stop an individual backup, for example from the GUI.

• Jobs are queued for central parallelism control in the jobs daemon

Legato Systems, Inc - Confidential and Proprietary 7

Overview and Benefits (cont.)

Savegroup starts jobs using new central jobs daemon (nsrjobd)

• Savegroup receives information while processes run, rather than after the fact. This allows for better inactivity timeout monitoring.

• Jobs report indications (events) continuously during run

• Savegroup monitors indications and generates error reporting based on these

Legato Systems, Inc - Confidential and Proprietary 8

Overview and Benefits (cont.)

Old Framework

4

3

2

1

save

nsrexecd savegrp

nsrexec

nsrd

Server Client

Legato Systems, Inc - Confidential and Proprietary 9

Overview and Benefits (cont.)

New framework

4

3

2

1

save or savefs

nsrexecd nsr jobd

savegrp

nsrd

Server Client

Legato Systems, Inc - Confidential and Proprietary 10

Overview and Benefits (cont.)

System requirements to use feature• Standard requirements

• Needs more space under /nsr/res

Legato Systems, Inc - Confidential and Proprietary 11

Overview and Benefits

Where to learn more• D1783 D1786 D1682

• www.snia.org

• NMC TOI

Legato Systems, Inc - Confidential and Proprietary 12

Installation Considerations

Changes to installation• /nsr/res/jobsdb created at installation

• New binary on server: nsrjobd

• RAP database used by nsrjobd does not export an RPC interface, but is viewable on disk.

Legato Systems, Inc - Confidential and Proprietary 13

Configuring the Feature

How to enable and/or configure this feature• Always enabled (Cannot disable)

• New attributes in NSR resource

- Maximum Jobs DB size

- Minimum Retention time

• New attributes in savegroup• Restart window

– Time limit for valid restart (default: 12:00 hr)

• Success threshold– Threshold to determine success/failure based on

indication severity (default: Warning)

Legato Systems, Inc - Confidential and Proprietary 14

Using the Feature

Daemon started by nsrd, only runs on the server, not storage nodes or clients.

Daemon does all the remote execution and gathers information on the client side processes.

Information is stored in permanent storage to allow for NMC to use for reporting.

Legato Systems, Inc - Confidential and Proprietary 15

Using the Feature

New commands• No new command

GUI• Changes in the GUI

• Described by NMC TOI

Legato Systems, Inc - Confidential and Proprietary 16

Using the Feature

Attributes• Minimum retention time

• Use this to configure the minimum amount of time that records will stay in the jobs database.

• Maximum Jobsdb size• Use this to configure the maximum amount of space

that the records will use. (As reported by save –nq)

• Restart window• Use this to set a limit to consider last run as valid

backup

Legato Systems, Inc - Confidential and Proprietary 17

Using the Feature (cont.)

• Success threshold• Use the Success threshold to report

savesets as failure. If success threshold is set to Warning (default), even if warning indications are generated the savegroup is reported as successful

• Setting the success threshold to “Success” will mean warnings will be treated and reported as failure

Legato Systems, Inc - Confidential and Proprietary 18

Group Properties - Advanced

Using the Feature (cont.)

Legato Systems, Inc - Confidential and Proprietary 19

Using the Feature (cont.)

Report changes• Summary section

• NetWorker savegroup: (notice) Default completed, Total 3 client(s), 1 Succeeded with warnings(s), 2 Succeeded. Please see group completion details for more information.

• Succeeded with warnings: scoop.legato.com

• Succeeded: greenland.devlab.legato.com, soft

• Start time: Tue Jul 19 16:00:59 2005

• End time: Tue Jul 19 16:01:23 2005

Legato Systems, Inc - Confidential and Proprietary 20

Using the Feature (cont.)

• Indications• --- Unsuccessful Save Sets ---

• * pa1pberde:c:\SFU\var\adm save: Saving files modified since Thu Feb 24 16:01:22 2005.

• * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\.security• * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\utmpx• * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\wtmpx• * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\• * pa1pberde:c:\SFU\var\adm C:\SFU\var\• * pa1pberde:c:\SFU\var\adm C:\SFU\• * pa1pberde:c:\SFU\var\adm C:\• * pa1pberde:c:\SFU\var\adm /• * pa1pberde:c:\SFU\var\adm • pa1pberde: c:\SFU\var\adm level=incr, 8 KB 00:00:05 5 files• * <WARNING> : File C:\SFU\var\adm\.security could not be opened and was

not backed up. (The process cannot access the file because it is being used by another process.)

• * <WARNING> : File C:\SFU\var\adm\utmpx could not be opened and was not backed up. (The process cannot access the file because it is being used by another process.)

• * <WARNING> : File C:\SFU\var\adm\wtmpx could not be opened and was not backed up. (The process cannot access the file because it is being used by another process.)

Legato Systems, Inc - Confidential and Proprietary 21

Using the Feature (cont.)

• Previously completed in Restart• --- Previously Completed

Save Sets ---

• aragorn: / level=full, 3831 MB 02:08:07 93974 files

• aragorn: /space level=full, 6907 MB 02:48:55 65693 files

• dev-nwserv: index:aragorn level=full, 63 MB 00:00:09 9 files

Legato Systems, Inc - Confidential and Proprietary 22

Licensing Considerations

This feature is not licensed

Legato Systems, Inc - Confidential and Proprietary 23

Questions and Answers

Any questions that have not been answered yet?

Legato Systems, Inc - Confidential and Proprietary 24

savegrp nsrjobd

Jobs database

Architecture and Internal Design

Architectural diagramnsrd nsrexec

d

save

Console/GUI

nsrmmd

Legato Systems, Inc - Confidential and Proprietary 25

Architecture and Internal Design (cont.)

More notes on internal design • Jobs daemon uses session channels wherever

possible for doing remote execution and communication with nsrd and savegrp.

• All jobs get a record in the jobs database, this record remains for a period of time and then is purged based on the attributes set in the NSR resource.

• The daemon is multi-threaded. Not all threads are persistent. Depending on the OS this means it may appear that more than one nsrjobd is running.

Legato Systems, Inc - Confidential and Proprietary 26

Architecture and Internal Design (cont.)

• Savegroup opens a bidirectional session channel with nsrjobd at start

• Savegroup requests nsrjobd to start remote job

• Nsrjobd opens bidirectional session channel with the client’s nsrexecd

• Nsrexecd forks the child job and has bidirectional session channel to job

• Job reports state changes to nsrjobd

Legato Systems, Inc - Confidential and Proprietary 27

Architecture and Internal Design (cont.)

• Nsrjobd relays the state changes to savegroup

• Once the job gets media session, the session info is relayed to nsrjobd by nsrd

• Savegroup monitors for inactivity based on media session info and activity timestamp in nsrjobd database

• All stdout is redirected by nsrjobd to savegroup for backward compability

• Savegroup uses the stdout messages for completion reporting

Legato Systems, Inc - Confidential and Proprietary 28

Architecture and Internal Design (cont.)

• The instrumented client binaries (save) can generate indication events and completion events for the job for errors or warnings

• These indications are relayed to savegroup by nsrjobd and also stored in the jobs database

• Savegroup determines the success/failure of the backup based on indication severity

Legato Systems, Inc - Confidential and Proprietary 29

Debugging Techniques and Tips

How to obtain debugging or tracking information• Uses the standard debugging command of -D and levels 1-9.• All debugging and error output is logged to the daemon.log• All output will be prepended with a date/time stamp and the

daemon name.• The database at /nsr/res/jobsdb can be viewed using standard

RAP tools. It contains a record of the jobs that have run and as such is a useful repository of information for debugging.

• Core file location follow the same convention as all other daemons.

• Use –vvv to get verbose output of remote client• -D is not relayed to spawned jobs• The verbose output is copied over to daemon.log and the temp

file is retained as in 7.2• Indication level Debug is not used in 7.3, but wait for more

sleeker and internationalized tracing of remote backup jobs failures in future.

Legato Systems, Inc - Confidential and Proprietary 30

Debugging Techniques and Tips

Common pitfalls you or the customer may encounter• Server machines need more memory disk space and CPU power than the past. Still NetWorker

works with a decent low level server configuration• Data reported (file size and times for completion) will not be exactly same as reported by GUI

as these come from different sources. So the numbers can be slightly skewed• The instrumented client binary is not reporting right level of indication, can still cause a warning

to look like error or vice-versa• All messages are in client’s locale. So still messages coming from clients from different locale

will not be translated to servers locale. (This will be addressed in next release)• Too long of a retention period or too large of a maximum size on the jobs database.• Client state transitions are lost causing savegroup to seem like hung. (nsrjobd cleans up jobs in

incorrect state periodically causing savegroup to recover from the hung situation)• Very small restart window will cause the previous backups to be considered invalid and restarts

will take longer• Very large restart window can cause restart to overlap with next scheduled run. (Ideally restart

window should be half of interval)• Grouping needs to changed to group clients and savesets which are important and a warning

should be considered a failure, into groups with Success threhold of “Success”• Loss of reporting information if NMC daemon (gstd) is not run for period greater than minimum

retention period.• Savegroup unable to spawn processes. Check new authorization settings and the servers files.• Customer wondering why there are no nsrexec’s running on the server. This is as designed

Legato Systems, Inc - Confidential and Proprietary 31

Debugging Techniques and Tips (cont.)

Error messages customers might see• daemon.log messages

• All jobs did not end gracefully….– This means some jobs were not aborted at exit and savegroup was

forced to exit before waiting for the exit of all jobs.– Completion report will not be valid for all jobs

• Lost channel with server– This means the communication with the nsrjobd was broken and

caused savegroup to abort– If this message is seen repeatedly, nsrjobd is too busy to handle

requests or hung (if a restart does not solve the problem, a daemon diagnosis (truss/pstack etc.) of nsrjobd might be needed)

• Aborting inactive job (%d)– The job is not saving data longer than inactivity timeout– The network bandwidth with the client needs to be checked– If the save process is hung in disk read a retry might resolve the

issue.

Legato Systems, Inc - Confidential and Proprietary 32

Known Issues and Limitations

Known issues and/or bugs• Restarted savegroup does not clone savesets

in previous runs (Existing issue in all past releases)

• Workaround – None (Plan to resolve this in next maintenance release)

Limitations• Older clients will not have indications• All binaries are not fully instrumented to

generate new indications (Gradual approach)• CPE will be trained to extend existing error

messages into indications for 7.3 clients• Workaround (clients should be upgraded to 7.3)

Legato Systems, Inc - Confidential and Proprietary 33

Questions and Answers

Any questions that have not been answered yet?

Legato Systems, Inc - Confidential and Proprietary 34

Demonstration

If time permits

- show db layout on disk & browsing of db using nsradmin

- show savegrp –D9 and –vvv output and explain how to read new debug messages

- show temp files created (& how to cleanup the debug temp files)

Legato Systems, Inc - Confidential and Proprietary 35

Questions and Answers

Any questions that have not been answered yet?

Thanks for attending