Copyright © 2002 Legato Systems, Inc. Legato Confidential.
-
Upload
debra-walton -
Category
Documents
-
view
242 -
download
3
Transcript of Copyright © 2002 Legato Systems, Inc. Legato Confidential.
Copyright © 2002 Legato Systems, Inc.
<NetWorker Jobs Framework><Version 1>
<Pankaj Berde, Scott Quesnelle, Michal Drozd>
<1783>
<08/02/2005>Legato Confidential
Legato Systems, Inc - Confidential and Proprietary 2
Introduction
Prerequisites for attending this TOI session
Overview and Benefits of the new feature
Installation considerations
How to configure/enable the feature
Using the feature
Licensing considerations
Architecture and internal Design
Debugging techniques and tips
Questions and Answers
Legato Systems, Inc - Confidential and Proprietary 3
Prerequisites
List any prerequisites to attending this presentation
• Internal documentation (proprietary) – d1783 d1682
• Non-proprietary –
• http://www.snia.org/smi/tech_activities/smi_spec_pr/spec/SMIS_v101.pdf
• http://www.dmtf.org/standards/cim/cim_schema_v28
• http://www.dmtf.org/standards/documents/CIM/DSP0107.pdf
Legato Systems, Inc - Confidential and Proprietary 4
Overview and Benefits
NetWorker’s backup control and event management was limited + monitoring and reporting was also sparse
• Scheduled backup’s failure detection and classification was difficult for NetWorker Administrator
• History of failures/events was not stored in structured DB.
• Runtime monitoring of backups was limited
• Parallelism control was not centralized
Legato Systems, Inc - Confidential and Proprietary 5
Overview and Benefits (cont.)
Savegroup reporting had inadequate error reporting
• Savegroup failures are not easy to detect
• Failure to backup some files was not treated as errors or warnings
• Failure reporting was done post completion of saveset
• Runtime monitoring of savesets was limited
• Control on individual savesets was limited
Legato Systems, Inc - Confidential and Proprietary 6
Overview and Benefits (cont.)
To solve these problem a new jobs framework is utilized
• The framework utilizes a new daemon called nsrjobd (The jobs daemon).
• The jobs daemon maintains a repository that stores information about jobs such as: status, indications and job session information. This information is gathered at run time to allow monitoring of active jobs.
• Jobs are managed and controlled from a central point which provides the ability to stop an individual backup, for example from the GUI.
• Jobs are queued for central parallelism control in the jobs daemon
Legato Systems, Inc - Confidential and Proprietary 7
Overview and Benefits (cont.)
Savegroup starts jobs using new central jobs daemon (nsrjobd)
• Savegroup receives information while processes run, rather than after the fact. This allows for better inactivity timeout monitoring.
• Jobs report indications (events) continuously during run
• Savegroup monitors indications and generates error reporting based on these
Legato Systems, Inc - Confidential and Proprietary 8
Overview and Benefits (cont.)
Old Framework
4
3
2
1
save
nsrexecd savegrp
nsrexec
nsrd
Server Client
Legato Systems, Inc - Confidential and Proprietary 9
Overview and Benefits (cont.)
New framework
4
3
2
1
save or savefs
nsrexecd nsr jobd
savegrp
nsrd
Server Client
Legato Systems, Inc - Confidential and Proprietary 10
Overview and Benefits (cont.)
System requirements to use feature• Standard requirements
• Needs more space under /nsr/res
Legato Systems, Inc - Confidential and Proprietary 11
Overview and Benefits
Where to learn more• D1783 D1786 D1682
• www.snia.org
• NMC TOI
Legato Systems, Inc - Confidential and Proprietary 12
Installation Considerations
Changes to installation• /nsr/res/jobsdb created at installation
• New binary on server: nsrjobd
• RAP database used by nsrjobd does not export an RPC interface, but is viewable on disk.
Legato Systems, Inc - Confidential and Proprietary 13
Configuring the Feature
How to enable and/or configure this feature• Always enabled (Cannot disable)
• New attributes in NSR resource
- Maximum Jobs DB size
- Minimum Retention time
• New attributes in savegroup• Restart window
– Time limit for valid restart (default: 12:00 hr)
• Success threshold– Threshold to determine success/failure based on
indication severity (default: Warning)
Legato Systems, Inc - Confidential and Proprietary 14
Using the Feature
Daemon started by nsrd, only runs on the server, not storage nodes or clients.
Daemon does all the remote execution and gathers information on the client side processes.
Information is stored in permanent storage to allow for NMC to use for reporting.
Legato Systems, Inc - Confidential and Proprietary 15
Using the Feature
New commands• No new command
GUI• Changes in the GUI
• Described by NMC TOI
Legato Systems, Inc - Confidential and Proprietary 16
Using the Feature
Attributes• Minimum retention time
• Use this to configure the minimum amount of time that records will stay in the jobs database.
• Maximum Jobsdb size• Use this to configure the maximum amount of space
that the records will use. (As reported by save –nq)
• Restart window• Use this to set a limit to consider last run as valid
backup
Legato Systems, Inc - Confidential and Proprietary 17
Using the Feature (cont.)
• Success threshold• Use the Success threshold to report
savesets as failure. If success threshold is set to Warning (default), even if warning indications are generated the savegroup is reported as successful
• Setting the success threshold to “Success” will mean warnings will be treated and reported as failure
Legato Systems, Inc - Confidential and Proprietary 18
Group Properties - Advanced
Using the Feature (cont.)
Legato Systems, Inc - Confidential and Proprietary 19
Using the Feature (cont.)
Report changes• Summary section
• NetWorker savegroup: (notice) Default completed, Total 3 client(s), 1 Succeeded with warnings(s), 2 Succeeded. Please see group completion details for more information.
• Succeeded with warnings: scoop.legato.com
• Succeeded: greenland.devlab.legato.com, soft
• Start time: Tue Jul 19 16:00:59 2005
• End time: Tue Jul 19 16:01:23 2005
Legato Systems, Inc - Confidential and Proprietary 20
Using the Feature (cont.)
• Indications• --- Unsuccessful Save Sets ---
• * pa1pberde:c:\SFU\var\adm save: Saving files modified since Thu Feb 24 16:01:22 2005.
• * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\.security• * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\utmpx• * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\wtmpx• * pa1pberde:c:\SFU\var\adm C:\SFU\var\adm\• * pa1pberde:c:\SFU\var\adm C:\SFU\var\• * pa1pberde:c:\SFU\var\adm C:\SFU\• * pa1pberde:c:\SFU\var\adm C:\• * pa1pberde:c:\SFU\var\adm /• * pa1pberde:c:\SFU\var\adm • pa1pberde: c:\SFU\var\adm level=incr, 8 KB 00:00:05 5 files• * <WARNING> : File C:\SFU\var\adm\.security could not be opened and was
not backed up. (The process cannot access the file because it is being used by another process.)
• * <WARNING> : File C:\SFU\var\adm\utmpx could not be opened and was not backed up. (The process cannot access the file because it is being used by another process.)
• * <WARNING> : File C:\SFU\var\adm\wtmpx could not be opened and was not backed up. (The process cannot access the file because it is being used by another process.)
Legato Systems, Inc - Confidential and Proprietary 21
Using the Feature (cont.)
• Previously completed in Restart• --- Previously Completed
Save Sets ---
• aragorn: / level=full, 3831 MB 02:08:07 93974 files
• aragorn: /space level=full, 6907 MB 02:48:55 65693 files
• dev-nwserv: index:aragorn level=full, 63 MB 00:00:09 9 files
Legato Systems, Inc - Confidential and Proprietary 22
Licensing Considerations
This feature is not licensed
Legato Systems, Inc - Confidential and Proprietary 23
Questions and Answers
Any questions that have not been answered yet?
Legato Systems, Inc - Confidential and Proprietary 24
savegrp nsrjobd
Jobs database
Architecture and Internal Design
Architectural diagramnsrd nsrexec
d
save
Console/GUI
nsrmmd
Legato Systems, Inc - Confidential and Proprietary 25
Architecture and Internal Design (cont.)
More notes on internal design • Jobs daemon uses session channels wherever
possible for doing remote execution and communication with nsrd and savegrp.
• All jobs get a record in the jobs database, this record remains for a period of time and then is purged based on the attributes set in the NSR resource.
• The daemon is multi-threaded. Not all threads are persistent. Depending on the OS this means it may appear that more than one nsrjobd is running.
Legato Systems, Inc - Confidential and Proprietary 26
Architecture and Internal Design (cont.)
• Savegroup opens a bidirectional session channel with nsrjobd at start
• Savegroup requests nsrjobd to start remote job
• Nsrjobd opens bidirectional session channel with the client’s nsrexecd
• Nsrexecd forks the child job and has bidirectional session channel to job
• Job reports state changes to nsrjobd
Legato Systems, Inc - Confidential and Proprietary 27
Architecture and Internal Design (cont.)
• Nsrjobd relays the state changes to savegroup
• Once the job gets media session, the session info is relayed to nsrjobd by nsrd
• Savegroup monitors for inactivity based on media session info and activity timestamp in nsrjobd database
• All stdout is redirected by nsrjobd to savegroup for backward compability
• Savegroup uses the stdout messages for completion reporting
Legato Systems, Inc - Confidential and Proprietary 28
Architecture and Internal Design (cont.)
• The instrumented client binaries (save) can generate indication events and completion events for the job for errors or warnings
• These indications are relayed to savegroup by nsrjobd and also stored in the jobs database
• Savegroup determines the success/failure of the backup based on indication severity
Legato Systems, Inc - Confidential and Proprietary 29
Debugging Techniques and Tips
How to obtain debugging or tracking information• Uses the standard debugging command of -D and levels 1-9.• All debugging and error output is logged to the daemon.log• All output will be prepended with a date/time stamp and the
daemon name.• The database at /nsr/res/jobsdb can be viewed using standard
RAP tools. It contains a record of the jobs that have run and as such is a useful repository of information for debugging.
• Core file location follow the same convention as all other daemons.
• Use –vvv to get verbose output of remote client• -D is not relayed to spawned jobs• The verbose output is copied over to daemon.log and the temp
file is retained as in 7.2• Indication level Debug is not used in 7.3, but wait for more
sleeker and internationalized tracing of remote backup jobs failures in future.
Legato Systems, Inc - Confidential and Proprietary 30
Debugging Techniques and Tips
Common pitfalls you or the customer may encounter• Server machines need more memory disk space and CPU power than the past. Still NetWorker
works with a decent low level server configuration• Data reported (file size and times for completion) will not be exactly same as reported by GUI
as these come from different sources. So the numbers can be slightly skewed• The instrumented client binary is not reporting right level of indication, can still cause a warning
to look like error or vice-versa• All messages are in client’s locale. So still messages coming from clients from different locale
will not be translated to servers locale. (This will be addressed in next release)• Too long of a retention period or too large of a maximum size on the jobs database.• Client state transitions are lost causing savegroup to seem like hung. (nsrjobd cleans up jobs in
incorrect state periodically causing savegroup to recover from the hung situation)• Very small restart window will cause the previous backups to be considered invalid and restarts
will take longer• Very large restart window can cause restart to overlap with next scheduled run. (Ideally restart
window should be half of interval)• Grouping needs to changed to group clients and savesets which are important and a warning
should be considered a failure, into groups with Success threhold of “Success”• Loss of reporting information if NMC daemon (gstd) is not run for period greater than minimum
retention period.• Savegroup unable to spawn processes. Check new authorization settings and the servers files.• Customer wondering why there are no nsrexec’s running on the server. This is as designed
Legato Systems, Inc - Confidential and Proprietary 31
Debugging Techniques and Tips (cont.)
Error messages customers might see• daemon.log messages
• All jobs did not end gracefully….– This means some jobs were not aborted at exit and savegroup was
forced to exit before waiting for the exit of all jobs.– Completion report will not be valid for all jobs
• Lost channel with server– This means the communication with the nsrjobd was broken and
caused savegroup to abort– If this message is seen repeatedly, nsrjobd is too busy to handle
requests or hung (if a restart does not solve the problem, a daemon diagnosis (truss/pstack etc.) of nsrjobd might be needed)
• Aborting inactive job (%d)– The job is not saving data longer than inactivity timeout– The network bandwidth with the client needs to be checked– If the save process is hung in disk read a retry might resolve the
issue.
Legato Systems, Inc - Confidential and Proprietary 32
Known Issues and Limitations
Known issues and/or bugs• Restarted savegroup does not clone savesets
in previous runs (Existing issue in all past releases)
• Workaround – None (Plan to resolve this in next maintenance release)
Limitations• Older clients will not have indications• All binaries are not fully instrumented to
generate new indications (Gradual approach)• CPE will be trained to extend existing error
messages into indications for 7.3 clients• Workaround (clients should be upgraded to 7.3)
Legato Systems, Inc - Confidential and Proprietary 33
Questions and Answers
Any questions that have not been answered yet?
Legato Systems, Inc - Confidential and Proprietary 34
Demonstration
If time permits
- show db layout on disk & browsing of db using nsradmin
- show savegrp –D9 and –vvv output and explain how to read new debug messages
- show temp files created (& how to cleanup the debug temp files)