Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

34
Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL

Transcript of Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Page 1: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Troubleshooting Data Movement

Dan Gunter

LBNL

Page 2: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Background

• Work is part of SciDAC CEDPS (Center for Enabling Distributed Petascale Science)

• Basic question: Why did my transfer (or remote operation) fail?

• We want to answer this question before the users even ask it! Instrument: middleware, applications, etc. Monitor: gather data (in response to problems) Diagnose: failures and performance problems

Page 3: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Topics

• Two broad categories of work: Gathering and normalizing existing data to

allow analysis across sites (e.g. in OSG) Adding new data through instrumentation of

standard Grid middleware (e.g. GridFTP)

Page 4: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

CEDPS Troubleshooting Architecture

Page 5: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Syslog-ng

• Features: Can filter logs based on level and content

Arbitrary number of sources and destinations

Provide remote logging Can act as a proxy, tunnel thru firewalls

Execute programs Send email, load database, etc.

Built-in Log rotation Timezone support

Page 6: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Log collection using syslog-ng

Page 7: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Log Parser

• Normalizes unformatted logs to name=value pairs

• Plug-in architecture to make it easy to add additional log file formats We will provide a set of example parsers

• But… Parsers will be hard to write and maintain If middleware and application developers

follow the “logging best practices” document, the parser component will be not necessary

Page 8: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Missing Event Detector

• Assumes all ‘start’ events should have a corresponding ‘end’ event Looks for missing ‘end’ events Generates a replacement ‘end’ event with an

error code

• Planning to develop more sophisticated anomaly detection capabilities as MDS trigger services.

Page 9: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Log Filter

• Some sites may not want to forward potentially sensitive information to the VO archive E.g.: usernames, user DN’s, IP addresses

• Syslog-ng can filter entire events But would prefer to just filter out the sensitive

fields

• Log filter will be able to remove or anonymize specific fields in the event

Page 10: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Database Loader

• This component loads normalized logs into an SQL database

• Ability to specify mapping of fields to database columns

Page 11: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Sample Site Deployment

Page 12: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Sample Grid Deployment

Page 13: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Current Status

Page 14: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy Logging “Best Practices”

Recommendations• Practices

Each logged event must contain a unique event name and an ISO-format timestamp (e.g. 2007-06-12T07:23:22.887332Z)

All system operations that might fail or experience performance variations should be wrapped with start and end events.

All logs from a given execution thread should be tagged with a globally unique ID (or GUID), such as a Universal Unique Identifiers (UUIDs)

• Log format Logs should be lines of ASCII name=value pairs Example: ts=2006-12-08T18:48:27.598448Z

event=org.globus.gridFTP.transfer.start guid=ID file=filename src.host=H1 src.port=P1 dst.host=H2 dst.port=P2

Page 15: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Event Names

• Use a '.' as a separator and go from general to specific Same as Java class names

• First part of name should be used as a unique namespace (e.g.: org.globus)

• Use start/end suffixes whenever possible Helps immensely with troubleshooting

• Examples org.globus.gridFTP.start

org.globus.gridFTP.authn.start

org.globus.gridFTP.authn.end

org.globus.gridFTP.transfer.start

org.globus.gridFTP.transfer.end

org.globus.gridFTP.end

org.globus.MDS.response.start

org.globus.MDS.query.start

org.globus.MDS.query.end

org.globus.MDS.write.net.start

org.globus.MDS.write.net.end

org.globus.MDS.response.end

Page 16: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Reporting Errors

• Errors should be reported as part of the ‘end’ event if possible Use ‘status=N’ (>= 0 success) Not attempting to define other status codes

too hard to get agreement on these

• Example: ts=2006-12-08T18:39:23.114369Z event=org.globus.authz.gridmap.end status=-1 DN=”/O=CEDS/CN=Some User” msg=”Cannot open gridmap file /etc/grid-security/grid-mapfile for reading” guid=F7D64975-069A-4152-A21F-57109AA46DFA level=ERROR

Page 17: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Error Reporting cont.

• Depending on how program is structured, it may be hard to propagate the error message to the ‘end’ event

• Use ‘error’ event name suffix in this situation• Examples:

event=org.globus.gridFTP.write.error path=“/home/grid” msg=“write error, disk full”

event=myprogram.input.error msg=“invalid input”

Page 18: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Globally Unique IDs• Use the ‘guid’ reserved name to allow

correlation of a set of events togetherevent=org.globus.gridFTP.authn.start

guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BBevent=org.globus.gridFTP.authn.end

guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BBevent=org.globus.gridFTP.transfer.start

guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BBevent=org.globus.gridFTP.transfer.end

guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB

• UUID easy to generate uuidgen, uuidlib MD5 hash in hexadecimal

• But free-form ‘id’ also allowed (e.g. process ID)

Page 19: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Example: GridFTP

ts=2006-12-08T18:39:23.114369Z event=org.globus.gridFTP.start prog=GridFTP-4.0.3 localhost=myhost remoteHost=somehost.gov:56010 serverMode=inetd id=56010

ts=2006-12-08T18:39:23.114567Z event=org.globus.gridFTP.authn.start DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” id=56010

ts=2006-12-08T18:39:25.514369Z event=org.globus.gridFTP.authn.end DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” msg=“123456 successfully authorized” localUser=uscmspool381 id=56010 status=0

ts=2006-12-08T18:39:25.864369Z event=org.globus.gridFTP.transfer.start file=/tmp/myfile tcpBufferSize=128KB dataBlockSize=262144 numStreams=1 numStripes=1 destHost=129.79.4.64 id=56010

ts=2006-12-08T18:45:02.214369Z event=org.globus.gridFTP.transfer.end file=/tmp/myfile bytesTransferred=678433 id=56010 status=0

ts=2006-12-08T18:45:02.214386Z event=org.globus.gridFTP.end id=56010 status=226

Page 20: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Logging API

• We are not requiring any special library to generate log messages

• We assume that programmers use one of the standard logging APIs (syslog, Java log4j, python logger, etc.) Could also use ‘printf’, custom logging API,

etc. Syslog-ng can be used to forward any

newline-delimited ASCII log file

Page 21: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Status

• Working with GT4 developers to add this to GRAM4, GridFTP, MDS4, Java Core, C

Core, Delegation Service

• Working with OSG on deployment of syslog-ng to gather up logs

Page 22: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Log Summarizer

Page 23: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Issue

• Would like to have detailed I/O logging for performance analysis

• But detailed logs can be far too large and intrusive For example, a trace of the I/O operations performed

by a single GridFTP server capable of saturating a 10 Gigabit network will generate O(20,000) log events / second over 70 million per hour

• Need: ongoing report of I/O characteristics negligible perturbation ( < 1% ) and storage

Page 24: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Solution

• Summarization developer can log 1000’s of events/sec run-time choice of summary granularity (easy to turn

off by default) and algorithm

• NetLogger Summarization Library Summarizes logs before they are ever written to disk Huge reduction of log volume, while retaining

important information Can be used for bottleneck analysis and performance

anomaly detection General-purpose tool can be extended to do different

kinds of summarization (currently only does time-based)

Page 25: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

How summarization works

Codefor (i=0; i < N; i++) { nl_write(log, “loop.start”, “id=i”, 0); double v = do_work(); nl_write(log, “loop.end”, “id=i val=d”, 0, v);}

Configuration (XML version)<config> <event-name-keys>event</event-name-keys> <function id='f1' type='tsumm' url='/tmp/summarized.log'>" <param name='interval' >1</param> <param name='value' >val</param> </function> <event-group> <consume/> <function id='f1'> <param name='event'>loop.summ</param> </function> <id-keys>id</id-keys> <event>loop.start</event> <event>loop.end</event> </event-group></config>

Log calls

1 sec

2 sec

0 sec

summary events with average time, average value per time, etc.

start/endof each loop

Output

Page 26: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Programmatic configuration

• New NetLogger calls (slightly simplified): Add Events to Summarizer:

add_eventpair( “my.event”, my.event.start / my.event.end

“nbytes”, value field, e.g. nbytes=131024

“guid”) identifier field

Set summary interval: set_interval(10) 10 second summary interval

Get summary statistics:I = get_stats(“my.event”)

Page 27: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Sample GridFTP Deployment with Summarizer

Page 28: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Anomaly Detection

• Summarized events can be used for simple anomaly detection Summarize disk and network throughput

every 10 seconds Generate an alarm if disk or network drops

below threshold X for duration Y

Page 29: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Bottleneck Analysis

• Can configure summarizer to just output a single summary at the end

• Need to collect summary information at both client and server sides

• Because the start/end events measure both... time inside instrumented function (endi - starti)

time between successive calls (starti+1 - endi)

• ..there is potential for determining which functions are busy and which are mostly waiting admittedly, this is somewhat complicated by OS

buffering

Page 30: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Summary

• Two broad categories of work: Gathering and normalizing existing data to

allow analysis across sites (e.g. in OSG) syslog-ng, log parser, db loader missing event detector anomaly detection

Adding new data through instrumentation of standard Grid middleware best practices logging recommendations summarizer

Page 31: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

More Information

• CEDPS TS home page: http://www.cedps.net/wiki/index.php/Troubleshooting

• Best-practices sub-page: http://www.cedps.net/wiki/index.php/LoggingBestPractice

• CEDPS TS team Brian Tierney, LBNL (Area lead) Jen Schopf, ANL Stu Martin, ANL Laura Perlman, ISI

Page 32: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

Extra Slides

Page 33: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

NL summarizer performance

Full Summarized

Disk 202,000 events/sec

588,000 events/sec

/dev/null 331,000 events/sec

588,000 events/sec

Log dest

NL mode

Page 34: Office of Science U.S. Department of Energy Troubleshooting Data Movement Dan Gunter LBNL.

Office of Science

U.S. Department of Energy

NL summarizer overhead

I/O, compute overlap