Un-broken Logging - Operability.io 2015 - Matthew Skelton

149
Un-Broken Logging the foundation of software operability Operability.io conference #OIO15 Friday 25 th September 2015 Matthew Skelton Skelton Thatcher Consulting @matthewpskelton

Transcript of Un-broken Logging - Operability.io 2015 - Matthew Skelton

Un-Broken Loggingthe foundation of software operability

Operability.io conference #OIO15Friday 25th September 2015

Matthew SkeltonSkelton Thatcher Consulting

@matthewpskelton

The way we use logging is (often) broken

How to make our logging more awesome

Why we should care

Matthew Skelton

@matthewpskelton

#OIO15

@Operability

#operability

WhoOwnsMyOperability.com

confession:

I am a big fan of logging

exceptional situationsedge cases

metricsanalytics‘audits’

…@evanphx

execution trace

BAD STUFF

Logging is often unloved

1. Discontinuous

2. Errors only, or arbitrary

3. ‘Bolted on’

4. No aggregation & search

5. Specify severity up front

GOOD STUFF

How to make logging awesome

1. Continuous event IDs

2. Transaction tracing

3. Log aggregation & search tools

4. Design for logging

5. Decoupled severity

reduce time-to-detectincrease team engagement

increase configurabilityenhance DevOps collaboration

#operability

Background

Autonomous weather station

MRI brain scan imaging

Oil well monitoring

Web-scale systems

logging makes things work

(event sourcing)

(structured logging)

(CQRS)

How is logging usually broken?

Logging is often unloved

1. Discontinuous

2. Errors only, or arbitrary

3. ‘Bolted on’

4. No aggregation & search

5. Specify severity up front

using logging mainly for errors

inconsistent use of logging

logging slows down the software

logging ‘pollutes’ my precious domain model

logging is just for those weird Ops people

logging assumed to be free ($0) to implement

no budget for aggregating logs across machines

log aggregation happens only in Production

logs not available to Devs

fights over log severity levels

poor time synchronisation

Some history, with pirates

weather, course, sightings, latitude, longitude, …

(even when quiet)

John

Har

rison

Why log?

verificationtraceability

accountability

charting the waters

- June 13th –Pirates!!!!

- Weds –Sharks!!!

- 19th Jun –BIGGER sharks!!!!

How to make logging awesome

How to make logging awesome

1. Continuous event IDs

2. Transaction tracing

3. Log aggregation & search tools

4. Design for logging

5. Decoupled severity

Storage I/O

Worker Job

Queue

Upload

Continuous event IDs

How many distinct event types (state transitions) in

your application?

represent distinct states

enum

Human-readable sets: unique values, sparse, immutable

C#, Java, Python, node(Ruby, PHP, …)

public enum EventID

{

// Badly-initialised logging data

NotSet = 0,

// An unrecognised event has occurred

UnexpectedError = 10000,

ApplicationStarted = 20000,

ApplicationShutdownNoticeReceived = 20001,

PageGenerationStarted = 30000,

PageGenerationCompleted = 30001,

MessageQueued = 40000,

MessagePeeked = 40001,

BasketItemAdded = 60001,

BasketItemRemoved = 60002,

CreditCardDetailsSubmitted = 70001,

// ...

}

Technical

Domain

public enum EventID

{

// Badly-initialised logging data

NotSet = 0,

// An unrecognised event has occurred

UnexpectedError = 10000,

ApplicationStarted = 20000,

ApplicationShutdownNoticeReceived = 20001,

PageGenerationStarted = 30000,

PageGenerationCompleted = 30001,

MessageQueued = 40000,

MessagePeeked = 40001,

BasketItemAdded = 60001,

BasketItemRemoved = 60002,

CreditCardDetailsSubmitted = 70001,

// ...

}

BasketItemAdded = 60001

BasketItemAdded = 60001

BasketItemRemoved = 60002

BasketItemAdded = 60001

BasketItemRemoved = 60002

represent distinct states

OrderSvc_BasketItemAdded

Monolith to microservices:debugger does not have the full view

Even with remote debugger, it’s boring to attach and detach

Storage I/O

Worker Job

Queue

Upload

Transaction tracing

‘Unique-ish’ identifier for each request

Passed through downstream layers

Unique-ish ID

What about APM?

APM gives us application insightBUT

How much do we learn? Is APM available on the Dev box?

It’s not just ‘an Ops problem’!

Helps us to understand how the software really works

Small overhead is worth it

Configurable severity levels

Which log level is right?

DEBUG, INFO, WARNING, ERROR, CRITICAL

Log level should *not* be fixed at compile or build time!

Tune log levels

Tune log levels

Tune log levels

{

"eventmappings": {

"events": {

"event": [ {

"id": "CacheServiceStarted",

"severity": { "level": "Information" }

}, {

"id": "PageCachePurged",

"severity": { "level": "Debug" },

"state": { "enabled": false }

}, {

"id": "DatabaseConnectionTimeOut",

"severity": { "level": "Error" }

} ]

}

}

}

Tune severity levels of specific event IDs

Event tracing

Use enumerations (or closest thing)

Technical and Domain event types

Distributed systems: debuggers less useful

Trace calls with ‘unique-enough’ handles

Tune log levels via config

Log aggregation & search tools

Design for log aggregation

develop the software using log aggregation as a first-class thing

stories for testing logging

BasketItemAdded

grep BasketItem

logging is (‘just’) another system component

NTP

Dev and Ops collaboration*

* and testers too!

Where?

auditingcompliance

pre-emptive fault diagnosisperformance

metrics…

Recap

Logging is often unloved

1. Discontinuous

2. Errors only, or arbitrary

3. ‘Bolted on’

4. No aggregation & search

5. Specify severity up front

How to make logging awesome

1. Continuous event IDs

2. Transaction tracing

3. Log aggregation & search tools

4. Design for logging

5. Decoupled severity

logging makes things work

“There is no thought behind aspect-oriented programming”

MINDFUL LOGGING (?!)

database transaction logs

‘Structured Logging’TW: “Adopt” (May 2015)

https://www.thoughtworks.com/radar/techniques/structured-logging

http://gregoryszorc.com/

.NET: http://serilog.net/Java: https://github.com/fluent/fluent-logger-java

sanity

More

Ditch the Debugger and Use Log Analysis Instead

Matthew Skelton

https://blog.logentries.com/2015/07/ditch-the-debugger-and-use-log-analysis-instead/

More

Using Log Aggregation Across Dev & Ops: The Pricing Advantage

Rob Thatcher

https://blog.logentries.com/2015/08/using-log-aggregation-across-dev-ops-the-pricing-

advantage/

Evan Phoenix (@evanphx)

youtube.com/watch?v=Z-JskKlIBOA

Books

operabilitybook.comoperationalfeatures.com

Thank you

http://skeltonthatcher.com/[email protected]

@SkeltonThatcher

+44 (0)20 8242 4103

@matthewpskelton