PinTrace Advanced AWS meetup

23
PinTrace Distributed Tracing@Pinterest Suman Karumuri

Transcript of PinTrace Advanced AWS meetup

Page 1: PinTrace Advanced AWS meetup

PinTraceDistributed Tracing@Pinterest

Suman Karumuri

Page 2: PinTrace Advanced AWS meetup

Proprietary and Confidential

● About me● What is distributed tracing?● Why PinTrace?● Pintrace architecture● Challenges and Lessons● Contributions● Q & A.

Agenda

Page 3: PinTrace Advanced AWS meetup

Proprietary and Confidential

● Lead for Tracing effort at Pinterest.● Former Twitter Zipkin (open source distributed tracing project) lead.● Former Twitter, Facebook, Amazon, Yahoo, Goldman Sachs Engineer.● Published papers on automatic trace instrumentation@Brown CS.● Passionate about Distributed Tracing and Distributed cloud infrastructure.

About me

Page 4: PinTrace Advanced AWS meetup

Proprietary and Confidential

Distributed system

Client Service 1

Service 2

Service 3

Page 5: PinTrace Advanced AWS meetup

Proprietary and Confidential

10th Rule of Distributed System Monitoring

“Any sufficiently complicated distributed system contains an ad-hoc, informally-specified, siloed implementation of causal tracing.”

- Rodrigo Fonseca

Why Distributed tracing?

Page 6: PinTrace Advanced AWS meetup

Proprietary and Confidential

What is distributed tracing?

Client Service 1 Service 2

ts1, r1, client req sent

ts2, r1, server req rcvd

ts7, r1, server resp sent

ts3, r1, client req sent

ts4, r1, server req rcvd

ts5, r1, server resp sent

ts6, r1, client resp rcvdts8, r1, client resp rcvd

Structured logging on steroids.

Page 7: PinTrace Advanced AWS meetup

Proprietary and Confidential

Annotation

Client Service 1 Service 2

ts1, r1, CS

ts2, r1, server req rcvd

ts7, r1, server resp sent

ts3, r1, client req sent

ts4, r1, server req rcvd

ts5, r1, server resp sent

ts6, r1, client resp rcvdts8, r1, client resp rcvd

Timestamped event name with a structured payload.

Page 8: PinTrace Advanced AWS meetup

Proprietary and Confidential

Span

Client Service 1 Service 2

ts1, r1, s1, - , CR

ts2, r1, s1, - , SR

ts7, r1, s1, - , SS

ts3, r1, client req sent

ts4, r1, server req rcvd

ts5, r1, server resp sent

ts6, r1, client resp rcvdts8, r1, s1, -, CS

A logical unit of work captured as a set of annotations. Ex: A request response pair.

Page 9: PinTrace Advanced AWS meetup

Proprietary and Confidential

Trace

Client Service 1 Service 2

ts1, r1, s1, 0, CS

ts2, r1, s1, 0, SR

ts7, r1, s1, 0, SS

ts3, r1, s2, s1, CS

ts4, r1, s2, s1, SR

ts5, r1, s2, s1, SS

ts6, r1, s2, s1, CRts8, r1, s1, 0, CR

A DAG of spans that belong to the same request.

Page 10: PinTrace Advanced AWS meetup

Proprietary and Confidential

Tracer: Piece of software that traces a request and generates spans.

Sampler: selects which requests to trace.

Reporter: Gathers the spans from a tracer and sends them to the collector.

Span aggregation pipeline: a mechanism to transfer spans from reporter to collector.

Collector: A service that gathers spans from various services from the pipeline.

Span storage: A backend used by the collector to store the spans.

Client/UI: An interface to search, access and visualize trace data.

Components of Tracing infrastructure

Page 11: PinTrace Advanced AWS meetup

Proprietary and Confidential

Motivation:

Success of project prestige, Hbase debugging, Pinpoint.

Make backend faster and cheaper. Speed => More engagement.

Loading home feed consists of ~50 backend services.

Uses of Traces

Understand what we built: service dependency graphs.

Understand where a request spent it’s time - for debugging, tuning, cost attribution.

Improve time to triage: Ex: what service caused this request to fail? Why is the search API slow after recent deployment?

Why PinTrace?

Page 12: PinTrace Advanced AWS meetup

Proprietary and Confidential

PinTrace architecture

Varnish

ngapi

Singer -Kafka pipeline

(Spark) Span aggregationTrace processing & storage

ESTrace store Zipkin UI The Wall

Py thrift tracer

Py Span logger

Java service(s)

Java thrift tracer

Java span logger

Java ServicePython service

Go serviceMySQL

MemcachedDecider

Page 13: PinTrace Advanced AWS meetup

Proprietary and Confidential

Ensuring data quality.

Tracing infrastructure can be fragile since it has a lot of moving parts.

The more customized the pipeline, the harder it is to ensure data quality.

Use metrics and alerting to monitor the pipeline for correctness.

E2E monitoring: Sentinel

Traces a known request path periodically and check the resulting trace for correctness.

The known request path should have all known language/protocol combinations.

Measures end to end trace latency.

Testing

Page 14: PinTrace Advanced AWS meetup

Proprietary and Confidential

Collect a lot of trace data but provides very few insights.

Spend time scaling the trace collection infrastructure than provide value.

Using tracing when simpler methods would suffice.

Use simpler time series metrics for counting the number of API calls.

Tracing is expensive,

Higher dark latency compared to other methods.Tracing infrastructure is expensive since we are dealing with an order of magnitude more data.

Tracing tarpit

Page 15: PinTrace Advanced AWS meetup

Proprietary and Confidential

Tracing is not the solution to a problem, it’s a tool.

Build tools around traces to solve a problem.

Should augment our time series metrics and logging platform.

Traces should only be used for computing distributed metrics.

Tracing infrastructure should be cheap and easy to run.

Quality of traces is more important than quantity of traces.

All processing and analysis of traces on ingestion and avoid post processing.

Our Tracing philosophy

Page 16: PinTrace Advanced AWS meetup

Proprietary and Confidential

Instrumentation is hard.

Instrumenting the framework is less brittle, agnostic to business logic and more reusable.

Even after instrumenting the framework, there will be snow flakes.

The more opinionated the framework the easier it is to instrument. Ex: Java/go vs Python.

Need instrumentation for every language protocol combinations.

Use a framework that is already enabled for tracing.

Instrumentation challenges

Page 17: PinTrace Advanced AWS meetup

Proprietary and Confidential

Deploying tracing at scale is a complex and challenging process.

Needs a company wide span aggregation pipeline.

Enabling and deploying instrumentation across several Java/Python services is like herding cats.

Scaling the tracing backend.

Dealing with multiple stakeholders and doing things the “right” way.

Can’t see it’s benefits or ensure data quality until it is fully deployed.

Do deployments along key request paths first for best results.

Deployment challenges

Page 18: PinTrace Advanced AWS meetup

Proprietary and Confidential

User Education is very important.

Most people use tracing for solving needle in haystack and

SREs get tracing. Still an esoteric concept even for good engineers.

Explain the use cases on when they can use tracing.

Insights into performance bottlenecks or global visibility.

Tracing landscape is confusing.

Distributed tracing/Zipkin landscape is rapidly evolving and can be confusing.

Zipkin UI has some rough edges.

Lessons learned

Page 19: PinTrace Advanced AWS meetup

Proprietary and Confidential

Data quality

For identifying performance bottlenecks from traces relative durations are most important.

When deployed in the right order, even partial tracing is useful.

Trace errors are ok when in leaves.

Tracing Infrastructure

Tracing infrastructure is a Tier 2 service in almost all companies.

Tracing is expensive.

Lessons learned (contd)

Page 20: PinTrace Advanced AWS meetup

Proprietary and Confidential

● Identified that we use a really old version of finagle-memcache client that is blocking the finagle upgrade.

● Identified ~7% of Java code as dead code and deleted 20KLoC so far.● First company wide log/span aggregation pipeline.● Identified an synchronous mysql client, now moving to asynchronous one.● Local zipkin set up: Debugging Hbase latency issues.

Wins

Page 21: PinTrace Advanced AWS meetup

Proprietary and Confidential

Future work

● Short term○ Finish python instrumentation.○ Open source spark backend.○ Robust and scalable backend:

■ Trace all employee requests by default.■ Make it easy to look at trace data for a request in pinterest app and web UI.

● Medium term○ End to end traces to measure user perceived wait time. Ex:

Mobile/Browser -> Java/Python/go -> MySQL/MemCache/HBase.

○ Apply tracing to other use cases like jenkins builds times.○ Improve Zipkin UI.

Page 22: PinTrace Advanced AWS meetup

Q&A