Having a Pulse On Your Platform

21
Having a Pulse On Your Platform Kamyar Mohager (@kamyarsayshell Engineering Manager, Partner Engineeri Linked

Transcript of Having a Pulse On Your Platform

Page 1: Having a Pulse On Your Platform

Having a Pulse On Your Platform

Kamyar Mohager (@kamyarsayshello)Engineering Manager, Partner Engineering

LinkedIn

Kumaresh Pattabiraman
[email protected] [email protected] Made some minor changes to the previous 2 slides. Pls edit/scrap them as needed.
Huangming Xie
maybe add one point: “simplify filtering and remove dependency on whitelisted pagekeys and tracking codes for metrics computation (eg. search clicks)"
Huangming Xie
Should we change the header to "How does it affect metrics and relevance/quality?"
Page 2: Having a Pulse On Your Platform

WHAT WE’LL COVER

THE TECHNOLOGY

HOW WE OPERATIONALIZE

WHY BOTHER MONITORING?

Page 3: Having a Pulse On Your Platform

WHY BOTHER MONITORING?

INTERNALLY• Operations: Need to know the health of your platform just like any

other app or frontend client. Know your API is down before your developers do

• Business: Make data-driven decisions based on the data

EXTERNALLY

• API availability impacts external apps and their business• Provide some level of monitoring (and possibly alerting) for

developers externally so they’re not left in the dark• Developer empathy is important

Page 4: Having a Pulse On Your Platform

Technology

Page 5: Having a Pulse On Your Platform

APACHE KAFKA INGRAPHS

● Pub-Sub Messaging and Queuing System

● Data backbone for LinkedIn

● Visualization Frontend for metrics

● Standard tool for all LinkedIn Eng & Ops

API-ANALYZER

● Visualization Frontend specific to LinkedIn Platform

● Used by Platform and SRE teams for Operational needs

APACHE HADOOP

● Distributed Data Storage and Processing

● Used by Platform for Business / Product Analytics

Page 6: Having a Pulse On Your Platform

KAFKA AT A GLANCE

Broker

Consumer

Producer

AP0

AP1

API Gateway

InGraphs, API-Analyzer, Hadoop

Kafka Topic: ExternalApiAccessEvent

Page 7: Having a Pulse On Your Platform

EXAMPLE KAFKA TOPIC

ExternalApiAccessEvent

Page 8: Having a Pulse On Your Platform

INGRAPHS

• Standard visualization framework for operational metrics used @ LinkedIn

• Configuration driven with pre-selected applications to create monitoring dashboards

• Hooks into auto-alerting system

Page 9: Having a Pulse On Your Platform

DATA FLOWING TO INGRAPH

Page 10: Having a Pulse On Your Platform

DEVIL IN THE (MONITORING) DETAILS

WHO

WHAT

● Entire Platform (aggregate)● Per Partner Program● Per Application

● QPS● Latency● HTTP Response codes (4xx, 5xx)● APIs / Endpoints (granular to specific HTTP methods)

Page 11: Having a Pulse On Your Platform

INGRAPHS FOR PLATFORM

PROS

CONS

● Efficient: filters latency/QPS/error rates/call types based on configurations

● Stable: used by all of Engineering and Ops

● Doesn’t support ad hoc queries● Dependency on SRE team to add any configuration changes

Page 12: Having a Pulse On Your Platform

API-ANALYZER

• Visualization fronted specifically for ExternalApiAccessEvent metrics• Used by Platform and SRE Teams supporting API• Ad hoc based queries to help with troubleshooting

Page 13: Having a Pulse On Your Platform

API-ANALYZER PROCESS FLOW

Page 14: Having a Pulse On Your Platform

API-ANALYZER

PROS

CONS

● Supports fast ad hoc queries against a number of facets: appid, IP address, call types

● Free of dependencies on SRE team to maintain configurations for predefined applications

● Limited historical data available

Page 15: Having a Pulse On Your Platform

APACHE HADOOP

• The hub of all offline tracking data @ LinkedIn• All ExternalApiAccessEvent data gets ETL’d into Hadoop in near real-

time• Platform team relies on Hadoop for product and business analytics• In-depth analytics beyond just QPS, Latency, Call Types, etc• Historical Data

Page 16: Having a Pulse On Your Platform

How Do We Operationalize?

Page 17: Having a Pulse On Your Platform

PARTNER ENGINEERING AT LINKEDIN

TEAM GOAL

ROLE OF A PARTNER ENGINEER

Provide a world-class developer platform where our partners and developers can build fantastic 3rd party applications for LinkedIn members

Guide and support partners and developers using our RESTful APIs and mobile SDKs

TREAT PLATFORM AS A PRODUCT Incorporate feedback from our external developers to influence roadmap

Page 18: Having a Pulse On Your Platform

SUPPORT MODEL

• Organized by Partner Programs• Open Program: Stack Overflow + Developer Portal• Partner Programs: Dedicated Partner Engineers provide white-glove

support• SLAs vary by Partner Programs (and in certain cases, by strategic

partner)

Page 19: Having a Pulse On Your Platform

THE TECHNOLOGY IN ACTION

InGraphs

API-Analyzer

● Dashboards created for a given Partner Program or a specific application

● Charts any metrics we care about (e.g. QPS)● Set up alerts for support teams based on a given threshold● Depending on SLA, team gets emailed and/or called (via on-call

rotation)● Used for ad hoc queries● Fast when needing to troubleshoot and triage a production issue for

a partnerHadoop● Long term look backs● Provides all ExternalApiAccessEvent tracking data not available in

visualization frontends (e.g. member IDs, paths, query params, etc)● Ability to create complex, in-depth reports

Page 20: Having a Pulse On Your Platform

[In]SUMMARY

• Your external apps expect 99.99% API “site up”• Monitoring and Alerting essential for knowing health of your platform• Use data to make business and product decisions• It all goes back to tracking: necessary to solve operational and

business needs• Many different types of solutions: up to you to decide whether to

build or buy

Page 21: Having a Pulse On Your Platform

THANKS!

Kamyar Mohager (@kamyarsayshello)Engineering Manager, Partner Engineering

LinkedIn

Kumaresh Pattabiraman
[email protected] [email protected] Made some minor changes to the previous 2 slides. Pls edit/scrap them as needed.
Huangming Xie
maybe add one point: “simplify filtering and remove dependency on whitelisted pagekeys and tracking codes for metrics computation (eg. search clicks)"
Huangming Xie
Should we change the header to "How does it affect metrics and relevance/quality?"