It's all about telemetry
-
Upload
theo-schlossnagle -
Category
Technology
-
view
6.039 -
download
0
description
Transcript of It's all about telemetry
It’s all about telemetryMonitoring what matters in a useful way.
Tuesday, June 26, 12
Theo Schlossnagle @postwait
I write software
I write books
I give talks
I participate in the industry
I speak frankly about industry issues
Tuesday, June 26, 12
Data, data, everywhere.
A billion pageviews / month.
100k database queries / second.
1MM memcache queries / second.
500k MQ messages / second.
10MM I/O operations / second.
Tuesday, June 26, 12
Most new big data problems are
solvable
Big Data
Tuesday, June 26, 12
Most new big data problems arecreated by our solutions, and thussolvabledespite their ROI
Most new big data problems are
solvable
Big Data
Tuesday, June 26, 12
That’s a whole lot of data
Think in terms of logs (too many do)
About 26 trillion log lines / month
@ 40 bytes compressed: 1PB / month
Just because it is possibledoes not mean it will return on investment(and does not mean it won’t)
Tuesday, June 26, 12
It’s all “useful”; which data?
Think in terms of cost/benefit.
Sure the data is useful, but it costs money to store
Does it cost you more to have it or not to have it?
Maybe the right approach is to keep that level of detail for a few days?
Tuesday, June 26, 12
Double-edged sword.
Eroding granularity over timekeeps storage under control
Tuesday, June 26, 12
Double-edged sword.
Eroding granularity over timekeeps storage under control
MISTAKE
Tuesday, June 26, 12
1 yearat a glance
Tuesday, June 26, 12
1 weeklooks normalish
Tuesday, June 26, 12
1 dayconfidence of normalcy increases
Tuesday, June 26, 12
1 weekthat looks different
Tuesday, June 26, 12
1 dayyup, that’s not at all like that other week
Tuesday, June 26, 12
Other methods
What do you store?
How do you store it?
Why is it useful?
Winning the cost benefit game byreducing costs more significantly thanreducing benefits
Tuesday, June 26, 12
0 0.5 1 1.5 2 2.5 3
0.25
0.5
0.75
1
Benefit
Cost
Positive ValueBe in the green.
monitoring activity ➠
Tuesday, June 26, 12
0 1 2 3 4 5 6 7 8 9 10
2.5
5
7.5
10
Benefit
Cost
There’s a bigger pictureIt’s not as easy as you think.
monitoring activity ➠
Tuesday, June 26, 12
0 0.5 1 1.5 2 2.5 3
0.25
0.5
0.75
1
Benefit
Cost
Value is difference, not areaGreen can be misleading
monitoring activity ➠
Tuesday, June 26, 12
0.5 1 1.5 2 2.5 3
-1
-0.75
-0.5
-0.25
0.25
0.5
Value = Benefit - CostGreen means we have positive return
monitoring activity ➠
Tuesday, June 26, 12
0.5 1 1.5 2 2.5 3
-1
-0.75
-0.5
-0.25
0.25
0.5
It’s not about returnWell, it’s not only about return
monitoring activity ➠
Tuesday, June 26, 12
0.5 1 1.5 2 2.5 3
-1
-0.75
-0.5
-0.25
0.25
0.5
It’s about maximizing returnThis is a bit like black magic
monitoring activity ➠
Tuesday, June 26, 12
Technique 1: text
Store changes
Tuesday, June 26, 12
Technique 2: numericStore rollups(i.e. statistical aggregates over fixed windows)
over 1 minute store
min/max/avg/stddev/covariance/50%/95%/99%
lots of information
heavy lossy compression of high-frequency data
loses population distribution information
Tuesday, June 26, 12
Database replicationLag (green) and rate of lag change (purple)
Tuesday, June 26, 12
Storage UsageWe can see growth.More useful, we can use this to project.
Tuesday, June 26, 12
Storage UsageWe can see growth.More useful, we can use this to project.
Tuesday, June 26, 12
With simple numeric data
Tuesday, June 26, 12
With simple numeric dataUnknowns can be predicted
Tuesday, June 26, 12
With simple numeric dataIn sane ways with confidence
Tuesday, June 26, 12
Full Disclosure
You see awesome examples of predictive analytics
Like the real-world one on the previous slide
In practice, almost all data streams predict one thing:
they have no fucking clue.
Tuesday, June 26, 12
Technique 3: histograms
Store histograms
over 1 minute store
counts of datapoints seen in various buckets
retains complete population distribution
loss of precision
Tuesday, June 26, 12
Histograms 101This.
This is a histogram.
It shows the frequency ofvalues within a population.
Height represents frequency
Tuesday, June 26, 12
Histograms 101This.
This is a histogram.
It shows the frequency ofvalues within a population.
Now, height and colorrepresents frequency
Tuesday, June 26, 12
This.
This is a histogram.
It shows the frequency ofvalues within a population.
Now, only colorrepresents frequency
Histograms 101
Tuesday, June 26, 12
This.
This is a histogram.
It shows the frequency ofvalues within a population.
Now, only colorrepresents frequency
Histograms 101
Tuesday, June 26, 12
This.
This is a histogram.
It shows the frequency ofvalues within a population.
Now, only colorrepresents frequency
Histograms ➠ time series
Tuesday, June 26, 12
This.
This is a histogram.
It shows the frequency ofvalues within a population.
Now, only colorrepresents frequency
Histograms ➠ time series
Tuesday, June 26, 12
This.
This is a histogram.
It shows the frequency ofvalues within a population.
Now, only colorrepresents frequency
Histograms ➠ time series
at a single time interval
Tuesday, June 26, 12
API Service TimesWe can see a full population shiftof several milliseconds
Tuesday, June 26, 12
Combining techniques
In our system (as a reference point)
Arbitrary numbers of numeric data pointson a single streamoccupy 32 bytes of space for statistical aggregates andoccupy about 2k of space for a histogram
These means we can store these transforms on numeric data in perpetuity
Tuesday, June 26, 12
Combining techniques
Text is a bit harder
You need to be careful
Some data sources can be constantly changing
Producing gobs of change data
You’re doing it wrong
Find these and fix them
Tuesday, June 26, 12
Correlating EventsChange Management vs. Performance
Tuesday, June 26, 12
Correlating EventsChange Management vs. Performance
Tuesday, June 26, 12
What to monitor?
Most people don’t monitor the things that matter most
Tuesday, June 26, 12
Monitor the Business
Financials:
Revenues. Costs. Margins. AR. Account delinquency.
Marketing:
Web analytics. Campaigns. Costs. Returns. Convergence.
Tuesday, June 26, 12
Monitor the Support
Customer Service:
Problems. Time investment. Customer satisfaction. Resolution time.
Tuesday, June 26, 12
Monitor the Engineering
Engineering:
Deployments. Test coverage.Bug reports. Bug fixes. Effort spent.
Operations:
Faults. Pages. Escalations. Provisioning time. Equipment defect rates. 3rd party failure rates.
Tuesday, June 26, 12
Monitor the Service
Systems:
Networks. Systems. Storage.
Databases:
Performance. Error rates. Backups.
Middleware:
Herein lies the magic and room for awesomeness
Tuesday, June 26, 12
Monitor the Middleware
Your systems are complex
Monitor their interactions
Messaging, APIs, etc.
Tuesday, June 26, 12
Monitor all the things.
But, perhaps most importantly...
Tuesday, June 26, 12
Monitor all the things.
But, perhaps most importantly...
USE UNIFIED TOOLING
Tuesday, June 26, 12
What we use...
reconnoiter
SNMP, nad, resmon, statsd, HTTP traps, jdbc, etc.
statsd (clients)
javascript beacons
Tuesday, June 26, 12
Middleware mixAPI service times, traffic, user signup rates.
Tuesday, June 26, 12
Tuesday, June 26, 12
Thank you!
Tuesday, June 26, 12