Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs...

59
Mantis in Action Neeraj Joshi and Justin Becker 6/12/2015

Transcript of Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs...

Page 1: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Mantis in ActionNeeraj Joshi and Justin Becker

6/12/2015

Page 2: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Managing a complex operational environment

is hard

Page 3: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events
Page 4: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events
Page 5: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events
Page 6: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Developing an understanding of what is going on

Knowing what works

Page 7: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Developing an understanding of what is going on

Identify what doesn’t work

Page 8: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Developing an understanding of what is going on

Determining impact when doesn’t work

Page 9: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Developing a deep understanding is hard, due to complexity

• Hundreds of software services

• Processing billions of requests

• For millions of users

• Operating in multiple data centers

• Across the globe

Page 10: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Help us (operators) make sense out of complexity, need tools

specifically, we need…

Insight tools …to help comprehend what is going on in our

operational environments

Page 11: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Make the case a relationship exists, between complexity and comprehension

Page 12: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

So, in order to manage complex environments, need to rethink insights, shift

the curve

Page 13: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Identified three insight ‘patterns’ to help shift the curve

• Long tail analysis

• Real time tracking and trending

• Ad hoc investigation

Page 14: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Grouped three patterns into a new effort

Page 15: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Scalable Insights Initiative

Page 16: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Goal is to help us manage (comprehend) our environments given an increase in complexity

Page 17: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Mention specific insights tools to help shift curve

• Realtime Data Explorer

• Realtime Search

• Realtime Application Monitoring

Page 18: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Mention other generic insights jobs to help shift curve

• Short term historical anomaly detection

• Threshold-based anomaly detection

• Realtime metrics generation

Page 19: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Overview of scalable insights in action

• 65-75 total jobs running in 3 regions

• Global access to data

• Processing 4.7 million events per second at peak

• 20 specific data sources and generic adapters

• Ability to startup jobs in ~5 seconds

Page 20: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Demo

Page 21: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events
Page 22: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events
Page 23: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events
Page 24: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events
Page 25: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Mantis, a reactive stream processing system

Page 26: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Some Basics Concepts

Page 27: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Stream

Sequence of Events

time1 2 3 4

Page 28: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Higher Order Functions

Transformations applied to a Stream to create a new stream

time1 4 9 16

map (x=>sqrt(x))

time1 2 3 4

Page 29: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

SOURCE

Stream IN

SINK

Result OUT

Stage 1 Stage 2 Stage n

A sequence of functions applied to a stream

map window reduce merge

Mantis Job

Page 30: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Mantis Job

SOURCE

Stream IN

SINK

Result OUT

Stage 1 Stage 2 Stage n

Page 31: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Named Jobs

Text

aa

JobName Job

Version

Page 32: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Are Parameterized

Parameter

Parameter

Parameter

Page 33: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Have SLAsMin/Max

Instances of the job

Min/Maxruntime

Perpetual orTransient

Page 34: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Can be chained

DeviceLogs Source

Job

TopNJob

ServerLogs Source

Job

AnomalyDetector

Job

MetricsAggregator

Job

AlertService

Page 35: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

That is fine but...

Page 36: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

How does Mantis meet the Scalable Insights

challenge?

Page 37: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

• Cost (Utilization) Sensitive

• Optimize for low latency

• High Throughput

• Resilient

Key Requirements

Page 38: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Minimizing Costs

Page 39: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Elastic Clusters

Page 40: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Elastic Jobs

Page 41: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Job Autoscaling

Scaling Config

CPU Strategy

Network Strategy

Page 42: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Filtering at Source

SourceJob

Consumer Job

Data Producers

Page 43: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Low latency - High Throughput

Page 44: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

To block or not to block?

Page 45: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

RxNetty (non-blocking) vs Tomcat (blocking)

by Brendan Gregg@brendangregg

https://docs.google.com/presentation/d/18i-d72m7tD4wKlzm-1PCR8g62l66_9Btbg5-fuFRqf0/edit#slide=id.g761289dab_0_77

Page 46: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

CPU consumption

• RxNetty consumes less CPU / request• Reduced thread

migration• Lower object

allocation rateCPU consumed reduces

as load increases

Page 47: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Lower latency

• RxNetty has lower latency under high load• fewer lock

contentions• fewer thread

migrations

Latency knee for Tomcat ~ 400

Latency knee for Netty~700

Page 48: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Async Processing

Non-blocking I/O

Async Processing

Page 49: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Designed for Resilience

http://signsofpolitics.blogspot.com/2009/03/around-and-about-resilience.html

Page 50: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Server Resilience

• Servers crashes inevitable

• Server health constantly monitored with heartbeats

• Crashed servers replaced

• Lost jobs relaunched

Page 51: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Network Resilience

• Long lived connections can fail

• Connection topology is constantly monitored and corrected

Page 52: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Backpressure

Page 53: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Cold Source

Amazon SQS

Page 54: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Hot Source

Page 55: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Reactive push-pull (Cold Source)

f1 f2 f3 f4

Cold Source

102410241024

1024

100

100 100 100

Page 56: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Push mode

f1 f2 f3 f4

Cold Source

MaxMaxMax

Max

100

100 100 100

Page 57: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Pull mode

f1 f2 f3 f4

Cold Source

111

1

1

1 1 1

Page 58: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Backpressure Strategies (Hot Source)

f2 f3 f4

Hot Source

101010

100

10 10 10

StrategyFunction

90DropBuffer

Scale-up

Page 59: Mantis in Action - qconnewyork.com · Overview of scalable insights in action • 65-75 total jobs running in 3 regions • Global access to data • Processing 4.7 million events

Questions