Downtime is not an option - day 2 operations - Jörg Schad
-
Upload
codemotion -
Category
Technology
-
view
245 -
download
0
Transcript of Downtime is not an option - day 2 operations - Jörg Schad
![Page 1: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/1.jpg)
1
Downtime is not an option How Fast Data and Microservices change the datacenter
@joerg_schad @dcos
![Page 2: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/2.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 2
Jörg SchadDistributed Systems Engineer
@joerg_schad
![Page 3: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/3.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 3
In the beginning there was a big
Monolith
![Page 4: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/4.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 4
![Page 5: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/5.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
Hardware
Operating System
Application
5
COMPUTERS
![Page 6: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/6.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
noun | ˈmīkrō/ /ˈsərvəs/ :
an approach to application development in which a large application is built as a suite of modular services. Each module supports a specific business goal and uses a simple, well-defined interface to communicate with other modules.*
Microservices are designed to be flexible, resilient, efficient, robust, and individually scalable.
*From whatis.com
OVERVIEW
![Page 7: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/7.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
Operating System
Operating System
Operating System
ServiceApp ServiceServiceAppApp
7
MICROSERVICES- Polyglot- Single Responsibility- Smaller Teams- Utilization- Machine
types/groups- Dependency hell
Machine
Infrastructure
Machine Machine
ServiceService ServiceServiceServiceService
![Page 8: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/8.jpg)
© Gerard Julien/AFP
Run everything in containers!
![Page 9: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/9.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
ServiceApp ServiceServiceAppApp
OS
9
CONTAINERS- Rapid deployment- Dependency
vendoring- Container image
repositories- Spreadsheet
scheduling
OS OS
Machine
Infrastructure
Machine Machine
Container Runtime Container Runtime Container Runtime
ServiceService ServiceServiceServiceService
![Page 10: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/10.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 10
CONTAINERSCHEDULING
RESOURCE MANAGEMENT
SERVICE MANAGEMENT
- Load Balancing- Readiness Checking
CONTAINER ORCHESTRATION
![Page 11: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/11.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 11
CONTAINERSCHEDULING
- Placement- Replication/Scaling- Resurrection- Rescheduling- Rolling Deployment- Upgrades- Downgrades- Collocation
RESOURCE MANAGEMENT
- Memory- CPU- GPU- Volumes- Ports- IPs- Images/Artifacts
SERVICE MANAGEMENT
- Labels- Groups/Namespaces- Dependencies- Load Balancing- Readiness Checking
CONTAINER ORCHESTRATION
![Page 12: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/12.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
Orc
hest
ratio
n
12
Machine Infrastructure
Web Apps & Services
Scheduling
Resource Management
Container Runtime
Machine & OS
Service Management
CONTAINERORCHESTRATION
Machine & OS Machine & OS
Container Runtime Container Runtime
![Page 13: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/13.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 13
MapReduce is crunching Data
Meanwhile...
![Page 14: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/14.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 14
But then business demanded
FAST DATAWe need to turn faster!
![Page 15: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/15.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 15
Fast Data
Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics
Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations
![Page 16: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/16.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 16
The SMACK Stack
EVENTSUbiquitous data streams from connected devices
INGEST
Apache Kafka
STORE
Apache Spark
ANALYZE
Apache Cassandra
ACT
Akka
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven applications
Mesos/ DC/OS
Sensors
Devices
Clients
![Page 17: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/17.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 17
Datacenter
![Page 18: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/18.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 18
NAIVE APPROACH
Typical Datacentersiloed, over-provisioned servers,
low utilization
Industry Average12-15% utilization
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
![Page 19: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/19.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 19
![Page 20: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/20.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 20
MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS
Typical Datacentersiloed, over-provisioned servers,
low utilization
Mesos/ DC/OSautomated schedulers, workload multiplexing onto the
same machines
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
![Page 21: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/21.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
• A top-level Apache project• A cluster resource
negotiator• Scalable to 10,000s of
nodes• Fault-tolerant, battle-tested• An SDK for distributed apps• Native Docker support
21
Apache Mesos
![Page 22: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/22.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 22
![Page 23: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/23.jpg)
Datacenter Operating System (DC/OS)
Distributed Systems Kernel (Mesos)
DC/OS ENABLES MODERN DISTRIBUTED APPS
Big Data + Analytics EnginesMicroservices (in containers)
Streaming
Batch
Machine Learning
Analytics
Functions & Logic
Search
Time Series
SQL / NoSQL
Databases
Modern App Components
Any Infrastructure (Physical, Virtual, Cloud)23
![Page 24: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/24.jpg)
24
THEBASICS
DC/OS is … ● 100% open source (ASL2.0)
+ A big, diverse community● An umbrella for ~30 OSS projects
+ Roadmap and designs+ Docs and tutorials
● Not limited in any way● Familiar, with more features
+ Networking, Security, CLI, UI, Service Discovery, Load Balancing, Packages, ...
![Page 25: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/25.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
Container Options
Enhancements to the Mesos Containerizer to allow support launching specific container formats (Docker, AppC, OCI (future), etc)
● Reduces need to maintain and update multiple containerizers
● Support multiple container formats with a single containerizer
Image provisioner component added to the Mesos containerizer - responsible for pulling, caching, and preparing container root filesystems
Launcher Isolators
Universal containerizer
Provisioner
Process management
Container lifecycle hook
Container image support
![Page 26: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/26.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 26
DEMO
![Page 27: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/27.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 27
GEO-ENABLED IoT
![Page 28: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/28.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 28
DATA FLOW
![Page 29: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/29.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 29
Keep it running!
![Page 30: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/30.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 30
Monitoring- Collecting metrics- Routing events- Downstream processing
- Alerting- Dashboards- Storage (long-term retention)
Logging- Scopes- Local vs. Central- Security considerations
DAY 2 OPERATIONS
![Page 31: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/31.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 31
Maintenance - Cluster Upgrades- Cluster Resizing- Capacity Planning- User & Package Management- Networking Policies- Auditing- Backups & Disaster Recovery
Troubleshooting- Debugging
- Services- System- Access?
- Tracing- Chaos Engineering
DAY 2 OPERATIONS
![Page 32: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/32.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 32
Troubleshooting● Services: typically specific to service, use logging (for
example, dcos task log) and dcos node ssh for
per-node investigations
● dcos task exec
○ Permissions?
● System:
○ Simple diagnostics via dcos node diagnostics
○ Comprehensive dump via clump
![Page 33: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/33.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 33
THANK YOU!
ANY QUESTIONS?@dcos
/groups/8295652
/dcos/dcos/examples/dcos/demos
chat.dcos.io
![Page 34: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/34.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 34
FailuresFramework
Scheduler
Executor
Task
Agent
LEADER STANDBY STANDBY
ZK
ZK
ZK
Executor
Task
Agent
![Page 35: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/35.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
Distributed Systems could be so easy...
35
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
*) https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
![Page 36: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/36.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 36
Questions?
Code: https://git.io/vXUoy
http://grnh.se/ie76ru
![Page 37: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/37.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved. 37
Monitoring
![Page 38: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/38.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
METRICS
Measurements captured to determine health and performance of cluster
- How utilized is the cluster?- Are resources being optimally used?- Is the system performing better or worse over time?- Are there bottlenecks in the system?- What is the response time of applications?
![Page 39: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/39.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
DC/OS METRIC SOURCES
● Mesos metrics ○ Resource, frameworks, masters, agents,
tasks, system, events ● Container Metrics
○ CPU, mem, disk, network● Application Metrics
○ QPS, latency, response time, hits, active users, errors
OS
Mesos
Container ContainerContainer
App App App
![Page 40: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/40.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
Before upgrading1. Make sure cluster is healthy!2. Perform backup
a. ZKb. Replicated logsc. other state
3. Review release notes4. Generate install bundle
a. Validate versions
UPGRADE PROCEDURE
Framework
Scheduler
Executor
Task
Agent
Executor
Task
Agent
LEADER STANDBY STANDBY
ZK
ZK
ZK
![Page 41: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/41.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
MESOS MASTER METRICS
● Metrics for the master node are available at the following URL:○ http://<mesos-master-ip>/mesos/master/metrics/snapshot ○ The response is a JSON object that contains metrics names and values as
key-value pairs.● Metric Groups:
○ Resources○ Master○ System○ Slaves○ Frameworks○ Tasks○ Messages○ Event Queue○ Registrar
![Page 42: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/42.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
MESOS MASTER BASIC ALERTS
Metric Value Inference
master/uptime_secs is low The master has restarted
master/uptime_secs < 60 for sustained periods of time The cluster has a flapping master node
master/tasks_lost is increasing rapidly Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks or bugs in Mesos
master/slaves_active is low Slaves are having trouble connecting to the master
master/cpus_percent > 0.9 for sustained periods of time DCOS Cluster CPU utilization is close to capacity
master/mem_percent > 0.9 for sustained periods of time DCOS Cluster Memory utilization is close to capacity
master/disk_used & master/disk_percent DCOS Disk space consumed by Reservations
master/elected is 0 for sustained periods of time No Master is currently elected
![Page 43: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/43.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 43
Operations
UPGRADES
![Page 44: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/44.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
1. Master rolling upgradea. Start with standbyb. Install new DC/OS
2. Agent rolling upgrade3. Framework upgrades
UPGRADE PROCEDURE
Framework
Scheduler
Executor
Task
Agent
LEADER STANDBY STANDBY
ZK
ZK
ZK
Executor
Task
Agent
![Page 45: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/45.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
1. Master rolling upgrade2. Agent rolling upgrade
a. Uninstall DC/OSb. Install new DC/OS
3. Framework upgrades
UPGRADE PROCEDURE
Framework
Scheduler
Executor
Task
Agent
LEADER STANDBY STANDBY
ZK
ZK
ZK
Executor
Task
Agent
![Page 46: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/46.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
1. Master rolling upgrade2. Agent rolling upgrade3. Framework upgrades
a. Orthogonal to DC/OSb. Ensure changes don’t
affect existing apps
UPGRADE PROCEDURE
Framework
Scheduler
Executor
Task
Agent
LEADER STANDBY STANDBY
ZK
ZK
ZK
Executor
Task
Agent
![Page 47: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/47.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved. 47
Failure Handling
![Page 48: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/48.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 48
Failure Handling
MESOS TASK FAILURE
![Page 49: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/49.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MESOS TASK FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 50: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/50.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
Status UpdateStatus Update
EXECUTOR
MESOS TASK FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT
TASK
AGENT
SEGFAULT :(
![Page 51: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/51.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MESOS TASK FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT
EXECUTOR
TASK
Launch TaskLaunch Task
AGENT
Launch Task
![Page 52: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/52.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
EXECUTOR
Status UpdateStatus Update
MESOS TASK FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT
TASK
AGENT
Status Update
![Page 53: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/53.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 53
Failure Handling
MESOS AGENT FAILURE
![Page 54: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/54.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
LOCAL AGENT FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 55: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/55.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
LOCAL AGENT FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 56: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/56.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
LOCAL AGENT FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 57: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/57.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
LOCAL AGENT FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 58: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/58.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
LOCAL AGENT FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
Re-register
![Page 59: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/59.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 59
Failure Handling
MESOS HOST FAILURE
![Page 60: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/60.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
LOCAL AGENT FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 61: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/61.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
LOCAL AGENT FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT
![Page 62: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/62.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
LOCAL AGENT FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT
Status Update
![Page 63: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/63.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MESOS TASK FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT
EXECUTOR
TASK
Launch TaskLaunch Task
Launch Task
Resource Offer
![Page 64: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/64.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MESOS TASK FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT
EXECUTOR
TASK
Status Update
Status Update
Status Update
![Page 65: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/65.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 65
Failure Handling
MESOS MASTER FAILURE
![Page 66: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/66.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MASTER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 67: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/67.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MASTER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 68: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/68.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MASTER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 69: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/69.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MASTER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
Leading Master Leading Master
Leading Master
![Page 70: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/70.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
MASTER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
Reregister
Reregister
ReregisterReregister
![Page 71: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/71.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
Reregistered
MASTER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
Reregistered
ReregisteredReregistered
![Page 72: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/72.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 72
Failure Handling
SCHEDULER FAILURE
![Page 73: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/73.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
SCHEDULER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 74: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/74.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
SCHEDULER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 75: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/75.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
SCHEDULER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
![Page 76: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/76.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
SCHEDULER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
Framework IDLeading Master
![Page 77: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/77.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
SCHEDULER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
Reregister
![Page 78: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/78.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
SCHEDULER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
Reregistered
![Page 79: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/79.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved.
SCHEDULER FAILURE
ZK
MASTERMARATHON
CLIENT AGENT AGENT AGENT
EXECUTOR
TASK
Status Update
Reconcile Tasks
![Page 80: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/80.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
1. Master rolling upgradea. Start with standbyb. Uninstall DC/OSc. Install new DC/OS
2. Agent rolling upgrade3. Framework upgrades
UPGRADE PROCEDURE
Framework
Scheduler
Executor
Task
Agent
LEADER STANDBY STANDBY
ZK
ZK
ZK
Executor
Task
Agent
![Page 81: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/81.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
1. Master rolling upgrade2. Agent rolling upgrade
a. Uninstall DC/OSb. Install new DC/OS
3. Framework upgrades
UPGRADE PROCEDURE
Framework
Scheduler
Executor
Task
Agent
LEADER STANDBY STANDBY
ZK
ZK
ZK
Executor
Task
Agent
![Page 82: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/82.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
1. Master rolling upgrade2. Agent rolling upgrade3. Framework upgrades
a. Orthogonal to DC/OSb. Ensure changes don’t
affect existing apps
UPGRADE PROCEDURE
Framework
Scheduler
Executor
Task
Agent
LEADER STANDBY STANDBY
ZK
ZK
ZK
Executor
Task
Agent
![Page 83: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/83.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved.
FUTURES (TBD)
Leverage maintenance primitives in Mesos to drain host
Upgrade management through DC/OS to perform rolling upgrades
![Page 84: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/84.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 84
Monitoring- Collecting metrics- Routing events- Downstream processing
- Alerting- Dashboards- Storage (long-term retention)
Logging- Scopes- Local vs. Central- Security considerations
DAY 2 OPERATIONS
![Page 85: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/85.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 85
Maintenance - Cluster Upgrades- Cluster Resizing- Capacity Planning- User & Package Management- Networking Policies- Auditing- Backups & Disaster Recovery
Troubleshooting- Debugging
- Services- System
- Tracing- Chaos Engineering
DAY 2 OPERATIONS
![Page 86: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/86.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 86
MONITORING
![Page 87: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/87.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 87
MONITORINGCONCEPT
![Page 88: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/88.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 88
MONITORINGTOOLINGEXAMPLES
● local scraping:
a. collectd
b. cAdvisor*
● event router:
a. fluentd
b. Flume
c. Kafka*
d. logstash*
e. Riemann
*) available via Mesosphere Universe
![Page 89: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/89.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 89
MONITORINGTOOLINGEXAMPLES
● storage:
a. Elasticsearch*
b. Graphite
c. InfluxDB*
d. KairosDB/Cassandra*
e. OpenTSDB/HBase
f. others such a local filesystem, Ceph FS*,
HDFS*, etc.
*) available via Mesosphere Universe
![Page 90: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/90.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 90
MONITORINGTOOLINGEXAMPLES
● dashboard:
a. D3
b. Grafana*
c. signal fx
● alerting:
a. BigPanda
b. PagerDuty
c. signal fx
d. VictorOps
*) available via Mesosphere Universe
![Page 91: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/91.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 91
MONITORINGTOOLINGEXAMPLES(INTEGRATED)
● Amazon CloudWatch ● AppDynamics ● Azure Monitor ● Circonus ● DataDog* ● dcos/metrics● Ganglia ● Google Stackdriver ● Hawkular ● Icinga ● Librato ● Nagios ● New Relic ● OpsGenie ● Pingdom ● Prometheus ● Ruxit Dynatrace* ● Sensu ● Sysdig* ● Zabbix
*) available via Mesosphere Universe
![Page 92: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/92.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 92
LOGGING
![Page 93: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/93.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 93
LOGGINGSCOPES
![Page 94: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/94.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 94
LOGGINGTOOLINGEXAMPLES(PRIMITIVES) ● DC/OS logging overview
● Docker logging drivers
● systemd's journalctl
![Page 95: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/95.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 95
LOGGINGTOOLINGEXAMPLES(INTEGRATED)
● Centralized app logging with fluentd
● DC/OS
a. ELK stack log shipping
b. Splunk
● Graylog
● Loggly
● Papertrail
● Sumo Logic
![Page 96: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/96.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 96
MAINTENANCE
![Page 97: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/97.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 97
Overview
● How to install a new version of X?● When to scale what (service-level vs. nodes)● Who gets to access/install which services in what way?
Upgrades
Sizing
User and package management
● What services can talk to each other and in which way?● Who accessed what, when and how?● How is the continuous operation of the cluster and the services accomplished?
What happens when cluster (or critical infra components like ZK) go down?
Networking
Auditing
Disaster Recovery
![Page 98: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/98.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 98
OTHER TROUBLESHOOTING TECHNIQUES
● Tracing
○ Idea: identify latency issues and perform
root-cause analysis in a distributed setup
○ OpenTracing
● Chaos Engineering
○ Idea: proactively break (parts of) the system to
understand how it reacts
○ Chaos Monkey
○ DRAX
![Page 99: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/99.jpg)
© 2016 Mesosphere, Inc. All Rights Reserved. 99
![Page 100: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/100.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 100
ARCHITECTUREMESOS FUNDAMENTALS
● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master
![Page 101: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/101.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 101
ARCHITECTUREMESOS FUNDAMENTALS
● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master
![Page 102: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/102.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 102
ARCHITECTUREMESOS FUNDAMENTALS
● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master
![Page 103: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/103.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 103
ARCHITECTUREMESOS FUNDAMENTALS
● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master
![Page 104: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/104.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 104
ARCHITECTUREMESOS FUNDAMENTALS
● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master
![Page 105: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/105.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 105
Questions?
Code: https://git.io/vXUoyPsssssssst …… we are hiring!
http://grnh.se/ie76ru
![Page 106: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/106.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
CONTAINER SCHEDULING
106
![Page 107: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/107.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
RESOURCE MANAGEMENT
107
![Page 108: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/108.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
SERVICE MANAGEMENT
108
![Page 109: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/109.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
Service Service Service
Web App Web App Web App
Hardware
Operating System
109
SERVICE-ORIENTEDARCHITECTURE
- Separation of concerns
- Optimization of bottlenecks
- Smaller teams- API Contracts- Data replication- Complicated
provisioning- Dependency
management
Operating System
Operating System
Hardware Hardware
![Page 110: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/110.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved.
Operating System
Operating System
Operating System
ServiceApp ServiceServiceAppApp
110
MICROSERVICES- Polyglot- Single Responsibility- Smaller Teams- Utilization- Machine
types/groups- Dependency hell
Machine
Infrastructure
Machine Machine
ServiceService ServiceServiceServiceService
![Page 111: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/111.jpg)
© 2017 Mesosphere, Inc. All Rights Reserved. 111
THE BIRTH OF MESOS
TWITTER TECH TALKThe grad students working on Mesos
give a tech talk at Twitter.
March 2010
APACHE INCUBATIONMesos enters the Apache Incubator.
Spring 2009
CS262BBen Hindman, Andy Konwinski and
Matei Zaharia create “Nexus” as their CS262B class project.
MESOS PUBLISHEDMesos: A Platform for Fine-Grained
Resource Sharing in the Data Center is published as a technical report.
September 2010
December 2010
DC/OS
April 2016
![Page 112: Downtime is not an option - day 2 operations - Jörg Schad](https://reader034.fdocuments.net/reader034/viewer/2022042723/5a6479b07f8b9a3b568b47cb/html5/thumbnails/112.jpg)
© 2015 Mesosphere, Inc. All Rights Reserved. 112
Monitoring- Collecting metrics
- Routing events- Downstream processing
○ Alerting○ Dashboards○ Storage (long-term retention)
Logging- Scopes- Local vs. Central- Security considerations
Maintenance - Cluster Upgrades- Cluster Resizing- Capacity Planning- User & Package
Management- Networking Policies- Auditing- Backups & Disaster
Recovery
Troubleshooting- Debugging
○ Services○ System
- Tracing- Chaos Engineering