Scalable Eventing Over Apache Mesos
date post
11-Jan-2017Category
Engineering
view
5.353download
1
Embed Size (px)
Transcript of Scalable Eventing Over Apache Mesos
Autodesk Corporate PPT template 4:3
Scalable Eventing Over MesosOlivier PaugamSW Architect / Autodesk CloudBig Data Montreal
2015 Autodesk
1
Goals & Challenges
2015 Autodesk
2
The MissionGeneral purpose, high-volume eventing system.Batch oriented I/O.Target audience: 20+ teams within Autodesk.Must be active/active across multiple data-centers.Must be able to scale at any time. Must be able to absorb traffic spikes. Must be accessible via a single API.Must be secure (transport + data at rest).Must not be tied to a specific provider.
2015 Autodesk#
3
A Few Use CasesApplication log pre-aggregation transport.Metering updates from our Platform API.Analytics transport prior to indexing.Event transport for Search, Activity & other services.Identity updates down to our IT systems.Editing increments for large 3D model collaboration.
2015 Autodesk#
4
Our 5 Technical CommandmentsMust use Docker.Must run on Apache Mesos + Marathon.Must leverage Apache Kafka.Must be as autonomous & low-maintenance as possible.No automation scripting allowed (Chef, Salt, Ansible).
2015 Autodesk#
5
Introducing Ochopod
2015 Autodesk
6
Ochopod100% Open Source !Novel container-centric orchestration model.Mix between a discovery & an init system.No need for dedicated frameworks.Direct Peer To Peer HTTP I/O.Can run on Mesos, K8S, etc.Relies on ZK.
2015 Autodesk#
7
The Stack
2015 Autodesk#
8
How Does It Work ?Source of truth : Zookeeper.Each container belong to a cluster.A leader is picked per cluster.Leaders manage their peers via HTTP I/O.Settings passed via environment vars.Eventually consistent.
2015 Autodesk#
9
Proxy approach.100% Mesos+Ochopod.Used for CI/CD as well.Proxy running on an edge node.Could easily factor OAUTH2 in.Access via direct HTTPS or using a CLI.Toolkit to deploy, list, query, kill & update containers
A quick DYI Mini-PaaS
2015 Autodesk#
10
Building verticals at scale
2015 Autodesk
11
Architecture
2015 Autodesk#
12
Phone Switch & State Machines
2015 Autodesk#
13
Going Global
2015 Autodesk#
14
Shooting For Higher ScalesUnit of scale == 1 Kafka topic.Keep the pressure on each broker constant.Every sub-system can be scaled independently.API protocol designed to account for nodes shutting down.Mix of horizontal scaling & sharding via RabbitMQ.Checkpoints + idempotency + state-machines.Ochopod is critical to enable scaling.
2015 Autodesk#
15
Conclusion
2015 Autodesk
16
6 man/month effort.6 open-sourced 3rd-parties (Kafka, Zookeeper, RabbitMQ...).3 deployments over 2 data-centers, using DCOS.36+ c3.2xlarge CoreOS slaves on AWS/EC2 + VPC.~20 Kafka brokers, ~40 Play! Nodes.~150 live containers.~500 live streaming sessions at any time.~30M events / ~65M API hits a day.< 5 minor incidents, no major incident to date.1 single dev/op (!).
2015 Autodesk#
17
Issues & Next StepsWhat does one do if a slave goes offline ?Need for better placement constraints.Need for better storage schemes.The K8S pod concept is cool after all...We could invest into a dedicated Mesos framework.What about Spot instances ?
2015 Autodesk#
18
https://github.com/autodesk-cloud/ochopod
2015 Autodesk#
19
Autodesk is a registered trademark of Autodesk, Inc., and/or its subsidiaries and/or affiliates in the USA and/or other countries. All other brand names, product names, or trademarks belong to their respective holders. Autodesk reserves the right to alter product and services offerings, and specifications and pricing at any time without notice, and is not responsible for typographical or graphical errors that may appear in this document. 2015 Autodesk
20