Scalable Eventing Over Apache Mesos

Click here to load reader

  • date post

    11-Jan-2017
  • Category

    Engineering

  • view

    5.353
  • download

    1

Embed Size (px)

Transcript of Scalable Eventing Over Apache Mesos

Autodesk Corporate PPT template 4:3

Scalable Eventing Over MesosOlivier PaugamSW Architect / Autodesk CloudBig Data Montreal

2015 Autodesk

1

Goals & Challenges

2015 Autodesk

2

The MissionGeneral purpose, high-volume eventing system.Batch oriented I/O.Target audience: 20+ teams within Autodesk.Must be active/active across multiple data-centers.Must be able to scale at any time. Must be able to absorb traffic spikes. Must be accessible via a single API.Must be secure (transport + data at rest).Must not be tied to a specific provider.

2015 Autodesk#

3

A Few Use CasesApplication log pre-aggregation transport.Metering updates from our Platform API.Analytics transport prior to indexing.Event transport for Search, Activity & other services.Identity updates down to our IT systems.Editing increments for large 3D model collaboration.

2015 Autodesk#

4

Our 5 Technical CommandmentsMust use Docker.Must run on Apache Mesos + Marathon.Must leverage Apache Kafka.Must be as autonomous & low-maintenance as possible.No automation scripting allowed (Chef, Salt, Ansible).

2015 Autodesk#

5

Introducing Ochopod

2015 Autodesk

6

Ochopod100% Open Source !Novel container-centric orchestration model.Mix between a discovery & an init system.No need for dedicated frameworks.Direct Peer To Peer HTTP I/O.Can run on Mesos, K8S, etc.Relies on ZK.

2015 Autodesk#

7

The Stack

2015 Autodesk#

8

How Does It Work ?Source of truth : Zookeeper.Each container belong to a cluster.A leader is picked per cluster.Leaders manage their peers via HTTP I/O.Settings passed via environment vars.Eventually consistent.

2015 Autodesk#

9

Proxy approach.100% Mesos+Ochopod.Used for CI/CD as well.Proxy running on an edge node.Could easily factor OAUTH2 in.Access via direct HTTPS or using a CLI.Toolkit to deploy, list, query, kill & update containers

A quick DYI Mini-PaaS

2015 Autodesk#

10

Building verticals at scale

2015 Autodesk

11

Architecture

2015 Autodesk#

12

Phone Switch & State Machines

2015 Autodesk#

13

Going Global

2015 Autodesk#

14

Shooting For Higher ScalesUnit of scale == 1 Kafka topic.Keep the pressure on each broker constant.Every sub-system can be scaled independently.API protocol designed to account for nodes shutting down.Mix of horizontal scaling & sharding via RabbitMQ.Checkpoints + idempotency + state-machines.Ochopod is critical to enable scaling.

2015 Autodesk#

15

Conclusion

2015 Autodesk

16

6 man/month effort.6 open-sourced 3rd-parties (Kafka, Zookeeper, RabbitMQ...).3 deployments over 2 data-centers, using DCOS.36+ c3.2xlarge CoreOS slaves on AWS/EC2 + VPC.~20 Kafka brokers, ~40 Play! Nodes.~150 live containers.~500 live streaming sessions at any time.~30M events / ~65M API hits a day.< 5 minor incidents, no major incident to date.1 single dev/op (!).

2015 Autodesk#

17

Issues & Next StepsWhat does one do if a slave goes offline ?Need for better placement constraints.Need for better storage schemes.The K8S pod concept is cool after all...We could invest into a dedicated Mesos framework.What about Spot instances ?

2015 Autodesk#

18

https://github.com/autodesk-cloud/ochopod

2015 Autodesk#

19

Autodesk is a registered trademark of Autodesk, Inc., and/or its subsidiaries and/or affiliates in the USA and/or other countries. All other brand names, product names, or trademarks belong to their respective holders. Autodesk reserves the right to alter product and services offerings, and specifications and pricing at any time without notice, and is not responsible for typographical or graphical errors that may appear in this document. 2015 Autodesk

20