Apache Mesos Ecosystem at Allegro First Year of Production Use

Post on 15-Apr-2017

277 views 2 download

Transcript of Apache Mesos Ecosystem at Allegro First Year of Production Use

Apache Mesos Ecosystem at Allegro - First Year of Production Use

Wojciech Lesicki - Product ManagerTomasz Ziarko - Software Engineer

Allegro

Wojciech Lesicki
Do wywalenia.

● What we do in Allegro?● Our Mesos Ecosystem● How we deploy apps?● Problems we’ve had● Q&A

Agenda

What is Allegro?

Allegro

● 16 years on the market● Started as an auction site and now the

biggest e-commerce company in Poland and one of the biggest in Central and Eastern Europe

● 50% of e-commerce market and 80% of m-commerce market in Poland

● 623 items sold every minute

● 14 mln users (37% population of Poland)

● 201 mln visits, 3 billion page views per month

Our infrastructure and IT

● Two DC● Openstack - 510 hosts,

20128 CPU, 5537 VM+BaaS with openstack Ironic

● Monolith (PHP) and microservices● Around 500 people in IT, most of them

are software engineers

Ok, so why we need Mesos?

Our deployment before Mesos

● No standards, no procedures● Every team did deployment their own way● Inefficient

Architecture

Openstack

Mesos Slave

Mesos Executor

Mesos Slave

Docker Executor

Mesos Master

Discovery agent Discovery Agent

Zookeeper

Marathon

Discovery

Consul

- 100 % openstack (VM + bare metal)

- marathon as scheduler,

- sync, state, election - zookeeper,

- service discovery - consul,

- separated mesos and docker containerizer.

11

Implementation

- multiple clusters,

- each spawned across two datacenters,

- separate ecosystem,

- fair-share distribution between data centers.

- Prod (105 slaves, 1000 CPU)

- Test (96 slaves, 368 CPU)

- Dev (30 slaves, 120 CPU)

dc1 dc2

Prod Network

Prod Mesos Cluster

Test Network

Test Mesos Cluster

Dev Network

Dev Mesos Cluster

12

Implementation

Implementation

$ terraform apply -var "buildnr=setup234" \ -var "branch=mesoscon2016" \-var "marathon_version=0.15.3-1.ubuntu1404" \-var "mesos_version=0.28.0-1boost+glog+protobuf" \-var 'masters.dc1=1' \-var 'slaves.dc1=2' \-var ‘slaves.dc2=1’

openstack_compute_instance_v2.mesos-master-dc1: Refreshing state... (ID: ce86ab7a-3660-4702-bba0-5825ae2350b1)

openstack_compute_instance_v2.mesos-slave-dc1.1: Refreshing state... (ID: 39bfd9c1-f6b0-4056-a3ac-28b0136cb220)

openstack_compute_instance_v2.mesos-slave-dc1.0: Refreshing state... (ID: acfb2e86-b4d1-44bd-b9e0-2eb4685a76ff)

openstack_compute_instance_v2.mesos-slave-dc2.0: Creating…….Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

14

MESOS

Discovery

Config service

SSL Service

MaaS

LBaaS

AppEngine Console (e.q. Bamboo, Stash, Artifactory)

Implementation

Service Discovery

- Registering inside cluster,

- Automatic or manual registration,

- Fail detection, changes detection,

- DC aware services.15

Service Discovery

Marathon Leader

Marathon

Marathon Consul

Event busSubscription

- Event based registration marathon apps in consul,- Forwards data to appropriate consul agents,- Leader aware,- Cyclic resyncs of all information,

Consul Agent

https://github.com/allegro/marathon-consul

marathon-consul

Slave 1

Slave Process

Consul Agent

Slave 2

Slave Process

Consul Agent

Slave 3

Slave Process

Consul Agent

Slave n

Slave Process

Consul Agent

Marathon Leader

Marathon

Marathon Consul

Mesos Master

Schedule

Register running tasks

Service Discovery

Schedule

Hermes - KafkaConsul Master

Consul Server

ConwatchConsul polling Publish event

Marathon Leader

Running app

Service lookupDNS or RESTConsul agent

Consul Master ( aka discovery Service)

Service Discovery

Production of discovery events

Discovery lookup

Discovery Service

{ "ID": "mesoscon2016", "Name": "mesoscon2016", "Tags": [

"std-srv","v1"

], "Address": "127.0.0.1", "Port": 8000}

$ curl -X POST -d @register_service_on_agent.json 127.0.0.1:8500/v1/agent/service/register$ curl 127.0.0.1:8500/v1/agent/services | python -m json.tool

"mesoscon2016": { "Address": "127.0.0.1", "EnableTagOverride": false, "ID": "mesoscon2016", "ModifyIndex": 0, "Port": 8000, "Service": "mesoscon2016", "Tags": [ "Std-srv",…..

SSL Service

- Custom mesos hook,

- Part of microservice

contract,

- Vault as CA solution,

- Short term

certificates/keys,

- Generated for each

instance. 20

Slave 1

Slave ProcessVault

certhook

Extend env

Executor

service Consul

app_x app_ySSL mutual mode

Storage

Application usage

SSL Service

Application environment setup

Config Service

- Secure storage,

- Fetch in mutual ssl,

- Version controlled config,

- Auth apps only,

- Ease to use,

- Peer review of changes.

Starting App Config serviceMutual SSL

Git repository

Revision X

Revision Y

Revision ZEncrypted Data

Fetch config data

Get revision and environment config

Encrypted Valuable DataConfigured git repo

Git push

Config Service

Push configuration

MAAS

- Metrics collected,

- Dashboards set,

- Service owners get

notified,

- Triggers, not

mandatory,

- Multiple monitoring

solutions,

24

Graphite

Mesos Slave

Git repo

Diamond Collector

Mesos Master

Diamond Collector

MAASGrafana

Cabot

Checks definitions

Triggers

Notifications

Kafka - Hermes

Developer

Mesos Cluster Events

SubscriptionEmail

Pagerduty

Events

Eve

nts

Notify

Metric

MAAS

LBAAS

26

- Based on discovery,- Available through discovery tags,- HAproxy at the core.

Haproxy

VarnishVAAS

LBAAS

Consul

Service Catalog

Service X Information

Service Y Information

Instance x

Instance y

Instance x

Instance y

KAFKA/HERMES

Register instance

Unregister instance

Disco

Pub/Sub

REST Config

LBAAS

Mesos Agent

Mesos Master

Graphite

MAAS

Kafka

Consul Server

Vault

Consul Agent

Conwatch

VAAS

Marathonconsul

Mesos Agent

Implementation

Demo

Figures

What our Mesos Ecosystem gives our devs:

What our Mesos Ecosystem gives our devs:

1. Fast and easy deployment of new applications

What our Mesos Ecosystem gives our devs:

1. Fast and easy deployment of new applications

2. Standardization (e.g. out-of-the-box monitoring tools)

What our Mesos Ecosystem gives our devs:

1. Fast and easy deployment of new applications

2. Standardization (e.g. out-of-the-box monitoring tools)

3. Automation

What our Mesos Ecosystem gives our devs:

1. Fast and easy deployment of new applications

2. Standardization (e.g. out-of-the-box monitoring tools)

3. Automation4. Self-healing

// solved

The bumpy road

Netisolation killing slaves

Netisolation killing slaves

- Enabled isolation,

- Many cyclic deploys, on test env,

- Consulted our fellow mesos developers,

- Decided to disable it,

- Problem solved,

Marathon registers multiple times

- On error while getting znode data,

- Marathon registers with other framework id,

- Exhausting resources in cluster,

- After version 0.14 behaviour changed,

- Now marathon just waits,

- Maybe problem on zookeeper maybe not, solved anyway.

Deploy constraints

Deploy constraints

- We want cross dc/zones instances,

- Working unpredictable,

- Taking into account applications which are going to be downed,

- Multi constraint definitions prone to be unpredictable.

- Solved in newest version, so far.

Readiness checks ...

Readiness checks ...

- Application are deployed and upgraded in blue green principle,

- Recently started instanced not ready to handle load,- No standard mechanism for checking applications are really

running,- Check is passed ? no ? doesn't matter,- We developed custom service wrapper.

The bumpy road# occuring

- DC failure, AWS standby master for quorum,

- Application scaling, usage vs allocation (we try creating our

autoscaling)

- Users authorizations, quota for user,

- Graceful shutdown,

- Opened various endpoints, without authorization.

In a nutshell - you have seen

● Our Mesos Ecosystem● Our deployment● Our bumpy road

Mesos - it takes some time and effort

Mesos - it takes some time and effort, but it's worth it.

More information

Tomasz Ziarkotomasz.ziarko@allegrogroup.com

Wojciech Lesickiwojciech.lesicki@allegrogroup.comTwitter @WLesicki