Путь мониторинга 2.0 всё стало другим / Всеволод Поляков...

Post on 16-Apr-2017

682 views 3 download

Transcript of Путь мониторинга 2.0 всё стало другим / Всеволод Поляков...

МОНИТОРИНГ. ОПЯТЬ.Всеволод Поляков

Platform Engineer . Grammarly

ctrlok.com

Что такое метрики?

Успешность

Количество

Время

Взаимодействие

Внутренние процессы

Системные метрики

Зачем нужны метрики?

Алерты

Аналитика

Graphite

Default graphite architecture

what?

what?• RRD-like (gram.ly/gfsx)

what?• RRD-like (gram.ly/gfsx)

• so.it.is.my.metric → /so/it/is/my/metric.wsp

what?• RRD-like (gram.ly/gfsx)

• so.it.is.my.metric → /so/it/is/my/metric.wsp

• Fixed retention (by name\pattern)

what?• RRD-like (gram.ly/gfsx)

• so.it.is.my.metric → /so/it/is/my/metric.wsp

• Fixed retention (by name\pattern)

• Fixed size (actually no)

Retention and size

Retention and size• 1s:1d → 1 036 828 bytes

Retention and size• 1s:1d → 1 036 828 bytes

• 10s:10d → 1 036 828 bytes

Retention and size• 1s:1d → 1 036 828 bytes

• 10s:10d → 1 036 828 bytes

whisper calc

Retention and size• 1s:1d → 1 036 828 bytes

• 10s:10d → 1 036 828 bytes

• 1s:365d → 378 432 028 bytes (1 TB ~ 3 000)

whisper calc

Retention and size• 1s:1d → 1 036 828 bytes

• 10s:10d → 1 036 828 bytes

• 1s:365d → 378 432 028 bytes (1 TB ~ 3 000)

• 10s:365d → 37 843 228 bytes (1 TB ~ 30 000)

whisper calc

Retention and size

Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes

Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes

• 240 864 metrics in 1 TB

Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes

• 240 864 metrics in 1 TB

• aggregation: average, sum, min, max, and last.

Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes

• 240 864 metrics in 1 TB

• aggregation: average, sum, min, max, and last.

• can be assign per metric

How• terraform (https://www.terraform.io/)

• docker (https://www.docker.com/)

• ansible (https://www.ansible.com/)

• rocker (https://github.com/grammarly/rocker)

• rocker-compose (https://github.com/grammarly/rocker-compose)

Default graphite architecture

Default graphite architecture

carbon-cache.py

link

carbon-cache.py

• single-core

link

carbon-cache.py

• single-core

• many options in config file

link

carbon-cache.py

• single-core

• many options in config file

• default

link

architecturecarbon-cache.py

Start load testing

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

• retentions = 1s:1d

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

• retentions = 1s:1d

• MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

• retentions = 1s:1d

• MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf

• defaults

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

• retentions = 1s:1d

• MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf

• defaults

• almost 1.5h to get limit :(

carbon-cache.py cache size → 75k m\s

updates
upd time

results

• 75 000 m\s max

• 60 000 m\s flagman speed

• I\O :(

Try to tune!

• WHISPER_SPARSE_CREATE = true (don’t allocate space on creation) non-linear I\O load.

• CACHE_WRITE_STRATEGY = sorted (default)

cache size 1k → 195k m\s

results

• 120 000 m\s flagman speed • cache flush problem :(

Try to tune!

• CACHE_WRITE_STRATEGY = max will give a strong flush preference to frequently updated metrics and will also reduce random file-io.

from 1k to 150k

results

• 90 000 m\s flagman speed • cache flush problem :(

Try to tune!

• CACHE_WRITE_STRATEGY = naive just flush. Better with random I\O.

from 45k to 135k

results

• 120 000 m\s flagman speed • still CPU

sorted

max

naive

• Maybe it’s I\O EBS limitation? → 512 GB disk.

• Maybe it’s I\O EBS limitation? → 512 GB disk.

• No.

• Maybe it’s I\O EBS limitation? → 512 GB disk.

• No.

go-carbon

link

go-carbon

• multi-core single daemon

link

go-carbon

• multi-core single daemon

• written in golang

link

go-carbon

• multi-core single daemon

• written in golang

• not many options to tune :(

link

Start load testing

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

• retentions = 1s:1d

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

• retentions = 1s:1d

• max-size = 0

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

• retentions = 1s:1d

• max-size = 0

• max-updates-per-second = 0

Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)

• retentions = 1s:1d

• max-size = 0

• max-updates-per-second = 0

• almost 1h to get limit :(

1k → 130k m\s ~3k/min

1k → 130k m\s ~3k/min

1k → 130k m\s ~3k/min

results

results• 120 000 m\s flagman speed

results• 120 000 m\s flagman speed• but it’s without sparse.

results• 120 000 m\s flagman speed• but it’s without sparse. • try to implement

try to tune! remaining := whisper.Size() - whisper.MetadataSize() whisper.file.Seek(int64(remaining-1), 0) whisper.file.Write([]byte{0}) chunkSize := 16384 zeros := make([]byte, chunkSize) for remaining > chunkSize { // if _, err = whisper.file.Write(zeros); err != nil { // return nil, err // } remaining -= chunkSize } if _, err = whisper.file.Write(zeros[:remaining]); err != nil { return nil, err }

Уже есть в go-carbon

180 000 m\s !

try to tune!

• max update operation = 1500

results

• TLDR 210 000 - 240 000 m\s flagman speed

• 31 000 000 cache size!

try to tune!

• max update operation = 0

• input-buffer = 400 000

results

• 270 000 m\s flagman speed

• 10-20kk cache size!

try to tune!

• vm.dirty_background_ratio=40

• vm.dirty_ratio=60

300 000 req\s

results

• 300 000 m\s flagman speed

• 180k+ m\s ±without cache

Re:Lays

Default graphite architecture

Default graphite architecture

arch forward

arch named\regexp

arch hash

arch hash replicafactor: 2

carbon-relay.py

• twisted based

• native

Start load testing

Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)

Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)

• ~1 Gb lan

Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)

• ~1 Gb lan

• default parameters

Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)

• ~1 Gb lan

• default parameters

• hashing

Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)

• ~1 Gb lan

• default parameters

• hashing

• 10 connections

WTF!

carbon-relay-ng• golang-based

link

carbon-relay-ng• golang-based

• web-panel

link

carbon-relay-ng• golang-based

• web-panel

• live-updates

link

carbon-relay-ng• golang-based

• web-panel

• live-updates

• aggregators

link

carbon-relay-ng• golang-based

• web-panel

• live-updates

• aggregators

• spooling

link

<150 000 req\s

carbon-c-relay

• написан на C

• advanced cluster management

from 100 000 to 1 600 000 req\s

1 400 000 flagman speed. Or not?

1 400 000 flagman speed. Or not?

1 400 000 flagman speed. Or not?

Итак…go-carbon + carbon-c-relay = ♡

Контейнеры

Всё перепутано

Различия• Окружение

• Роль

• Трек (Модификатор)

• IP

• Датацентр

• Что-угодно

Теги

TSDB с тегами• influxDB

• openTSDB (hbase)

• cyanite (cassandra)

• newTS (cassandra)

• Prometheus

(cluster) influx, 130k metric\s

openTSDB single instance + hbase cluster = upto 150k metric\s

Compaction

Graphite

Найти уникальное

Работает с Grafana

Zipper

• https://github.com/grobian/carbonserver

• https://github.com/dgryski/carbonzipper

• https://github.com/dgryski/carbonapi

ALSO

• https://github.com/jssjr/carbonate

• https://github.com/jjneely/buckytools

• https://github.com/dgryski/carbonmem

• https://github.com/grobian/carbonwriter

Планы

• Патч statsd → ES

• Патч carbonserver → carbonlink

feel free to ask• Vsevolod Polyakov

• ctrlok@gmail.com

• skype: ctrlok1987

• github.com/ctrlok

• twitter.com/ctrlok

• slack: HangOps

• Gitter: dev_ua/devops

• skype: DevOps from Ukraine

• slack.ukrops.club

feel free to ask• Vsevolod Polyakov

• ctrlok@gmail.com

• skype: ctrlok1987

• github.com/ctrlok

• twitter.com/ctrlok

• slack: HangOps

• Gitter: dev_ua/devops

• skype: DevOps from Ukraine

• slack.ukrops.club

Мы хайрим!