GOCON Autumn (Story of our own Monitoring Agent in golang)
Transcript of GOCON Autumn (Story of our own Monitoring Agent in golang)
Story of our own Monitoring Agent
in golang@dxhuy
LINE corp
Introduction
• @dxhuy • Vietnamese • Building monitoring stack at LINE
My goal today• Join GoConference without lottery
My goal today• Show that this is not 100% true
Today takeaway
→Anatomy of monitoring agent →How to design one →Challenges and learn
Monitoring Agent !?
• Small application run on host machine • Collect host machine metrics
• Request latency? • MySQL load? • Redis hit/miss rate? • .....
• Aggregate metrics (sum/avg/histogram..) • Send to collector server → alert / chart ...
• statsd / collectd / telegraf...
Not a generic log transfer
Why not reuse existing technology?
• Scale problem • We need to write our own stack
• Various environment problem • Management problem • Development velocity problem
Let's start write our own
Language
Features
• Modularity (for user)
• Buffer (prevent data loss)
• Management friendly (for admin)
Modularity
• What is modularity? • Easily to add new metrics from user
view • Pluggable
Modularity• How?
• Input : get metric • Codec : understand metric • Output : send metric
// Metric is central model for imonDtype Metric struct {
ProtocolVersion ProtocolVerName stringVal ValueTimeStamp time.TimeFingerprint FingerprintType MetricTypeLabels map[string]string
}
Input Plugin design
Input Plugin design
• Three important things: • Process model • Plugin model • Collecting model (push vs pull)
Process model
Single process vs
Multiple process
Process model
- Adv : easy management / maintainance
- DisAdv : one bad plugin could affect the whole
Same language vs
Embedded language
Plugin model
Plugin model- Adv: Simple model, better maintainance - DisAdv: each time add new plugin, need to restart the whole agent
// InputPlugin represent an input plugin interfacetype InputPlugin interface {
Interval() config.DurationGracefulStop() errorName() stringType() InputType
}
type InputByte interface {Decoder() codec.DecoderReadBytesWithContext(ctx context.Context) ([]byte, error)
}
type InputMetrics interface {ReadMetricsWithContext(ctx context.Context) (model.Metrics, error)
}
All plugins share same interface
Push vs
Pull
Collecting model
Collecting model
- Adv: less affect to middleware, simple model - DisAdv: Application need to expose some thing to "pull" (http endpoint / file / ..)
func (i *MemcachedInput) ReadMetricsWithContext(ctx context.Context) (model.Metrics, error) {
..............conn, err := net.DialTimeout("tcp", i.endpoint, i.timeout.Duration)if err != nil {
return nil, err}defer conn.Close()
_, err = conn.Write([]byte("stats\n"))if err != nil {
return nil, err}..................scanner := bufio.NewScanner(conn)
for scanner.Scan() {text := scanner.Text()if text == "END" {
break}// Split entries which look like: STAT time 1488291730entries := strings.Split(text, " ")if len(entries) == 3 {
v, err := strconv.ParseInt(entries[2], 10, 64)if err != nil {
log.Debug("invalid value %s", entries[2])continue
}
ms = append(ms, *model.NewMetric(entries[1],model.Value(float64(v)),time.Now(),model.GaugeType,
))}
}..........return ms, nil
}
Pull sample directly contact server
Codec Plugin / Output Plugin
type Encoder interface {//Name() stringEncode(metrics model.Metrics) ([]byte, error)Name() string
}
type Decoder interface {//Name() stringDecode(input []byte) (model.Metrics, error)Name() string
}
Codec interface
// OutputPlugin represent an output plugin interfacetype OutputPlugin interface {
WriteWithContext(ctx context.Context, metrics model.Metrics) error // for Cancellable write
Encoder() codec.EncoderInterval() config.DurationGracefulStop() errorWalReader() wal.LogReaderName() string
}
Output interface
Buffer design
each Output maintain its own offset i offset will be update when output success
Buffer design
Buffer design• Advantages
• When output failed, just rollback index
• Chunks will be organized by segments (each segments ~ 1GB) • To clean up, just delete old segments
which already consumed by all output
Buffer design• Other concerns
• Serialization • It's not hard to write your own serialization method (link)
• mmap vs file read • not much different in our case • mmap index management is cubersome to write because it
has to manipulate at 2^n address
• Concurrent write vs Synchronized write • Synchronized write for data safety
https://www.slideshare.net/dxhuy88/story-writing-byte-serializer-in-golang
Buffer designtype LogReader interface {
Read() (model.Metrics, error)Read1() (model.Metrics, error)CurrentOffset() int64SetOffset(int64) errorDestroy() error
}
type LogWriter interface {Write(*model.Metrics) errorLastOffset() int64
}
Management friendly
• Monitoring agents is f**king hard
• Deploy agents in large scale is painful
Potential risk
• Die without noticing • Over resource consume • Overflow buffer • Dirty data • Resend storm
Resend storm is aweful
How we solve those problems
• Expose agent state as http endpoint • and monitoring them all using prometheus • Monitoring everything
• Aliveness / CPU / Memory / Output Lag • Using circuitbreaker / jitter resend to
prevent resend storm
func (b *AutoOpenBreaker) Close() {log.Info("close breaker for %v", b.autoOpenTime)b.state = CLOSEb.closeTime = time.Now()go b.autoOpen()
}
func (b *AutoOpenBreaker) open() {b.state = OPEN
}
func (b *AutoOpenBreaker) IsOpen() bool {return b.state == OPEN
}
func (b *AutoOpenBreaker) autoOpen() {tick := time.Tick(b.autoOpenTime)select {case <-tick:
log.Info("auto open breaker after %v", b.autoOpenTime)b.open()
}} Circuit
breaker
func (i *Output) retry(left int, cancelCtx context.Context, f func() error) error {
select {case <-cancelCtx.Done():
return fmt.Errorf("got cancelled")default: // no-op}
// jitter retrym := math.Min(capacity, float64(base*math.Pow(2.0, float64(maxRetry-
left))))s := rand.Intn(int(m))log.Debug("retry sleep %d second", s)time.Sleep(time.Duration(s) * time.Second)
// do some work....}
jitter
Agent monitoring using prometheus / grafana
Export agent own metrics at http://host:port/agent_metrics
Admin page
Finally• Golang is awesome
• Quick prototype, works everywhere • Never, ever write your own agent
• ... unless you have to • But it's fun because there're a lot of
problems
We're hiring