Project Deimos

Hi, this issimonsuo

“To provide a conceptual framework for designing a dispatch engine that

reacts to a request by gathering various inputs, dispatching requests

with the inputs to some pricing engines, then reassembling the results into a form the original requestor can

comprehend.” – Andrei

Introducing PHOBOSDEIMOS

DEIMOS: short-term goal

Debugging:storemorelogfaster

DEIMOS: long-term goal

Service-oriented performance profiling

DEIMOS: high level architecture

StorageComputation / Indexing

Ingestion / Buffering

Apache Kakfa Apache Storm Apache HBase

A bunch of cooltoys

“Distributed publish-subscribe message queue”

Kafka: concept

Kafka: message queue

Kafka: storm integration

Storm: concept

“Distributed real-time computation graph”

Storm: topology

Number Spout

Data-store

Odd Bolt

Even Bolt

[1,2,3,4, …]

[2,4, …]

[1,3, …]Pair Bolt

Log Bolt

[(1,2),(3,4), …]

Storm: parallelism

• 1 worker per node per topology• 1 executor per core for CPU bound tasks• 1-10 executors per core for IO bound tasks• Compute total parallelism possible and

distribute it amongst slow and fast tasks. High parallelism for slow tasks, low for fast tasks.

Storm: tuning guidelines

Storm: topology code

HBase: concept

“Distributed, non-relational, key/value store based on HDFS”

HBase: schema designRaw Table: "logTable"

Row Key

Family"rrh" "rrb" "mh" "mb"

QualifiersentryId0 entryId1 entryId0 entryId1 entryId0 entryId1 entryId0 entryId1

rootRequestId0|requestId0 rootRequestId0|requestId1 rootRequestId1|requestId0 rootRequestId1|requestId1

Uuid/LogTime Index Table: "indexTable"

Row Key

Family"rrh"

QualifiersrootRequestId0 rootRequestId1

logTime0|uuid0 logTime0|uuid1 logTime1|uuid0 logTime1|uuid1

HBase: storm integration

HBase: retrieval

DEIMOS: detailed implementation

Kafka Cluster Storm Cluster HBase Cluster

xaplog

Kafka producer library

MARS & pricing tasks

Marsloggerclient library

BAS

mttsvc (java)

HBase client library

MTT MLOG(Terminal)

MTT WEB(PHP)

Index

Storage

KafkaSpoutIndexBolt

LogBolt

BAS

DEIMOS: extension

HBase Cluster

Index

Storage

Hadoop Archive Job

Archive Elasticsearch / Solr Interactive Analytics

Stuff I learned the hard way

• Debugging is difficult (dbxtool > ddd)• Always check version number of open source

libraries• The right balance between planning and doing• Use bcpc if you want to test things• BASO is great• Reading a book might be better than googling

Scalable Logging

To Assess and Improve Performance

Problem Statement

A customer is shouting at me!How do I find what happened quickly?

How do I prevent it next time?How can I anticipate entirely new problems?

Use Cases(needed today)

• Debugging– Goal: Investigate complaints by looking at the inputs that

went into a specific request.– What needs to be fixed: NOT logging everything so a lot of

time wasted trying to reproduce customer problems instead of having it already there.

– Motivation: Spending a week tracking down reproduction data because logging subsystem cannot handle full selective BAEL logging in production.

Use Cases(planning for tomorrow)

• Automated Request Audit– Goal: Need to know exact inputs, path it took through the input

system, and outputs provided (all based on the logs received).– What needs to be fixed: We have no way to analyze the requests

we receive except manually one at a time. We cannot go back in time to perform hypothesis testing and automatic auditing of requests according to rules.

– Motivation: Recent malformed requests caused one of our daemons to throw an exception and crash because number of scenarios did not match number of dates in input. It is not possible to see how many malformed requests we got in past or detect this condition in production without deploying new code in the actual system itself.

Use Cases(planning for tomorrow)

• Aggregation of end-to-end trends– Goal: Anomaly (spike/dip) detection (define a window and build a historical distribution for the

data).– What needs to be fixed: Need to establish expected SLAs for each kind of request received based

on input sizes and estimations of downstream system performance.– Motivation: MARS team received a complaint about processing being too slow. We had no

baseline. We had to use trial and error to determine what could be pushed through the system. A lot of guesswork.

• Operational analysis of the dependent systems– Goal: Capacity planning and performance optimization.– What needs to be fixed: Problem detection by analyzing deviation from historical

trends for:• Processing rates, error rates, and response times.

– Motivation: When the downstream mortgage services started throwing errors it took a lot of manual reproduction attempts to figure out.

The Challenge

We are reactive instead of proactiveNeed More Data

Data-drivenEvolutionInMakingOperationalSubstitutions

Definitions• A log is some arbitrary sequence of events ordered in time representing state that

we want to preserve for later retrieval• An event is a tuple representing an occurrence of

– Input system (system type + specific instance)– Event time (start and end time)– Event ID and Parent Event ID (to establish causation)– Location (OS and Bloomberg process/task/machine name)– Privilege information (UUID)– Event data – can be an arbitrary object (input system provides direction of how to interpret event data)

• Conceptually, the events are stored as directed acyclic graph with a start node, where each node represents an event. (see the MTT tool as an example)

• Input systems– Other systems that provide the event stream– Two main input systems types:

• BAEL entries• BAS requests

– Currently targeted input instance is only MARS

Overall Architecture

Event feed – take responsibility for logging events • MARS daemons – Sends actual log

events to xaplog instances.

• xaplog instances – Receives log events and forwards them to Kafka instance.

• Kafka– Middleware to queue messages, it is

scalable and durable.– Once Kafka accepts an event, the

associated xaplog instance is freed of any further obligations.

Ingestion – group related events together

• Kafka – Collects events into two main queues.– First queue: BAS messages– Second queue: BAEL messages– Log events are persisted onto disk.– Serves as a shock absorber to handle bursts

in log event traffic (since it just stores the messages, it doesn’t have to process them).• The rest of the system should be

designed to handle the average load case.

• Storm Ingestion Topology – Groups event stream by root request.

• Partitioner – Holds grouped events together.

Encoding – efficiently code the event stream at the binary level

• Partitioner – Writes the same request chain under the same rows in Hbase.– The data is split into three main content

types:• BAS/BAEL headers• BAS string data (XML)• BAEL string data (trace information)

• Storm Encoding Topology – Writes each group of events as one BLOB – with special coding tailored to data type (i.e. header data, XML, text).

• Log warehouse – Encoded blobs are written to different tables for longer-term archiving.

Indexing – speed up access to relevant fields for interactive querying

• Log warehouse – By storing similar data together with specialized encoding it can significantly reduce storage costs.

• Storm Indexing Topology – Extracts the relevant subset of data to feed the indexes.

• Indexes – Underlying implementation of the indexes. Basic ones can be stored in HBase. More complicated ones can be stored in ElasticSearch/Solr.

Querying – let users lookup the event stream • Indexes / log warehouse –

– User queries would hit the indexes first.– If additional data is needed and is not

available in an index it would need to access the warehouse.

• xapqrylg – New daemons to marshal requests from the UIs.

• MTT UIs – Would be unchanged. More improvements can be added later.

Phase I tasksReplace MTT backend

• Code in xaplog to send events to Kafka queue– Kafka & Storm will live on BCPC for proof-of-concept, need to see about production– See if can reuse what pricing history team did?

• Maybe not, it should just be a simple push.

• Design Kafka queue layout (partitioning and topics)– Two topics: BAS and BAEL

• Maybe: three later, BAS lite, BAS xml + BAEL – decouple the ingestion rates if better latency needed???

– Look at the best settings and make sure DRQS 54369477 doesn’t apply• Storm Ingestion topology & HBase schema (in Java)

– Write each header-data row separately and let the encoding aggregate them.– Blobs do not need any ingestion right now, they can be written to target table directly.

• Storm Encoding topology & HBase schema (in Java)– Keeping it simple for now. Split up XML blobs from rest of data.– Store all non-blob data grouped by root request id (protobuf??)– For blob data do some basic XML to binary, and as part of key order responses and requests together.– How to ensure if the same log data is fed more than once it only gets written once?

• Storm Indexing topology & HBase schema (in Java)– A few simple indexes will live in HBase to allow query by UUID, date range, pricing #, and security.– How to keep indexes synchronized with the warehouse tables?

• Xapqrylg – read HBase indexes and storage tables– Reuse Kirill’s work on mttweb where it makes sense.

Q&A

"Go ahead, make my day.“ -Harry

Key Properties

…of a useful event stream logging system

Required Properties1. Ownership - It accepts logging data and takes responsibility so that input systems are freed from

offering any guarantees after handoff (logging is not the main task of input systems, just a side effect)

1. Makes it easy to generate IDs to link events in a tree2. Two main casual link models can be considered (explicitly is preferred):

1. Explicitly, by having each event have a parent event id as well as its own event id2. Implicitly, by having a root request id, and then ordering by event time, and ingestion order

2. Durability - reduce chances of data loss, especially in the event of crashes3. Idempotence - It correctly handles the same input log data if sent into the system more than once

1. Due to failures, input systems might send the same data twice – client side problems easy to handle: just send data again2. To support batch input of the data from other sources (“bulk import”) – to stand up another instance of the system or

migration from other systems in a consistent fashion3. Replaying existing log data to simplify re-indexing and related side-effects

4. Time-invariance - Does not expect the event stream to be time ordered (even though it usually will be), the output of the system might be different in-between, once the exact same overall data has been fed to the system the outputs should be the same

5. Avoiding Lock-in - Allows easy export of data in bulk into a neutral form1. for exporting into other systems or into another instance2. don’t want the data to be stranded

6. Scalable – as close to linear as possible to improve performance by just adding more machines.

Required Properties (cont’d)7. High Availability – have some form of redundancy so that if machines in the

system fail the system can still operate, maybe in a degraded state (performance-wise).

8. Manageable - Export metrics to support decisions on the operation of the system9. Schema-agnostic - Is as schema-less as possible

7. requires only to know about the fields it needs to index on8. otherwise shouldn’t care about the data being in a specific format9. the input format should be akin to a nested JSON object10. but with a parent id to correlate to a parent and then ordered by time.

10. Space-efficient - Ability to optimize binary storage to …7. Reduce disk space taken8. Improve read times9. …at the expense of increased complexity and CPU costs when writing the data

Why Current Solutions Are Inadequate

• APDX (and TRACK – a functional subset of APDX)– Collects only numerical metrics with no ability to store arbitrary event data

or casual relationships between events. It just counts events.– It can be used in parallel, but does not our nearly meet our needs.

• Splunk– Lightweight analysis done based on:

• {TEAM MOB2:SPLUNK TUTORIAL<GO>} • http://rndx.prod.bloomberg.com/questions/9584/how-should-we-do-distributed-logging

– Main points that discourage further research:• Splunk expects log lines only with no arbitrary data.

– Hard to save space

• Cost is per log volume (uncompressed) – we expect to easily exceed 100GiB of raw logging volume a day (supposedly that will be a one-time cost of $110k).

• Better suited as a higher level tool that we could maybe use on top.

http://rndx.prod.bloomberg.com/questions/9584/how-should-we-do-distributed-logging

http://rndx.prod.bloomberg.com/questions/9584/how-should-we-do-distributed-logging

Project Deimos

Documents

Transcript of Project Deimos