Download - Log everything! @DC13

Transcript
Page 1: Log everything! @DC13
Page 2: Log everything! @DC13

Log Everything! @DC13

Page 3: Log everything! @DC13

Stefan & Mike

Mike Lohmann Co-Founder / Software Engineer

[email protected]

Dr. Stefan Schadwinkel Co-Founder / Analytics Engineer

[email protected]

Page 4: Log everything! @DC13

ABOUT DECK36 Who We Are –  DECK36 is a young spin-off from ICANS

–  Small team of 7 engineers

–  Longstanding expertise in designing, implementing and operating complex web systems

–  Developing own data intelligence-focused tools and web services

–  Offering our expert knowledge in Automation & Operations, Architecture & Engineering, Analytics & Data Logistics

Page 5: Log everything! @DC13

–  Log everything! – The Data Pipeline.

–  Tackling the Leviathan – Realtime Stream Processing with Storm.

–  JS Client DataCollector: Live Demo

–  Storm Processing with PHP: Live Demo

WHAT WE WILL TALK ABOUT Topics

Page 6: Log everything! @DC13

Log everything! The Data Pipeline

Page 7: Log everything! @DC13

THE DATA PIPELINE Requirements Background: Building and operating multiple education communities

Baseline: PokerStrategy.com KPIs

–  6M registered users, 700k posts/month, 2.8M page impressions/day, 7.6M requests/day

New products à New business models à New Questions

–  Extendable generic solution

–  Storage and accessability more important than specific, optimized applications

Page 8: Log everything! @DC13

Analytics

Producer

Transport

Storage

Realtime Stream Processing

Producer

–  Monolog Plugin, JS Client

Transport

–  Flume 0.9.4 m( à RabbitMQ, Erlang Consumer

–  Evaluated Apache Kafka

Storage

–  Hadoop HDFS (our very own) à Amazon S3

Analytics

Producer

Transport

Storage

Realtime Stream Processing

THE DATA PIPELINE Requirements

Page 9: Log everything! @DC13

THE DATA PIPELINE Logging Pipeline

Analytics

-  Hadoop MapReduce à Amazon EMR, Python, R

-  Exports to Excel (CSV), Qlikview à Amazon Redshift

Realtime Stream Processing

-  Twitter Storm

Analytics

Producer

Transport

Storage

Realtime Stream Processing

Analytics

Producer

Transport

Storage

Realtime Stream Processing

Page 10: Log everything! @DC13

THE DATA PIPELINE Unified Message Format -  Fixed, guaranteed envelope

-  Processing driven by message content

-  Single message gets compressed (LZOP) to about 70% of original size "(1184 B à 817 B)

-  Message bulk gets compressed to about 12-14% of original size "(@ 42k & 325k messages)

Page 11: Log everything! @DC13

Unified Message Form

Page 12: Log everything! @DC13

THE DATA PIPELINE Compaction RabbitMQ consumer (Erlang) stores data to cloud

-  Relatively large amount of files

-  Mixed messages

We want

-  A few files

-  Messages grouped by „Event Type“ and „Time Partition“

-  Data transformation

s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo

Hive partitioning!

Determined by message content

Page 13: Log everything! @DC13

THE DATA PIPELINE Compaction Using Cascalog

-  Based on Clojure (LISP) and Cascading

-  Provides a Datalog-like query language

-  Don‘t LISP? à JCascalog

Very handy features (unavailable in Hive or Pig)

-  Cascading Output Taps can be parameterized by data records

-  Trap location for corrupted records (job finishes for all the correct messages)

-  Runs within the JVM à large available codebase, arbitrary processing is simple

Page 14: Log everything! @DC13

Cacalog Query Syntax

Cascalog is Clojure, Clojure is Lisp

(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))

Query Operator

Cascading Output Tap

Columns of the dataset generated

by the query

„Generator“ „Predicate“

-  as many as you want

-  both can be any clojure function

-  clojure can call anything that is

available within a JVM

Page 15: Log everything! @DC13

Cacalog Query Syntax

Run the Cascalog processing on Amazon EMR:

./elastic-mapreduce [standard parameters omitted]

--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar

--main-class icans.cascalogjobs.processing.compaction

--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error

Page 16: Log everything! @DC13

The Data Pipeline Data Queries with Hive Hive is table-based and provides SQL-like syntax

-  Assumes one storage location (directory) per table

-  Simple to use if you know SQL

-  Widely used, rapid development for „simple“ queries

Hive @ Amazon

-  Table locations can be S3

-  „Cluster on demand“ à requires to rebuild Hive metadata

-  CREATE TABLE for source and target S3 locations

-  Import Table metadata (auto-discovery for partitions)

-  INSERT OVERWRITE to query source table(s) and store to target S3 location

Page 17: Log everything! @DC13

Hive @ Amazon (1)

Page 18: Log everything! @DC13

Hive @ Amazon (2)

We can now simply copy the data from S3 and import into any local analytical tool e.g. Excel, Redshift, QlikView, R, etc.

Page 19: Log everything! @DC13

Further Reading

-  More details in the Log Everything! ebook

-  Available at Amazon and DeveloperPress

Page 20: Log everything! @DC13

THE DATA PIPELINE Still: It’s Batch Processing

-  While quite efficient in flight, the logistics of getting the job started are significant.

-  Only cost-efficient for long distance travel.

Page 21: Log everything! @DC13

THE DATA PIPELINE Instant Insight through Stream Processing

-  Often, only updates for the recent day, week, or month are necessary

-  Time is of importance when direct feedback or user interaction is desired

Page 22: Log everything! @DC13

More Wind In The Sails With Storm

Page 23: Log everything! @DC13

-  Distributed realtime processing framework

-  Battle-proven by Twitter

-  All *BINGO-Abilities fulfilled!

-  Hadoop = data batch processing; Storm = realtime data processing

-  More (and maybe new) *BINGO: DRPC, ETL, RTET, Spouts, Bolts, Tuple, Topology

-  Easy to use (Really!)

REALTIME STREAM PROCESSING Instant Insight through Stream Processing

Page 24: Log everything! @DC13

Realtime Stream Processing Infrastructure with Storm

Producer Transport

Queue

Nimbus (Master)

Zookeeper

Supervisor

Supervisor

Worker Worker

Worker

NodeJS

Realtime Data Stream Analytics

S3

Zabbix Graylog

DB

Storage

Analytics

Storm-Cluster

Apps &Server

Page 25: Log everything! @DC13

REALTIME STREAM PROCESSING JS Client Features -  Event system

-  Master/Slave Tabs

-  Local queuing of data

-  Ability to use node modules

-  Easy to extend

-  Complete development suite

-  Deliver bundles with vendors or not

Page 26: Log everything! @DC13

Realtime Stream Processing - Loading the JS Client

NodeJS

<script .. src=“https://cdn.tradimo.com/js/starlog-client.min.js?5193e1ba0325c756b78d87384d2f80e9"></script>

https://../starlog-client.min.js

Set-Cookie:UUID starlog-client.min.js

/socket.io/1/websockets Upgrade: websockets

Cookie: UUID

HTTP 101 – Protocol Change Connection: Upgrade Upgrade: websocket

Established connection

Create signed cookie

Check cookie

Sending data in UMF

Queue Sending data to the client

UMF

Counts

Collecting Data

Backend Magic

Queue

Page 27: Log everything! @DC13

Realtime Stream Processing - JS Client in action

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge

ClickEvent collector register onclick Event

localstorage

Clicked-Data

SocketConnect

obse

rve

Clicked-Data

NodeJS

Clicked-Data-UMF

Page 28: Log everything! @DC13

Realtime Stream Processing - JS Client in action

function ClickFetcher() { this.collectData = function (callback) { var clicked = 1; logger.debug('ClickFetcher - collectData called!'); window.onclick = function() { var collectedData = { key : window.location.host.toString()+window.location.pathname.toString(), value: { payload: clicked, timestamp: +new Date() } }; localstorage.set(collectedData, function (storageResult) { logger.debug("err = " + storageResult.hasError()); logger.debug("storageResult = " + storageResult); }, false, true, true); clicked++; }; }; } var clickFetcher = new ClickFetcher(); starlogclient.on(starlogclient.COLLECTINGDATA, clickFetcher.collectData);

Page 29: Log everything! @DC13

Client Live Demo

https://localhost:3001/test/1-page-stub.html

Page 30: Log everything! @DC13

REALTIME STREAM PROCESSING Producer Libraries -  LoggingComponent: Provides interfaces, filters and handlers

-  LoggingBundle: Glues all together for Symfony2

-  Drupal Logging Module: Using the LoggingComponent

-  JS Frontend Client: LogClient Framework for Browsers

https://github.com/ICANS/IcansLoggingComponent

https://github.com/ICANS/IcansLoggingBundle

https://github.com/ICANS/drupal-logging-module

https://github.com/DECK36/starlog-js-frontend-client

Page 31: Log everything! @DC13

Realtime Stream Processing - PHP & Storm

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge

Using PHP for that! https://github.com/Lazyshot/storm-php/blob/master/lib/storm.php

Clicked-Data-UMF

Queue

Event: „Star Trek Commander“ Badge

Page 32: Log everything! @DC13

Storm & PHP Live Demo

Page 33: Log everything! @DC13

REALTIME STREAM PROCESSING Get Inspired! Powered-by Storm: https://github.com/nathanmarz/storm/wiki/Powered-By

-  50+ companies (Twitter, Yahoo, Groupon, Ooyala, Baidu, Wayfair, …)

-  Ads & real-time bidding, Data-centric (Economic, Environmental, Health), User interactions

Language-agnostic backend systems (Operate Storm, Develop in PHP)

Streaming „counts“: Sentiment Analysis, Frequent Items, Multi-armed Bandits, …

DRPC: Custom user feeds, Complex Queries (i.e. trace graph links)

Realtime, distributed ETL

-  Buffering / Retries

-  Integrate Data: Third-party API, Machine Learning

-  Store to DBs, Search engines, etc

Page 34: Log everything! @DC13

Questions?

Page 35: Log everything! @DC13

Thanks a lot!

Page 36: Log everything! @DC13

You can find us:

github.com/DECK36

[email protected]

deck36.de