Log everything! @DC13

36

description

Big commercial websites breathe data: they create a lot of it very fast, but also need the feedback based on the very same data to become better and better. In this talk we're showing our ideas, the drawbacks and the solutions, for building your own big data infrastructure. We further explore the possibilities to access and harness the data using map/reduce and near real-time approaches in order to prepare you for the most challenging part of it all: gaining relevant knowledge you did not had before. This talk was held at the Developer Conference 2013 (http://www.developer-conference.eu/session_post/log-everything/)

Transcript of Log everything! @DC13

Page 1: Log everything! @DC13
Page 2: Log everything! @DC13

Log Everything! @DC13

Page 3: Log everything! @DC13

Stefan & Mike

Mike Lohmann Co-Founder / Software Engineer

[email protected]

Dr. Stefan Schadwinkel Co-Founder / Analytics Engineer

[email protected]

Page 4: Log everything! @DC13

ABOUT DECK36 Who We Are –  DECK36 is a young spin-off from ICANS

–  Small team of 7 engineers

–  Longstanding expertise in designing, implementing and operating complex web systems

–  Developing own data intelligence-focused tools and web services

–  Offering our expert knowledge in Automation & Operations, Architecture & Engineering, Analytics & Data Logistics

Page 5: Log everything! @DC13

–  Log everything! – The Data Pipeline.

–  Tackling the Leviathan – Realtime Stream Processing with Storm.

–  JS Client DataCollector: Live Demo

–  Storm Processing with PHP: Live Demo

WHAT WE WILL TALK ABOUT Topics

Page 6: Log everything! @DC13

Log everything! The Data Pipeline

Page 7: Log everything! @DC13

THE DATA PIPELINE Requirements Background: Building and operating multiple education communities

Baseline: PokerStrategy.com KPIs

–  6M registered users, 700k posts/month, 2.8M page impressions/day, 7.6M requests/day

New products à New business models à New Questions

–  Extendable generic solution

–  Storage and accessability more important than specific, optimized applications

Page 8: Log everything! @DC13

Analytics

Producer

Transport

Storage

Realtime Stream Processing

Producer

–  Monolog Plugin, JS Client

Transport

–  Flume 0.9.4 m( à RabbitMQ, Erlang Consumer

–  Evaluated Apache Kafka

Storage

–  Hadoop HDFS (our very own) à Amazon S3

Analytics

Producer

Transport

Storage

Realtime Stream Processing

THE DATA PIPELINE Requirements

Page 9: Log everything! @DC13

THE DATA PIPELINE Logging Pipeline

Analytics

-  Hadoop MapReduce à Amazon EMR, Python, R

-  Exports to Excel (CSV), Qlikview à Amazon Redshift

Realtime Stream Processing

-  Twitter Storm

Analytics

Producer

Transport

Storage

Realtime Stream Processing

Analytics

Producer

Transport

Storage

Realtime Stream Processing

Page 10: Log everything! @DC13

THE DATA PIPELINE Unified Message Format -  Fixed, guaranteed envelope

-  Processing driven by message content

-  Single message gets compressed (LZOP) to about 70% of original size "(1184 B à 817 B)

-  Message bulk gets compressed to about 12-14% of original size "(@ 42k & 325k messages)

Page 11: Log everything! @DC13

Unified Message Form

Page 12: Log everything! @DC13

THE DATA PIPELINE Compaction RabbitMQ consumer (Erlang) stores data to cloud

-  Relatively large amount of files

-  Mixed messages

We want

-  A few files

-  Messages grouped by „Event Type“ and „Time Partition“

-  Data transformation

s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo

Hive partitioning!

Determined by message content

Page 13: Log everything! @DC13

THE DATA PIPELINE Compaction Using Cascalog

-  Based on Clojure (LISP) and Cascading

-  Provides a Datalog-like query language

-  Don‘t LISP? à JCascalog

Very handy features (unavailable in Hive or Pig)

-  Cascading Output Taps can be parameterized by data records

-  Trap location for corrupted records (job finishes for all the correct messages)

-  Runs within the JVM à large available codebase, arbitrary processing is simple

Page 14: Log everything! @DC13

Cacalog Query Syntax

Cascalog is Clojure, Clojure is Lisp

(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))

Query Operator

Cascading Output Tap

Columns of the dataset generated

by the query

„Generator“ „Predicate“

-  as many as you want

-  both can be any clojure function

-  clojure can call anything that is

available within a JVM

Page 15: Log everything! @DC13

Cacalog Query Syntax

Run the Cascalog processing on Amazon EMR:

./elastic-mapreduce [standard parameters omitted]

--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar

--main-class icans.cascalogjobs.processing.compaction

--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error

Page 16: Log everything! @DC13

The Data Pipeline Data Queries with Hive Hive is table-based and provides SQL-like syntax

-  Assumes one storage location (directory) per table

-  Simple to use if you know SQL

-  Widely used, rapid development for „simple“ queries

Hive @ Amazon

-  Table locations can be S3

-  „Cluster on demand“ à requires to rebuild Hive metadata

-  CREATE TABLE for source and target S3 locations

-  Import Table metadata (auto-discovery for partitions)

-  INSERT OVERWRITE to query source table(s) and store to target S3 location

Page 17: Log everything! @DC13

Hive @ Amazon (1)

Page 18: Log everything! @DC13

Hive @ Amazon (2)

We can now simply copy the data from S3 and import into any local analytical tool e.g. Excel, Redshift, QlikView, R, etc.

Page 19: Log everything! @DC13

Further Reading

-  More details in the Log Everything! ebook

-  Available at Amazon and DeveloperPress

Page 20: Log everything! @DC13

THE DATA PIPELINE Still: It’s Batch Processing

-  While quite efficient in flight, the logistics of getting the job started are significant.

-  Only cost-efficient for long distance travel.

Page 21: Log everything! @DC13

THE DATA PIPELINE Instant Insight through Stream Processing

-  Often, only updates for the recent day, week, or month are necessary

-  Time is of importance when direct feedback or user interaction is desired

Page 22: Log everything! @DC13

More Wind In The Sails With Storm

Page 23: Log everything! @DC13

-  Distributed realtime processing framework

-  Battle-proven by Twitter

-  All *BINGO-Abilities fulfilled!

-  Hadoop = data batch processing; Storm = realtime data processing

-  More (and maybe new) *BINGO: DRPC, ETL, RTET, Spouts, Bolts, Tuple, Topology

-  Easy to use (Really!)

REALTIME STREAM PROCESSING Instant Insight through Stream Processing

Page 24: Log everything! @DC13

Realtime Stream Processing Infrastructure with Storm

Producer Transport

Queue

Nimbus (Master)

Zookeeper

Supervisor

Supervisor

Worker Worker

Worker

NodeJS

Realtime Data Stream Analytics

S3

Zabbix Graylog

DB

Storage

Analytics

Storm-Cluster

Apps &Server

Page 25: Log everything! @DC13

REALTIME STREAM PROCESSING JS Client Features -  Event system

-  Master/Slave Tabs

-  Local queuing of data

-  Ability to use node modules

-  Easy to extend

-  Complete development suite

-  Deliver bundles with vendors or not

Page 26: Log everything! @DC13

Realtime Stream Processing - Loading the JS Client

NodeJS

<script .. src=“https://cdn.tradimo.com/js/starlog-client.min.js?5193e1ba0325c756b78d87384d2f80e9"></script>

https://../starlog-client.min.js

Set-Cookie:UUID starlog-client.min.js

/socket.io/1/websockets Upgrade: websockets

Cookie: UUID

HTTP 101 – Protocol Change Connection: Upgrade Upgrade: websocket

Established connection

Create signed cookie

Check cookie

Sending data in UMF

Queue Sending data to the client

UMF

Counts

Collecting Data

Backend Magic

Queue

Page 27: Log everything! @DC13

Realtime Stream Processing - JS Client in action

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge

ClickEvent collector register onclick Event

localstorage

Clicked-Data

SocketConnect

obse

rve

Clicked-Data

NodeJS

Clicked-Data-UMF

Page 28: Log everything! @DC13

Realtime Stream Processing - JS Client in action

function ClickFetcher() { this.collectData = function (callback) { var clicked = 1; logger.debug('ClickFetcher - collectData called!'); window.onclick = function() { var collectedData = { key : window.location.host.toString()+window.location.pathname.toString(), value: { payload: clicked, timestamp: +new Date() } }; localstorage.set(collectedData, function (storageResult) { logger.debug("err = " + storageResult.hasError()); logger.debug("storageResult = " + storageResult); }, false, true, true); clicked++; }; }; } var clickFetcher = new ClickFetcher(); starlogclient.on(starlogclient.COLLECTINGDATA, clickFetcher.collectData);

Page 29: Log everything! @DC13

Client Live Demo

https://localhost:3001/test/1-page-stub.html

Page 30: Log everything! @DC13

REALTIME STREAM PROCESSING Producer Libraries -  LoggingComponent: Provides interfaces, filters and handlers

-  LoggingBundle: Glues all together for Symfony2

-  Drupal Logging Module: Using the LoggingComponent

-  JS Frontend Client: LogClient Framework for Browsers

https://github.com/ICANS/IcansLoggingComponent

https://github.com/ICANS/IcansLoggingBundle

https://github.com/ICANS/drupal-logging-module

https://github.com/DECK36/starlog-js-frontend-client

Page 31: Log everything! @DC13

Realtime Stream Processing - PHP & Storm

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge

Using PHP for that! https://github.com/Lazyshot/storm-php/blob/master/lib/storm.php

Clicked-Data-UMF

Queue

Event: „Star Trek Commander“ Badge

Page 32: Log everything! @DC13

Storm & PHP Live Demo

Page 33: Log everything! @DC13

REALTIME STREAM PROCESSING Get Inspired! Powered-by Storm: https://github.com/nathanmarz/storm/wiki/Powered-By

-  50+ companies (Twitter, Yahoo, Groupon, Ooyala, Baidu, Wayfair, …)

-  Ads & real-time bidding, Data-centric (Economic, Environmental, Health), User interactions

Language-agnostic backend systems (Operate Storm, Develop in PHP)

Streaming „counts“: Sentiment Analysis, Frequent Items, Multi-armed Bandits, …

DRPC: Custom user feeds, Complex Queries (i.e. trace graph links)

Realtime, distributed ETL

-  Buffering / Retries

-  Integrate Data: Third-party API, Machine Learning

-  Store to DBs, Search engines, etc

Page 34: Log everything! @DC13

Questions?

Page 35: Log everything! @DC13

Thanks a lot!

Page 36: Log everything! @DC13

You can find us:

github.com/DECK36

[email protected]

deck36.de