Log everything! @DC13

Post on 06-May-2015

520 views 2 download

description

Big commercial websites breathe data: they create a lot of it very fast, but also need the feedback based on the very same data to become better and better. In this talk we're showing our ideas, the drawbacks and the solutions, for building your own big data infrastructure. We further explore the possibilities to access and harness the data using map/reduce and near real-time approaches in order to prepare you for the most challenging part of it all: gaining relevant knowledge you did not had before. This talk was held at the Developer Conference 2013 (http://www.developer-conference.eu/session_post/log-everything/)

Transcript of Log everything! @DC13

Log Everything! @DC13

Stefan & Mike

Mike Lohmann Co-Founder / Software Engineer

mike.lohmann@deck36.de

Dr. Stefan Schadwinkel Co-Founder / Analytics Engineer

stefan.schadwinkel@deck36.de

ABOUT DECK36 Who We Are –  DECK36 is a young spin-off from ICANS

–  Small team of 7 engineers

–  Longstanding expertise in designing, implementing and operating complex web systems

–  Developing own data intelligence-focused tools and web services

–  Offering our expert knowledge in Automation & Operations, Architecture & Engineering, Analytics & Data Logistics

–  Log everything! – The Data Pipeline.

–  Tackling the Leviathan – Realtime Stream Processing with Storm.

–  JS Client DataCollector: Live Demo

–  Storm Processing with PHP: Live Demo

WHAT WE WILL TALK ABOUT Topics

Log everything! The Data Pipeline

THE DATA PIPELINE Requirements Background: Building and operating multiple education communities

Baseline: PokerStrategy.com KPIs

–  6M registered users, 700k posts/month, 2.8M page impressions/day, 7.6M requests/day

New products à New business models à New Questions

–  Extendable generic solution

–  Storage and accessability more important than specific, optimized applications

Analytics

Producer

Transport

Storage

Realtime Stream Processing

Producer

–  Monolog Plugin, JS Client

Transport

–  Flume 0.9.4 m( à RabbitMQ, Erlang Consumer

–  Evaluated Apache Kafka

Storage

–  Hadoop HDFS (our very own) à Amazon S3

Analytics

Producer

Transport

Storage

Realtime Stream Processing

THE DATA PIPELINE Requirements

THE DATA PIPELINE Logging Pipeline

Analytics

-  Hadoop MapReduce à Amazon EMR, Python, R

-  Exports to Excel (CSV), Qlikview à Amazon Redshift

Realtime Stream Processing

-  Twitter Storm

Analytics

Producer

Transport

Storage

Realtime Stream Processing

Analytics

Producer

Transport

Storage

Realtime Stream Processing

THE DATA PIPELINE Unified Message Format -  Fixed, guaranteed envelope

-  Processing driven by message content

-  Single message gets compressed (LZOP) to about 70% of original size "(1184 B à 817 B)

-  Message bulk gets compressed to about 12-14% of original size "(@ 42k & 325k messages)

Unified Message Form

THE DATA PIPELINE Compaction RabbitMQ consumer (Erlang) stores data to cloud

-  Relatively large amount of files

-  Mixed messages

We want

-  A few files

-  Messages grouped by „Event Type“ and „Time Partition“

-  Data transformation

s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo

Hive partitioning!

Determined by message content

THE DATA PIPELINE Compaction Using Cascalog

-  Based on Clojure (LISP) and Cascading

-  Provides a Datalog-like query language

-  Don‘t LISP? à JCascalog

Very handy features (unavailable in Hive or Pig)

-  Cascading Output Taps can be parameterized by data records

-  Trap location for corrupted records (job finishes for all the correct messages)

-  Runs within the JVM à large available codebase, arbitrary processing is simple

Cacalog Query Syntax

Cascalog is Clojure, Clojure is Lisp

(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))

Query Operator

Cascading Output Tap

Columns of the dataset generated

by the query

„Generator“ „Predicate“

-  as many as you want

-  both can be any clojure function

-  clojure can call anything that is

available within a JVM

Cacalog Query Syntax

Run the Cascalog processing on Amazon EMR:

./elastic-mapreduce [standard parameters omitted]

--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar

--main-class icans.cascalogjobs.processing.compaction

--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error

The Data Pipeline Data Queries with Hive Hive is table-based and provides SQL-like syntax

-  Assumes one storage location (directory) per table

-  Simple to use if you know SQL

-  Widely used, rapid development for „simple“ queries

Hive @ Amazon

-  Table locations can be S3

-  „Cluster on demand“ à requires to rebuild Hive metadata

-  CREATE TABLE for source and target S3 locations

-  Import Table metadata (auto-discovery for partitions)

-  INSERT OVERWRITE to query source table(s) and store to target S3 location

Hive @ Amazon (1)

Hive @ Amazon (2)

We can now simply copy the data from S3 and import into any local analytical tool e.g. Excel, Redshift, QlikView, R, etc.

Further Reading

-  More details in the Log Everything! ebook

-  Available at Amazon and DeveloperPress

THE DATA PIPELINE Still: It’s Batch Processing

-  While quite efficient in flight, the logistics of getting the job started are significant.

-  Only cost-efficient for long distance travel.

THE DATA PIPELINE Instant Insight through Stream Processing

-  Often, only updates for the recent day, week, or month are necessary

-  Time is of importance when direct feedback or user interaction is desired

More Wind In The Sails With Storm

-  Distributed realtime processing framework

-  Battle-proven by Twitter

-  All *BINGO-Abilities fulfilled!

-  Hadoop = data batch processing; Storm = realtime data processing

-  More (and maybe new) *BINGO: DRPC, ETL, RTET, Spouts, Bolts, Tuple, Topology

-  Easy to use (Really!)

REALTIME STREAM PROCESSING Instant Insight through Stream Processing

Realtime Stream Processing Infrastructure with Storm

Producer Transport

Queue

Nimbus (Master)

Zookeeper

Supervisor

Supervisor

Worker Worker

Worker

NodeJS

Realtime Data Stream Analytics

S3

Zabbix Graylog

DB

Storage

Analytics

Storm-Cluster

Apps &Server

REALTIME STREAM PROCESSING JS Client Features -  Event system

-  Master/Slave Tabs

-  Local queuing of data

-  Ability to use node modules

-  Easy to extend

-  Complete development suite

-  Deliver bundles with vendors or not

Realtime Stream Processing - Loading the JS Client

NodeJS

<script .. src=“https://cdn.tradimo.com/js/starlog-client.min.js?5193e1ba0325c756b78d87384d2f80e9"></script>

https://../starlog-client.min.js

Set-Cookie:UUID starlog-client.min.js

/socket.io/1/websockets Upgrade: websockets

Cookie: UUID

HTTP 101 – Protocol Change Connection: Upgrade Upgrade: websocket

Established connection

Create signed cookie

Check cookie

Sending data in UMF

Queue Sending data to the client

UMF

Counts

Collecting Data

Backend Magic

Queue

Realtime Stream Processing - JS Client in action

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge

ClickEvent collector register onclick Event

localstorage

Clicked-Data

SocketConnect

obse

rve

Clicked-Data

NodeJS

Clicked-Data-UMF

Realtime Stream Processing - JS Client in action

function ClickFetcher() { this.collectData = function (callback) { var clicked = 1; logger.debug('ClickFetcher - collectData called!'); window.onclick = function() { var collectedData = { key : window.location.host.toString()+window.location.pathname.toString(), value: { payload: clicked, timestamp: +new Date() } }; localstorage.set(collectedData, function (storageResult) { logger.debug("err = " + storageResult.hasError()); logger.debug("storageResult = " + storageResult); }, false, true, true); clicked++; }; }; } var clickFetcher = new ClickFetcher(); starlogclient.on(starlogclient.COLLECTINGDATA, clickFetcher.collectData);

Client Live Demo

https://localhost:3001/test/1-page-stub.html

REALTIME STREAM PROCESSING Producer Libraries -  LoggingComponent: Provides interfaces, filters and handlers

-  LoggingBundle: Glues all together for Symfony2

-  Drupal Logging Module: Using the LoggingComponent

-  JS Frontend Client: LogClient Framework for Browsers

https://github.com/ICANS/IcansLoggingComponent

https://github.com/ICANS/IcansLoggingBundle

https://github.com/ICANS/drupal-logging-module

https://github.com/DECK36/starlog-js-frontend-client

Realtime Stream Processing - PHP & Storm

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge

Using PHP for that! https://github.com/Lazyshot/storm-php/blob/master/lib/storm.php

Clicked-Data-UMF

Queue

Event: „Star Trek Commander“ Badge

Storm & PHP Live Demo

REALTIME STREAM PROCESSING Get Inspired! Powered-by Storm: https://github.com/nathanmarz/storm/wiki/Powered-By

-  50+ companies (Twitter, Yahoo, Groupon, Ooyala, Baidu, Wayfair, …)

-  Ads & real-time bidding, Data-centric (Economic, Environmental, Health), User interactions

Language-agnostic backend systems (Operate Storm, Develop in PHP)

Streaming „counts“: Sentiment Analysis, Frequent Items, Multi-armed Bandits, …

DRPC: Custom user feeds, Complex Queries (i.e. trace graph links)

Realtime, distributed ETL

-  Buffering / Retries

-  Integrate Data: Third-party API, Machine Learning

-  Store to DBs, Search engines, etc

Questions?

Thanks a lot!

You can find us:

github.com/DECK36

info@deck36.de

deck36.de