情報処理学会 Exciting Coding! Treasure Data

43
www.treasuredata.com Treasure Data Exciting Coding! Nov 2013 Presented by Masahiro Nakagawa Senior Software Engineer 1

Transcript of 情報処理学会 Exciting Coding! Treasure Data

Page 1: 情報処理学会 Exciting Coding! Treasure Data

www.treasuredata.com

Treasure Data Exciting Coding! Nov 2013

Presented by

Masahiro Nakagawa Senior Software Engineer

1

Page 2: 情報処理学会 Exciting Coding! Treasure Data

•  Masahiro Nakagawa –  @repeatedly

–  [email protected] or d@

•  Treasure Data, Inc –  Senior Software Engineer

•  Fluentd / Client libraries / etc...

–  Since 2012/11

•  Open Source projects –  D Programming Language

–  MessagePack: D, Python, etc…

–  Fluentd: Core, Mongo, Logger, etc…

–  Etc…

Who are you

2

Page 3: 情報処理学会 Exciting Coding! Treasure Data

www.treasuredata.com

Board Meeting Presentation August 15th, 2013 - 3:30PM PDT

Presented by

Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing

Company & Service Introduction

3

Page 4: 情報処理学会 Exciting Coding! Treasure Data

•  Founded 2011 in Mountain View, CA –  The first cloud service for the entire data pipeline

–  Including: Acquisition, Storage, & Analysis

•  Provide a “Cloud Data Service” –  Fast Time to Value

–  Cloud Flexibility and Economics

–  Simple and Well Supported

•  Treasure Data has over 100+ customers in production –  Incl. Fortune 500 companies

–  500+ Billion new records / month

–  Around 2 Trillion records under management

–  Variety of use cases and verticals

Company Background

4

The Treasure Data Team Hiro Yoshikawa – CEO Open source business veteran

Kaz Ohta – CTO Founder of world’s largest Hadoop Group

Jeff Yuan – Director, Engineering LinkedIn, MIT / Michale Stonrebrraker Lab

Keith Goldstein – VP Sales & Bus Dev VP of Bus Dev from Tibco and Talend

Rich Ghiossi – VP Marketing VP of Marketing from ParAccel

Notable Investors Othman Laraki Ex-VP of Growth at Twitter

Jerry Yang Founder of Yahoo!

Yukihiro “Matz” Matusmoto Creator of “Ruby” programming language

James Lindenbaum Founder of Heroku

Page 5: 情報処理学会 Exciting Coding! Treasure Data

•  Lots of companies today produce Big Data by having “New Data Sources” (Sensor, Weblog, etc) –  But few have the resources to build a

Big Data Analytics system

•  60-70% of a company’s Big Data time & budget consumed by: –  Infrastructure setup & Maintenance –  Building Collection & Storage Flows –  Hiring/Training Hadoop Expertise

•  On average, it takes 6 months to get a Hadoop environment into production

Problem Statement

5

Page 6: 情報処理学会 Exciting Coding! Treasure Data

6

Page 7: 情報処理学会 Exciting Coding! Treasure Data

7

Treasure Data’s Focus

(80% of the needs)

Page 8: 情報処理学会 Exciting Coding! Treasure Data

8

Page 9: 情報処理学会 Exciting Coding! Treasure Data

BI Tools Tableau, Metric Insights,

QlikView, Excel, etc.

Treasure Data Service: Overview

9

Web logs

App logs

Sensor

CRM

ERP

RDBMS

Streaming Log !Collector (JSON)!

Treasure Agent

Parallel Upload from CSV, MySQL, etc.!

Bulk Import

Treasure Data Cloud

Flexible, Scalable, Columnar Storage!

REST API, SQL, Pig, JDBC / ODBC!

BI Connectivity

REST API, SQL, Pig!

Result Push Dashboards

Custom App, Local DB, FTP Server, etc.

Time to Value Economy & Flexibility Simple & Supported

Acquire Analyze Store

Page 10: 情報処理学会 Exciting Coding! Treasure Data

10

Our Value Propositions •  Faster time to value

On-demand cloud infrastructure & versatile streaming data collection agent –  Instantly provision a fully tuned & managed infrastructure –  Go live into production on average in 14 days (collection, analytics, & BI)

•  Cloud flexibility and economics Fraction of the cost of traditional solutions by leveraging cloud storage and processing, which scales to meet your needs –  Leverage the cost-advantage of the cloud –  Leverage the elasticity of the cloud – scale on demand –  Predictable monthly subscription fee –  No upfront costs & no long-term commitment

•  Simple and well supported We are passionate about simplicity, and customer support excellence –  Focus your time on analyzing your data –  Rely on us to keep your data secure & online –  We love making customers successful & building long-term relationships

Page 11: 情報処理学会 Exciting Coding! Treasure Data

Initial Setup & Onboarding – Two Weeks

11

1. Data Collection 2. Data Storage

3. Data Analysis 4. Service & Support

•  Setup, tuning, and monitoring of Treasure Agent

•  Embed Treasure Agent code into applications

•  Basic log templates (register, pay, login, etc.)

•  Basic KPI queries (DAU, MAU, ARPU, etc.)

•  Setup dashboards with basic KPIs

•  Training on creating customized reports and ad-hoc querying

•  Assigned a dedicated technical account manager

•  Real-time support via email, online chat, and call

Page 12: 情報処理学会 Exciting Coding! Treasure Data

12

Solutions Accelerators

Treasure Data Platform

Out-of-the Box Reporting

Configured Treasure Agent

Solution Components: -  Treasure Data Platform -  Event Collection

Template -  Pre-configured

Treasure Agent Configuration

-  BI Dashboard with KPIs

Page 13: 情報処理学会 Exciting Coding! Treasure Data

13

- Vision - Single Analytics Platform for the World

Page 14: 情報処理学会 Exciting Coding! Treasure Data

www.treasuredata.com

Board Meeting Presentation August 15th, 2013 - 3:30PM PDT

Presented by

Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing

Treasure Data Platform Architecture Overview

14

Page 15: 情報処理学会 Exciting Coding! Treasure Data

Treasure Data Cloud

Data Acquisition – Streaming Capture

15

# Application Code ... ... # Post event to Treasure Data TD.event.post('access', {:uid=>123}) ... ...

Treasure Data Library Java, Ruby, PHP, Perl, Python, Scala,

Node.js

Application Server

Treasure Agent (local)

•  Automatic Micro-batching

•  Local buffering Fall-back

•  Network Tolerance

Open-Sourced as Fluentd Project ( http://fluentd.org/ )

Page 16: 情報処理学会 Exciting Coding! Treasure Data

Data Acquisition – Bulk Loader

16

Treasure Data Cloud

RDBMS App SaaS

FTP

CSV, TSV, JSON, MessagePack, Apache, regex, MySQL, FTP

Bulk Loader

Prepare Upload Perform Commit

Page 17: 情報処理学会 Exciting Coding! Treasure Data

Data Storage

17

Treasure Data Cloud

•  Stored “schema-less” as JSON –  Schema can be applied/updated

AFTER storage

•  Compressed & columnar format –  For higher query performance

•  Optimized for time-based filtering

•  Quickly scale-up processing power –  WITHOUT reloading/redistributing the data

time v 13841604

00 {“ip”:”135.52.211.23”, “code”:”0”}

1384162200

{“ip”:”45.25.38.156”, “code”:”-1”}

1384164000

{“ip”:”97.12.76.55”, “code”:”99”}

time ip : string code : int 138416040

0 135.52.211.23 0

1384162200

45.25.38.156 -1

1384164000

97.12.76.55 99

Default (schema-less)

Schema applied

SELECT v[‘ip’] as ip, v[‘code’] as code …

SELECT ip, code …

~30% Faster

Page 18: 情報処理学会 Exciting Coding! Treasure Data

Data Analysis

18

Treasure Data Cloud

Scripted Processing (Pig): - DataFu (LinkedIn) - Piggybank (Apache)

Heavy Lifting SQL (Hive): - Hive’s Built-in UDFs - TD Added Functions:

- Time Functions - First, Last, Rank - Sessionize

JDBC Connectivity: -  Custom Java Apps -  Standards-based -  BI Tool Integration

Tableau ODBC connector - Leverages Impala

Push Query Results: - MySQL, PostgreSQL - Google Spreadsheet - Web, FTP, S3 - Leftronic, Indicee - Treasure Data Table

Interactive SQL Treasure Query Accelerator (Impala)

Scheduled Jobs - SQL, Pig Scripts - Data Pushes REST API

Page 19: 情報処理学会 Exciting Coding! Treasure Data

www.treasuredata.com

Board Meeting Presentation August 15th, 2013 - 3:30PM PDT

Presented by

Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing

Treasure Data General Use Cases

19

Page 20: 情報処理学会 Exciting Coding! Treasure Data

20

A case: “14 Days” from Signup to Success

1.  Europe’s largest mobile ad exchange.

2.  Serving >60 billion imps/month for >30,000 mobile apps (Q4 2013)

3.  Immediate need of analytics infrastructure: ASAP!

4.  With TD, MobFox got into production only in 14 days, by one engineer.

"Time is the most precious asset in our fast-moving business, and Treasure Data saved us a lot of it." Julian Zehetmayr, CEO & Founder

Page 21: 情報処理学会 Exciting Coding! Treasure Data

21

A case: “Replace” in-house Hadoop to TD

1.  Global “Hulu” - Online Video Service with millions of users

2.  Video contents are distributed to over 150 languages.

3.  Had hard time maintaining Hadoop cluster

4.  With TD, Viki deprecated their in-house Hadoop cluster and use engineer for core businesses.

Before

After

“Treasure Data has always given us thorough and timely support peppered with insightful tips to make the best use of their service." Huy Nguyen, Software Engineer

Page 22: 情報処理学会 Exciting Coding! Treasure Data

22

A case: Treasure Data with BI Tool (Tableau)

1.  World’s largest android application market

2.  Serving >3 billion app downloads for >100 million users

3.  Only one engineer managing the data infrastructure

4.  With TD, the data engineer can focus on analyzing data with existing BI tool

"I will recommend Treasure Data to my friends in a heartbeat because it benefits all three stakeholders: Operations, Engineering and Business." Simon Dong, Principal Architect - Data Engineering

Page 23: 情報処理学会 Exciting Coding! Treasure Data

www.treasuredata.com

Board Meeting Presentation August 15th, 2013 - 3:30PM PDT

Presented by

Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing

Treasure Data Platform Fluentd Overview

23

Page 24: 情報処理学会 Exciting Coding! Treasure Data

•  Open sourced log collector written in Ruby –  Easy to use, reliable and well performance –  Streaming event processing

•  Using rubygems ecosystem to distribute plugins

What is Fluentd?

24

Fluentdthe missing log collector

fluentd.org

Page 25: 情報処理学会 Exciting Coding! Treasure Data

Data processing pipeline

25

Collect Store Process Visualize

Data source

Reporting Monitoring

Page 26: 情報処理学会 Exciting Coding! Treasure Data

Data processing pipeline

26

Collect Store Process Visualize

Data source

Reporting Monitoring

Important but no defacto

middleware!

Page 27: 情報処理学会 Exciting Coding! Treasure Data

Fluentd general example

27

tail

insert

eventbuffering

127.0.0.1 - - [11/Dec/2012:07:26:27] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:30] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:32] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:40] "GET / ...127.0.0.1 - - [11/Dec/2012:07:27:01] "GET / ...

...

Fluentd

Web Server

2012-02-04 01:33:51apache.log

{ "host": "127.0.0.1", "method": "GET", ...}

Page 28: 情報処理学会 Exciting Coding! Treasure Data

Pluggable Architecture

28

Buffer Output

Input

> Forward> HTTP> File tail> dstat> ...

> Forward> File> MongoDB> ...

> File> Memory

Engine

Output

> rewrite> ...

Pluggable Pluggable

Page 29: 情報処理学会 Exciting Coding! Treasure Data

Resolve your requirement by writing plugin

29

Nagios

MongoDB

Hadoop

Alerting

Amazon S3

Analysis

Archiving

MySQL

Apache

Frontend

Access logs

syslogd

App logs

System logs

Backend

Databasesfilter / buffer / routing

Page 30: 情報処理学会 Exciting Coding! Treasure Data

•  Open sourced distribution package of Fluentd –  ETL part of Treasure Data –  deb / rpm / homebrew

•  Including useful components –  Ruby, jemalloc, fluentd –  3rd party gems: td, mongo, webhdfs, etc… –  Init script

• 

Treasure Agent (td-agent)

30

http://packages.treasure-data.com/

Page 31: 情報処理学会 Exciting Coding! Treasure Data

Fluentd users

31

Page 32: 情報処理学会 Exciting Coding! Treasure Data

www.treasuredata.com

Board Meeting Presentation August 15th, 2013 - 3:30PM PDT

Presented by

Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing

Treasure Data Platform Backend Overview

32

Page 33: 情報処理学会 Exciting Coding! Treasure Data

•  RDS –  Store user information, job, status, etc… –  Queue Worker / Scheduler

•  EC2 –  API Server, Hadoop Cluster, Job Worker / Scheduler

•  S3 –  Columnar storage

•  Realtime / Archive storage • MessagePack columnar

•  ELB

AWS components

33

Page 34: 情報処理学会 Exciting Coding! Treasure Data

Plazma(Hadoop, Storage, Queue and Workers)

34

FrontendQueue

WorkerHadoop

Fluentd

Applications push metrics to Fluentd(via local Fluentd)

Librato Metricsfor realtime analysis

Treasure Data

for historical analysis

Fluentd sums up data minutes(partial aggregation)

Hadoop

Page 35: 情報処理学会 Exciting Coding! Treasure Data

www.treasuredata.com

Board Meeting Presentation August 15th, 2013 - 3:30PM PDT

Presented by

Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing

Treasure Data Development Philosophy

35

Page 36: 情報処理学会 Exciting Coding! Treasure Data

•  TD prefers engineers, who are contributing to the OSS products –  MessagePack, Fluentd, ZeroMQ, Hadoop,

MongoDB, Angular.js, Huahin, D-Lang, etc. –  https://github.com/treasure-data?tab=members

•  Reasons –  Fixing & Improving the other people’s code is

crucial for the distributed team. –  TD’s engineering workflow is really similar with

OSS product workflow. –  A+ OSS engineers will bring another A+ OSS

engineer!

Open-Source Culture

36

Page 37: 情報処理学会 Exciting Coding! Treasure Data

•  OSS Everything on the Client Side –  http://github.com/treasure-data/ –  http://fluentd.org/

•  TD is helping the world to collect more data in an analytics-ready format

•  2000+ companies (e.g. Nintendo, SlideShare/LinkedIn) are using as OSS product. 3-4% of the users are TD’s customer.

•  We also leverage other OSS products as much as possible.

•  Closed Source on the Cloud Side –  The core value must be a proprietary to sustain as a

business. –  The components can be OSS, but the most of the system will

remain proprietary to create the value chain.

OSS v.s. Proprietary

37

Page 38: 情報処理学会 Exciting Coding! Treasure Data

•  Solving the Customer Pain is the #1 Priority –  Developers directly provide the support for customers, and spending

30%-40% of the development time to talk with customers –  Developers are the BEST person to come up with the solution. –  # of code lines != value

•  Suffering Oriented Development

–  First, make it possible –  Then, make it beautiful –  Then, make it fast

•  The Largest Customer Pain is NOT always applicable to other customers. –  Need to be brave to say NO. NO. NO. NO. NO….

•  TD doesn’t have 1-year Product Roadmap. Having 3-months roadmap accelerates the development, and other teams (marketing / sales), too.

How to decide Product Roadmap?

38

Page 39: 情報処理学会 Exciting Coding! Treasure Data

•  13 Engineers as of Nov. 2013 –  5 Engineers in Tokyo, Japan –  8 Engineers in Mountain View, USA –  40% of the whole company

•  Asynchronous Communication –  Use async communication tools as much as possible:

Chat, JIRA, Email, Github, etc. –  Use video conferencing for weekly sync-up

•  English is the primary communication language –  If you cannot speak English, your value is nearly zero at

Treasure Data engineering team.

Distributed Team (International)

39

Page 40: 情報処理学会 Exciting Coding! Treasure Data

•  Predictable Deployment Cycle –  Weekly Deployment

•  Continuous Deployment didn’t fit into B2B SaaS application, our customers want predictability of the changes.

•  As a distributed team, it’s hard to track the every changes + deployment status.

–  Track every changes on JIRA, and QA engineer is responsible for the deployment too.

•  Continuous Deployment for Staging –  Single branch, always automatically deployed to the staging

environment –  Monitoring is a continuous testing

•  On-Call Alert Schedule, based on the Timezone –  No need to get up around 3am

Distributed Team (Deployment)

40

Page 41: 情報処理学会 Exciting Coding! Treasure Data

•  Use Cloud Services as Much as Possible –  Don’t hire people, use cloud services. –  Out source everything, except your core value. –  Developers tend to forget his own cost. If you spend 1-hour, it

already costs around $50 as a company.

•  Examples –  EC2 (IaaS) –  CopperEgg (Infrastructure Monitoring) –  NewRelic (Application Performance Management) –  Hosted Chef (Configuration Management) –  Librato Metrics (Application Metrics) –  Pager Duty (Alerting) –  Logentries (Log Search) –  CircleCI, TravisCI (Continuous Integration) –  HipChat, JIRA, Confluence (Development Tool) –  Etc….

Leverage Cloud Services

41

Page 42: 情報処理学会 Exciting Coding! Treasure Data

www.treasuredata.com

Board Meeting Presentation August 15th, 2013 - 3:30PM PDT

Presented by

Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing

Treasure Data Conclusion

42

Page 43: 情報処理学会 Exciting Coding! Treasure Data

•  Treasure Data, Inc –  Cloud based Data Service for the world –  Customer oriented development

•  Our Unique Products and Culture –  Fluend / Plazma (backend) –  OSS enthusiast

•  Use Cloud or not? –  Cloud leverages an idea but not differentiator –  Focus own vision!

Key points

43