Scaling to Infinity - Open Source meets Big Data

28
Dealing with Unstructured Data Scaling to Infinity Image: Boykung/Shutterstock

Transcript of Scaling to Infinity - Open Source meets Big Data

Page 1: Scaling to Infinity - Open Source meets Big Data

Dealing with Unstructured Data

Scaling to Infinity

Image: Boykung/Shutterstock

Page 2: Scaling to Infinity - Open Source meets Big Data

Image: John Hammink

Page 3: Scaling to Infinity - Open Source meets Big Data
Page 4: Scaling to Infinity - Open Source meets Big Data

There are many sources of information

Page 5: Scaling to Infinity - Open Source meets Big Data

Copyright ©2014 Treasure Data. All Rights Reserved.

Results Push

Results Push

SQL

Big Data Simplified: One ApproachAp

p Se

rver

s

Multi-structured Events • register • login • start_event • purchase • etc

SQL-basedAd-hoc Queries

SQL-based Dashboards

DBs & Data Marts

Other Apps

Results Push

Familiar & Table-oriented

Infinite & EconomicalCloud Data Store

✓App log data ✓Mobile event data ✓Sensor data ✓Telemetry

Mobile SDKs

Web SDK

Multi-structured EventsMulti-structured Events

Multi-structured Events

Multi-structured Events

Agent

Agent

Agent

Agent Agent

Agent

Agent

Agent

Embedded SDKs

Server-side Agents

Page 6: Scaling to Infinity - Open Source meets Big Data

Copyright ©2014 Treasure Data. All Rights Reserved.

What is the point of all this data?

BI Business

Intelligence Using Very Large

Sets of Data

Page 7: Scaling to Infinity - Open Source meets Big Data
Page 8: Scaling to Infinity - Open Source meets Big Data

Copyright ©2015 Treasure Data. All Rights Reserved.

Service LaunchedSeries A Funding

100 Customers

Selected by Gartner as Cool Vendor in Big Data

10 Trillion Records

5 Trillion Records

Treasure Data By the Numbers (Jan-2015): 13T+ records of data imported since launch 500K+ records imported each second 1.5 Trillion+ records imported each month 12B records sent per day by one customer

13 Trillion RecordsSeries B Funding

Data Records Stored in the Treasure Data Cloud Service

0

3500000000000

7000000000000

10500000000000

14000000000000

Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14

8

Last 2 years

Page 9: Scaling to Infinity - Open Source meets Big Data

Statistics

Total Records Stored

25 Trillion

Managed & Supported

24 * 7 * 365 Uptime

99.99%

New Records / second

1 Million Daily Twitter

volume

100x

1 0 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1

24 /7

Page 10: Scaling to Infinity - Open Source meets Big Data

A solution?• There are trade-offs to consider

• Any trade off should make it easy to collect data

• Easy does it! un- and semi-structured data (multi-structured data)

• Open source means it’s free; also means that you need someone on hand to maintain and implement

• Cloud storage means you don’t have to scale and/or shard; tradeoff means performance hit against bare metal

Image: John Hammink

Page 11: Scaling to Infinity - Open Source meets Big Data

Image: Dreamstime

Page 12: Scaling to Infinity - Open Source meets Big Data

Images: Lightspring/Shutterstock, John Hammink, Treasure Data

There are a few intro to Data Science blogs at blog.treasuredata.com!

Page 13: Scaling to Infinity - Open Source meets Big Data

What does a pipeline need?

Page 14: Scaling to Infinity - Open Source meets Big Data

Open vs. Closed source

Image: Heather Craig/Shutterstock

Page 15: Scaling to Infinity - Open Source meets Big Data

Images: PC World, Data-Hive, Wallpapersmela

oror

?

Page 16: Scaling to Infinity - Open Source meets Big Data

LAMBDA ARCHITECTURE

Page 17: Scaling to Infinity - Open Source meets Big Data

# logs from a file<source> type tail path /var/log/httpd.log format apache2 tag web.access</source>

# logs from client libraries<source> type forward port 24224

</source>

# store logs to ES and HDFS<match *.*> type copy

<store> type elasticsearch logstash_format

Page 18: Scaling to Infinity - Open Source meets Big Data

LESS SIMPLE FORWARDING

Page 19: Scaling to Infinity - Open Source meets Big Data

Before fluentd

Page 20: Scaling to Infinity - Open Source meets Big Data

Multi- structured data

• un-structured data better for data for ultimate use in statistics

Page 21: Scaling to Infinity - Open Source meets Big Data

fluentd!

http://www.fluentd.org/

Page 22: Scaling to Infinity - Open Source meets Big Data

http://msgpack.org/

Page 23: Scaling to Infinity - Open Source meets Big Data

an open-source bulk data loader that helps data transfer between various databases, storages, file

formats, and cloud services

embulk.org/docs

Page 24: Scaling to Infinity - Open Source meets Big Data
Page 25: Scaling to Infinity - Open Source meets Big Data
Page 26: Scaling to Infinity - Open Source meets Big Data

Hivemall

Hivemall is a scalable machine learning library that runs on Apache Hive.

Hivemall is designed to be scalable to the number of training instances as well as the number of training features.

• Classification• Regression• Recommendation• k-nearest neighbor• Anomaly Detection• Feature Engineering

https://github.com/myui/hivemall

Page 27: Scaling to Infinity - Open Source meets Big Data

The Hadoop Story on MongoDBImage courtesy of Steven Francia @ Docker

Page 28: Scaling to Infinity - Open Source meets Big Data

Questions?