Big Data Rampage

35
Big Data Rampage ! NIKO VUOKKO 13 MAY 2013, HIIT SEMINAR

description

Presentation for the 45 min. + QA talk I gave at HIIT seminar on 13 May 2013 for local data science researchers.

Transcript of Big Data Rampage

Page 1: Big Data Rampage

Big Data Rampage !NIKO VUO K KO13 MAY 20 13 , HI IT SEMINAR

Page 2: Big Data Rampage

The data

2

Page 3: Big Data Rampage

About that data of yours…

• Researchers generally live in a nice utopia where data just works *

* Yes, you do munge it for days, that’s nice

Realitycheck

3

Page 4: Big Data Rampage

What if you suddenly notice that there’s

• … corrupted JSON/XML/whatever

• … corrupted ids

• … transient ids

• … 5 different transient ids

• … text in number fields

• … new fields

• … disappeared fields

• … fields whose meaning just changed

• … but you have no idea of the new definition

• … all of these, regularly, without forward notice

• … and the bad data is coming at you at 1 GB per hour

• … and yours or someone else’s business depends on the data

4

YouGarbage Great insights

Page 5: Big Data Rampage

The data

• Enriched by many operationally attainable sources

--> varying schema and complicated ID soup

• Developed by frontline instead of IT waterfall

--> faster process, but volatile data definition

• Data scientists often requires access to more data

--> further risks of lapses

• Big and streaming in

--> risks of discontinuity

5

Page 6: Big Data Rampage

The Big DataPL E A SE DO N’ T SHO OT ME FO R USING T HE T E R M

6

Page 7: Big Data Rampage

What is big?

Human-generated

• 5K tweets / s

• 25K events / s from a mobile game (that’s 200 GB / day)

• 40K Google searches / s

Machine-generated

• 5M quotes / s in the US options market

• 120 MB / s of diagnostics from a single gas turbine

• 1 PB / s peaking from CERN LHC

7

Page 8: Big Data Rampage

What will be big?

• Human-generated data will get more detailed

• … but won’t grow much faster than the userbase

• It will become small eventually

• Machine-generated data will grow by the Moore’s law

• … and it’s already massive

8

Page 9: Big Data Rampage

How many of you consider this scale?

• Why not ?

• We already understand CPU and memory intensive problems

• But the new world out there is data intensive

• How can research stay in touch with change and stay relevant?

9

Page 10: Big Data Rampage

The CurriculumR E T R O FITT ING CS ST UDIES

10

Page 11: Big Data Rampage

Software Architectures

• Single thread performance and disk IO hitting a wall

• How do learning algorithms scale out of this corner ?

• Stochastic methods

• Ensembles

• Online learning

11

Page 12: Big Data Rampage

Databases 1

• In memory: MongoDB, Exasol, Redis

• On disk (single/sharded): MySQL, PostgreSQL

• On data warehouse: TeraData, DB2, Oracle

• Distributed: HDFS, Cassandra, Riak

• Cloud: S3, Azure, GCE, OpenStack

12

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

DataData

Page 13: Big Data Rampage

Databases 2

• Good old OLTP

• Analytic

• Key-value stores

• Document stores

• HDFS

• What is the best choice for this job ?

13

Page 14: Big Data Rampage

Data Structures and Algorithms

• Transforming data is expensive --> play safe with data structures

• Normalization dilemma

• Algorithms must tolerate the volatile nature of data

• Data drift, errors, missing values, outliers

• Models need to be explanatory

• Attention to complexity

• The usual obvious (CPU, memory, disk scans & seeks)

• Iterations

• Model size: What is an example of this?

14

Page 15: Big Data Rampage

Real-time Systems

What is real-time?

Very different requirements:

• Analyst: “What’s the user count today? By source? Now? From France?”

• Sysadmin: “Network traffic up 5x in 5 seconds! What’s going on?”

• Google: “Make a bid for these placements. You have 50 ms”

15

Page 16: Big Data Rampage

User Interfaces

• Operations or not, visualization is critical for acceptance

• From business concept to implementation

• What information do these users want to see ?

• How does this information support decision making ?

• How to visualize it with clarity yet powerfully ?

16

Page 17: Big Data Rampage

Significance Testing

• Data-driven actions must be backed by numbers

• Early analytics glazed over significance

• Executive: “Can I trust these numbers? Is my decision justified?”

• Systems must act conservatively

• Trust is built slowly, but lost quickly

• Data solutions must not screw up

17

Page 18: Big Data Rampage

Modeling Information Business Systems

• Understanding business and how to improve it with data

• : business problem data solution

• The most important quality of a data scientist

18

Page 19: Big Data Rampage

Contrasts

19

Page 20: Big Data Rampage

Hand-written Turing Machine vs Excel

• Average business has tons of low-hanging data fruit

• Developing and automating all that takes years (and years)

• No use for “advanced” stuff without visibility to the underlying

• There is no shortcut

• The organization itself needs to mature

20

Page 21: Big Data Rampage

Supervised vs Unsupervised

• Decide purpose of analysis now or later ?

• Most often the need is already formulated

• Here’s a standard clustering of human behavior

• Power laws will screw things up

21

Page 22: Big Data Rampage

Ad-hoc vs Operations

• Operative data algorithms run day and night without supervision

• Can produce massive leverage and ROI to a business

• … but they are crazy hard to develop

• Ad hoc analysis can employ all the cool stuff from last month’s JMLR

• … but they can’t scale

• … and 90 % of effort goes to communication and visualization

22

Page 23: Big Data Rampage

Computation Models

23

Page 24: Big Data Rampage

State snapshots

• User actions modify the current state in an OLTP

• Single actions go to offline audit log for re-running

• Data algorithms need to export and import data

• Things are run in batches

• What data used to be (and still often is)

24

Even

ts

Snapshot

Snapshot

Snapshot

Page 25: Big Data Rampage

Data Warehouse

• Additional endpoint specialized for analytics

• Can run surprisingly many algorithms

• … because the speed is so worth the effort

25

Page 26: Big Data Rampage

Cloud

• “Scalable SOA for computation, networking and storage”

• Really all about strict APIs

• Service dog wagging the infrastructure tail

• Public cloud very competitive for the small guys

• Hybrid clouds increasingly replace enterprise systems

26

Page 27: Big Data Rampage

Event data

• Event stream itself becomes first class citizen and master-labeled

• Needs novel storage

• Needs novel processing

• Data scientists beware! Sugar high imminent!

27

Page 28: Big Data Rampage

Stream processing

• New data is coming in all the time

• Process it online

• Data becomes somewhat disposable

• “Why bother with month old data when there’s too much of it anyways ?”

28

Page 29: Big Data Rampage

Iterative processing

• Always been the problem with large data

• Keeping state in memory necessary, but hard

• Spark doesn’t solve this, but makes it less painful

• Common fix: don’t do iterations

29

Page 30: Big Data Rampage

Hadoop the Hairy Framework

• HDFS, ZooKeeper, MapReduce, Hive, Pig,Yarn, Flume, Mahout, Bigtop, Oozie, Hue,HCatalog, Avro, Whirr, Sqoop, Impala, DataFu, …

• Premise of insanely large and/or unstructured data

• You probably don’t need it

30

Page 31: Big Data Rampage

Will Hadoop replace the Data Warehouse?

Separate concepts: Hadoop The Framework vs. MapReduce

• MapReduce suited for totally different tasks

• Hadoop can host a data warehouse

• … but it won’t be any easier or quicker to develop

31

Page 32: Big Data Rampage

The Purpose

32

Page 33: Big Data Rampage

What does Big Data mean for a business?

• Answers … a lot more answers

• Better, more reliable decision making

• Treating customers as individuals instead of segments

• How to design processes (both business and social) to employ data?

33

Page 34: Big Data Rampage

Data-driven decision making

34

Page 35: Big Data Rampage

Thank you!

• Always eager to talk about this stuff, feel free to contact !

• Now it’s time for lots of questions !

[email protected]

• linkedin.com/in/nikovuokko

• @nikovuokko

35