Big data, why now?

41
1 ©MapR Technologies - Confidential MapR: The Next Generation Big Data Platform

Transcript of Big data, why now?

1©MapR Technologies - Confidential

MapR: The Next Generation Big Data Platform

2©MapR Technologies - Confidential

Big is the next big thing

Big data and Hadoop are exploding

Companies are being funded

Books are being written

Applications sprouting up everywhere

2

3©MapR Technologies - Confidential

Slow Motion Explosion

3

4©MapR Technologies - Confidential

Hadoop Explosion

4

5©MapR Technologies - Confidential

Why Now?

But Moore’s law has applied for a long time

Why is Hadoop exploding now?

Why not 10 years ago?

Why not 20?

56/1/2012

6©MapR Technologies - Confidential

Size Matters, but …

If it were just availability of data then existing big companies would adopt big data technology first

6

7©MapR Technologies - Confidential

Size Matters, but …

If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

7

8©MapR Technologies - Confidential

Or Maybe Cost

If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

8

9©MapR Technologies - Confidential

Or Maybe Cost

If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

9

10©MapR Technologies - Confidential

Backwards adoption

Under almost any threshold argument startups would not adopt big data technology first

10

11©MapR Technologies - Confidential

Backwards adoption

Under almost any threshold argument startups would not adopt big data technology first

They did

11

12©MapR Technologies - Confidential

Everywhere at Once?

Something very strange is happening

– Big data is being applied at many different scales

– At many value scales

– By large companies and small

12

13©MapR Technologies - Confidential

Everywhere at Once?

Something very strange is happening

– Big data is being applied at many different scales

– At many value scales

– By large companies and small

Why?

13

14©MapR Technologies - Confidential

More data is being produced more quickly

Data sizes are bigger than even a very large computer can hold

Cost to create and store continues to decrease

The Conventional Answer

15©MapR Technologies - Confidential

Analytics Scaling Laws

Analytics scaling is all about the 80-20 rule

– Big gains for little initial effort

– Rapidly diminishing returns

The key to net value is how costs scale

– Old school – exponential scaling

– Big data – linear scaling, low constant

Cost/performance has changed radically

– IF you can use many commodity boxes

16©MapR Technologies - Confidential

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

17©MapR Technologies - Confidential

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

18©MapR Technologies - Confidential

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue Net value optimum has a

sharp peak well before maximum effort

19©MapR Technologies - Confidential

But scaling laws are changing both slope and shape

20©MapR Technologies - Confidential

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

More than just a little

21©MapR Technologies - Confidential

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

They are changing a LOT!

22©MapR Technologies - Confidential

23©MapR Technologies - Confidential

24©MapR Technologies - Confidential

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

25©MapR Technologies - Confidential

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

26©MapR Technologies - Confidential

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

27©MapR Technologies - Confidential

Pre-requisites for Tipping

To reach the tipping point,

Algorithms must scale out horizontally

– On commodity hardware

– That can and will fail

Data practice must change

– Denormalized is the new black

– Flexible data dictionaries are the rule

– Structured data becomes rare

28©MapR Technologies - Confidential

But there is more

Especially for large enterprises

29©MapR Technologies - Confidential

Physics of startup companies

30©MapR Technologies - Confidential

For startups

History is always small

The future is huge

Must adopt new technology to survive

Compatibility is not as important

– In fact, incompatibility is assumed

31©MapR Technologies - Confidential

Startup phase

Absolute growth still very large

Physics of large companies

32©MapR Technologies - Confidential

For large businesses

Present state is always large

Relative growth is much smaller

Absolute growth rate can be very large

Must adopt new technology to survive

– Cautiously!

– But must integrate technology with legacy

Compatibility is crucial

33©MapR Technologies - Confidential

The startup technology picture

Old computersand software

Current computersand software

Expected hardwareand software growth

No compatibility requirement

34©MapR Technologies - Confidential

The large enterprise picture

Proof of concept Hadoop cluster

Long-term Hadoop cluster

Current hardwareand software

?

Must worktogether

35©MapR Technologies - Confidential

So that is why and why now

35

36©MapR Technologies - Confidential

So that is why, and why now

What can you do with it?

And how?

36

37©MapR Technologies - Confidential

Scale-free Computing

Map-reduce

– pure functions for practical batch parallel computation

– high level languages like Hive and Pig available

– MapR provides standard access systems via NFS and ODBC

BSP

– pure functions for synchronous iterative actor-based compute

– Apache Giraph provides practical implementation

Actors

– tuple passing with transformations

– Storm provides practical implementation

38©MapR Technologies - Confidential

Future Proof Schemas

Denormalize data where possible to avoid seeks

– use embedded lists

– duplicate data

Flexible Schemas

– use standard system for data serialization

– must provide protocol migration without versioning

– Protobufs (Google), Avro (Apache) and Thrift can all be used

39©MapR Technologies - Confidential

Open Compute and Storage

Big data has mass and inertia

– once it lands, it should not move

Computation must move to the data

– map-reduce, Storm, Giraph … all OK

– conventional relational models … not OK

One model is not enough

– must allow access by multiple models of computation

41©MapR Technologies - Confidential

Thank You