Download - Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Transcript
Page 1: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Introduction to Big Data

Amir H. PayberahSwedish Institute of Computer Science

[email protected] 8, 2014

Amir H. Payberah (SICS) Introduction April 8, 2014 1 / 36

Page 2: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Data are not much use without human intuition ...

Data is not information, information is not knowledge, knowledge isnot understanding, understanding is not wisdom.

- Clifford Stoll

Amir H. Payberah (SICS) Introduction April 8, 2014 2 / 36

Page 3: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

... analyzing data gives power.

Without big data analytics, companies are blind and deaf, wanderingout onto the web like deer on a freeway.

- Geoffrey Moore

Amir H. Payberah (SICS) Introduction April 8, 2014 3 / 36

Page 4: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Analyzing data is worth the cost ...

The price of light is less than the cost of darkness.

- Arthur C. Nielsen

Amir H. Payberah (SICS) Introduction April 8, 2014 4 / 36

Page 5: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

..., but there are problems with relying on data too much.

Not everything that can be counted counts, and not everything thatcounts can be counted.

- Albert Einstein

Amir H. Payberah (SICS) Introduction April 8, 2014 5 / 36

Page 6: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Data is a treasure ..., except when it is not.

Getting information off the Internet is like taking a drinkfrom a fire hose.

- Mitchell Kapor

Amir H. Payberah (SICS) Introduction April 8, 2014 6 / 36

Page 7: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

However, any data is better than none.

An approximate answer to the right problem is worth a good dealmore than an exact answer to an approximate problem.

- John Tukey

Amir H. Payberah (SICS) Introduction April 8, 2014 7 / 36

Page 8: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data, the Noam Chomsky Way

Big data is a step forward. But, our problems are not lack of access to data,

but understanding them. [Big data] is very useful if I want to find out something

without going to the library, but I have to understand it, and that’s the problem.

Hmmm, not very much Chomsky-ish ..., but wait!

We can be confident that any system of power - whether it’s the state, Google, or

whatever - is going to use the best available technology to control, to dominate,

and to maximize their power. And they’ll want to do it in secret.

Now that’s sounding more like Chomsky.

Amir H. Payberah (SICS) Introduction April 8, 2014 8 / 36

Page 9: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data, the Noam Chomsky Way

Big data is a step forward. But, our problems are not lack of access to data,

but understanding them. [Big data] is very useful if I want to find out something

without going to the library, but I have to understand it, and that’s the problem.

Hmmm, not very much Chomsky-ish ..., but wait!

We can be confident that any system of power - whether it’s the state, Google, or

whatever - is going to use the best available technology to control, to dominate,

and to maximize their power. And they’ll want to do it in secret.

Now that’s sounding more like Chomsky.

Amir H. Payberah (SICS) Introduction April 8, 2014 8 / 36

Page 10: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data, the Noam Chomsky Way

Big data is a step forward. But, our problems are not lack of access to data,

but understanding them. [Big data] is very useful if I want to find out something

without going to the library, but I have to understand it, and that’s the problem.

Hmmm, not very much Chomsky-ish ..., but wait!

We can be confident that any system of power - whether it’s the state, Google, or

whatever - is going to use the best available technology to control, to dominate,

and to maximize their power. And they’ll want to do it in secret.

Now that’s sounding more like Chomsky.

Amir H. Payberah (SICS) Introduction April 8, 2014 8 / 36

Page 11: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

They Want to Do It In Secret ...

The truth cannot stay hidden forever!

Amir H. Payberah (SICS) Introduction April 8, 2014 9 / 36

Page 12: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

A Brief History ofData Management!

Amir H. Payberah (SICS) Introduction April 8, 2014 10 / 36

Page 13: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

4000 B.C

I Manual recording

I From tablets to papyrus, to parchment, and then to paper

Amir H. Payberah (SICS) Introduction April 8, 2014 11 / 36

Page 14: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

1450

I Gutenberg’s printing press

Amir H. Payberah (SICS) Introduction April 8, 2014 12 / 36

Page 15: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

1800’s - 1940’s

I Punched cards (no fault-tolerance)

I Binary data

I 1890: US census

I 1911: IBM appeared

Amir H. Payberah (SICS) Introduction April 8, 2014 13 / 36

Page 16: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

1940’s - 1970’s

I Magnetic tapes

I Batch transaction processing

I File-oriented record processing model (e.g., COBOL)

I Hierarchical DBMS (one-to-many)

I Network DBMS (many-to-many)

Amir H. Payberah (SICS) Introduction April 8, 2014 14 / 36

Page 17: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

1980’s

I Relational DBMS (tables) and SQL

I ACID

I Client-server computing

I Parallel processing

Amir H. Payberah (SICS) Introduction April 8, 2014 15 / 36

Page 18: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

1990’s - 2000’s

I The Internet...

Amir H. Payberah (SICS) Introduction April 8, 2014 16 / 36

Page 19: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

2010’s

I NoSQL: BASE instead of ACID

I Big Data

Amir H. Payberah (SICS) Introduction April 8, 2014 17 / 36

Page 20: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data

I In recent years we have witnessed a dramatic increase in availabledata.

I For example, the number of web pages indexed by Google, whichwere around one million in 1998, have exceeded one trillion in 2008,and its expansion is accelerated by appearance of the social net-works.

Amir H. Payberah (SICS) Introduction April 8, 2014 18 / 36

Page 21: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data Definition

I Big Data refers to datasets and flows largeenough that has outpaced our capability tostore, process, analyze, and understand.

Amir H. Payberah (SICS) Introduction April 8, 2014 19 / 36

Page 22: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

The Four Dimensions of Big Data

I Volume: data size

I Velocity: data generation rate

I Variety: data heterogeneity

I Veracity: uncertainty of accuracy andauthenticity of data

Amir H. Payberah (SICS) Introduction April 8, 2014 20 / 36

Page 23: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data Market Driving Factors

I Mobile devices

I Internet of Things (IoT)

I Cloud computing

I Open source initiatives

Amir H. Payberah (SICS) Introduction April 8, 2014 21 / 36

Page 24: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

The Big Data Stack!

Amir H. Payberah (SICS) Introduction April 8, 2014 22 / 36

Page 25: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data Analytics Stack

Amir H. Payberah (SICS) Introduction April 8, 2014 23 / 36

Page 26: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data - Storage (Filesystem)

I Traditional filesystems are not well-designed for large-scale dataprocessing systems.

I Efficiency has a higher priority than other features, e.g., directoryservice.

I Massive size of data tends to store it across multiple machines in adistributed way.

I HDFS, Amazon S3, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 24 / 36

Page 27: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data - Database

I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.

I NoSQL databases relax one or more of the ACID properties: BASE

I Different data models: key/value, column-family, graph, document.

I Dynamo, Scalaris, BigTable, Hbase, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 25 / 36

Page 28: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data - Resource Management

I Different frameworks require different computing resources.

I Large organizations need the ability to share data and resourcesbetween multiple frameworks.

I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.

I Mesos, YARN, Quincy, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 26 / 36

Page 29: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data - Execution Engine

I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.

I Data-parallel programming model for clusters of commodity ma-chines.

I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 27 / 36

Page 30: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data - Query/Scripting Language

I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.

I Need high-level language to improve the query capabilities of exe-cution engines.

I It translates user-defined functions to low-level API of the executionengines.

I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 28 / 36

Page 31: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data - Stream Processing

I Providing users with fresh and low latency results.

I Database Management Systems (DBMS) vs. Stream ProcessingSystems (SPS)

I Storm, S4, SEEP, D-Stream, Naiad, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 29 / 36

Page 32: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data - Graph Processing

I Many problems are expressed using graphs: sparse computationaldependencies, and multiple iterations to converge.

I Data-parallel frameworks, such as MapReduce, are not ideal forthese problems: slow

I Graph processing frameworks are optimized for graph-based prob-lems.

I Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 30 / 36

Page 33: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Big Data - Machine Learning

I Implementing and consuming machine learning techniques at scaleare difficult tasks for developers and end users.

I There exist platforms that address it by providing scalable machine-learning and data mining libraries.

I Mahout, MLBase, SystemML, Ricardo, Presto, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 31 / 36

Page 34: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Hadoop Big Data Analytics Stack

Amir H. Payberah (SICS) Introduction April 8, 2014 32 / 36

Page 35: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Stratosphere Big Data Analytics Stack

Amir H. Payberah (SICS) Introduction April 8, 2014 33 / 36

Page 36: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Spark Big Data Analytics Stack

Amir H. Payberah (SICS) Introduction April 8, 2014 34 / 36

Page 37: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Summary

Amir H. Payberah (SICS) Introduction April 8, 2014 35 / 36

Page 38: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big

Questions?

Amir H. Payberah (SICS) Introduction April 8, 2014 36 / 36