Big Data in Action

Post on 08-Sep-2014

3.436 views 1 download

Tags:

description

Big Data technologies and applications in Vietnam

Transcript of Big Data in Action

Big Data in ActionNgon Pham, Lana Engineer

Introduction● Introduction● Problem● Approach● Demo● Big Data in Vietnam

Introduction● Internet-enabled devices

○ Tons of data generated every second● Hardware becomes much cheaper

○ We can now store and process much more data

Problem● How to process 10TB, how long and how

much? ○ Assume

■ Amazon EC2 ■ HDD read at 50MB/s■ Computation time is less than I/O time

Problem● 1 machine, 1 core, 1 HDD

○ Time: 55.56 hours○ Amazon Cost: $0.12 x 55.56 = $6.67

● 10 machines, 40 cores, 40 HDD○ Time: 1.39 hours○ Amazon Cost: $0.48 x 10 x 1.39 = $6.67⇒ The same cost but 40x faster

Question● How to divide data/process between

machines?● How to make each process read data inside

the machine directly instead of another?● How to replicate data, restore the process if

there is failure?● Lots of task management questions...

Approach● Hadoop ● MongoDB● Spark

Hadoop Approach● Storage

○ HDFS

Hadoop Approach● Computation

○ MapReduce

MongoDB Approach● Storage

○ Document

MongoDB Approach● Computation

○ SQL○ Aggregation○ MapReduce

Spark Approach● Storage

○ Resilient distributed dataset (RDD)

○ Persistent backed byHDFS / HBase...

Spark Approach● Computation

○ Mixed○ In-memory

computing

Demo● Hadoop

○ Run script to create Amazon cluster○ Play with Hadoop / HDFS / Spark○ Process Wikipedia data

● MongoDB○ Collect data from different sources and analyze

Big Data in Vietnam

Big Data in Vietnam● Why is MongoDB popular?

○ Lots of PHP developers prefer○ Simple to setup and use○ Similar to MySQL

Big Data in Vietnam● Hadoop is used by a few big local online

companies & international startups○ Analyze tons of data○ Create new competitive advantage⇒ But there is a big shortage of skilled engineers

Q & A

Q & A