Big Data in Action
-
Upload
ngonpham -
Category
Technology
-
view
3.435 -
download
1
description
Transcript of Big Data in Action
Big Data in ActionNgon Pham, Lana Engineer
Introduction● Introduction● Problem● Approach● Demo● Big Data in Vietnam
Introduction● Internet-enabled devices
○ Tons of data generated every second● Hardware becomes much cheaper
○ We can now store and process much more data
Problem● How to process 10TB, how long and how
much? ○ Assume
■ Amazon EC2 ■ HDD read at 50MB/s■ Computation time is less than I/O time
Problem● 1 machine, 1 core, 1 HDD
○ Time: 55.56 hours○ Amazon Cost: $0.12 x 55.56 = $6.67
● 10 machines, 40 cores, 40 HDD○ Time: 1.39 hours○ Amazon Cost: $0.48 x 10 x 1.39 = $6.67⇒ The same cost but 40x faster
Question● How to divide data/process between
machines?● How to make each process read data inside
the machine directly instead of another?● How to replicate data, restore the process if
there is failure?● Lots of task management questions...
Approach● Hadoop ● MongoDB● Spark
Hadoop Approach● Storage
○ HDFS
Hadoop Approach● Computation
○ MapReduce
MongoDB Approach● Storage
○ Document
MongoDB Approach● Computation
○ SQL○ Aggregation○ MapReduce
Spark Approach● Storage
○ Resilient distributed dataset (RDD)
○ Persistent backed byHDFS / HBase...
Spark Approach● Computation
○ Mixed○ In-memory
computing
Demo● Hadoop
○ Run script to create Amazon cluster○ Play with Hadoop / HDFS / Spark○ Process Wikipedia data
● MongoDB○ Collect data from different sources and analyze
Big Data in Vietnam
Big Data in Vietnam● Why is MongoDB popular?
○ Lots of PHP developers prefer○ Simple to setup and use○ Similar to MySQL
Big Data in Vietnam● Hadoop is used by a few big local online
companies & international startups○ Analyze tons of data○ Create new competitive advantage⇒ But there is a big shortage of skilled engineers
Q & A
Q & A