Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data...
Transcript of Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data...
![Page 1: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/1.jpg)
1© 2015 The MathWorks, Inc.
Working with Big Data using MATLAB
Ben Tordoff
![Page 2: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/2.jpg)
2
How big is big?
Reading big data
Processing quite big data
Processing big data
Summary
Agenda
![Page 3: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/3.jpg)
3
How big is big?What does “Big Data” even mean?
“Any collection of data sets so large and complex that it becomes difficult to
process using … traditional data processing applications.”(Wikipedia)
“Any collection of data sets so large that it becomes difficult to process using
traditional MATLAB functions, which assume all of the data is in memory.”(MATLAB)
![Page 4: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/4.jpg)
4
How big is big?Not a new problem
In 1085 William 1st commissioned a survey of England
– ~2 million words and figures collected over two years
– too big to handle in one piece
– collected and summarized in several pieces
– used to generate revenue (tax), but most of the data then sat unused
![Page 5: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/5.jpg)
5
How big is big?A new problem
The Large Hadron Collider was switched back on earlier this year
– ~600 million collisions per second (only a fraction get recorded)
– amounts to 30 petabytes per year
– too big to even store in one place
– used to explore interesting science, but taking researchers a long time to get through
Image courtesy of CERN. Copyright 2011 CERN.
![Page 6: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/6.jpg)
6
How big is big?Sizes of data in this talk
Most of our data lies somewhere in between
– a few MB up to a few TB
– <1GB can typically be handled in memory on one machine (small data)
– 1-100GB can typically be handled in memory of many machines (quite big data)
– >100GB typically requires processing in pieces using many machines (big data)
![Page 7: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/7.jpg)
7
How big is big?
Reading big data
Processing quite big data
Processing big data
Summary
Agenda
![Page 8: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/8.jpg)
8
Reading big dataWhat tools are there?
load
imread
readtable
Import Tool
SMALL BIG
ImageAdapter
![Page 9: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/9.jpg)
9
load
imread
readtable
Import Tool
ImageAdapter
Reading big dataWhat tools are there?
memmapfile
matfile
API
fread
SystemObjects(streaming data)
textscan
SMALL BIG
database
xlsread
![Page 10: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/10.jpg)
10
Reading big dataWhat tools are there?
load
memmapfile
matfile
API
imread
fread
SystemObjects(streaming data)
readtable
Import Tool
SMALL BIG
ImageAdapter
database
textscan
xlsread datastore
![Page 11: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/11.jpg)
11
Reading big data
Datastore:
Simple interface for data in multiple files/folders
Presents data a piece at a time
Access pieces in serial (desktop) or in parallel (cluster)
Back-ends for tabular text, images, databases and more
![Page 12: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/12.jpg)
12
Reading big data
Datastore DEMO
![Page 13: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/13.jpg)
13
How big is big?
Reading big data
Processing quite big data
Processing big data
Summary
Agenda
![Page 14: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/14.jpg)
14
Processing quite big dataWhen the data fits in cluster memory
Using distributed arrays
– Use the memory of multiple machines as though it was your own
– Client sees a “normal” MATLAB variable
– Work happens on cluster
![Page 15: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/15.jpg)
15
Processing quite big dataDistributed array functions
Many common MATLAB functions supported:
(about 250)
Includes most linear algebra
Scale up your maths
![Page 16: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/16.jpg)
16
Processing quite big dataMultiplication of 2 NxN matrices
N
Execution time (seconds)
1 node,
16 workers
2 nodes,
32 workers
4 nodes,
64 workers
8000 19 13 11
16000 120 75 50
20000 225 132 86
25000 - 243 154
30000 - 406 248
35000 - - 376
45000 - - 743
50000 - - -
Processor: Intel Xeon E5-class v2
16 cores, 60 GB RAM per compute node, 10 Gb Ethernet
>> C = A * B
![Page 17: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/17.jpg)
17
Distributed DEMO
Processing quite big data
![Page 18: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/18.jpg)
18
How big is big?
Reading big data
Processing quite big data
Processing big data
Summary
Agenda
![Page 19: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/19.jpg)
19
Processing really big dataWhen you can never see all the data
Can never have all the data loaded
Must process small pieces of data independently
Extract (“map”) some pertinent information from each independent piece
– Typically summary statistics, example records, etc.
– No communication between pieces
Combine (“reduce”) this information to give a final (small) result
– Intermediate results from each piece must be communicated
![Page 20: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/20.jpg)
20
Introduction to Map-Reduce
Input filesIntermediate files
(local disk)Output files
MA
P
SH
UF
FL
E S
OR
T
RE
DU
CE
![Page 21: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/21.jpg)
21
Introduction to Map-Reduce
Input filesIntermediate files
(local disk)Output files
Newspaper
pages
For each page how many
times do “David”, “Nicola” and
“Jeremy” get mentioned?
Total
mentions
Nicola 9%
David 53%
Jeremy 38%
Example:
National popularity
contest
Relative
popularity
![Page 22: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/22.jpg)
22
Processing medium data
Map-Reduce DEMO
![Page 23: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/23.jpg)
23
Datastore
MATLABwith Hadoop
HDFS
Node Data
Node Data
Node Data
Hadoop
Datastore access data stored in
HDFS from MATLAB
![Page 24: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/24.jpg)
24
Datastore
MATLAB Distributed Computing Serverwith Hadoop
map.m
reduce.m
HDFS
Node Data
MATLAB
Distributed
Computing
Server
Node Data
Node Data
Map Reduce
Map Reduce
Map Reduce
Hadoop
![Page 25: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/25.jpg)
25
How big is big?
Reading big data
Processing quite big data
Processing big data
Summary
Agenda
![Page 26: Working with Big Data - MATLAB EXPO · PDF file · 2015-10-27Working with Big Data using MATLAB Ben Tordoff. 2](https://reader033.fdocuments.net/reader033/viewer/2022052516/5ab5d7ba7f8b9a7c5b8d2f6a/html5/thumbnails/26.jpg)
26
Summary
Reading data:
1. When data gets big you need to work on pieces
2. Use datastore to read pieces from a large data-set
Processing data:
1. If it fits in memory, use MATLAB
2. If it fits in cluster memory, use distributed arrays
3. If you need to scale beyond cluster memory, use map-reduce