Social and Economic Challenges of Open Data and Big data”
Transcript of Social and Economic Challenges of Open Data and Big data”
![Page 1: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/1.jpg)
www.bsc.es
Social and Economic Challenges of Open Data and Big data”
Prof. Mateo Valero
BSC Director
Madrid, Dec. 11th, 2014
![Page 2: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/2.jpg)
Social and Economic Challenges of Open Data and Big data
Organiza: Departamento de Bibliotecas y Documentación
Dirección de Cultura. Instituto Cervantes
Prof. Mateo ValeroDirector
Barcelona Supercomputing Center
![Page 3: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/3.jpg)
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
3Prof. Mateo Valero – Big Data
![Page 4: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/4.jpg)
¿Cómo avanza la ciencia hoy?
CARO PELIGROSO IMPOSIBLE
Simulación
Simulación = Calcular las formulas de la teoría
Teoría
Experimentación
![Page 5: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/5.jpg)
5
Higgs and Englert’s Nobel for Physics 2013
Last year one of the most computer-intensive scientific experiments ever undertaken confirmed Peter Higgs and François Englert’s theory by making the Higgs boson –the so-called “God particle” – in an $8bn atom smasher, the Large Hadron Collider at Cern outside Geneva.
“ the LHC produces 600 TB/sec… and after filtering needs to store 25 PB/year”… 15 million sensors….
![Page 6: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/6.jpg)
Big Data in Biology
High resolution imaging
Clinical records
Simulations
Omics
![Page 7: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/7.jpg)
Life Sciences and Health:personalized medicine
Atom
Small Molecule
Macromolecule
Cell
Tissue
OrganPopulation
Prof. Mateo Valero – Big Data
![Page 8: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/8.jpg)
8
Sequencing Costs
8
Source: National Human Genome Research Institute (NHGRI)http://www.genome.gov/sequencingcosts/
(1) "Cost per Megabase of DNA Sequence" — the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality
(2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001
In both graphs, the data from 2001 through October 2007 represent the costs of generating DNA sequence using Sanger-based chemistries and capillary-based instruments ('first generation' sequencing platforms). Beginning in January 2008, the data represent the costs of generating DNA sequence using 'second-generation' (or 'next-generation') sequencing platforms. The change in instruments represents the rapid evolution of DNA sequencing technologies that has occurred in recent years.
Prof. Mateo Valero – Big Data
![Page 9: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/9.jpg)
9
World-class genomicconsortiums
• European Genome-phenome Archive (EGA) repository will allow us to explore datasets from numerous genetic studies.
• Pan-cancer will produce rich data that will provide a major opportunity to develop an integrated picture of differences across tumor lineages.
![Page 10: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/10.jpg)
ICGC-TCGAPanCancer Project
10
Identification and classification of genome and expression variation
Different Tumor types(2000 patients, 4000 genomes)
European Bioinformatic
Institute
University ofChicago
Electronics and Telecom.
Res. Inst. (Korea)
IMSUTRIKEN
Barcelona Supercomputing
Center
Broad InstituteMIT
Ontario institute for cancer research
Largest world-wide initiative in of biomedical genomics. Computing protocols and resources are a limiting factor
Prof. Mateo Valero – Big Data
![Page 11: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/11.jpg)
Square Kilometer Array (SKA)
11Prof. Mateo Valero – Big Data
![Page 12: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/12.jpg)
Square Kilometer Array (SKA)
12
Tbps ·············> Gbps
Prof. Mateo Valero – Big Data
![Page 13: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/13.jpg)
Pervasive Connectivity
Explosion of Information
400,710 ad requests
2000 lyrics playedon Tunewiki
1,500 pings sent on PingMe
208,333 minutesAngry Birds played
23,148 appsdownloaded
98,000 tweets
Smart Device
Expansion
In 60 sectoday
2013
30Billion
By 2020
40 Trillion GB
… for 8Billion
10Million
DATA
(1)
(2)
(3)
Devices
Mobile
Apps
(4)
(1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4) http://en.wikipedia.org
A New Era of Information Technology
Current infrastructure sagging under its own weight
Internet of Things
Prof. Mateo Valero – Big Data
![Page 14: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/14.jpg)
The Data Deluge
202040ZB*
(figure exceeds prior forecasts by 5 ZBs)
2005 20102012
2015
8.5ZB
2.8ZB1.2ZB0.1ZB
* Source: IDC
Prof. Mateo Valero – Big Data
![Page 15: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/15.jpg)
1030
This will take us beyond our decimal system
Geopbyte
This will be our digital universe tomorrow…
Brontobyte
1027
1024This is our digital universe today
= 250 trillion of DVDs
Yottabyte
1021
1.3 ZB of network traffic by 2016
Zettabyte
1018
1 EB of data is created on the internet each day = 250 million DVDs worth of information. The proposed Square Kilometer Array telescope will generated an EB of data
per day
Exabyte
1012
Terabyte
500TB of new data per day are ingested in Facebookdatabases
1015
PetabyteThe CERN Large Hadron Collider generates 1PB per second
109
Gigabyte
106
Megabyte
How big is big?
Saganbyte, Jotabyte,…
Prof. Mateo Valero – Big Data
![Page 16: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/16.jpg)
Every area of science is now engaged in data-intensive researchResearchers need– Technology to publish and share data in the cloud– Data analytics tools to explore massive data collections– A sustainable economic model for scientific analysis,
collaboration and data curation
The data explosion is transforming science
Prof. Mateo Valero – Big Data
![Page 17: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/17.jpg)
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
17Prof. Mateo Valero – Big Data
![Page 18: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/18.jpg)
Magnetic Tape Memory
Invented by Eckert & Mauchly for the UNIVAC I, March, 21,1951
– Model UNISERVO
– 224 KB of data
– 1/2 inches of diameter,1200 feets, 128 characters per inch
– Speed: 100 inches per second, equivalent to 12800 characters per second
Storage Tech, 2013, T-10000D,
– 8.5 Terabytes (40.000.000 increase)
– EBW: 250 Mbytes/second (20.000 increase)
– Load Time: 10 seconds
Prof. Mateo Valero – Big Data
![Page 19: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/19.jpg)
First Disk, IBM, 1956
1956, IBM 305 RAMAC
4 MB, 50x24” disks, 1200 rpm, 100 bits/track
Intertracks: 0.1 inches, Density: 1000 bits/in2
100 ms access , Tubes, 35k$/y rent
Year 2013: 4 Terabytes (1.000.000 increase)
Average access time: few milliseconds (40 to 1)
Areal: doubling in average every 2/4 years, but not now
Predicted: 14 Terabytes in 2020 at the cost of $40
HDD: Hard Drives Disk
![Page 20: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/20.jpg)
20
Tapes advancing fast
May 2014
IBM and Fujifilm have demoed a 154TB LTO-size tape cartridge which could come to market in 10 years' time.
The Sony development involved a 148Gbit/in2 tape density and its own tape design to achieve a 185TB uncompressed capacity.
148 Gbit/in2
3.1 Gbit/in2 (T10000C)
http://www.insic.org/news/2012Roadmap/12index.html
Prof. Mateo Valero – Big Data
![Page 21: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/21.jpg)
21Seagate Step up to SAS; SAS Roadmap - Source: SCSI Trade Association
How We Increase 10x and Beyond…
Bandwidth and Connectivity
Capacity Drive Technologies
2010 2015 ~2020
1TB
8TB
20TB
Example Roadmap
Example features for capacity driven usage
![Page 22: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/22.jpg)
22Seagate Storage Effect Web Site
Areal Density
Future Candidate(s) Single Molecule
Magnets
Shingled Magnetic Recording (SMR)
Heat-Assisted Magnetic Recording(HAMR)
Projected 25% Areal Density
Increase
Enables+10TB and above
Projected 55% Areal Density
Increase
Theoretical up to 30TB to 60TB Hard Drive
Envisioned 1,000x Areal Density
Increase
Theoretical up to 3000TB Hard Drive
![Page 23: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/23.jpg)
2323
Capability computing:Research directions
Can “trade” memory capacity for other metrics of interest –e.g. bandwidth?
Packaging– 2.5D stacking
– 3D stacking
Technologies
– Wide I/O
– Hybrid memory cube
Prof. Mateo Valero – Big Data
![Page 24: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/24.jpg)
Emerging (non volatile)Memories
Traditional Technologies Emerging Technologies
Improved Flash
DRAM SRAM NOR NAND FeRAM MRAM PCM Memristor NEMS
Cell Elements 1T1C 6T 1T 1T1C 1T1R 1T1R 1M 1T1N
Half pitch (F) (nm) 50 65 90 90 180 130 65 3-10 10
Smallest cell area (F2) 6 140 10 5 22 45 16 4 36
Read time (ns) < 1 < 0.3 < 10 < 50 < 45 < 20 < 60 < 50 0
Write/Erase time (ns) < 0.5 < 0.3 105 106 10 20 60 < 250 1ns(140ps-5ns)
Retention time (years) seconds N/A > 10 > 10 > 10 > 10 > 10 > 10 > 10
Write op. Voltage (V) 2.5 1 12 15 0.9-3.3 1.5 3 < 3 < 1
Read op. Voltage (V) 1.8 1 2 2 0.9-3.3 1.5 3 < 3 < 1
Write endurance 1016 1016 105 105 1014 1016 109 1015 1011
Write energy (fJ/bit) 5 0.7 10 10 30 1.5×105 6×103 < 50 < 0.7
Density (Gbit/cm2) 6.67 0.17 1.23 2.47 0.14 0.13 1.48 250 48
Voltage scaling Fairly scalable no poor promising promising
Highly scalable Major technological barriers poor promising promising promising
Sources: Chong et al ICCAD09, Eshraghian TVLSI11, ITRS13
Prof. Mateo Valero – Big Data
![Page 25: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/25.jpg)
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
25Prof. Mateo Valero – Big Data
![Page 26: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/26.jpg)
26
The MultiCore Era
Moore’s Law + Memory Wall + Power Wall
Chip MultiProcessors (CMPs)
UltraSPARC T2 (2007)
Intel Xeon 7100 (2006)
POWER4 (2001)
Prof. Mateo Valero – Big Data
![Page 27: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/27.jpg)
27
Fujitsu SPARC64 XIfx
32 computing cores (single threaded) + 2 assistantcores
24MB L2 sector cache
256-bit wide SIMD
20nm, 3.75M transistors
2.2GHz frequency
1.1TFlops peak performance
High BW interconnects– HMC (240GB/s x 2 in/out)
– Tofu2 (125GB/s x 2 in/out)
![Page 28: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/28.jpg)
Intel Xeon Phi or Intel Many Integrated Core Architecture (MIC)
Knights Corner (2011)– Coprocessor, 61x86 cores,
22nm, AVX-512, 4 HTs
– 1.2TFLOPS (DP), 300W TDP, 4 GFLOPS/W
– 512KB/core L2 coherent
– Int Netw: Ring
– Mem BW: 352GB/s
Knights Landing (exp 2015)
– Coprocessor or host processor
– 72 Atom cores, 14nm, AVX512 per core, 4 HTs
– Up to 16GB of DRAM 3D stacked on-package, 384GB GDDR
– 3TFLOPS (DP), 200W TDP, 15GFLOPS/W
28Prof. Mateo Valero – Big Data
![Page 29: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/29.jpg)
29
Accelerators: NVIDIA Kepler GK110 GPU (2014)
DP Performance: 1.43 Tflop
Mem BW (ECC off): 288 GB/s
Memory size (GDDR5): 12 GB
15 SMX units– 192 single‐precision
CUDA cores
– 64 double‐precision units
– 32 special function units
– 32 load/store units
Six 64‐bit memorycontrollers
Prof. Mateo Valero – Big Data
![Page 30: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/30.jpg)
FLOP/second (operaciones sobre números reales 64 bits)
1988Cray Y-MP (8 procesadores)
1998Cray T3E (1024 procesadores)
2008Cray XT5 (150000 procesadores)
~2018? (1x107 procesadores
Evolution of the computing power of Supercomputers
Prof. Mateo Valero – Big Dataee
![Page 31: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/31.jpg)
FLOP/segon (operaciones sobre números reales de 64 bits)
#1 (55 PF)Tianhe @ National University of
Defense Technology, 54.9 PFlops
#1 EU (5PF)JUQUEEN @ Forschungszentrum Jülich
#1 Espanya (1PF)MN3 @ BSC
Samsung Exynos > 50 GflopsPrototips MontBlanc @ BSC
The “Formula 1” of Supercomputers today
Prof. Mateo Valero – Big Data
![Page 32: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/32.jpg)
Nvidia: Node for theExaflop Computer
Thanks Bill Dally
![Page 33: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/33.jpg)
Graph 500 Project
HPC benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications.
Graph 500 establishes a set of large-scale benchmarks for these applications.
Steering Committee: 50 international HPC experts from academia and industry.
Graph 500 is developing benchmarks to address three application kernels:
– Concurrent search
– Optimization (single source shortest path)
– Edge-oriented (maximal independent set)
Also they are addressing five graph-related business areas:
– Cybersecurity
– Medical Informatics
– Data Enrichment
– Social Networks
– Symbolic Networks.
Prof. Mateo Valero – Big Data
![Page 34: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/34.jpg)
• Top500 defined a benchmark (Linpack) to rate HPC machines upon performance. This benchmark is not suitable to address the characteristics of Big Data applications.
• Green Graph 500 list:
• Collects performance-per-watt metrics
• To compare the energy consumption of data intensive computing workloads.
Graph500 vs. Top500
Linpack:
– computation bounded
– focused on floating-point operations
– Bulk-Synchronous-Parallel model: behavior based on big computation and short communication bursts
– dense data structures highly organized and coalesced (spatial locality)
Graph500: graph500.org Top500: top500.org
• Graph500:
• communication bounded
• focused on integer operations
• asynchronous spatial uniform communication interleaved with computation
• larger sparse datasets (very low spatial and temporal locality)
Prof. Mateo Valero – Big Data
![Page 35: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/35.jpg)
35
Accelerators: NVIDIA Pascal (expected 2016)
Aimed to fix data movement bottleneck
Based on NVlinks– chip-to-chip
communication approach
– comprised of bi-directional 8-lane links
– provide between 80 and 200 GB/s of bandwidth
This approach is expected to provide 4x speedups w. r. t. current GPU-based designs
Prof. Mateo Valero – Big Data
![Page 36: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/36.jpg)
36
Specialization is Everything (?)
ASICsFPGAs
Source: Bob Broderson, Berkeley Wireless groupProf. Mateo Valero – Big Data
![Page 37: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/37.jpg)
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
37Prof. Mateo Valero – Big Data
![Page 38: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/38.jpg)
38
BioInformatics, Big Data and Supercomputing
1 PB of compressed data
100.000 subjects suffering different diseases
![Page 39: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/39.jpg)
If we ever had a 1PB disk(100MB/s)
scanning 1 Petabyte:
3,000 hours / 125 days
What is the time required to retrieve information?
Prof. Mateo Valero – Big Data
![Page 40: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/40.jpg)
What if we want to process the data in
1 hour ?
Supercomputing is about doing things FAST…
Prof. Mateo Valero – Big Data
![Page 41: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/41.jpg)
41
36-p
ort
FD
R10
36-p
ort
FD
R10
36-p
ort
FD
R10
36-p
ort
FD
R10
36-p
ort
FD
R10
36-p
ort
FD
R10
36-p
ort
FD
R10
36-p
ort
FD
R10
Mella
nox
648-p
ort
IBC
ore
S
witch
Mella
nox
648-p
ort
IBC
ore
S
witch
Mella
nox
648-p
ort
IBC
ore
S
witch
Mella
nox
648-p
ort
IBC
ore
S
witch
Infinib
and 6
48-
port
F
DR
Core
sw
itch
Mella
nox
648-p
ort
IBC
ore
S
witch
Mella
nox
648-p
ort
IBC
ore
S
witch
36-p
ort
FD
R10
36-p
ort
FD
R10
560
560
560
560
560
560
Le
af sw
itch
es
18
18
18
18
12
3 lin
ks to
each c
ore
3 lin
ks to
each c
ore
3 lin
ks to
each c
ore
3 lin
ks to
each c
ore
2 lin
ks to
each c
ore
FD
R1
0
links
18
18
18
18
12
3 lin
ks to
each c
ore
3 lin
ks to
each c
ore
3 lin
ks to
each c
ore
3 lin
ks to
each c
ore
2 lin
ks to
each c
ore
18
18
18
18
12
18
18
18
18
12
La
ten
cy: 0
,7
µs
Ba
nd
wid
th:
40
Gb
/s
Evolution of storage architecture for Big Data
Co
mp
ute
Ne
two
rk(4
0G
bp
s N
od
eA
dap
ter)
Co
mp
ute
No
de
s
Storage Network(1Gbps Node Adapter)
Sto
rag
e R
ac
ks
11hrs for 1 PB
1PB
Prof. Mateo Valero – Big Data
![Page 42: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/42.jpg)
42
Solid State Hybrid (SSHD) Technology
SSHD: Capacity, performance, and value
HDD large capacity + SSD high speed
Adaptive Memory™ technology
• Self-learning software algorithms make HDD SLC NAND flash work together.
• Enables SSD-like performance when accessing most frequently-used files
• Reduces workload and increases reliability of SLC NAND flash
Reduces Costs Now practical to employ enterprise-class SLC NAND flash
![Page 43: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/43.jpg)
Compute andCommunication Energies
43
Source: Processors and sockets: What's next? Greg Astfalk / Salishan Conference / April 25, 2013
Prof. Mateo Valero – Big Data
![Page 44: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/44.jpg)
44
Paradigm shift
Old
Compute-centric Model
New
Data-centric Model
Massive ParallelismPersistent Memory
Flash Storage Class Mem
Manycore Accelerators
Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics
Prof. Mateo Valero – Big Data
![Page 45: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/45.jpg)
Optimized System Design forData-Centric Deep Computing
45
Source: IBM Corporation, 2013
Prof. Mateo Valero – Big Data
![Page 46: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/46.jpg)
46
Solutions forSupercomputing
© 2013 IBM CorporationGene Active Storage
0
0.5
1
1.5
2
2.5
3
3.5
4
0 5000 10000 15000 20000 25000 300000
5000
10000
15000
20000
25000
30000
35000
40000
A2
AB
W/N
od
e(G
B/s
)
Ag
gre
ga
teA
2A
BW
(GB
/s)
Node Count
3D-Torus per Node3D-Torus Aggregate4D-Torus per Node
4D-Torus Aggregate5D-Torus per Node
5D-Torus AggregateFull Bisection per Node
Full Bisection Aggregate
Prof. Mateo Valero – Big Data
![Page 47: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/47.jpg)
47
HW-SW-Network co-design:Data Appliances
Prof. Mateo Valero – Big Data
![Page 48: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/48.jpg)
48
Microsoft Catapult
Microsoft is planning to replace traditional CPUs in data
centers with field-programmable arrays, or FPGAs,
processors that Microsoft could modify specifically for use
with its own software. These FPGAs are already available in
the market and Microsoft is sourcing it from a company called
Altera. The FPGAs are 40 times faster than a CPU at
processing Bing’s custom algorithms.
Doug Burger From Microsoft Research Talks About Project Catapult Which Will Make Bing Twice As Fast (Jun 17, 2014)http://microsoft-news.com/doug-burger-from-microsoft-research-talks-about-project-catapult-which-will-make-bing-twice-as-fast-video/
Prof. Mateo Valero – Big Data
![Page 49: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/49.jpg)
49
HP unveils “The Machine”
June 11, 2014
HP unveils “The Machine”
It uses clusters of special-purpose cores, rather than a few generalized cores; photonics link everything instead of slow, energy-hungry copper wires; memristors give it unified memory that's as fast as RAM yet stores data permanently, like a flash drive.
A Machine server could address 160 petabytes of data in 250 nanoseconds; HP says its hardware should be about six times more powerful than an existing server, even as it consumes 80 times less energy. Ditching older technology like copper also encourages non-traditional, three-dimensional computing shapes, since you're not bound by the usual distance limits.
Source: http://www.engadget.com/2014/06/11/hp-the-machine/
![Page 50: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/50.jpg)
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
50Prof. Mateo Valero – Big Data
![Page 51: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/51.jpg)
Relational databasessometimes not good for scale-out
1 2 3 4 5
2 3 4
1 3 5
1 3 4
2 4 5
1 2 5
NoSQL for non-structured data, eventual consistency
Prof. Mateo Valero – Big Data
![Page 52: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/52.jpg)
Programming model
� To meet the challenges: MapReduce
– Programming Model introduced by Google in early 2000s to support distributed computing (special emphasis in fault-tolerance)
� Ecosystem of big data processing tools
• open source, distributed, and run on commodityhardware.
� The key innovation of MapReduce is – the ability to take a query over a data
set, divide it, and run it in parallelover many nodes.
� Two phases– Map phase– Reduce phase
Reducers
Mappers
Input Data
Output Data
Prof. Mateo Valero – Big Data
![Page 53: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/53.jpg)
53
Google progressivelydropping MapReduce
2004 introducesMapReduce
2006The Apache Hadoop Project brings an opensource MapReduce Implementation, with the team of the Apache NutchCrawler and the support of Yahoo!
June 25, 2014Google announces Cloud Dataflow: write code once, run it in batch or stream mode
Cloud Dataflow is a managed service for creating data pipelines that ingest, transform and analyze data in both batch and streaming modes.
2010Google announces Pregel used for building incremental reverse indexes instead of MapReducePhylosophy: ‘think like a vertex” (event-driven)
Prof. Mateo Valero – Big Data
![Page 54: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/54.jpg)
Deep Learning: generating high level abstractions
Toronto Face Database (TFD)– Automatic recognition of Facial Expressions
54
Fig. 15. Expression recognition accuracy on the TFD datasetwhen both training and test labeled images are subject to 7 typesof occlusion.
Ranzato, M., Mnih, V., Susskind, J. and Hinton, G. E. Modeling Natural Images Using Gated MRFs. IEEE Trans. Pattern Analysis and Machine Intelligence, to appear.
Prof. Mateo Valero – Big Data
![Page 55: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/55.jpg)
55
Watson: CalculatingVs Learning
55
Source: Courtesy of IBMProf. Mateo Valero – Big Data
![Page 56: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/56.jpg)
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
56Prof. Mateo Valero – Big Data
![Page 57: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/57.jpg)
57
Barcelona Supercomputing Center Centro Nacional de Supercomputación
BSC-CNS objectives:– R&D in Computer, Life, Earth and Engineering Sciences
– Supercomputing services and support to Spanish and European researchers
BSC-CNS is a consortium that includes:– Spanish Government 51%
– Catalonian Government 37%
– Universitat Politècnica de Catalunya (UPC) 12%
+400 people, 45 countries
57Prof. Mateo Valero – Big Data
![Page 58: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/58.jpg)
58
EARTH SCIENCES
To develop and implement
global and regional state-
of-the-art models for
short-term air quality
forecast and long-term
climate applications
LIFE SCIENCES
To understand living
organisms by means of
theoretical and
computational methods
(molecular modeling,
genomics, proteomics)
CASE
To develop scientific and
engineering software to
efficiently exploit super-
computing capabilities
(biomedical, geophysics,
COMPUTER
SCIENCES
To influence the way
machines are built,
programmed and used:
programming models,
performance tools, Big
Data, computer
atmospheric, energy,
social and economic simulations)
architecture, energy efficiency
Mission of BSCScientific Departments
Prof. Mateo Valero – Big Data
![Page 59: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/59.jpg)
MareNostrum 3
59
48,896 Intel SandyBridge cores at 2.6 GHz84 Intel Xeon Phi
Peak Performance of 1.1 Petaflops100.8 TB of main memory
2 PB of disk storage8.5 PB of archive storage
9th in Europe, 29th in the world (June 2013 Top500 List)
![Page 60: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/60.jpg)
60
Abstraction of computer middleware
e.g. NoSQL DB
e.g. MapReduce
e.g. Cloud
e.g. Machine Learning
e.g. Linux
e.g. FPGAs, GPUs
e.g. Genomics
![Page 61: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/61.jpg)
61
Ongoing Big Data projects - I
BSC visible @ Spark community
- leading the Barcelona Spark Meetup
- Contacts with Databricks(Berkeley Univ)
- Turn Marenostrum as a Big Data Platform
BigIoT
Ultra-Large vision of Internet of Things(IoT)
Website of IoT Europe: http://ioteurope.bsc.es/ Site where you can find research and companies working on IoT
Optical Network
& Memories
New architectures developed using optical interconnections and optical memory
Bring together the key European hardware,networking, and system architects with thekey producers and consumers of Big Data toidentify the industry coordination points thatwill maximize European competitiveness
Roadmap
DB Analytics Accelerati
on
Focus on automatic scaling of complex analytics, while addressing the full requirements of real data sets.
![Page 62: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/62.jpg)
62
Ongoing Big Data projects – II
Big DataApplicationMonitoring Design and implement a
set of tools to monitor and to trace network traffic for
hadoop applications
SoftwareDefined
Env.
Building cloud environments embracing
hardware and network heterogeneity to host a
variety of workloads (JSA-SDN)Cloud
Recom-mendation
Enhance a decisionsupport system providingguidance in the selectionof the right set of Cloudsanalysing trade-off between cost, reliability, risks and quality impacts Hi-EST
Holistic Integration of Emerging Supercomputing TechnologiesERC Starting Grant David Carrera
Fog Computing
Fog extends the cloud by implementing a distributed platform capable of processing the data close to the end-points, becoming an excellent platform for Internet of Things (IoT)Pioneered by Mario Nemirovsky
![Page 63: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/63.jpg)
63
Ongoing Big Data projects - III
HighPerformance
Key/ValueStores
Explore the use of high performance key/value
databases for fast persistent memory technologies
Mngt.of Data
StreamingEnviron.
Explore novel architectures of the emerging IoT stream processing platforms, that provide the capabilities of data stream composition, transformation and filtering in real time
ExtendingPyCompSsto NoSQL
DBIntegration of COMPSs runtime with Cassandraand exploration of scheduling policies driven by data locality
NoSQLData
ManagementResearch
Design and implement a software layer to enable NoSQL databases to decouple
data organization from data model and provide NoSQL databases with efficient
multi-level indexing support
ImprovingCost-effect.of Big Data
Deploy.
Develop mechanisms for an automated characterization of cost-effectiveness of Hadoop deployments (runtime performance vssoftware and hardware configuration choices)
Secured
An innovative architecture to achieve protection from Internet threats by offloading execution of security applications into a programmable device at the edge of the network
![Page 64: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/64.jpg)
64
Ongoing Big Data projects - IV
Large-scalegraph proc.
for RTUse of probabilistic bayesian graph
algorithms for real-time applications and
its efficient parallel computation
DeepLearning
Enhance new deep neural network algorithms, such as autoencoders of Boltzmann machines to create generative models
MultimediaBig Data
Computing
Visual concept detection & notationScalable indexing of image and videoMultimodal Big Data Analytics
Master Thesis
![Page 65: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/65.jpg)
65
Ongoing Big Data projects - V
Large-scaleGenome analysis.Use of statistical
imputation theory to infer unknown genetic
variants. Large input data
analysis and efficient parallel computation.
Big-data analytics
Large-scale Input data analysis Optimized DNA
compression algorithms, such as BWA, Suffix Arrays, FM-Indexing, etc., to compare multi-patients DNA information . Big-data analytics
: Somatic Mutation Finder on cancer genomes.
: GWImp-COMPSs for large-scale Imputation and Genome-wide Association studies.
New architectures development for Computational Genomics tasks
Efficient Big-data genomic workflows deployment in current and future: HPC Clusters, Clouds, etc.
![Page 66: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/66.jpg)
66
Ongoing Big Data projects - VI
DeepLearning
Processing & integrating urban data from heterogeneous data sources into GIS and different simulators.
H2020 Lighthouse on Smartcitiesand sustainability. 12 Smart Solutions to meet the three different aspects of sustainability: economic, social and environmental.
DeepLearningGrow
Smarter
Urban semantics: Our model is based on the concept of linked data and has several advantages: (1) the mapped data may be accessed uniformly, (2) the user can explore and navigate the metadata graph to understand its contents, (3) the user can query the model and visualize the result, and (4) data conflicts can be automatically detected.
![Page 67: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/67.jpg)
67
Ongoing Big Data projects :Global picture
NoSQLData
ManagementResearch
ExtendingPyCompSsto NoSQL
DB
HighPerformance
Key/ValueStores
ImprovingCost-effect.of Big Data
Deploy.
Mngt.of Data
StreamingEnv.
LS
SoftwareDefined
Env
Big DataApplicationMonitoring
CS CASE LS LSES
Large-scalegraph proc.
for RT
DeepLearning
MultimediaBig Data
Computing
DeepLearningGrow
Smarter
DeepLearning
CASE CASE
Optical Network
& Memories
BigIoT
Fog Computing
LS
DB Analytics Accelerati
on
Roadmap
![Page 69: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/69.jpg)
Big Data VOLUME @ BSC
Some examples:
Alya System Computational Mechanics code
Alya simulation software with particles module: 300Gb or more each simulation.
MoDEL Data Base is the largest database of Molecular Simulation in Europe: 18 Tb of data.
Chronic Lymphocytic Leukemia (CLL) Genome Project that aims to generate a comprehensive catalogue of genomic alterations: 1.5 Pb of data.
Biomolecular Simulation project with Ascona B-DNA Consortium to generate a database containing structural and dynamic information of DNA sequences: 30Tb data.
69Prof. Mateo Valero – Big Data
![Page 70: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/70.jpg)
Big Data VOLUME @ BSC
Problem:
Difficult to share/manage and analysis this big amount of data.
Solution:
Research at Computer Science department with a holistic approach at different layers:
Under:
70
Progr.Models
Algorithms
Applications
Run Time
Architecture
Prof. Mateo Valero – Big Data
![Page 71: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/71.jpg)
71
Computer Science strategy
Create an unified, productive, easy to use and efficient environment for our broad range of applications
Big Data software stack @ BSC
(main components)
Active Store
COMPSs
NoSQL data management
hierarchical storage + computing resources
Wasabi Others
Resource management policies: data organization, query plans, computation scheduling
Common API
App
SKV
Allow the use of NoSQL Data Bases
technologies as a backend
Facilitate the access to Big Data storage
from COMPSs applications and
offering Abstraction to the programmer
Data sharing platform based on self-contained objects (data + code)
Use of high performance in-memory databases architecture to accelerate workloads.
Prof. Mateo Valero – Big Data
![Page 72: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/72.jpg)
72
BSC Performance Tools
An efficient and resilient system is a crucial part of a Big Data solution
– Entire suite of performance tools & methodology are necessary to ensure it
– In order to Detect/Understand structure and behavior, perform projections and extrapolations, …
Fundamental new component: Performance Analytics– Combination of techniques from
many areas as signal processing, clustering, sequence alignment, image tracking, modeling …
Prof. Mateo Valero – Big Data
![Page 73: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/73.jpg)
Big Data collaborationwith IT industry @ BSC
Learning form Large & Sparse Graphs
• Algorithmic development: Machine Learning, Data mining, Pattern discovery/predictive analysis, …
• Applied to Neuroscience data, Behavior in Location Based Social Networks, …
Real-Time Monitoring of Image/Video Streaming
• Algorithmic development: Massive scale image replica detection, Massive scale image concept annotation, …
• MOU with SFY and CAR: Sport learning support system with personal cameras.
Citizen as a sensor
• Citizens, using their mobile devices, leave a digital footprint that can be gathered.
• Algorithmic development: Stream mining algorithms for massive real-time data, Adaptive clustering, Geo-aware DM methods, …
Deep Learning • Deep Learning attempts to mimic by soft the activity in layers of neurons.
• In the last decade researchers made some fundamental conceptual breakthroughs, but computers were not powerful enough.
• Now we have the resources!
73Prof. Mateo Valero – Big Data
![Page 74: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/74.jpg)
Big Data collaborationwith IT industry @ BSC
Systematic Study of Hadoop Deployment Variables to Enable Automated Characterization of Cost-Effectiveness
Use of high performance in-memory databases architecture to accelerate workloads.
Network-aware placement on "Software Defined Environments” (SDE) for HPC/BigData
Create a Decision Support System using social information (stackoverflow, ….) for cloud services
74
SDE
ALOJA
BGAS
Prof. Mateo Valero – Big Data
![Page 75: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/75.jpg)
Big Data relatedEuropean Projects @ BSC
Management of BigData for SmartCities & IoT.
Deliver a strategic roadmap for how technology advancements in hardware and networking can be exploited for Big Data analytics and beyond.
Accelerate “traditional” extreme large databases &
Improve Visualization using advanced hardware.
Offload security applications from the user terminal to a trusted and secure network node placed at the edge of the network.
75Prof. Mateo Valero – Big Data
![Page 76: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/76.jpg)
Energy European Projects& Big Data @ BSC
Parallel Distributed Infrastructure for Minimization of energy
Install, configure, benchmark Hadoop on ARM prototypes
Network energy proportionality for Hadoop
Energy Benchmarking for Big Data (Modelling & Prediction)
76Prof. Mateo Valero – Big Data
![Page 77: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/77.jpg)
77
Analytics
(real-time)
Big Data
HPC
Cloud
Computing
SDE
ALOJA
BGAS
BSCTools
BSC-CATech
Citizen as a sensorDeep Learning
NoSQLData Management
Collab. CASE
Life Science
Real-time Monitoring of Image Streaming
BSC Programming
Models
servIoTicy.com
Adaptive
Scheduler
![Page 78: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/78.jpg)
Sustainable
Computing
BSC Programming
Models
Analytics
(real-time)
Big Data
HPC
Cloud
Computing
SDE
Smarts Cities& IoT
ALOJA
BGAS
BSCTools
BSC-CATech
Citizen as a sensorDeep Learning
NoSQLData Management
Collab. CASE
Life Science
Real-time Monitoring of Image Streaming
BSC Programming
Models
servIoTicy.com
Adaptive
Scheduler
![Page 79: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/79.jpg)
www.bsc.es
Thank you !
![Page 80: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/80.jpg)
Emerging (non volatile)
MemoriesTraditional Technologies Emerging Technologies
Improved Flash
DRAM SRAM NOR NAND FeRAM MRAM PCM Memristor NEMS
Cell Elements 1T1C 6T 1T 1T1C 1T1R 1T1R 1M 1T1N
Half pitch (F) (nm) 50 65 90 90 180 130 65 3-10 10
Smallest cell area (F2) 6 140 10 5 22 45 16 4 36
Read time (ns) < 1 < 0.3 < 10 < 50 < 45 < 20 < 60 < 50 0
Write/Erase time (ns) < 0.5 < 0.3 105 106 10 20 60 < 250 1ns(140ps-5ns)
Retention time (years) seconds N/A > 10 > 10 > 10 > 10 > 10 > 10 > 10
Write op. Voltage (V) 2.5 1 12 15 0.9-3.3 1.5 3 < 3 < 1
Read op. Voltage (V) 1.8 1 2 2 0.9-3.3 1.5 3 < 3 < 1
Write endurance 1016 1016 105 105 1014 1016 109 1015 1011
Write energy (fJ/bit) 5 0.7 10 10 30 1.5×105 6×103 < 50 < 0.7
Density (Gbit/cm2) 6.67 0.17 1.23 2.47 0.14 0.13 1.48 250 48
Voltage scaling Fairly scalable no poor promising promising
Highly scalable Major technological barriers poor promising promising promising
Sources: Chong et al ICCAD09, Eshraghian TVLSI11, ITRS13
Prof. Mateo Valero – Big Data
![Page 81: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/81.jpg)
81
Introducción
• Aparición de la escritura, sistemas de numeración
• Biblioteca de Alejandria, imprenta
• Amazon y Google…,de la información a.. Sugerir nuevos libros a comprar y a traducir automáticamente entre idiomas
• Ejemplo de las tablas de navegación de Mathew F. Maury y sus “doce computadores” (operarios), libro “The Physical Geography of the Sea”, 1855, 1.2 millones de puntos, redujo1/3 la duración de los viajes, Neptuno…
• Censo en USA, Hermann Hollerith, censo de 1890, cada 10 años, 13 años,..de 8 años en el censo de 1880 se paso a 1 en el del 1890.. Ademas, capacidad para procesar rápidamente la información
• Ahora, tenemos posibilidad de generar muchisima información y procesarla… la culpa la tuvo el transistor
• Cuantos son los Big Data?
Prof. Mateo Valero – Big Data
![Page 82: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/82.jpg)
BigData needs Intelligence!
Deep Learning: Building high-level abstractions
Watson: Reasoning
Prof. Mateo Valero – Big Data
![Page 83: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/83.jpg)
MapReduce: example
Reducers
Mappers
Input Data
Output Data
be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt
to be or not
hamlet.txt
be not afraid of greatness
12th.txt
to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt
afraid, (12th.txt)be, (12th.txt, hamlet.txt)
greatness, (12th.txt)not, (12th.txt, hamlet.txt)
of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)
Input
Map
Reduce
Output
Prof. Mateo Valero – Big Data
![Page 84: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/84.jpg)
84
What is behind Watson?
IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basementBy Tony Pearson (ibm.co/Pearson)http://tinyurl.com/pt6mdfu
DeepQA: Massively Parallel Probabilistic Evidence-Based Architecture For each question, generates and scores many hypotheses using a combination of 1000’s of Natural Language Processing, Information Retrieval, Machine Learning and Reasoning Algorithms. These gather, evaluate, weigh and balance different types of evidence to deliver the answer with the best evidence support it can find
Prof. Mateo Valero – Big Data
![Page 85: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/85.jpg)
85
Software tuning keyto exploit hardware!
Prof. Mateo Valero – Big Data
![Page 86: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/86.jpg)
86
Internet & Big Data
Prof. Mateo Valero – Big Data
![Page 87: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/87.jpg)
Challenges of data generation
Source: http://www-01.ibm.com/software/data/bigdata/
Prof. Mateo Valero – Big Data
![Page 88: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/88.jpg)
88
![Page 89: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/89.jpg)
Graph500 and Top500 Lists: June 2014
Graph500
# Machine Site CoresProblem
sizeGTEPS
1 K computer (Fujitsu - Custom supercomputer)RIKEN Advanced Institute for
Computational Science (AICS)524,288 40 17,977.1
2Sequoia (IBM - BlueGene/Q, Power BQC 16C 1.60
GHz)Lawrence Livermore National
Laboratory1,048,576 40 16,599
3 Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Argonne National Laboratory 786,432 40 14,328
4JUQUEEN (IBM - BlueGene/Q, Power BQC 16C 1.60
GHz)Forschungszentrum Juelich (FZJ) 262,144 38 5,848
5 Fermi (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) CINECA 131,072 37 2,567
Top500
# Machine Site CoresRmax
(TFlop/s)Rpeak
(TFlop/s)
1Tianhe-2 (TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C
2.200GHz, Intel Xeon Phi 31S1P)National Super Computer Center
in Guangzhou3,120,000 33,862.7 54,902.4
2Titan (Cray XK7 , Opteron 6274 16C 2.200GHz, NVIDIA
K20x)Oak Ridge National Laboratory 560,640 17,590.0 27,112.5
3Sequoia (IBM - BlueGene/Q, Power BQC 16C 1.60
GHz)Lawrence Livermore National
Laboratory1,572,864 17,173.2 20,132.7
4 K computer (Fujitsu - SPARC64 VIIIfx 2.0GHz)RIKEN Advanced Institute for
Computational Science (AICS)705,024 10,510.0 11,280.4
5 Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Argonne National Laboratory 786,432 8,586.6 10,066.3
Prof. Mateo Valero – Big Data
![Page 90: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/90.jpg)
90
Limitations of MapReduceas a Programming model?
• MapReduce is great but not every one is a MapReduce expert
“I am a python expert but ….”
• There is a class of algorithms that cannot be efficiently implemented with the MapReduce programming model
• Different programming models deal with different challenges
– pyCOMPSs from BSC
– SPARK from Berkeley
Input Data
Output Data
Prof. Mateo Valero – Big Data
![Page 91: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/91.jpg)
Commodity Solution –Cloud Oriented
What code runs on such a complex environment (many nodes, many disk volumes) with a short time to market deployment and on commodity HW?
91
![Page 92: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/92.jpg)
92
What happens if SW does not consider HW
• Terasort contest: sorting 100TB data
• Number 1: Hadoop
• 2100 nodes, 12 cores per node, 64 Gb per node
• 24.000 cores
• 134 Tb memory
• Time: 4300 segs
• Cost in Amazon: $ 8.800
• Number 2: Tritonsort
• 52 nodes, 8 cores per node, 24 Gb
• 416 cores
• 1,2 Tb memory
• Time: 8300 segs and 6400 segs
• Cost in Amazon: $ 294 and 226
• Hadoop is easy to program, but needs 57X more cores, 100X more memory, and only gets 2X performance
Prof. Mateo Valero – Big Data
![Page 93: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/93.jpg)
Graph500 characterization
Graph500 benchmark performs a Breadth-First Search over a random graph. It finds an embedded spanning tree over a given graph vertex.
Benchmark application receives two parameters:
– Scale of the graph. Number of vertices is N=2SCALE.
– Edge-factor, the number of edges per vertex. Number of edges is M=edgefactor*N.
Performance Metric is TEPS (as opposed to FLOPS in Top500)
– Traversed Edges Per Second, counts number of graph vertex tuples traversed by the search algorithm within its execution time.
There are multiple implementations for the BFS algorithm; the most scalable one is the ‘simple’ one. We will focus on this one to present a few charts.
Prof. Mateo Valero – Big Data
![Page 94: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/94.jpg)
Prof. Mateo Valero – Big Data
Evolution over time of the research paradigm
• In the last millennium, science was empirical:• description of natural phenomena
• A few centuries ago opens the theoretical approach • using models, formulas and generalizations.
• In recent decades appears computational science• simulation of complex phenomena
• Now focuses on the exploration of big data (eScience)• unification of theory, experiment and simulation
• capture massive data using instruments or generated through simulation and processed by
• computer
• knowledge and information stored in computers
• Scientists analyse databases and files on data infrastructures
2
2
2.
3
4
a
cG
a
aΚ−=
ρπ
2
2
2.
3
4
a
cG
a
aΚ−=
ρπ
Jim Gray, National Research Council, http://sites.nationalacademies.org/NRC/index.htm; Computer Science and Telecommunications Board, http://sites.nationalacademies.org/cstb/index.htm.
![Page 95: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/95.jpg)
95
Peering into Big DataHW future
In the next 10 years, the HW will change more dramatically than it has in the past 10 years
3D Stacking
Non-Volatile Memories
Dark Silicon
HW will influence the products and services that you provide
Prof. Mateo Valero – Big Data
![Page 96: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/96.jpg)
Top10Rank Site Computer Procs Rmax Rpeak Power
GFlops/W
attName
1National Super ComputerCenter in Guangzhou
TH-IVB-FEP Cluster, Intel XeonE5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P
31200002736000
33,86 54,90 17,8 1,90Tianhe-2
(MilkyWay-2)
2DOE/SC/OAK Ridge National Lab
CRAY XK7, Opteron 6274 16C, 2.20 GHz, Cray Geminiinterconnect, NVIDIA K20x
560640261632
17,59 27,11 8,21 2,14 Titan
3 DOE/NNSA/LLNLBlueGene/Q, Power BQC 16C 1.60 GHz, Custom
1572864 17,17 20,13 7,89 2,18 Sequoia
4RIKEN Advanced Institutefor Computational Science(AICS)
Fujitsu, K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
705024 10,51 11,28 12,65 0,83 K
5DOE/SC/Argonne NationalLaboratory
BlueGene/Q, Power BQC 16C 1.60GHz, Custom
786432 8,58 10,06 3,94 2,18 Mira
6 CSCSCray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x
11598473808
6,27 7,79 2,32 2,70 Piz Daint
7Texas AdvancedComputing Center
PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi
462462366366
5,17 8,52 4,51 1,14 Stampede
8ForschungszentrumJuelich (FZJ)
BlueGene/Q, Power BQC 16C 1.60GHz, Custom
458752 5,00 5,87 2,30 2,18 JUQUEEN
9 DOE/NNSA/LLNLBlueGene/Q, Power BQC 16C 1.60 GHz, Custom
393216 4,29 5,03 1,97 2,18 Vulcan
10 GovernmentCray XC30, Intel Xeon E5-2660v2 10C 2.2GHz, Aries, NVIDIA K40
7280062400
3,57 6,13 1,50 2,39
![Page 97: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/97.jpg)
Top10
Rank Site Computer CoresProblem
SizeGTEPs
1RIKEN Advanced Institute forComputational Science (AICS)
K computer (Fujitsu - Customsupercomputer)
524,288 40 17,977.1
2Lawrence Livermore NationalLaboratory
DOE/NNSA/LLNL Sequoia (IBM -BlueGene/Q, Power BQC 16C 1.60 GHz)
1,048,576 40 16,599
3 Argonne National LaboratoryDOE/SC/Argonne National Laboratory Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)
786,432 40 14,328
4 Forschungszentrum Juelich (FZJ)JUQUEEN (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)
262,144 38 5,848
5 CINECAFermi (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)
131,072 37 2,567
6 Changsha, ChinaTianhe-2 (MilkyWay-2) (National Universityof Defense Technology - MPP)
196,608 36 2,061.48
7 CNRS/IDRIS-GENCITuring (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)
65,536 36 1,427
7Science and Technology FacilitiesCouncil - Daresbury Laboratory
Blue Joule (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)
65,536 36 1,427
7 University of EdinburghDIRAC (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)
65,536 36 1,427
7 EDF R&DZumbrota (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)
65,536 36 1,427
June 2014
![Page 98: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/98.jpg)
98
What are FPGAs?
FPGA: Field Programmable Gate Array
Generic sea of programmable logic and interconnect
Program into specialized computing & networking circuits
Can be reconfigured in as little as 100 ms
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
Network
Network
Network
Network
PCIe
PCIe
PCIe
PCIe Ge
ne
ric L
og
ic
(LU
Ts)
Specialized I/O Blocks
DS
P M
ult
ipli
er
Blo
ck
s
36 Kb Dual Port RAMs
Prof. Mateo Valero – Big Data
![Page 99: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/99.jpg)
9999
Solutions
Generic Architectures
– High-End: Data Appliances on premise
– Commodity: Hadoop / SPARK / …
Architecture Specialization– IBM BGAS / Netezza
– Microsoft Catapult
Prof. Mateo Valero – Big Data
![Page 100: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/100.jpg)
100
High-End Data Appliance
Prof. Mateo Valero – Big Data
![Page 101: Social and Economic Challenges of Open Data and Big data”](https://reader034.fdocuments.net/reader034/viewer/2022052607/5874b4761a28abab268bd79b/html5/thumbnails/101.jpg)
101
IBM Netezza Appliance(>50TB)
Prof. Mateo Valero – Big Data