Social and Economic Challenges of Open Data and Big data”

www.bsc.es

Social and Economic Challenges of Open Data and Big data”

Prof. Mateo Valero

BSC Director

Madrid, Dec. 11th, 2014

Social and Economic Challenges of Open Data and Big data

Organiza: Departamento de Bibliotecas y Documentación

Dirección de Cultura. Instituto Cervantes

Prof. Mateo ValeroDirector

Barcelona Supercomputing Center

Outline

The Big Data era: data generation explosion

Storing data

Processing data

Where do we place data?

How we use data?

Research at BSC

3Prof. Mateo Valero – Big Data

¿Cómo avanza la ciencia hoy?

CARO PELIGROSO IMPOSIBLE

Simulación

Simulación = Calcular las formulas de la teoría

Teoría

Experimentación

5

Higgs and Englert’s Nobel for Physics 2013

Last year one of the most computer-intensive scientific experiments ever undertaken confirmed Peter Higgs and François Englert’s theory by making the Higgs boson –the so-called “God particle” – in an $8bn atom smasher, the Large Hadron Collider at Cern outside Geneva.

“ the LHC produces 600 TB/sec… and after filtering needs to store 25 PB/year”… 15 million sensors….

Big Data in Biology

High resolution imaging

Clinical records

Simulations

Omics

Life Sciences and Health:personalized medicine

Atom

Small Molecule

Macromolecule

Cell

Tissue

OrganPopulation

Prof. Mateo Valero – Big Data

8

Sequencing Costs

8

Source: National Human Genome Research Institute (NHGRI)http://www.genome.gov/sequencingcosts/

(1) "Cost per Megabase of DNA Sequence" — the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality

(2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001

In both graphs, the data from 2001 through October 2007 represent the costs of generating DNA sequence using Sanger-based chemistries and capillary-based instruments ('first generation' sequencing platforms). Beginning in January 2008, the data represent the costs of generating DNA sequence using 'second-generation' (or 'next-generation') sequencing platforms. The change in instruments represents the rapid evolution of DNA sequencing technologies that has occurred in recent years.


9

World-class genomicconsortiums

• European Genome-phenome Archive (EGA) repository will allow us to explore datasets from numerous genetic studies.

• Pan-cancer will produce rich data that will provide a major opportunity to develop an integrated picture of differences across tumor lineages.

ICGC-TCGAPanCancer Project

10

Identification and classification of genome and expression variation

Different Tumor types(2000 patients, 4000 genomes)

European Bioinformatic

Institute

University ofChicago

Electronics and Telecom.

Res. Inst. (Korea)

IMSUTRIKEN

Barcelona Supercomputing

Center

Broad InstituteMIT

Ontario institute for cancer research

Largest world-wide initiative in of biomedical genomics. Computing protocols and resources are a limiting factor


Square Kilometer Array (SKA)


Square Kilometer Array (SKA)

12

Tbps ·············> Gbps


Pervasive Connectivity

Explosion of Information

400,710 ad requests

2000 lyrics playedon Tunewiki

1,500 pings sent on PingMe

208,333 minutesAngry Birds played

23,148 appsdownloaded

98,000 tweets

Smart Device

Expansion

In 60 sectoday

2013

30Billion

By 2020

40 Trillion GB

… for 8Billion

10Million

DATA

(1)

(2)

(3)

Devices

Mobile

Apps

(4)

(1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4) http://en.wikipedia.org

A New Era of Information Technology

Current infrastructure sagging under its own weight

Internet of Things


The Data Deluge

202040ZB*

(figure exceeds prior forecasts by 5 ZBs)

2005 20102012

2015

8.5ZB

2.8ZB1.2ZB0.1ZB

* Source: IDC


1030

This will take us beyond our decimal system

Geopbyte

This will be our digital universe tomorrow…

Brontobyte

1027

1024This is our digital universe today

= 250 trillion of DVDs

Yottabyte

1021

1.3 ZB of network traffic by 2016

Zettabyte

1018

1 EB of data is created on the internet each day = 250 million DVDs worth of information. The proposed Square Kilometer Array telescope will generated an EB of data

per day

Exabyte

1012

Terabyte

500TB of new data per day are ingested in Facebookdatabases

1015

PetabyteThe CERN Large Hadron Collider generates 1PB per second

109

Gigabyte

106

Megabyte

How big is big?

Saganbyte, Jotabyte,…


Every area of science is now engaged in data-intensive researchResearchers need– Technology to publish and share data in the cloud– Data analytics tools to explore massive data collections– A sustainable economic model for scientific analysis,

collaboration and data curation

The data explosion is transforming science


Outline


Storing data

Processing data


How we use data?

Research at BSC


Magnetic Tape Memory

Invented by Eckert & Mauchly for the UNIVAC I, March, 21,1951

– Model UNISERVO

– 224 KB of data

– 1/2 inches of diameter,1200 feets, 128 characters per inch

– Speed: 100 inches per second, equivalent to 12800 characters per second

Storage Tech, 2013, T-10000D,

– 8.5 Terabytes (40.000.000 increase)

– EBW: 250 Mbytes/second (20.000 increase)

– Load Time: 10 seconds


First Disk, IBM, 1956

1956, IBM 305 RAMAC

4 MB, 50x24” disks, 1200 rpm, 100 bits/track

Intertracks: 0.1 inches, Density: 1000 bits/in2

100 ms access , Tubes, 35k$/y rent

Year 2013: 4 Terabytes (1.000.000 increase)

Average access time: few milliseconds (40 to 1)

Areal: doubling in average every 2/4 years, but not now

Predicted: 14 Terabytes in 2020 at the cost of $40

HDD: Hard Drives Disk

20

Tapes advancing fast

May 2014

IBM and Fujifilm have demoed a 154TB LTO-size tape cartridge which could come to market in 10 years' time.

The Sony development involved a 148Gbit/in2 tape density and its own tape design to achieve a 185TB uncompressed capacity.

148 Gbit/in2

3.1 Gbit/in2 (T10000C)

http://www.insic.org/news/2012Roadmap/12index.html


21Seagate Step up to SAS; SAS Roadmap - Source: SCSI Trade Association

How We Increase 10x and Beyond…

Bandwidth and Connectivity

Capacity Drive Technologies

2010 2015 ~2020

1TB

8TB

20TB

Example Roadmap

Example features for capacity driven usage

22Seagate Storage Effect Web Site

Areal Density

Future Candidate(s) Single Molecule

Magnets

Shingled Magnetic Recording (SMR)

Heat-Assisted Magnetic Recording(HAMR)

Projected 25% Areal Density

Increase

Enables+10TB and above

Projected 55% Areal Density

Increase

Theoretical up to 30TB to 60TB Hard Drive

Envisioned 1,000x Areal Density

Increase

Theoretical up to 3000TB Hard Drive

2323

Capability computing:Research directions

Can “trade” memory capacity for other metrics of interest –e.g. bandwidth?

Packaging– 2.5D stacking

– 3D stacking

Technologies

– Wide I/O

– Hybrid memory cube


Emerging (non volatile)Memories

Traditional Technologies Emerging Technologies

Improved Flash

DRAM SRAM NOR NAND FeRAM MRAM PCM Memristor NEMS

Cell Elements 1T1C 6T 1T 1T1C 1T1R 1T1R 1M 1T1N

Half pitch (F) (nm) 50 65 90 90 180 130 65 3-10 10

Smallest cell area (F2) 6 140 10 5 22 45 16 4 36

Read time (ns) < 1 < 0.3 < 10 < 50 < 45 < 20 < 60 < 50 0

Write/Erase time (ns) < 0.5 < 0.3 105 106 10 20 60 < 250 1ns(140ps-5ns)

Retention time (years) seconds N/A > 10 > 10 > 10 > 10 > 10 > 10 > 10

Write op. Voltage (V) 2.5 1 12 15 0.9-3.3 1.5 3 < 3 < 1

Read op. Voltage (V) 1.8 1 2 2 0.9-3.3 1.5 3 < 3 < 1

Write endurance 1016 1016 105 105 1014 1016 109 1015 1011

Write energy (fJ/bit) 5 0.7 10 10 30 1.5×105 6×103 < 50 < 0.7

Density (Gbit/cm2) 6.67 0.17 1.23 2.47 0.14 0.13 1.48 250 48

Voltage scaling Fairly scalable no poor promising promising

Highly scalable Major technological barriers poor promising promising promising

Sources: Chong et al ICCAD09, Eshraghian TVLSI11, ITRS13


Outline


Storing data

Processing data


How we use data?

Research at BSC


26

The MultiCore Era

Moore’s Law + Memory Wall + Power Wall

Chip MultiProcessors (CMPs)

UltraSPARC T2 (2007)

Intel Xeon 7100 (2006)

POWER4 (2001)


27

Fujitsu SPARC64 XIfx

32 computing cores (single threaded) + 2 assistantcores

24MB L2 sector cache

256-bit wide SIMD

20nm, 3.75M transistors

2.2GHz frequency

1.1TFlops peak performance

High BW interconnects– HMC (240GB/s x 2 in/out)

– Tofu2 (125GB/s x 2 in/out)

Intel Xeon Phi or Intel Many Integrated Core Architecture (MIC)

Knights Corner (2011)– Coprocessor, 61x86 cores,

22nm, AVX-512, 4 HTs

– 1.2TFLOPS (DP), 300W TDP, 4 GFLOPS/W

– 512KB/core L2 coherent

– Int Netw: Ring

– Mem BW: 352GB/s

Knights Landing (exp 2015)

– Coprocessor or host processor

– 72 Atom cores, 14nm, AVX512 per core, 4 HTs

– Up to 16GB of DRAM 3D stacked on-package, 384GB GDDR

– 3TFLOPS (DP), 200W TDP, 15GFLOPS/W


29

Accelerators: NVIDIA Kepler GK110 GPU (2014)

DP Performance: 1.43 Tflop

Mem BW (ECC off): 288 GB/s

Memory size (GDDR5): 12 GB

15 SMX units– 192 single‐precision

CUDA cores

– 64 double‐precision units

– 32 special function units

– 32 load/store units

Six 64‐bit memorycontrollers


FLOP/second (operaciones sobre números reales 64 bits)

1988Cray Y-MP (8 procesadores)

1998Cray T3E (1024 procesadores)

2008Cray XT5 (150000 procesadores)

~2018? (1x107 procesadores

Evolution of the computing power of Supercomputers

Prof. Mateo Valero – Big Dataee

FLOP/segon (operaciones sobre números reales de 64 bits)

#1 (55 PF)Tianhe @ National University of

Defense Technology, 54.9 PFlops

#1 EU (5PF)JUQUEEN @ Forschungszentrum Jülich

#1 Espanya (1PF)MN3 @ BSC

Samsung Exynos > 50 GflopsPrototips MontBlanc @ BSC

The “Formula 1” of Supercomputers today


Nvidia: Node for theExaflop Computer

Thanks Bill Dally

Graph 500 Project

HPC benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications.

Graph 500 establishes a set of large-scale benchmarks for these applications.

Steering Committee: 50 international HPC experts from academia and industry.

Graph 500 is developing benchmarks to address three application kernels:

– Concurrent search

– Optimization (single source shortest path)

– Edge-oriented (maximal independent set)

Also they are addressing five graph-related business areas:

– Cybersecurity

– Medical Informatics

– Data Enrichment

– Social Networks

– Symbolic Networks.


• Top500 defined a benchmark (Linpack) to rate HPC machines upon performance. This benchmark is not suitable to address the characteristics of Big Data applications.

• Green Graph 500 list:

• Collects performance-per-watt metrics

• To compare the energy consumption of data intensive computing workloads.

Graph500 vs. Top500

Linpack:

– computation bounded

– focused on floating-point operations

– Bulk-Synchronous-Parallel model: behavior based on big computation and short communication bursts

– dense data structures highly organized and coalesced (spatial locality)

Graph500: graph500.org Top500: top500.org

• Graph500:

• communication bounded

• focused on integer operations

• asynchronous spatial uniform communication interleaved with computation

• larger sparse datasets (very low spatial and temporal locality)


35

Accelerators: NVIDIA Pascal (expected 2016)

Aimed to fix data movement bottleneck

Based on NVlinks– chip-to-chip

communication approach

– comprised of bi-directional 8-lane links

– provide between 80 and 200 GB/s of bandwidth

This approach is expected to provide 4x speedups w. r. t. current GPU-based designs


36

Specialization is Everything (?)

ASICsFPGAs

Source: Bob Broderson, Berkeley Wireless groupProf. Mateo Valero – Big Data

Outline


Storing data

Processing data


How we use data?

Research at BSC


38

BioInformatics, Big Data and Supercomputing

1 PB of compressed data

100.000 subjects suffering different diseases

If we ever had a 1PB disk(100MB/s)

scanning 1 Petabyte:

3,000 hours / 125 days

What is the time required to retrieve information?


What if we want to process the data in

1 hour ?

Supercomputing is about doing things FAST…


41

36-p

ort

FD

R10

36-p

ort

FD

R10

36-p

ort

FD

R10

36-p

ort

FD

R10

36-p

ort

FD

R10

36-p

ort

FD

R10

36-p

ort

FD

R10

36-p

ort

FD

R10

Mella

nox

648-p

ort

IBC

ore

S

witch

Mella

nox

648-p

ort

IBC

ore

S

witch

Mella

nox

648-p

ort

IBC

ore

S

witch

Mella

nox

648-p

ort

IBC

ore

S

witch

Infinib

and 6

48-

port

F

DR

Core

sw

itch

Mella

nox

648-p

ort

IBC

ore

S

witch

Mella

nox

648-p

ort

IBC

ore

S

witch

36-p

ort

FD

R10

36-p

ort

FD

R10

560

560

560

560

560

560

Le

af sw

itch

es

18

18

18

18

12

3 lin

ks to

each c

ore

3 lin

ks to

each c

ore

3 lin

ks to

each c

ore

3 lin

ks to

each c

ore

2 lin

ks to

each c

ore

FD

R1

0

links

18

18

18

18

12

3 lin

ks to

each c

ore

3 lin

ks to

each c

ore

3 lin

ks to

each c

ore

3 lin

ks to

each c

ore

2 lin

ks to

each c

ore

18

18

18

18

12

18

18

18

18

12

La

ten

cy: 0

,7

µs

Ba

nd

wid

th:

40

Gb

/s

Evolution of storage architecture for Big Data

Co

mp

ute

Ne

two

rk(4

0G

bp

s N

od

eA

dap

ter)

Co

mp

ute

No

de

s

Storage Network(1Gbps Node Adapter)

Sto

rag

e R

ac

ks

11hrs for 1 PB

1PB


42

Solid State Hybrid (SSHD) Technology

SSHD: Capacity, performance, and value

HDD large capacity + SSD high speed

Adaptive Memory™ technology

• Self-learning software algorithms make HDD SLC NAND flash work together.

• Enables SSD-like performance when accessing most frequently-used files

• Reduces workload and increases reliability of SLC NAND flash

Reduces Costs Now practical to employ enterprise-class SLC NAND flash

Compute andCommunication Energies

43

Source: Processors and sockets: What's next? Greg Astfalk / Salishan Conference / April 25, 2013


44

Paradigm shift

Old

Compute-centric Model

New

Data-centric Model

Massive ParallelismPersistent Memory

Flash Storage Class Mem

Manycore Accelerators

Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics


Optimized System Design forData-Centric Deep Computing

45

Source: IBM Corporation, 2013


46

Solutions forSupercomputing

© 2013 IBM CorporationGene Active Storage

0

0.5

1

1.5

2

2.5

3

3.5

4

0 5000 10000 15000 20000 25000 300000

5000

10000

15000

20000

25000

30000

35000

40000

A2

AB

W/N

od

e(G

B/s

)

Ag

gre

ga

teA

2A

BW

(GB

/s)

Node Count

3D-Torus per Node3D-Torus Aggregate4D-Torus per Node

4D-Torus Aggregate5D-Torus per Node

5D-Torus AggregateFull Bisection per Node

Full Bisection Aggregate


47

HW-SW-Network co-design:Data Appliances


48

Microsoft Catapult

Microsoft is planning to replace traditional CPUs in data

centers with field-programmable arrays, or FPGAs,

processors that Microsoft could modify specifically for use

with its own software. These FPGAs are already available in

the market and Microsoft is sourcing it from a company called

Altera. The FPGAs are 40 times faster than a CPU at

processing Bing’s custom algorithms.

Doug Burger From Microsoft Research Talks About Project Catapult Which Will Make Bing Twice As Fast (Jun 17, 2014)http://microsoft-news.com/doug-burger-from-microsoft-research-talks-about-project-catapult-which-will-make-bing-twice-as-fast-video/


49

HP unveils “The Machine”

June 11, 2014

HP unveils “The Machine”

It uses clusters of special-purpose cores, rather than a few generalized cores; photonics link everything instead of slow, energy-hungry copper wires; memristors give it unified memory that's as fast as RAM yet stores data permanently, like a flash drive.

A Machine server could address 160 petabytes of data in 250 nanoseconds; HP says its hardware should be about six times more powerful than an existing server, even as it consumes 80 times less energy. Ditching older technology like copper also encourages non-traditional, three-dimensional computing shapes, since you're not bound by the usual distance limits.

Source: http://www.engadget.com/2014/06/11/hp-the-machine/

Outline


Storing data

Processing data


How we use data?

Research at BSC


Relational databasessometimes not good for scale-out

1 2 3 4 5

2 3 4

1 3 5

1 3 4

2 4 5

1 2 5

NoSQL for non-structured data, eventual consistency


Programming model

� To meet the challenges: MapReduce

– Programming Model introduced by Google in early 2000s to support distributed computing (special emphasis in fault-tolerance)

� Ecosystem of big data processing tools

• open source, distributed, and run on commodityhardware.

� The key innovation of MapReduce is – the ability to take a query over a data

set, divide it, and run it in parallelover many nodes.

� Two phases– Map phase– Reduce phase

Reducers

Mappers

Input Data

Output Data


53

Google progressivelydropping MapReduce

2004 introducesMapReduce

2006The Apache Hadoop Project brings an opensource MapReduce Implementation, with the team of the Apache NutchCrawler and the support of Yahoo!

June 25, 2014Google announces Cloud Dataflow: write code once, run it in batch or stream mode

Cloud Dataflow is a managed service for creating data pipelines that ingest, transform and analyze data in both batch and streaming modes.

2010Google announces Pregel used for building incremental reverse indexes instead of MapReducePhylosophy: ‘think like a vertex” (event-driven)


Deep Learning: generating high level abstractions

Toronto Face Database (TFD)– Automatic recognition of Facial Expressions

54

Fig. 15. Expression recognition accuracy on the TFD datasetwhen both training and test labeled images are subject to 7 typesof occlusion.

Ranzato, M., Mnih, V., Susskind, J. and Hinton, G. E. Modeling Natural Images Using Gated MRFs. IEEE Trans. Pattern Analysis and Machine Intelligence, to appear.


55

Watson: CalculatingVs Learning

55

Source: Courtesy of IBMProf. Mateo Valero – Big Data

Outline


Storing data

Processing data


How we use data?

Research at BSC


57

Barcelona Supercomputing Center Centro Nacional de Supercomputación

BSC-CNS objectives:– R&D in Computer, Life, Earth and Engineering Sciences

– Supercomputing services and support to Spanish and European researchers

BSC-CNS is a consortium that includes:– Spanish Government 51%

– Catalonian Government 37%

– Universitat Politècnica de Catalunya (UPC) 12%

+400 people, 45 countries


58

EARTH SCIENCES

To develop and implement

global and regional state-

of-the-art models for

short-term air quality

forecast and long-term

climate applications

LIFE SCIENCES

To understand living

organisms by means of

theoretical and

computational methods

(molecular modeling,

genomics, proteomics)

CASE

To develop scientific and

engineering software to

efficiently exploit super-

computing capabilities

(biomedical, geophysics,

COMPUTER

SCIENCES

To influence the way

machines are built,

programmed and used:

programming models,

performance tools, Big

Data, computer

atmospheric, energy,

social and economic simulations)

architecture, energy efficiency

Mission of BSCScientific Departments


MareNostrum 3

59

48,896 Intel SandyBridge cores at 2.6 GHz84 Intel Xeon Phi

Peak Performance of 1.1 Petaflops100.8 TB of main memory

2 PB of disk storage8.5 PB of archive storage

9th in Europe, 29th in the world (June 2013 Top500 List)

60

Abstraction of computer middleware

e.g. NoSQL DB

e.g. MapReduce

e.g. Cloud

e.g. Machine Learning

e.g. Linux

e.g. FPGAs, GPUs

e.g. Genomics

61

Ongoing Big Data projects - I

BSC visible @ Spark community

- leading the Barcelona Spark Meetup

- Contacts with Databricks(Berkeley Univ)

- Turn Marenostrum as a Big Data Platform

BigIoT

Ultra-Large vision of Internet of Things(IoT)

Website of IoT Europe: http://ioteurope.bsc.es/ Site where you can find research and companies working on IoT

Optical Network

& Memories

New architectures developed using optical interconnections and optical memory

Bring together the key European hardware,networking, and system architects with thekey producers and consumers of Big Data toidentify the industry coordination points thatwill maximize European competitiveness

Roadmap

DB Analytics Accelerati

on

Focus on automatic scaling of complex analytics, while addressing the full requirements of real data sets.

62

Ongoing Big Data projects – II

Big DataApplicationMonitoring Design and implement a

set of tools to monitor and to trace network traffic for

hadoop applications

SoftwareDefined

Env.

Building cloud environments embracing

hardware and network heterogeneity to host a

variety of workloads (JSA-SDN)Cloud

Recom-mendation

Enhance a decisionsupport system providingguidance in the selectionof the right set of Cloudsanalysing trade-off between cost, reliability, risks and quality impacts Hi-EST

Holistic Integration of Emerging Supercomputing TechnologiesERC Starting Grant David Carrera

Fog Computing

Fog extends the cloud by implementing a distributed platform capable of processing the data close to the end-points, becoming an excellent platform for Internet of Things (IoT)Pioneered by Mario Nemirovsky

63

Ongoing Big Data projects - III

HighPerformance

Key/ValueStores

Explore the use of high performance key/value

databases for fast persistent memory technologies

Mngt.of Data

StreamingEnviron.

Explore novel architectures of the emerging IoT stream processing platforms, that provide the capabilities of data stream composition, transformation and filtering in real time

ExtendingPyCompSsto NoSQL

DBIntegration of COMPSs runtime with Cassandraand exploration of scheduling policies driven by data locality

NoSQLData

ManagementResearch

Design and implement a software layer to enable NoSQL databases to decouple

data organization from data model and provide NoSQL databases with efficient

multi-level indexing support

ImprovingCost-effect.of Big Data

Deploy.

Develop mechanisms for an automated characterization of cost-effectiveness of Hadoop deployments (runtime performance vssoftware and hardware configuration choices)

Secured

An innovative architecture to achieve protection from Internet threats by offloading execution of security applications into a programmable device at the edge of the network

64

Ongoing Big Data projects - IV

Large-scalegraph proc.

for RTUse of probabilistic bayesian graph

algorithms for real-time applications and

its efficient parallel computation

DeepLearning

Enhance new deep neural network algorithms, such as autoencoders of Boltzmann machines to create generative models

MultimediaBig Data

Computing

Visual concept detection & notationScalable indexing of image and videoMultimodal Big Data Analytics

Master Thesis

65

Ongoing Big Data projects - V

Large-scaleGenome analysis.Use of statistical

imputation theory to infer unknown genetic

variants. Large input data

analysis and efficient parallel computation.

Big-data analytics

Large-scale Input data analysis Optimized DNA

compression algorithms, such as BWA, Suffix Arrays, FM-Indexing, etc., to compare multi-patients DNA information . Big-data analytics

: Somatic Mutation Finder on cancer genomes.

: GWImp-COMPSs for large-scale Imputation and Genome-wide Association studies.

New architectures development for Computational Genomics tasks

Efficient Big-data genomic workflows deployment in current and future: HPC Clusters, Clouds, etc.

66

Ongoing Big Data projects - VI

DeepLearning

Processing & integrating urban data from heterogeneous data sources into GIS and different simulators.

H2020 Lighthouse on Smartcitiesand sustainability. 12 Smart Solutions to meet the three different aspects of sustainability: economic, social and environmental.

DeepLearningGrow

Smarter

Urban semantics: Our model is based on the concept of linked data and has several advantages: (1) the mapped data may be accessed uniformly, (2) the user can explore and navigate the metadata graph to understand its contents, (3) the user can query the model and visualize the result, and (4) data conflicts can be automatically detected.

67

Ongoing Big Data projects :Global picture

NoSQLData

ManagementResearch

ExtendingPyCompSsto NoSQL

DB

HighPerformance

Key/ValueStores

ImprovingCost-effect.of Big Data

Deploy.

Mngt.of Data

StreamingEnv.

LS

SoftwareDefined

Env

Big DataApplicationMonitoring

CS CASE LS LSES

Large-scalegraph proc.

for RT

DeepLearning

MultimediaBig Data

Computing

DeepLearningGrow

Smarter

DeepLearning

CASE CASE

Optical Network

& Memories

BigIoT

Fog Computing

LS

DB Analytics Accelerati

on

Roadmap

¡Gracias por su atenció[email protected]

Prof. Mateo Valero. Big Data

Big Data VOLUME @ BSC

Some examples:

Alya System Computational Mechanics code

Alya simulation software with particles module: 300Gb or more each simulation.

MoDEL Data Base is the largest database of Molecular Simulation in Europe: 18 Tb of data.

Chronic Lymphocytic Leukemia (CLL) Genome Project that aims to generate a comprehensive catalogue of genomic alterations: 1.5 Pb of data.

Biomolecular Simulation project with Ascona B-DNA Consortium to generate a database containing structural and dynamic information of DNA sequences: 30Tb data.


Big Data VOLUME @ BSC

Problem:

Difficult to share/manage and analysis this big amount of data.

Solution:

Research at Computer Science department with a holistic approach at different layers:

Under:

70

Progr.Models

Algorithms

Applications

Run Time

Architecture


71

Computer Science strategy

Create an unified, productive, easy to use and efficient environment for our broad range of applications

Big Data software stack @ BSC

(main components)

Active Store

COMPSs

NoSQL data management

hierarchical storage + computing resources

Wasabi Others

Resource management policies: data organization, query plans, computation scheduling

Common API

App

SKV

Allow the use of NoSQL Data Bases

technologies as a backend

Facilitate the access to Big Data storage

from COMPSs applications and

offering Abstraction to the programmer

Data sharing platform based on self-contained objects (data + code)

Use of high performance in-memory databases architecture to accelerate workloads.


72

BSC Performance Tools

An efficient and resilient system is a crucial part of a Big Data solution

– Entire suite of performance tools & methodology are necessary to ensure it

– In order to Detect/Understand structure and behavior, perform projections and extrapolations, …

Fundamental new component: Performance Analytics– Combination of techniques from

many areas as signal processing, clustering, sequence alignment, image tracking, modeling …


Big Data collaborationwith IT industry @ BSC

Learning form Large & Sparse Graphs

• Algorithmic development: Machine Learning, Data mining, Pattern discovery/predictive analysis, …

• Applied to Neuroscience data, Behavior in Location Based Social Networks, …

Real-Time Monitoring of Image/Video Streaming

• Algorithmic development: Massive scale image replica detection, Massive scale image concept annotation, …

• MOU with SFY and CAR: Sport learning support system with personal cameras.

Citizen as a sensor

• Citizens, using their mobile devices, leave a digital footprint that can be gathered.

• Algorithmic development: Stream mining algorithms for massive real-time data, Adaptive clustering, Geo-aware DM methods, …

Deep Learning • Deep Learning attempts to mimic by soft the activity in layers of neurons.

• In the last decade researchers made some fundamental conceptual breakthroughs, but computers were not powerful enough.

• Now we have the resources!


Big Data collaborationwith IT industry @ BSC

Systematic Study of Hadoop Deployment Variables to Enable Automated Characterization of Cost-Effectiveness

Use of high performance in-memory databases architecture to accelerate workloads.

Network-aware placement on "Software Defined Environments” (SDE) for HPC/BigData

Create a Decision Support System using social information (stackoverflow, ….) for cloud services

74

SDE

ALOJA

BGAS


Big Data relatedEuropean Projects @ BSC

Management of BigData for SmartCities & IoT.

Deliver a strategic roadmap for how technology advancements in hardware and networking can be exploited for Big Data analytics and beyond.

Accelerate “traditional” extreme large databases &

Improve Visualization using advanced hardware.

Offload security applications from the user terminal to a trusted and secure network node placed at the edge of the network.


Energy European Projects& Big Data @ BSC

Parallel Distributed Infrastructure for Minimization of energy

Install, configure, benchmark Hadoop on ARM prototypes

Network energy proportionality for Hadoop

Energy Benchmarking for Big Data (Modelling & Prediction)


77

Analytics

(real-time)

Big Data

HPC

Cloud

Computing

SDE

ALOJA

BGAS

BSCTools

BSC-CATech

Citizen as a sensorDeep Learning

NoSQLData Management

Collab. CASE

Life Science

Real-time Monitoring of Image Streaming

BSC Programming

Models

servIoTicy.com

Adaptive

Scheduler

Sustainable

Computing

BSC Programming

Models

Analytics

(real-time)

Big Data

HPC

Cloud

Computing

SDE

Smarts Cities& IoT

ALOJA

BGAS

BSCTools

BSC-CATech

Citizen as a sensorDeep Learning

NoSQLData Management

Collab. CASE

Life Science

Real-time Monitoring of Image Streaming

BSC Programming

Models

servIoTicy.com

Adaptive

Scheduler

www.bsc.es

Thank you !

Emerging (non volatile)

MemoriesTraditional Technologies Emerging Technologies

Improved Flash

DRAM SRAM NOR NAND FeRAM MRAM PCM Memristor NEMS

Cell Elements 1T1C 6T 1T 1T1C 1T1R 1T1R 1M 1T1N

Half pitch (F) (nm) 50 65 90 90 180 130 65 3-10 10

Smallest cell area (F2) 6 140 10 5 22 45 16 4 36

Read time (ns) < 1 < 0.3 < 10 < 50 < 45 < 20 < 60 < 50 0

Write/Erase time (ns) < 0.5 < 0.3 105 106 10 20 60 < 250 1ns(140ps-5ns)

Retention time (years) seconds N/A > 10 > 10 > 10 > 10 > 10 > 10 > 10

Write op. Voltage (V) 2.5 1 12 15 0.9-3.3 1.5 3 < 3 < 1

Read op. Voltage (V) 1.8 1 2 2 0.9-3.3 1.5 3 < 3 < 1

Write endurance 1016 1016 105 105 1014 1016 109 1015 1011

Write energy (fJ/bit) 5 0.7 10 10 30 1.5×105 6×103 < 50 < 0.7

Density (Gbit/cm2) 6.67 0.17 1.23 2.47 0.14 0.13 1.48 250 48

Voltage scaling Fairly scalable no poor promising promising

Highly scalable Major technological barriers poor promising promising promising

Sources: Chong et al ICCAD09, Eshraghian TVLSI11, ITRS13


81

Introducción

• Aparición de la escritura, sistemas de numeración

• Biblioteca de Alejandria, imprenta

• Amazon y Google…,de la información a.. Sugerir nuevos libros a comprar y a traducir automáticamente entre idiomas

• Ejemplo de las tablas de navegación de Mathew F. Maury y sus “doce computadores” (operarios), libro “The Physical Geography of the Sea”, 1855, 1.2 millones de puntos, redujo1/3 la duración de los viajes, Neptuno…

• Censo en USA, Hermann Hollerith, censo de 1890, cada 10 años, 13 años,..de 8 años en el censo de 1880 se paso a 1 en el del 1890.. Ademas, capacidad para procesar rápidamente la información

• Ahora, tenemos posibilidad de generar muchisima información y procesarla… la culpa la tuvo el transistor

• Cuantos son los Big Data?


BigData needs Intelligence!

Deep Learning: Building high-level abstractions

Watson: Reasoning


MapReduce: example

Reducers

Mappers

Input Data

Output Data

be, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt

to be or not

hamlet.txt

be not afraid of greatness

12th.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

afraid, (12th.txt)be, (12th.txt, hamlet.txt)

greatness, (12th.txt)not, (12th.txt, hamlet.txt)

of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)

Input

Map

Reduce

Output


84

What is behind Watson?

IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basementBy Tony Pearson (ibm.co/Pearson)http://tinyurl.com/pt6mdfu

DeepQA: Massively Parallel Probabilistic Evidence-Based Architecture For each question, generates and scores many hypotheses using a combination of 1000’s of Natural Language Processing, Information Retrieval, Machine Learning and Reasoning Algorithms. These gather, evaluate, weigh and balance different types of evidence to deliver the answer with the best evidence support it can find


85

Software tuning keyto exploit hardware!


86

Internet & Big Data


Challenges of data generation

Source: http://www-01.ibm.com/software/data/bigdata/


Graph500 and Top500 Lists: June 2014

Graph500

# Machine Site CoresProblem

sizeGTEPS

1 K computer (Fujitsu - Custom supercomputer)RIKEN Advanced Institute for

Computational Science (AICS)524,288 40 17,977.1

2Sequoia (IBM - BlueGene/Q, Power BQC 16C 1.60

GHz)Lawrence Livermore National

Laboratory1,048,576 40 16,599

3 Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Argonne National Laboratory 786,432 40 14,328

4JUQUEEN (IBM - BlueGene/Q, Power BQC 16C 1.60

GHz)Forschungszentrum Juelich (FZJ) 262,144 38 5,848

5 Fermi (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) CINECA 131,072 37 2,567

Top500

# Machine Site CoresRmax

(TFlop/s)Rpeak

(TFlop/s)

1Tianhe-2 (TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C

2.200GHz, Intel Xeon Phi 31S1P)National Super Computer Center

in Guangzhou3,120,000 33,862.7 54,902.4

2Titan (Cray XK7 , Opteron 6274 16C 2.200GHz, NVIDIA

K20x)Oak Ridge National Laboratory 560,640 17,590.0 27,112.5

3Sequoia (IBM - BlueGene/Q, Power BQC 16C 1.60

GHz)Lawrence Livermore National

Laboratory1,572,864 17,173.2 20,132.7

4 K computer (Fujitsu - SPARC64 VIIIfx 2.0GHz)RIKEN Advanced Institute for

Computational Science (AICS)705,024 10,510.0 11,280.4

5 Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Argonne National Laboratory 786,432 8,586.6 10,066.3


90

Limitations of MapReduceas a Programming model?

• MapReduce is great but not every one is a MapReduce expert

“I am a python expert but ….”

• There is a class of algorithms that cannot be efficiently implemented with the MapReduce programming model

• Different programming models deal with different challenges

– pyCOMPSs from BSC

– SPARK from Berkeley

Input Data

Output Data


Commodity Solution –Cloud Oriented

What code runs on such a complex environment (many nodes, many disk volumes) with a short time to market deployment and on commodity HW?

91

92

What happens if SW does not consider HW

• Terasort contest: sorting 100TB data

• Number 1: Hadoop

• 2100 nodes, 12 cores per node, 64 Gb per node

• 24.000 cores

• 134 Tb memory

• Time: 4300 segs

• Cost in Amazon: $ 8.800

• Number 2: Tritonsort

• 52 nodes, 8 cores per node, 24 Gb

• 416 cores

• 1,2 Tb memory

• Time: 8300 segs and 6400 segs

• Cost in Amazon: $ 294 and 226

• Hadoop is easy to program, but needs 57X more cores, 100X more memory, and only gets 2X performance


Graph500 characterization

Graph500 benchmark performs a Breadth-First Search over a random graph. It finds an embedded spanning tree over a given graph vertex.

Benchmark application receives two parameters:

– Scale of the graph. Number of vertices is N=2SCALE.

– Edge-factor, the number of edges per vertex. Number of edges is M=edgefactor*N.

Performance Metric is TEPS (as opposed to FLOPS in Top500)

– Traversed Edges Per Second, counts number of graph vertex tuples traversed by the search algorithm within its execution time.

There are multiple implementations for the BFS algorithm; the most scalable one is the ‘simple’ one. We will focus on this one to present a few charts.



Evolution over time of the research paradigm

• In the last millennium, science was empirical:• description of natural phenomena

• A few centuries ago opens the theoretical approach • using models, formulas and generalizations.

• In recent decades appears computational science• simulation of complex phenomena

• Now focuses on the exploration of big data (eScience)• unification of theory, experiment and simulation

• capture massive data using instruments or generated through simulation and processed by

• computer

• knowledge and information stored in computers

• Scientists analyse databases and files on data infrastructures

2

2

2.

3

4

a

cG

a

aΚ−=

ρπ

2

2

2.

3

4

a

cG

a

aΚ−=

ρπ

Jim Gray, National Research Council, http://sites.nationalacademies.org/NRC/index.htm; Computer Science and Telecommunications Board, http://sites.nationalacademies.org/cstb/index.htm.

95

Peering into Big DataHW future

In the next 10 years, the HW will change more dramatically than it has in the past 10 years

3D Stacking

Non-Volatile Memories

Dark Silicon

HW will influence the products and services that you provide


Top10Rank Site Computer Procs Rmax Rpeak Power

GFlops/W

attName

1National Super ComputerCenter in Guangzhou

TH-IVB-FEP Cluster, Intel XeonE5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P

31200002736000

33,86 54,90 17,8 1,90Tianhe-2

(MilkyWay-2)

2DOE/SC/OAK Ridge National Lab

CRAY XK7, Opteron 6274 16C, 2.20 GHz, Cray Geminiinterconnect, NVIDIA K20x

560640261632

17,59 27,11 8,21 2,14 Titan

3 DOE/NNSA/LLNLBlueGene/Q, Power BQC 16C 1.60 GHz, Custom

1572864 17,17 20,13 7,89 2,18 Sequoia

4RIKEN Advanced Institutefor Computational Science(AICS)

Fujitsu, K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect

705024 10,51 11,28 12,65 0,83 K

5DOE/SC/Argonne NationalLaboratory

BlueGene/Q, Power BQC 16C 1.60GHz, Custom

786432 8,58 10,06 3,94 2,18 Mira

6 CSCSCray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x

11598473808

6,27 7,79 2,32 2,70 Piz Daint

7Texas AdvancedComputing Center

PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi

462462366366

5,17 8,52 4,51 1,14 Stampede

8ForschungszentrumJuelich (FZJ)

BlueGene/Q, Power BQC 16C 1.60GHz, Custom

458752 5,00 5,87 2,30 2,18 JUQUEEN

9 DOE/NNSA/LLNLBlueGene/Q, Power BQC 16C 1.60 GHz, Custom

393216 4,29 5,03 1,97 2,18 Vulcan

10 GovernmentCray XC30, Intel Xeon E5-2660v2 10C 2.2GHz, Aries, NVIDIA K40

7280062400

3,57 6,13 1,50 2,39

Top10

Rank Site Computer CoresProblem

SizeGTEPs

1RIKEN Advanced Institute forComputational Science (AICS)

K computer (Fujitsu - Customsupercomputer)

524,288 40 17,977.1

2Lawrence Livermore NationalLaboratory

DOE/NNSA/LLNL Sequoia (IBM -BlueGene/Q, Power BQC 16C 1.60 GHz)

1,048,576 40 16,599

3 Argonne National LaboratoryDOE/SC/Argonne National Laboratory Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)

786,432 40 14,328

4 Forschungszentrum Juelich (FZJ)JUQUEEN (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)

262,144 38 5,848

5 CINECAFermi (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)

131,072 37 2,567

6 Changsha, ChinaTianhe-2 (MilkyWay-2) (National Universityof Defense Technology - MPP)

196,608 36 2,061.48

7 CNRS/IDRIS-GENCITuring (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)

65,536 36 1,427

7Science and Technology FacilitiesCouncil - Daresbury Laboratory

Blue Joule (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)

65,536 36 1,427

7 University of EdinburghDIRAC (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)

65,536 36 1,427

7 EDF R&DZumbrota (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz)

65,536 36 1,427

June 2014

98

What are FPGAs?

FPGA: Field Programmable Gate Array

Generic sea of programmable logic and interconnect

Program into specialized computing & networking circuits

Can be reconfigured in as little as 100 ms

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

DSP

DSP

DSP

DSP

DSP

DSP

DSP

DSP

DSP

DSP

DSP

DSP

Network

Network

Network

Network

PCIe

PCIe

PCIe

PCIe Ge

ne

ric L

og

ic

(LU

Ts)

Specialized I/O Blocks

DS

P M

ult

ipli

er

Blo

ck

s

36 Kb Dual Port RAMs


9999

Solutions

Generic Architectures

– High-End: Data Appliances on premise

– Commodity: Hadoop / SPARK / …

Architecture Specialization– IBM BGAS / Netezza

– Microsoft Catapult


100

High-End Data Appliance


101

IBM Netezza Appliance(>50TB)


Social and Economic Challenges of Open Data and Big data”

Documents

Transcript of Social and Economic Challenges of Open Data and Big data”