詹剑锋:Big databench—benchmarking big data systems

45
INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: Benchmarking Big Data Systems Jianfeng Zhan Computer Systems Research Center, ICT, CAS CCF Big Data Technology Conference 2013-12-06 1 http://prof.ict.ac.cn/jfzhan

description

BDTC 2013 Beijing China

Transcript of 詹剑锋:Big databench—benchmarking big data systems

Page 1: 詹剑锋:Big databench—benchmarking big data systems

INSTITUTE O

F COM

PUTING

TECHN

OLO

GY

BigDataBench: Benchmarking Big Data Systems

Jianfeng ZhanComputer Systems Research Center, ICT, CAS

CCF Big Data Technology Conference 2013-12-06

1

http://prof.ict.ac.cn/jfzhan

Page 2: 詹剑锋:Big databench—benchmarking big data systems

2/

Why Big Data Benchmarking?2

Measuring big data architecture and systems quantitatively

Page 3: 詹剑锋:Big databench—benchmarking big data systems

3/

What is BigDataBench? An open source project on big data

benchmarking: • http://prof.ict.ac.cn/BigDataBench/

• 6 real-world data sets and 19 workloads– Extended in near future

• 4V characteristics– Volume, Variety, Velocity, and Veracity

3/

Page 4: 詹剑锋:Big databench—benchmarking big data systems

4/

Comparison of Big Data Benchmarking Efforts4/

Page 5: 詹剑锋:Big databench—benchmarking big data systems

5/

Possible Users5/

BigDataBench

ArchitectureProcessorMemory

Networks…….....

Systems OS for big data

File systems for big data…………………………..

Data management

…………..

Performance optimization

Co-design

Distributed systemsScheduling

Programming systems

Page 6: 詹剑锋:Big databench—benchmarking big data systems

6/

Research Publications

Characterizing data analysis workloads in data centers. Zhen Jia, Lei Wang, Jianfeng Zhan, Lixing Zhang, and Chunjie Luo. IISWC 2013 Best paper award

BigDataBench: a Big Data Benchmark Suite from Internet Services. Lei Wang, Jianfeng Zhan, et al. HPCA 2014, Industry Session.

6/

Page 7: 詹剑锋:Big databench—benchmarking big data systems

Outline7/

Benchmarking Methodology and Decision1

Case Study

3 How to Use

5

32

Future Work 44

Page 8: 詹剑锋:Big databench—benchmarking big data systems

8/

BigDataBench Methodology8/

4V of Big Data BigDataBench

Page 9: 詹剑锋:Big databench—benchmarking big data systems

Methodology (Cont’)9/

Representative Data

Sets

Diverse Worklo

ads

Data SourcesText dataGraph dataTable dataExtended …

Data TypesStructuredSemi-structuredUnstructured

Big Data Sets Preserving 4V

BigDataBench

Investigate Typical

Application Domains

data generation tool preserving data characteristics

Application TypesOffline analyticsRealtime analyticsOnline services

Basic & Important Operations and Algorithms Extended…

Represent Software Stack Extended…

Big Data Workloads

Page 10: 詹剑锋:Big databench—benchmarking big data systems

10/

Methodology (Cont’)10/

4V of Big Data

System and architecture characteristics

Similarity analysis

BigDataBench

Page 11: 詹剑锋:Big databench—benchmarking big data systems

Top Sites on the Web

More details in http://www.alexa.com/topsites/global;0

Search Engine, Social Network and Electronic Commerce hold 80% page views of all the Internet service.11/

Page 12: 詹剑锋:Big databench—benchmarking big data systems

12/

Workloads Chosen12/

• Cover workloads in diverse and representative application scenarios

• Search Engine, E-commerce, Social Network

• Pay equal attentions to different application types:

• online service, real-time analytics, offline analytics

• Include different data sources

• Text data, Graph data, Table data

• Cover representative software stacks

Page 13: 詹剑锋:Big databench—benchmarking big data systems

13/

19 Chosen Workloads13/

Application Scenarios

Micro Benchmarks

Basic Datastore Operations

Relational Queries

Search engines

Social networks

E-commerce system

Page 14: 詹剑锋:Big databench—benchmarking big data systems

14/

Data Generation Tools

Data Sources Text, Graph and Table

• Six real raw data

Synthetics Data Scale

• From GB to PB

Features• Preserve characteristics of real-world data

14/

Page 15: 詹剑锋:Big databench—benchmarking big data systems

15/

Naïve Text generator15/

wordsfollowing multinomial distribution

select word randomly

documents

big

architecture

system

CPU

miningdata

benchmarkingmemory

evaluatemachine

learning

cpu

Only modeling on the word level;

Page 16: 詹剑锋:Big databench—benchmarking big data systems

16/

Improved Text generator16/

select word randomly

wordsfollowing multinomial distribution under topic2

big

architecture

CPU

benchmarking

miningdata

systemmemory

evaluatemachine

learning

document

topic1

topic3

topic2

select topic randomly

topicsfollowing multinomial distribution

Modeling on the both topic and word level

CPU

Page 17: 詹剑锋:Big databench—benchmarking big data systems

17/

Outline17/

Benchmarking Methodology and Decision1

Case Study

3 How to Use

5

32

Future Work 44

Page 18: 詹剑锋:Big databench—benchmarking big data systems

18/

BigDataBench Case Study18/

BigDataBench

Evaluating Big Data Hardware

Systems

Performance evaluation and Diagnosis

Workload Characterization

Networks for big data Energy Efficiency of

Big Data Systems

USTC, and Florida International University

ICT, CASSIAT, CAS

CNCERTOSU

SJTU, and XJTU

http://prof.ict.ac.cn/BigDataBench/#users

Page 19: 詹剑锋:Big databench—benchmarking big data systems

19/

Testbed 19/

Page 20: 詹剑锋:Big databench—benchmarking big data systems

Workloads Analyzed 20/

http://prof.ict.ac.cn/BigDataBench

Page 21: 詹剑锋:Big databench—benchmarking big data systems

21/

Floating point operation intensity

21

The total number of (floating point or integer) instructions divided by the total number of memory access bytes in a run of workload.

Very low floating point operation intensities ( 0.009), two orders of magnitude lower than the theory number of state-of-practice CPU (1.8)

Data Analytics Services

Page 22: 詹剑锋:Big databench—benchmarking big data systems

Instruction Breakdown

Less floating point operations More Integer operations22/

Data Analytics Services

Page 23: 詹剑锋:Big databench—benchmarking big data systems

Ratio of Integer to Floating Point Operations

23/

The average of big data workloads is 100 Parsec, HPCC and SPECFP (1.4, 1.0, 0.67)

Data Analytics Services

Page 24: 詹剑锋:Big databench—benchmarking big data systems

Integer operation intensity

The average integer operation intensity of big data workloads is 0.49

That of PARSEC, HPCC, SPECFP is 1.5, 0.38, 0.23 24/

Data Analytics Services

Page 25: 詹剑锋:Big databench—benchmarking big data systems

Cache Behaviors

Big data workloads have high L1I misses than HPC workloads Data analysis workloads have better L2 cache behaviors than service workloads

except BFS

Big data workloads have good L3 behaviors

25/

Data Analytics Services

Page 26: 詹剑锋:Big databench—benchmarking big data systems

26/

TLB Behaviors

ITLB misses of big data workloads are higher than HPC workloads. DTLB misses of big data workloads are higher than HPC workloads. 26/

data analysis service14 5

Page 27: 詹剑锋:Big databench—benchmarking big data systems

BigDataBench Case Study27/

BigDataBench

Evaluating Big Data Hardware

Systems

Performance evaluation and Diagnosis

Big Data workload Characterization

Networks for big data Energy Efficiency of

Big Data Systems

USTC, and Florida International University

ICT, CASSIAT, CAS

CNCERTOSU

SJTU, and XJTU

http://prof.ict.ac.cn/BigDataBench/#users

Page 28: 詹剑锋:Big databench—benchmarking big data systems

28/

Evaluating Big Data Hardware Systems

Page 29: 詹剑锋:Big databench—benchmarking big data systems

29/

Experimental Platforms

Xeon (Common processor)

Atom ( Low power processor)

Tilera (Many core processor)CPU Type Intel Xeon

E5310 Intel Atom D510 Tilera TilePro36

CPU Core 4 cores @ 1.6GHz

2 cores @ 1.66GHz

36 cores @ 500MHz

L1 I/D Cache 32KB 24KB 16KB/8KB

L2 Cache 4096KB 512KB 64KB

Basic InformationBrief Comparison

Page 30: 詹剑锋:Big databench—benchmarking big data systems

30/

Experimental Platforms

Hadoop ClusterInformation Xeon VS Atom Xeon VS Tilera

Comprison(the same logical

core number)

[ 1 Xeon master+7 Xeon slaves ] VS [ 1

Atom master +7 Atom slaves]

[1 Xeon master+7 Xeon slaves] VS [ 1 Xeon

master +1 Tilera slave]

Hadoop setting Following the guidance on Hadoop official website

Page 31: 詹剑锋:Big databench—benchmarking big data systems

31/

Benchmark SelectionBigDataBench 1.0

Application Time Complexity Characteristics

Sort O(n*log2n) Integer comparison

WordCount O(n) Integer comparison and calculation

Grep O(n) String comparisonNaïve Bayes O(m*n) Floating-point computation

SVM O(n3) Floating-point computation

Page 32: 詹剑锋:Big databench—benchmarking big data systems

32/

Metrics

Performance: Data processed per second (DPS)

Energy Efficiency: Application Performance Power Usage Effectiveness(DPJ)

Page 33: 詹剑锋:Big databench—benchmarking big data systems

33/

Xeon VS Atom – DPJ

Page 34: 詹剑锋:Big databench—benchmarking big data systems

34/

Xeon VS Tilera – DPJ

Page 35: 詹剑锋:Big databench—benchmarking big data systems

35/

Reference

Jing Quan, University of Science and Technology of China, Yingjie Shi, Chinese Academy of Sciences, Ming Zhao, Florida International University, Wei Yang, University of Science and Technology of China.

”The Implications from Benchmarking Three Different Data Center Platforms”

The First Workshop on Big Data Benchmarks, Performance Optimization, and Emerging hardware (BPOE 2013) in conjunction with 2013 IEEE International Conference on Big Data (IEEE Big Data 2013)

Page 36: 詹剑锋:Big databench—benchmarking big data systems

Outline36/

Benchmarking Methodology and Decision1

Case Study

3 How to Use

5

32

Future Work 44

Page 37: 詹剑锋:Big databench—benchmarking big data systems

37/

BigDataBench Class For Architecture

19 among 19

For OS 19 among 19

For Runtime environment (Hadoop) 9 of 19 workloads

•Sort, Grep, WordCount, PageRank, Index, Kmeans, Connected Components, Collaborative Filtering and Naive Bayes.

For Data management 6 of 19 workloads

•Read, Write, Scan, Select Query, Aggregate Query, Join Query

37/

Page 38: 詹剑锋:Big databench—benchmarking big data systems

BigDataBench Class: data sources Text related

6 of 19 workloads•Sort, Grep, WordCount, Index, Collaborative Filtering and Naive Bayes

Graph related 4 of 19 workloads

•BFS, PageRank, Kmeans, and Connected Components

Table related 9 of 19 workloads

•Read, Write, Scan, Select Query, Aggregate Query, Join Query, Nutch Server, Olio Server and Rubis Server

38/

Page 39: 詹剑锋:Big databench—benchmarking big data systems

BigDataBench Class: Application Types

Online Services 6 of 19 workloads

• Read, Write, Scan, Nutch server, Olio Server and Rubis server

Offline Analytics 10 of 19 workloads

• Sort, Grep, WordCount, BFS, PageRank, Index, Kmeans, Connected Components, Collaborative Filtering and Naive Bayes.

Realtime Analytics 3 of 19 workloads

• Select Query, Aggregate Query and Join Query

39/

Page 40: 詹剑锋:Big databench—benchmarking big data systems

BigDataBench Class: Application Domains Search engine related: Basic Operations + Search Engine

7 of 19 workloads•Sort, Grep, WordCount, BFS, PageRank, Index and Nutch Server

Social network related: Basic Cloud OLTP+ Basic Relational Query+ Social Network

9 of 19 workloads•Read, Write, Scan, Select Query, Aggregate Query, Join Query, Olio Server, Kmeans and Connected Components

E-commerce related: Basic Cloud OLTP+ Basic Relational Query+ Social Network

9 of 19 workloads• Read, Write, Scan, Select Query, Aggregate Query, Join Query, Rubis server, Collaborative Filtering and Naive Bayes

40/

Page 41: 詹剑锋:Big databench—benchmarking big data systems

Outline41/

Benchmarking Methodology and Decision1

Case Study

3 How to Use

5

32

Future Work 44

Page 42: 詹剑锋:Big databench—benchmarking big data systems

42/

Near Future Work

Multi-media data

Deep learning workloads

HPC

Refine BigDataBench

42/

Page 43: 詹剑锋:Big databench—benchmarking big data systems

Related Resources

BigDataBench project http://prof.ict.ac.cn/BigDataBench

BPOE workshop http://prof.ict.ac.cn/bpoe A series of workshops on Big Data Benchmarks,

Performance Optimization, and Emerging Hardware BPOE-4: interaction among OS, architecture, and data

management• Co-located with ASPLOS 2014

43/

Page 44: 詹剑锋:Big databench—benchmarking big data systems

BPOE-4 SC Christos Kozyrakis, Stanford Xiaofang Zhou, University of Queensland Dhabaleswar K Panda, Ohio State University Raghunath Nambiar, Cisco Lizy K John, University of Texas at Austin Xiaoyong Du, Renmin University of China H. Peter Hofstee, IBM Austin Research Laboratory Ippokratis Pandis, IBM Almaden Research Center Alexandros Labrinidis, University of Pittsburgh Bill Jia, Facebook Jianfeng Zhan, ICT, Chinese Academy of Sciences

44/

Page 45: 詹剑锋:Big databench—benchmarking big data systems

THANKS45/