Ase2010 shang

21
An Experience Report on Scaling Tools for MSR Studies Using MapReduce Weiyi Shang, Bram Adams, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL) School of Computing, Queen’s University

Transcript of Ase2010 shang

Page 1: Ase2010 shang

An Experience Report on Scaling Tools for MSR Studies Using MapReduce

Weiyi Shang, Bram Adams, Ahmed E. HassanSoftware Analysis and Intelligence Lab (SAIL)

School of Computing, Queen’s University

Page 2: Ase2010 shang

2

Mining Software Repositories: Propagating code changes

MethodA is

changed

MethodA calls

Method B

MethodC calls

Method A

Change methods B and C

MethodA is

changed

When method A is changed, 90% of the

time method D is changed.

Change method

D

Not Enough

History helps!

Page 3: Ase2010 shang

3

Traditional pipeline for MSR studies

Software repositories Data preparation (ETL)

Extraction

Transformation

Loading

DataWarehouse

Data Analysis

Source code history

Bug database

Mailing list

System log

Continues to grow

More complex algorithms

MSR studies must scale

Page 4: Ase2010 shang

4

Existing solutions to scale

powerful machines

ad hoc distributed computing

multi-threaded and multi-core

EXPENSIVELARGE PROGRAMMING EFFORT

NOT RE-USABLE

Page 5: Ase2010 shang

5

Example: D-CCFinder Clone Detector

40 days on 1 pc machine 52 hours on 80-machines cluster

Page 6: Ase2010 shang

6

Web Analysis is similar to MSR studies

Large-scale data Scan-centric Rapidly evolving

Page 7: Ase2010 shang

7

Web-scale platforms

We believe that the MSR field can benefit from web-scale platforms to overcome the limitations of current approaches.

Page 8: Ase2010 shang

8

In our previous research

Hadoop is up to 3 times faster on a 4-machine cluster

Feasibility study using Hadoop to scale a software evolution study on Eclipse.

Page 9: Ase2010 shang

9

In this paper

1. Does MapReduce scale to other MSR studies and larger clusters?

2. What are the challenges and experiences of scaling MSR studies?

Page 10: Ase2010 shang

10

ReduceMap

An example of MapReduce

Datagoodhellofishcatschoolnighthappydog

ValueKey dog3cat3

fish4good4

hello5night5happy5

school6

ValueKey

23243516

Counting the frequency of word lengths

Key 45436553

Page 11: Ase2010 shang

11

Three large-scale MSR studies

• Software evolution study– J-REX: code-change information abstractor for

Java from line level to program entity level• Code clone detection– CC-Finder: code clone detection tool

• Log analysis– JACK: log analysis tool for detecting system

anomalies during load testing

Page 12: Ase2010 shang

12

Experimental environment

CPU type #machines Memory size

Operating system

Intel Quad Core Q6600 (2.40 GHz)

18 3GB Ubuntu 8.04

8 Xeon (3.0 GHZ)

10 8GB CentOS 5.2

Page 13: Ase2010 shang

13

Input data

Data Size Data type #Files

EclipseDatatools

10.4 GB227 MB

CVS repositoryCVS repository

189,15610,629

FreeBSD 5.1 GB source code 317,740Log files No.1Log files No.2

9.9 GB2.1 GB

execution logexecution log

5454

Page 14: Ase2010 shang

14

1. Does MapReduce scale to other MSR studies and larger clusters?

Page 15: Ase2010 shang

15

SHARCNET(×10)

1 machine

0 100 200 300 400 500 600 700

98

580

min

SHARCNET(×10)

1 machine

0 100 200 300 400 500 600 700 800

80

755

Software Evolution & Log analysis J-REX

JACK

×9

×6

min

Page 16: Ase2010 shang

16

Code clone detection

Can MapReduce scale up CCFinder ?

Yes!58 hours on an 18-machine cluster.

Page 17: Ase2010 shang

17

2. What are the challenges and experiences of scaling MSR studies?

Page 18: Ase2010 shang

Challenge 1: Locality of MSR analysis

18

Local analysis

Semi-local analysis

Global analysis

WebMSR MSR MSR

Page 19: Ase2010 shang

19

Challenge 2: Granularity of MSR analysis

Fine-grained analysis

Coarse-grained analysis

• Web community experience:– #Map: 10 ~ 100 × # machines– #Reduce: 0.95 or 1.75 × #CPU

cores • MSR experience:– #Reduce tasks= #CPU cores

(fine-grained analysis)– #Reduce task= #input records

(coarse-grained analysis)WebMSR MSR

Page 20: Ase2010 shang

20

Challenges of migrating MSR studies to MapReduce

1. Locality of MSR analysis2. Granularity of MSR analysis3. Locating a suitable cluster4. Managing data during analysis5. Recovering from errors

Page 21: Ase2010 shang

21

Questions?