Detecting Thread-Safety Violations in Hybrid OpenMP/MPI ...lwang/papers/Cluster2015.pdfthread-safety...

Detecting Thread-Safety Violations in Hybrid OpenMP/MPI Programs

Hongyi Ma, Liqiang Wang, and Krishanthan Krishnamoorthy

Department of Computer Science, University of Wyoming. {hma3, lwang7, kkrishna}@uwyo.edu

Abstract—Hybrid MPI/OpenMP programming is be-coming an increasingly popular parallel programmingmodel on High Performance Computing (HPC), wheremultiple OpenMP threads could execute within a singleMPI process. Such a hybrid model is restricted byseveral rules stated in the MPI standard for cor-rectness. However, it is very tricky to ensure hy-brid MPI/OpenMP programs to be thread-safe. Thereare several concurrency problems happen in hybridMPI/OpenMP model, such as data race on source andtag information inside of MPI calls and wrong wayto synchronize MPI calls using threads. Concurrencyerrors in MPI/OpenMP model are very difficult todebug due to the complex mixed semantics by MPIroutines and OpenMP directives. So developing anuseful and efficient tool is necessary to help usersdebug errors in the hybrid MPI/OpenMP model. Inthis paper, we propose an approach by integratingstatic and dynamic analyses to check the potentialproblems of hybrid MPI/OpenMP programs in orderto ensure correctness. In order to obtain thread levelinformation, we instrument monitored variables inthe MPI calls in order to obtain both thread andprocess runtime information. In our approach, thestatic analysis would help to reduce unnecessary codeinstrumentation during runtime detection and createthe variable checklist for thread-safety checking. Indynamic analysis, both happen-before and lockset-based classical dynamic algorithms are used to checkconcurrency problems. Static analysis firstly find themonitored variables and MPI calls using control flowgraph, then dynamic analysis would take code in-strumentation for verifying concurrent races in thesemonitored variables, at last, our tool would merge theconcurrency reports from dynamic analysis analysisto check whether there are violations to thread-safetyspecifications. Our experimental evaluation over real-world applications shows that our approach is accurateand efficient. The observed average overhead is around30% using our test experiment setup.

Keywords-Hybrid MPI/OpenMP, Concurrency, Vio-lations, Thread-safety

I. INTRODUCTION

MPI/OpenMP are the two most popular pro-gramming models in High-Performance Computing(HPC). MPI is based on a distributed memory modeland supports large-scale processes across computenodes. OpenMP is based on a shared memory model

and usually fork multiple threads within a process. Itis not easy to write a correct program using MPI andOpenMP due to the tricky semantics of MPI routinesand OpenMP directives.

There are two common problems for MPI pro-grams: message race and deadlock. In MPI pro-grams, the process communicate with each otherthrough message-passing and those messages mayarrive at a process in a non-deterministic orderby variations in process scheduling and networklatency. If two or more messages are sent overcommunication channels on which a receive listens,and they simultaneously in transit without guaran-teeing the order of their arrivals, then a messagerace [14] occurs in the receive event and causesnondeterministic execution of the program. The un-expected message race is difficult for a developerto debug or reproduce without exploring the fullprogram state space. Although most message racesare benign without breaking runtime execution, somemay lead to incorrect computation and violation ofuser defined assertions. Deadlock is a more commonproblem for MPI programs, which is often caused byimproper usage of incompatibility of MPI semantics.But in this paper, we only care about how to detectthese thread-safety issues instead of pure MPI errors,since some existing work for MPI errors alreadypresented these issues.

The one of current major techniques to detect mes-sage races is using dynamic model checking methodto replay all different interleavings of MPI applica-tion, then check whether there is non-deterministicorder for messages receiving. For deadlock, the dy-namic graph-based method is used to detect whetherthere is a state circle inside of execution.

The OpenMP Specifications 3.0[17] briefly re-views the memory consistency model in OpenMP,which is supported by the commodity hardware havebeen developed in the past few years. Hoeflinger etal [8] presents a thorough discussion of the OpenMPmodel. Sarita Adve et al [3] covers both the sequen-tial consistency model and weak memory model indetails. These model structure make the concurrency

error happening in OpenMP. A race condition occurswhen two or more concurrent threads perform con-flicting accesses (i.e., accesses to the same sharedvariable and at least one access is a write) andthe threads use no explicit mechanism to preventthe accesses from being simultaneous.The majortechniques to check OpenMP errors include usinghappen-before dynamic analysis algorithm [16] toanalyze runtime execution order, or using symbolicexecution [13] to check whether there is a executiontrace can have nondeterministic results, like updatingwith different values by at least two threads. Re-cently, concolic execution is used to simulate inter-leaving execution order of multithreaded programs[4].

MPI-2 supports forking multiple threads in anMPI process, which enables a hybrid MPI/OpenMPprogramming model [6]. However, implementingthread-safety in MPI is not easy because enforcingproper synchronization among threads and processesis very tricky. For example, by default, a hybridMPI/OpenMP program allows only the master threadto execute within one process if there is no explicitmulti-thread specification in MPI_Init() (the cur-rent one should MPI_Init_Thread(). In Figure1, only MPI_Send or MPI_Recv is executed, butnot both. It is difficult to check the error becausethere is no compilation errors or warning before run-ning. Even there is no problem in the specification ofthreads in MPI process, but errors may still arise dueto improper MPI communication on threads level.For example, Figure 2 shows that two processesare running with two threads in each process. Adeadlock may happen nondeterministically in someexecutions (but not always). Such a scenario violatesthe thread-safety specification, which requires thatall arrival messages within an MPI process should bedifferentiated by their tags. Messages with differenttags will be handled by different threads.In theexample of Figure 2, some MPI_Recv could beblocked because the corresponding thread does notobtain an arrival message, as all arrival messagesare not differentiated because of the same tag value.A common solution is to use thread ID as tag todistinguish these messages.

In this paper, we propose a hybrid approach tocheck thread-safety violations for MPI/OpenMP pro-grams that utilizes the results from static analysisto guide the dynamic analysis on error detection.Specifically, our paper makes the following contri-butions.

• Our approach uses a static analysis that

MPI_Init();omp_set_num_threads(2);#pragma omp parallel{#pragma omp sections{

#pragma omp sectionif (rank == 0 )

MPI_Send(rank1);#pragma omp sectionif (rank ==0)

MPI_Recv(rank1);}

}

Figure 1. A Hybrid MPI/OpenMP case study 1

MPI_Init_thread(0,0,MPI_THREAD_MULTIPLE, &provided);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);int tag=0;omp_set_num_threads(2);

#pragma omp parallel for private(i)for(j = 0; j < 2; j++) {

if(rank==0) {MPI_Send(&a, 1, MPI_INT, 1, tag,

MPI_COMM_WORLD);MPI_Recv(&a, 1, MPI_INT, 1, tag,

MPI_COMM_WORLD,MPI_STATUS_IGNORE);

} // rank == 0if(rank==1) {

MPI_Recv(&a, 1, MPI_INT, 0, tag,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

MPI_Send(&a, 1, MPI_INT, 0, tag,MPI_COMM_WORLD);

} // rank == 1}

Figure 2. Hybrid MPI/OpenMP case study 2.

can statically detect potential unsafe hybridMPI/OpenMP programming styles and moni-tored and can reduce some error-free regionchecking for dynamic runtime analysis.

• Our approach uses a static analysis that canreport and statistically provide all possible codelocations that are involved in errors in HybridOpenMP/MPI programs.

• This approach requires little human interventionor annotation, imposes lightweight performanceoverhead and produces more precise error re-port than the purely static and dynamic datarace detection approaches for MPI/OpenMP.

In this paper, we summarize all concurrency errors

in hybrid MPI/OpenMP programs and categorizethem into different scenarios. We propose an ap-proach by integrate static and dynamic programanalysis to check these concurrency errors. Specifi-cally, we instrument monitored variables within MPIruntime libraries to specify write operations for theseshared variable. Our approach first utilizes staticanalysis to generate the control flow graph of thesource code, and to retrieve the arguments in MPIcalls inside of OpenMP parallel region, since thread-safety violations only happen in hybrid program-ming region of source code. We set up monitoredvariable using our instrumented function to initializemonitored variables for dynamic analysis. Second,linking to our instrumented MPI wrapper duringthe runtime execution, after getting result from dy-namic concurrency analysis, our tool would checkmonitored variables with the other thread-safety ar-gument list to see whether they have violations inhybrid MPI/OpenMP program. In order to showour accuracy and efficiency, we artificially modifiedseveral testing benchmarks from NPB-MZ hybridMPI/OpenMP benchmark for testing, also we runIntel Thread Checker and Marmot for comparisonwith our approach.

We build our test bed in Amazon EC2 cloudplatform [1] with 32 instance, each instance has4 cores. Since we only monitor the variables forissues in hybrid MPI/OpenMP,lots of variables incomputation are not covered by detection duringthis study. In our approach overhead is very lowand runtime error detection is more efficient thanmonitoring all writing and reading operations inprogram. The highest overhead we observed is 45%using 64 processes. Since we tested our approachusing NAS parallel benchmark [15],these well-testedbenchmarks do not have thread-safety issues men-tioned in this paper.So we artifically implementedseveral tricky errors inside of these benchmarks forthe accuracy testing in our approach. Based on theexperiments observation, our approach would detectall the violations we constructed in the applica-tion with less overhead, which is 50% less whencompared with the Intel Thread Checker [18] andMarmot[6].

Our paper is organized as follows. Section IIprovides existing work for detecting errors inMPI/OpenMP. Section III provides the thread-safetyspecifications in MPI and constraints for these spec-ifications in our approach. Section IV describes theworkflow of our approach, and details the imple-mentation of the static and dynamic analyses to

check traces for thread-safety. Section V presents ourexperimental evaluation over a set of benchmarksand real world applications. Section VI concludesour results and provides directions for future work.

II. RELATED WORKS AND BACKGROUND

In recent years, there are several existing toolsdeveloped for the error detections in both of MPIand OpenMP programs. But to the best of ourknowledge, there are no tools to ensure thread-safetyin hybrid MPI/OpenMP programs which can reportviolations and locate the issues in programs.

In these existing tools, both static and dynamicmethods are used to check parallel programs. Whenit comes to static program analysis approaches, suchas model checking and symbolic execution, oftensuffer state explosion problem due to checking com-binatorial number of schedules or reachable states.Like the MPI-SPIN [22], it utilizes model checkingand symbolic execution method to detect deadlock.If a property of execution schedule is violated byexploring reachable states of the model, then anexplicit violation example is returned to the user withexecution traces.

When it comes to dynamic analysis, we canmention DAMPI [5], Marmot[6], Umpire[23], MPI-CHECK [20], and MUST [7] . Umpire, Marmotand MUST rely on a dynamic analysis of MPIcalls instrumented through the MPI profiling inter-face (PMPI). Like, Marmot and MPI-CHECK, theyutilize a timeout approach to detect deadlock, andMUST detects deadlock with a scheduling checking.The timeout approach in above tools sometimeswould to produce false positives. When it comesto DAMPI, it uses a scalable algorithm based onLamport Clocks (vector clocks focused on call order)to capture possible non deterministic matches. Foreach MPI collective operation, participating pro-cesses update their clock, based on operation se-mantics. Umpire, which relies on dependency graphswith additional edges for collective operations todetect deadlocks. In Marmot, an additional MPIprocess performs a global analysis of function callsand communication patterns. When it comes to MPI-CHECK [20], it instruments the source code atcompile time adding extra arguments to MPI callsand it allows collective verification for the full MPIstandard. For Umpire, it monitors the MPI operationsof an application by interposing itself between theapplication and the MPI runtime system using theMPI profiling, which increases the overhead in thecommunication between processes.

Several research works have been proposed to ad-dress the error detection issues in OpenMP program.One of the earliest work is the Intel Thread checker[2] which rewrites the program binary code withadditional intercepting instructions to monitor theprogram serial execution and infer possible parallelexecution events of the program. However, it lacksspecific knowledge about the OpenMP program andis unable to consider the happens-before relation-ship when checking for the data race in OpenMPprograms.Young-Joo Kim et al [10] designed a prac-tical tool utilizes the on-the-fly dynamic monitoringto detect the data races in OpenMP programs. Mun-Hye Kang et al [9] presents a new tool that focuseson the detection of first data races which are conflict-ing accesses with no explicit happen-before order inOpenMP programs. The tool first executes the in-strumented program in order to obtain the conflictingaccesses. Later, it refines the conflicting accesses thatare involved in the first data races, by rerunning theprogram with happened-before analysis. In this paperwe show that, our approach is different from othertools. We combine the static analysis results in orderto assist on the fly monitoring, and our experimentsreveal that the approach is scalable and efficient.

After static and dynamic analysis have been com-bined for multi-threaded programs, Lee et la [11]presents a similar two-phase static/dynamic inter-procedural and inter-thread escape analysis. Ourprevious approach[13] utilized symbolic executionorder for simulating the OpenMP multi-threadedruntime behaviors, but this approach has to confrontinterleaving exploration issues when many threadsare encoded in SMT solver.

Our approach performs a static analysis followedby more accurate and faster online dynamic analysiswhich integrates the information from static analysis.Our tool combines the static analysis for reducing theoverhead and providing the check variables for dy-namic analysis. In order to improve dynamic analysisaccuracy, and using static analysis to retrieve runtimehybrid MPI/OpenMP information without modifica-tion semantics on source code level. Then we utilizeclassical lockset analysis and happen-before analysisto check the concurrency of monitored variables,after merging these concurrency reports into thread-safety specification argument list, violations reportwould be ready for generation in our tool.

III. THREAD-SAFETY IN THE HYBRIDMPI/OPENMP PROGRAMMING

The MPI standard allows using multiple threadswithin an MPI process, which is restricted byseveral rules stated in the MPI standard. Thiswas rened in the MPI-2 standard by introduc-ing thread support. In order to support threads,MPI should be initialized using the followingway: (1) MPI_THREAD_SINGLE indicates thatonly one thread is used in MPI process; (2)MPI_THREAD_FUNNELED means that multiplethreads can exist, but only the main thread cancall MPI routines; MPI_THREAD_SERILIZED al-lows multiple threads, but only one thread can callan MPI routine at a time in the current process;(4) MPI_THREAD_MULTIPLE allows that multiplethreads call MPI routines without restriction. Thethread-safety of hybrid MPI/OpenMP programs isvital, otherwise the performance of normal threadbehavior can show incorrect.

The correct hybrid MPI/OpenMP applicationshould specify a desired thread level and passes itto the MPI implementation using appropriate aboveMPI-Thread representations to return to thread levelfrom process level.

A. Thread-safety Properties

According to [6], the violations to thread-safety properties of hybrid MPI/OpenMPprograms can be categorized as follows.Totally, we have 6 violations need to detect,which are isInitializationViolation,isMPIFinalizationVoilation,isConcurrentRecvVoilation ,isConcurrentRequestViolation, isProbeViolation andisCollectiveCallViolation . concurrentfunction is results for monitored variables fromdynamic concurrency analysis procedure usinglockset analysis and happen-before analysis. thereturn value of Concurrent(a) is true, which meansa has concurrent execution issues on variable a, otherwise, there is no concurrency issues onvariable a . In order to detect concurrency executionissues of MPI calls at thread level, this concurrentfunction is for monitoring these variables srctmp,tagtmp, commtmp, collectivecalltmp and requesttmpand finializetmp in MPI calls at thread level. Themonitored variable would be introduced moreclearly in the MPI wrapper implementation section,since these are inserted into monitored MPI callsfor thread-safety violation detection.

• Initialization Violation: Executing MPIcalls within threads should follow thespecification in MPI thread initialization.For example, if MPI_THREAD_SINGLE isspecified, omp parallel returns false. IfMPI_THREAD_SERIALIZED is specified,it is not allowed to call MPI routines in twoconcurrent threads.isInitializationViolation =MPI THREAD SINGLE == true ∧ompparallel == true) ∧(MPI THREAD SERIALIZED == true∧ (Concurrent(srctmp) ∨Concurrent(tagtmp) ∨Concurrent(commtmp) ∨Concurrent(requesttmp) ∨Concurrent(collectivecalltmp))∧ (MPI THREAD FUNNELED == true∧ MPI IS THREAD MAIN() == false)

• Concurrent MPI finalizing violation:MPI_Finalize should be called in themain thread at each process. In addition,prior to call MPI_Finalize, each threadneeds to finish existing MPI calls under itsown thread to make sure there is no pendingcommunication.isMPIFinalizationVoilation =MPI IS THREAD MAIN() == false ∧mpitypetk = MPIFinialize) ∨Concurrent(terminationtmp) ∧timestamp(MPI Finialize()) <timestamp(MPICalls)

• Concurrent MPI Recv violation: Each threadwithin an MPI process may issue MPI calls;however, threads are not separately addressable.In other words, the rank of a send or receive callidentifies a process instead of threads, whichmeans when two threads call MPIRecev withsame tag and communicator the order is unde-fined. In such cases, we can expect a data race.Moreover, we can prevent such data races usingdistinct communicators or tags for each thread.isConcurrentRecvVoilation =Concurrent(srctmp) ∧Concurrent(tagtmp)∧ Concurrent(commtmp)∧mpitypet1 == MPIRecv ∧mpitypet2 ==MPIRecv∧ t1 6= t2

• Concurrent request violation in MPI Waitand MPI Test: MPI does not allow that

two or more threads are concurrentlyinvokes MPI Wait(request) andMPI Test(request) with the same sharedrequest variable.isConcurrentRequestViolation =Concurrent(requesttmp) ∧(mpitypet1 == MPIWait ∨MPITest)∧ (mpitypet2 == MPIWait ∨MPITest)∧ (t1! = t2)

• Probe Violation in MPI Probe andMPI IProbe: Two concurrent invocationsof MPI Probe() or MPI IProbe() fromdifferent threads on the same communicatorshould not have the same “source and “tag astheir parameters.isProbeViolation =Concurrent(srctmp)∧Concurrent(tagtmp)∧ (mpitypet1 == MPIProbe,MPIIProbe)∧ (mpitypet2 == MPIRecev) ∧ (t1 6= t2)

• Collective Call Violation: According to the MPIrequirements all processes on a given com-municator must make the same collective call.Furthermore, the user is required to ensure thatthe same communicator is not concurrently usedby two different collective calls by threads inthe same process.isCollectiveCallViolation =Concurrent(collectivetmp) ∧mpitypet1 == collectiveroutine ∧mpitypet2 == collectiveroutine ∧ t1 6= t2The lockset analysis and happen-before anal-ysis help to detect concurrency for monitoredvariables. Later, HOME will deliver the match-ing rules to detect violations for thread-safetyspecification.

IV. DETECTING THREAD-SAFETY VIOLATIONSUSING INTEGRATED STATIC AND DYNAMIC

ANALYSIS

This paper proposes a novel approach to detectthread-safety violations using the integrated staticand dynamic program analysis. Dynamic analysisreasons about behavior of a program through ob-serving its executions. It is usually performed byinstrumenting source code or byte/ binary code andmonitoring the programs’ executions. The observedevents can be online analysis (i.e., during executions)or offline (i.e., after executions terminate). In order todetect concurrency errors, dynamic analysis extendsthe traditional testing techniques. In other words,

it inspects potential concurrency errors by search-ing specific patterns based on the current observedevents, even the errors do not show up in the currentexecution paths. Most dynamic analysis approacheshave the same weakness on path coverage issue,which is that they cannot detect concurrency andlogic errors in unexecuted code. Pure Static anal-ysis makes predictions about a program’s runtimebehavior based on analyzing its source code. Staticanalysis can often explore the whole code, but sac-rifice accuracy and may report many false positives.

In order to avoid reporting false positives and highoverhead, we combine static and dynamic analysistechniques together to utilize their advantages andavoid their shortcomings in HOME implementation.Specifically, we design a lightweight technique tocheck thread-safety violations without sacrificinganalysis accuracy and precision. Our proposed hy-brid program analysis speculatively approximatesthe behaviors of unexecuted code by instantiatingits static summary using runtime information. Inaddition, the monitoring overhead is another problemof dynamic analysis, which usually slows down thespeed of programs by a factor of 2 to 100. Ourproposed approach significantly reduces the over-head by using a static analysis to perform selectivemonitoring.

A. Workflow

Figure 3 shows the architecture overview of ourapproach HOME, which consists of two phases:compile-time checking and runtime checking. Dur-ing the compilation phase, we classify code sectionsinto correct and potentially erroneous. The correctcode sections are filtered out, and MPI routine callsin the potentially erroneous code are instrumented.This filtering approach avoid systematic instrumenta-tion, thus reducing the overhead of the dynamic anal-ysis. Since compile-time checking procedure doesnot require running program, so we often call it staticanalysis. In this compile-time analysis procedure,HOME would generate the control flow graph ofthis hybrid MPI/OpenMP program. In this way, eachMPI call would be represented as a node in controlflow graph. At beginning of visiting all the nodesin the control flow graph, we first insert severalmonitored variables into control flow graph at globalvariable region. These variables are inserted intoMPI wrappers by our instrumented and can help todetect the violations of MPI calls in thread level.The reason is that, since one property in most of theviolations to thread-safety specifications in hybrid

Hybrid MPI and OpenMP source code

MPI Wrapper with inserted monitored

variables

Hybrid Dynamic Analysis for concurrency detection

of monitored variables

Static Analysis for inserting monitored

variables

Executable Binary Code

Thread-Safety specification argument

list generation

Concurrency Reports

Final Reports

Happen-Before analysis

instrumentation

Lockset analysis instrumentation

Figure 3. The architecture of the tool HOME.

MPI/OpenMP is about at least two threads executethe same operations at the same time, which canbe abstracted to concurrency execution by severalthreads. By using the MPI wrapper designed by ourinstrumented MPI library, each monitored variable isassociated to one violation to thread-safety specifi-cation. If this monitored variable is detected to be aconcurrent operations on that, then we can say thisMPI calls are executed concurrently by associatedthreads.

The way to detect the concurrent execution ofmonitored variables is to combine lockset analy-sis [21] and happen-before analysis [16], both ofthese two algorithm are classical algorithm to detectdata races in multi-threaded programs. For locksetsanalysis, the key idea behind it is to track locksets that govern access to each shared location. Adata race on monitored variable is an access to ashared variable that is not governed by a set oflocks. The purpose of happens-before analysis is toestablish partial ordering of events across differentthreads. The reason why dynamic analysis procedurecombines the algorithm of lockset analysis algorithmand happen-before algorithm is to reduce false pos-itive and overhead on binary code instrumentation.Also, these two data race detection approaches donot require errors real happen in runtime. After theresults of these monitored variables are generatedduring dynamic code instrumentation analysis, thenour HOME would analyze the recorded argumentsinformation with the concurrency issues of moni-tored variables in MPI calls, then match them todetermine whether there is a match to violation tothread-safety specification in MPI standard.

B. MPI Wrapper Instrumentation

Our mechanism utilizes the concurrency executionproperty of monitored variables in MPI calls atthread level to determine whether two MPI callscan be executed at the same time at thread level.In order obtain the runtime information for dy-namic analysis, we instrument MPI wrappers thatcan catch runtime information in MPI call, suchas source, tag, communicator, and thread ID in-formation, these MPI wrapper would execute theappropriate MPI call to perform MPI function-ality. MPI MonitorV ariableSetup is the func-tion to For example, in Figure IV-B, our wrapperfor MPI Recv is named as HMPI Recv, whenHOME would detect Concurrent MPI Recv vio-lation, then dynamic analysis using lockset analysisand happen-before analysis obtains the thread ID ex-ecution information, then it would detect the WRITEoperations on source src, tag tag and communicatorcomm in MPI calls whether there are concurrentexecutions happen on src, tag and comm at the sametime, if there are concurrent execution issues onthese three variables at the same time and have atleast two different thread IDs in log. Then we wouldreport there is a violation.

Listing 1. source code#include<mympi.h>

MPI_MonitorVariableSetup(srctmp, tagtmp,terminationtmp,requesttmp,collectivetmp);

int main(){code();return

}void code(){...HMPI_Recv(&var, count,

tag, dest)

}

Listing 2. mympi.h#include<mpi.h>

//! different routineshas its ownmonitored vairbale

//! MPI receive is notenough for coveringall cases

int HMPI_Recv(&var,count, src, tag,comm)

{int tid = getThreadID

();tagtmp = tag;srctmp = src;commtmp = comm;mpitype = receive;StartExecLog();// to

record all thearguments in log

MPI_Recv(var,count,src, tag, comm);

}


MPI_MonitorVariableSetup(srctmp, tagtmp,terminationtmp,requesttmp,collectivetmp,commtmp);

int main(){code();return}void code(){...HMPI_Recv(&var, count,

tag, dest)

}




int HMPI_Wait(request){int tid = getThreadID

();mpitype = mpiwait;requesttmp = request;StartExecLog();// to


MPI_Wait(var,count,src, tag, comm);

}


MPI_MonitorVariableSetup(srctmp, tagtmp,terminationtmp,requesttmp,collectivetmp,commtmp);

int main(){code();return}void code(){...HMPI_Recv(&var, count,

tag, dest)

}




int HMPI_Barrier(comm){int tid = getThreadID

();mpitype =

mpicollective;commtmp = comm;StartExecLog();// to


MPI_Barrier(comm);}

C. Static Analysis

The idea of overhead reduction is to reduce thenumber of monitored variables as many as possibleduring runtime code instrumentation. Since the vio-lations only happen in hybrid MPI/OpenMP region,so non-hybrid region is guaranteed error-free regionwhich have no violations, when it comes to hybridMPI/OpenMP regions, the violation to thread-safetyspecification in MPI standard would only happenin this region, so we assume this part is potentialerror region. For this concept, the static analysisin our approach HOME utilizes the control flowgraph of the hybrid MPI/OpenMP program to helpdistinguish whether the MPI call node is in ompparallel region, which can significantly reduce theoverhead of OpenMP binary code instrumentationduring the runtime. In Algorithm 1, the node of

Algorithm 1 Static Analysis Procedure1: % source code level pre-processing based on Control

Flow Graph %2:3: void StaticAnalysis(){4: src represents the source code of Hybrid MPI/OpenMP

program5: List srcCFG = CFGGeneration(src)6: add MPIMonitoredV ariables() at beginning of

global region7: While (!srcCFG.isEmpty())8: if (srcCFG.get(i) == ompParallelBegin()) then9: k = i;

10: While (srcCFG.get(k)!= ompParallelEnd() )11: if (srcCFG.get(k).type == MPIcalls) then12: replace srcCFG.get(k) with our instrumented

MPI call13: end if14: k++;15: it }16: end if17: i++;18: }19: }

source code CFG is put into a list srcCFG, whenstatic analysis procedure traverse all the node issrcCFG, if one node is indicated as omp parallelor omp parallel for, then the following MPI callsafter this omp parallel block would be replaced usingour MPI wrapper util it reaches the end of thisomp parallel block. The other MPI calls which arenot replaced by our MPI wrapper would be skippedduring binary code instrumentation in order to reducethe unnecessary overhead.

D. Hybrid Dynamic Analysis

Bases on the monitored variables provided fromthe static analysis, the list of variable accessed inMPI call in hybrid MPI/OpenMP region sites areinstrumented using the tool Intel Pin [12] for mon-itored variable concurrency detection. Then we runthe instrumented program with the dynamic locksetanalysis combining with happen-before analysis todetect concurrency issues for monitored variables.Since we would like to check one important propertyin thread-safety specification in MPI standard, whichis whether there two MPI calls can be executed bydifferent threads at the same time.

Since the operations of MPI calls are not catego-rized in READ or WRITE, so we have to leveragevariables inside of MPI calls at thread level torepresent current execution status, so HOME utilizesmonitored variables srctmp, tagtmp, commtmp, col-lectivecalltmp and requesttmp and finializetmp for

representing concurrency status in MPI calls. Thatis the most important innovation for our approach,since the dynamic analysis would detect the concur-rency status of these variables to determined thereis concurrent execution for MPI calls at thread levelor not. Another benefit of using lockset analysis andhappen-before analysis is that combination analysisof them do not require these races must happen inthe runtime, then some potential violations would bedetected no matter it real happens during runtime.

The dynamic analysis techniques in HOME forconcurrency detection of monitored variables arefully based on dynamic instrumentation. HOMEobserves a stream of events generated by instrumen-tation inserted into the program and sets up severalrules to determine concurrency happen conditionsfor monitored variables. The instrumented programwould output a sequence of events to our detec-tion approach. Since pure lockset analysis wouldfind more races then happens-before based tools,but would increase the overhead. That is why weimplement happen-before analysis for this step also.In this paper, we treat each event in sequence hasfollowing properties:

Table ITABLE OF NOTATION IN DYNAMIC ANALYSIS

V arNameMem(mi, ai, ti a set of eventsemory access location is mi

for variable ai at thread ti.t thread t.ei event at step i.(atm , btn ) a vector clock

for event a and bat thread tm and tn

LockSets a set of locks.READ, WRITE the two possible access types

for a memory access in event.

The Lockset-based analysis approach is imple-mented based on several hypothesis: Whenever twodifferent threads access a shared memory location,and one of the accesses is a write, the two ac-cesses are performed holding some common lock.Formally, given an access sequence ei, dynamicanalysis procedure in HOME would maintain thelock sets for each monitored variable before stepi by a thread t by using LockSetsi(t) for currentlockset updates, LockSetsi(t) for each live threadt, can be efficiently maintained online as acquisitionand release events are received.IsPotentialLockSetRace(i, j) =

ei = V arNameMem(mi, ai, ti) ∧ ej =V arNameMem(mj , aj , tj)

∧ ti 6= tj ∧mi = mj ∧(ai = WRITE) ∨ (aj = WRITE) ∧LockSetsi(ti) ∩ LockSetsj(tj) = ∅

When it comes to Happens-before, which is a par-tial order of all events of all threads in a concurrentexecution. For any single thread, events are orderedin the order in which they occur. The happens-before relation was first defined by Lamport as apartial order on events occurring in a distributedsystem[16]. To report the data race more accuratelyand efficiently, we apply a happen-before analysisto break down the execution of each thread intoseveral periods by the the synchronization eventsincurred by the thread synchronization, like ompbarrier directive.

The happens-before relation can be computed on-line using standard vector clocks, and each threadhas a vector maintainces the event order, and all thevectors should have a global order to represent onepartial order of execution. For example, thread t1 hasevent a happens before event b, then we have (at1, 0)¡ (bt1, 0), and thread t2 has event c happens beforeevent d, then we have (0, ct2) ¡ (0, dt2), and wesupposed that have (at1, 0) ¡ (0, ct2), which meansevent a at thread t1 happens before event c at threadt2 , but we have no conditions for the order of eventc and b, so there is concurrency issues happen onevent c and event d, if both of c and d using thesame memory address with WRITE within differentthreads, then we say there is concurrency issue. Thenwe apply the lockset analysis for these two eventsagain.

The formal representation forhappen-before analysis is listed below.IsPotentialHappenBeforeRace(ei, ej)=ei = V arNameMem(mi, ai, ti) ∧ ej =V arNameMem(mj , aj , tj) ∧ti 6= tj ∧mi = mj ∧(ai = WRITE) ∨ (aj = WRITE) ∧¬(eiti → ejtj ) ∧ ¬(ejtj → eiti)

There are several challenges in the instrumentationof the OpenMP binary programs using Intel Pin.Since implementation API within Intel Pin onlyprovides memory access read/write, synchronizationwait and locking lock, unlock operations, in order tosupport the OpenMP standard specifies several high-level synchronization points in dynamic analysis,the explicit synchronization points include #pragmaomp barrier , #pragma omp critical and implicitsynchronization include #pragma omp single should

0%

20%

40%

2 4 8 16 32 64

Ave

rgag

e O

verh

ead

Number of Processors

MARMOT

ITC

0

500

1000

1500

2000

2500

3000

3500

2 4 8 16 32 64

Exe

cuti

on

tim

e in

se

con

ds

Number of Processes

Base

HOME

MARMOT

ITC

Figure 4. LU-MZ hybrid MPI/OpenMP Testing

be supported.

V. EXPERIMENTS

We conducted all the experiments on real cloudenvironment, Amazon EC2, to demonstrate the ourapproach HOME is accuracy and efficiency and tomeasure the overhead of our approach. We summa-rize major results we observed in this section. Theoverhead of HOME is ranging from 16% to 45%.We can detect all 6 kinds of violations mentioned inthis paper using artificial inserted violations.

A. System Setup

This section discusses our experimental evaluationof HOME over some microbenchmarks for casestudy and real world applications for scalability test.In Amazon EC2 cluster, our instance type is C3instances with 2.8 GHz Intel Xeon E5-2680v2 . Thenumbers, e.g., 8, 16, 32, 64, 128, in the figures of thissection represent the number of MPI processes in ourcloud platform. We use up to 32 C3 instances totally.Our experiments are performanced on NPB 3.3-MZin NAS Parallel Benchmark, which includes BT, SPand LU with Class C size. The number of threads isset up to 2 by default in our experiment. Otherwise,the overhead of Intel Thread Checker would be veryhigh with number increasing of threads in processes.

B. Performance Analysis and Comparison

We also compare our approach HOME with IntelThread Checker[19] and Marmot [6] using evalu-ation results with injected or modified violationsin benchmark. In above Table I, it shows that weinserted 6 violations () into source code programs.ITC stands for Intel Thread Checker in this ex-periment section. The first column lists benchmarknames which are LU, BT and SP in NAS-MZ [15].The second column shows our detected results forall these inserted 6 violations in LU, BT and SP

0%

20%

40%

2 4 8 16 32 64

Ave

rgag

e O

verh

ead


MARMOT

ITC

0

500

1000

1500

2000

2500

3000

3500

2 4 8 16 32 64

Exe

cuti

on

tim

e in

se

con

ds

Number of Processes

Base

HOME

MARMOT

ITC

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Exe

cuti

on

tim

e in

se

con

ds

0

1000

2000

3000

4000

5000

6000

2 4 8 16 32 64

Exe

cuti

on

Tim

e in

se

con

ds

Number of Processes

Base

HOME

MARMOT

ITC

Figure 5. BT-MZ hybrid MPI/OpenMP Testing

8 16 32 64

Number of Processes

MARMOT

ITC

0

500

1000

1500

2000

2500

3000

3500

4000

4500

2 4 8 16 32 64

Exe

cuti

on

tim

e in

se

con

ds

Number of Processes

Base

HOME

MARMOT

ITC

Figure 6. SP-MZ hybrid MPI/OpenMP Testing

benchmark, the experiment shows that our HOMEcan detect all of these expected violations (termina-tion, communication and so on). For NPB LU, thereason why Intel Thread Checker reports 5 falses isbecause it cannot recognize omp critical directivescorrectly, so it would not detect violations to requestin MPI Probe. When it comes to Marmot, since theirapproaches only depends on the violations which arereal happening, so some potential violations wouldbe ignored. There is one false positive happen in BTtest using Intel Thread Checker, since the omp criti-cal directive can not be recognized correctly, so thissingle thread execution routine would be reportedas concurrent execution using different threads. Alsoboth of the marmot and intel thread checker woulddetect similar violations like HOME, but the over-head in Marmot and Intel Thread Checker are muchhigher than that in HOME.

Benchmarks HOME ITC MarmotNPB-MZ LU (6) 6 5 5NPB-MZ BT (6) 6 7 6NPB-MZ SP (6) 6 6 5

The way we add violations

0%

20%

40%

60%

80%

100%

120%

140%

2 4 8 16 32 64

Ave

rgag

e O

verh

ead


HOME

MARMOT

ITC

Figure 7. Overhead measurement

In Figure 4, Figure 5 and Figure 6, these figuresshows the execution runtime with inserted violationsof LU, BT and SP hybrid MPI/OpenMP bench-marks, we only inserted MPI calls within openmpparallel regions without any computation influenceon original benchmark semantics, so some extraoverhead would also caused by these inserted MPIcalls in thread level. The Base in the figure meansoriginal runtime of application, and ITC representsIntel Thread Checker. The runtime time in HOME,Marmot and Intel Checker shown in the Figure 4, 5and 6 includes overhead caused by instrumentation.

The problem of the Marmot error detection is thatit can only detect violations if they actually appearin a run made with MARMOT. It would not find theerrors which is a possible violation but not happenduring checking runtime. For the MPI calls of ahybrid application different runs may have a differentexecution order of the MPI calls and it might happenthat certain MPI calls are issued by different threads.So this approach would have false negatives on someright execution order but has potential violationsapplications.

The runtime detection for these errors with veryhigh overhead, since intel thread checker may mon-itor all the thread level instructions, and the sourceand tag information in MPI Probe() is not de-tected by intel thread checker. So our approachwould have better performance and scalability onthese hybrid MPI/OpenMP error checking with lessoverhead on runtime.

C. Overhead Analysis for HOME

The overhead of HOME in Figure 7 is rangingfrom 16 % to 45 % based on our experimental obser-vation. The overhead is caused by extra instructionsexecuted in MPI wrapper and binary code instru-mentation for dynamic analysis. Since binary codeinstrumentation is very expensive, so only monitored

variables mentioned for thread-safety specificationchecking are instrumented during dynamic analysis.The other variables or arguments lists in functioncall during computation are not considered in thisapproach, since lots of approaches are developedto detect concurrency errors in pure MPI and pureOpenMP programs.

We proved that our optimization would signif-icantly reduce the lots of overhead. In Figure 7,the overhead is increasing with number of processesin MPI raises. he reason is that our approach re-quires all the processes run the code instrumentationduring runtime, and these checking requires theextra procedure maintenance for lockset analysis andhappen-before analysis during the runtime. With thenumber of processes increasing, the overhead ofHOME is increasing also, the reason is that eachthread would have to being instrumented duringdynamic analysis, so with our observation, overheadof HOME is ranging from around 16% to 45%, whenit comes to Marmot it is ranging from 15% to 56%,and overhead it is much higher using Intel ThreadChecker which is up to around 200%.

VI. CONCLUSIONS AND FUTURE WORKS

In this paper, we present a practical and scalableerror detection approach for violations to thread-safety specification in hybrid MPI/OpenMP pro-gram, which combines the static analysis on provid-ing monitored variable to improve the dynamic datarace detection. Since HOME applies a static analysisto replace MPI calls with our MPI wrappers to get alist of variables that can be involved in the concur-rency checking for monitored variables, which areassociate with violations in Hybrid MPI/OpenMP.Then the dynamic analysis focus on monitoring thesevariables in the runtime and apply a enhanced locksetanalysis and happen before analysis to detect theconcurrency issues on these variables more accu-rately and efficiently, the overhead is about rangefrom 16% to 45% and each specified variable raceis associate with specified violations in thread-safetyspecification in hybrid MPI/OpenMP application.Our future works include testing HOME’s scalabil-ity and accuracy on more large-scale benchmarks,which would be incorporating with inter-procedureanalysis provided by front-end compiler to producemore refined and precise static analysis results inGUI, extending HOME to handle not only MPIand OpenMP but also the other distributed andshared memory programming model, like UPC andPThreads Programming.

VII. ACKNOWLEDGMENT

The work was supported in part by NSF underGrant 1118059 and CAREER 1054834.

REFERENCES

[1] Amazon EC2. http://http://aws.amazon.com/ec2/.[2] Intel thread checker. http://software.intel.com/en-

us/intel-thread-checker/.[3] S. V. Adve and K. Gharachorloo. Shared memory

consistency models: A tutorial. Computer, 29:66–76,December 1996.

[4] P. Garg, F. Ivancic, G. Balakrishnan, N. Maeda, andA. Gupta. Feedback-directed unit test generationfor c/c++ using concolic execution. In Proceedingsof the 2013 International Conference on SoftwareEngineering, ICSE ’13, pages 132–141, Piscataway,NJ, USA, 2013. IEEE Press.

[5] G. Gopalakrishnan, R. M. Kirby, S. Siegel,R. Thakur, W. Gropp, E. Lusk, B. R. De Supinski,M. Schulz, and G. Bronevetsky. Formal analysisof mpi-based parallel programs. Commun. ACM,54(12):82–91, Dec. 2011.

[6] T. Hilbrich, M. S. Muller, and B. Krammer. De-tection of violations to the mpi standard in hybridopenmp/mpi applications. In Proceedings of the4th International Conference on OpenMP in a NewEra of Parallelism, IWOMP’08, pages 26–35, Berlin,Heidelberg, 2008. Springer-Verlag.

[7] T. Hilbrich, J. Protze, M. Schulz, B. R. de Supinski,and M. S. Muller. Mpi runtime error detectionwith must: Advances in deadlock detection. InProceedings of the International Conference on HighPerformance Computing, Networking, Storage andAnalysis, SC ’12, Los Alamitos, CA, USA, 2012.IEEE Computer Society Press.

[8] J. P. Hoeflinger and B. R. De Supinski. The openmpmemory model. In Proceedings of the 2005 and 2006international conference on OpenMP shared mem-ory parallel programming, IWOMP’05/IWOMP’06,pages 167–177, Berlin, Heidelberg, 2008. Springer-Verlag.

[9] M.-H. Kang, O.-K. Ha, S.-W. Jun, and Y.-K. Jun.A tool for detecting first races in openmp programs.In PaCT ’09: Proceedings of the 10th InternationalConference on Parallel Computing Technologies,pages 299–303, Berlin, Heidelberg, 2009. Springer-Verlag.

[10] Y.-J. Kim, M.-Y. Park, S.-H. Park, and Y.-K. Jun.A practical tool for detecting races in openmp pro-grams. In PaCT, pages 321–330, 2005.

[11] K. Lee and S. P. Midkiff. A two-phase escapeanalysis for parallel java programs. In PACT ’06:Proceedings of the 15th international conference onParallel architectures and compilation techniques,pages 53–62, New York, NY, USA, 2006. ACM.

[12] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser,G. Lowney, S. Wallace, V. J. Reddi, and K. Hazel-wood. Pin: Building customized program analysistools with dynamic instrumentation. In Proceedingsof the 2005 ACM SIGPLAN Conference on Program-ming Language Design and Implementation, PLDI

’05, pages 190–200, New York, NY, USA, 2005.ACM.

[13] H. Ma, S. R. Diersen, L. Wang, C. Liao, D. Quinlan,and Z. Yang. Symbolic analysis of concurrencyerrors in openmp programs. In Proceedings ofthe 2013 42Nd International Conference on ParallelProcessing, ICPP ’13, pages 510–516, Washington,DC, USA, 2013. IEEE Computer Society.

[14] R. H. B. Netzer, T. W. Brennan, and S. K.Damodaran-Kamal. Debugging race conditionsin message-passing programs. In Proceedings ofthe SIGMETRICS Symposium on Parallel and Dis-tributed Tools, SPDT ’96, pages 31–40, New York,NY, USA, 1996. ACM.

[15] Nasa nas parallel benchmarks, OpenMPc versions 2.3. Available fromwww.nas.nasa.gov/Software/NPB.

[16] R. O’Callahan and J.-D. Choi. Hybrid dynamic datarace detection. In Proceedings of the Ninth ACMSIGPLAN Symposium on Principles and Practice ofParallel Programming, PPoPP ’03, pages 167–178,New York, NY, USA, 2003.

[17] OpenMP Architecture Review Board. Openmp ap-plication program interface. Specification, 2008.

[18] P. Sack, B. E. Bliss, Z. Ma, P. Petersen, and J. Torrel-las. Accurate and efficient filtering for the intel threadchecker race detector. In ASID ’06: Proceedingsof the 1st Workshop on Architectural and SystemSupport for Improving Software Dependability, pages34–41, New York, NY, USA, 2006. ACM.

[19] P. Sack, B. E. Bliss, Z. Ma, P. Petersen, and J. Torrel-las. Accurate and efficient filtering for the intel threadchecker race detector. In Proceedings of the 1stWorkshop on Architectural and System Support forImproving Software Dependability, ASID ’06, pages34–41, New York, NY, USA, 2006. ACM.

[20] E. Saillard, P. Carribault, and D. Barthou. Parcoach:Combining static and dynamic validation of mpicollective communications. Int. J. High Perform.Comput. Appl., 28(4), Nov. 2014.

[21] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, andT. E. Anderson. Eraser: A dynamic data race detectorfor multi-threaded programs. ACM Transactions onComputer Systems, 15(4):391–411, Nov. 1997.

[22] S. F. Siegel. Verifying parallel programs with mpi-spin. In Proceedings of the 14th European PVM/MPIUser’s Group Meeting on Recent Advances in Paral-lel Virtual Machine and Message Passing Interface,pages 13–14, Berlin, Heidelberg, 2007. Springer-Verlag.

[23] J. S. Vetter and B. R. de Supinski. Dynamicsoftware testing of mpi applications with umpire.In Proceedings of the 2000 ACM/IEEE Conferenceon Supercomputing, SC ’00, Washington, DC, USA,2000. IEEE Computer Society.

Detecting Thread-Safety Violations in Hybrid OpenMP/MPI ...lwang/papers/Cluster2015.pdfthread-safety...

Documents

Transcript of Detecting Thread-Safety Violations in Hybrid OpenMP/MPI ...lwang/papers/Cluster2015.pdfthread-safety...