Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building...

28
virtual techdays INDIA 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan SSG, Intel Corporation

Transcript of Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building...

Page 1: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation

Page 2: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

• Intel® Threading Building Blocks overview• Generic Parallel Algorithms• Lab: Parallelize serial application • Generic Concurrent Containers• Synchronization Primitives• Advanced Features Overview• Summary

virtual techdaysINDIA │ 18-20 august 2010

S E S S I O N A G E N D A

Page 3: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

• Enables you to specify tasks instead of threads• automatically maps task onto physical threads in the way that makes

efficient use of processor resources• Targets threading for performance

• solution for parallelizing a computationally intensive work units and preserve good scalability across various hardware

• Compatible with other threading packages• work well for CPU bound tasks, not I/O bound; coexists with other

threading packages • Emphasizes scalable, data parallel programming

• scales well for the bigger number of processors• Relies on generic programming

• Set of templates implemented in the Intel® TBB allows writing the flexible algorithms.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks

Overview

Page 4: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks

Overview– Supported Platforms:

• IA-32, Intel64 • Parallel Studio

– Product package includes:• Dynamic libraries (debug and

release)• Header files• Sample code• Documentation: tutorial,

getting started guide, reference

– Intel® TBB is a set of generic algorithms and data structures (C++ templates) Trivial Intel® TBB program:

#include "tbb/task_schedulerInit.h"using namespace tbb;int main (){ task_scheduler_init TBB_Init; return 0;}

All public classes and functions arein tbb namespace

Library requires explicit initialization: at least one task_scheduler_init object

must be active

Page 5: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

• Algorithms and data structures that manipulate with concepts– A concept is requirements on type– A type models a concept– Program defines types required by Intel® TBB constructs

• Parallel Generic Algorithms and Concurrent Containers– C++ programming experience, basic STL and basic threading knowledge are required to get started. No

need to be threading Expert.

• Task Scheduler– An engine to power Parallel Generic Algorithms that hide the complexity of the tasks management.

Task Scheduler may be used for advanced programming when your algorithm doesn’t naturally map onto one of pre-packaged Parallel Algorithms. Threading programming and tuning experience are required.

• Synchronization Primitives– The objects should be used carefully as inappropriate use of synchronization may lead to performance

and correctness issues. Solid threading programming and tuning experience are required.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksUsage Model

Page 6: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks

Generic Parallel Algorithms

Page 7: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksGeneric Parallel Algorithms : Basic Concepts

• Splittable Concept– The type X is splittable if it has a constructor that allows an instance to be split into two pieces

• Splitting constructor. Splits x into x and y• X::X (X&, Split)

• Range Concept– The type R represents recursively divisible set of values; it must model Splittable Concept

• Splitting constructor• R::R (R&, Split)

• Returns ‘true’ if range can be partitioned in to two sub-ranges

• bool R::is_divisible() const

• Returns ‘true’ if range is empty• bool R::is_empty() const

• Destructor• R::~R ()

• Copy constructor • R::R (const R&)

Page 8: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksGeneric Parallel Algorithms : parallel_for Template Function

• parallel_for Body Concept Requirements

• Apply Body to Range• void Body::operator() (Range&) const

• Destructor• Body::~Body ()

• Copy constructor• Body::Body (const Body&)

• Range type must model Intel® Threading Building Blocks Range Concept described on the previous foil

• #include “tbb/ParallelFor.h” • template <Range, Body> parallel_for (const Range& range, const Body&

body> – represents parallel execution of Body over each value in the Range

Page 9: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksExample: Parallelizing Simple Loops

• Task: loop over the fixed size array of elements and apply a function to each of them (iterations are independent)

• Serial version of the solution:

const int N = 20000000;void ChangeAarraySerial (int* array, int M) { for (int i = 0; i < M; i++){ array[i] *= 2; }}int main (){ int A[N]; for (int i = 0; i < N; i++) { A[i] = i;} ChangeArraySerial (A, N); return 0;}

Page 10: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks Parallel solution with Intel® TBB : using parallel_for

#include "tbb/blocked_range.h"#include "tbb/parallel_for.h"using namespace tbb;const int IdealGrainSize = <some number>;class ChangeArray{ int* array;public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range<int>& r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ array[i] *= 2; } }};void ChangeArrayParallel (int* a, int n ){ parallel_for (blocked_range<int>(0, n, IdealGrainSize), ChangeArray(a));}int main (){ int A[N]; // initialize tbb, array here… ChangeArrayParallel (A, N); return 0;}

ChangeArray class models ParallelFor Body

Blocked_range is a pre-packaged 1D iteration

space, models Range Concept

Apply changeto array element

in the body of operator()

Call generic function Parallel_for<Range, Body>:

Range Blocked_RangeBody ChangeArray

Experiment withGrain Size

Page 11: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks

Convert Serial Matrix multiplication application into parallel application using parallel_for.

Lab 1:

Page 12: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksGeneric Parallel Algorithms : parallel_reduce Template Function

• #include “tbb/ParallelReduce.h” • template <Range, Body> parallel_reduce (const Range& range, const Body& body > - represents parallel reduction of Body over each value in the Range

• parallel_reduce Body Concept Requirements

• Range type must model Intel® Threading Building Blocks Range Concept

• Apply Body to Range• void Body::operator() (Range&)

• Destructor• Body::~Body ()

• Copy constructor• Body::Body (const Body&)

• Body::Body (const Body&, Split) • Splitting constructor; must be able to run concurrently with ‘join’, `operator()’

• void Body::join (const Body& rhs) • The result of rhs must be merged with result of `this`

Page 13: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks Parallel solution with Intel® TBB : using parallel_reduce#include "tbb/blocked_range.h"#include "tbb/parallel_reduce.h"using namespace tbb;const int IdealGrainSize = <some number>;class SumArray{ int* array;public: int sum; SumArray (int* a): array(a), sum(0) {} void operator()( const blocked_range<int>& r ) { for (counter i=r.begin(); i!=r.end(); i++ ){ sum += array[i]; } } SumArray (SumArray& partial_sum, split): array(partial_sum.array), sum(0) {} void join (const SumArray& partial_sum) { sum += partial_sum.sum; }};void SumArrayParallel (int* a, int n ){ SumArray sum_array (a); parallel_reduce (blocked_range<int>(0, n, IdealGrainSize), sum_array); return sum_array.sum;}

Calculate partial ‘sum’ ofarray elements

in the body of operator()

Call generic function parallel_reduce<Range, Body>

Define splitting constructor

Class SumArray models parallel_reduceBody Concept

Perform Reductionin the body of ‘join’

Page 14: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks

Generic Concurrent Containers

Page 15: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

• Provides concurrent containers – STL containers are not thread-safe: attempt to modify them concurrently can corrupt

container– Standard practice is to wrap a lock around STL containers

• Turns container into serial bottleneck • Interfaces are similar to STL but don’t match 100%.

– Some STL interfaces are inherently not thread-safe

• Fine-grained locking or lockless implementations– Worse single-thread performance, but better scalability.– Can be used with the library, OpenMP, or native threads.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksConcurrent Containers

Page 16: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

• concurrent_hash_table <Key, T, HashCompare>

• Maps Key to element of type T• Hash table of to std::pair <const Key, T>• You should implement HashCompare class and define 2 methods: ‘hash’

(mapping Key to hash code of type size_t), and predicate ‘equal’ (returns true if two Key’s are equal)

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksConcurrent Containers : concurrent_hash_table

Page 17: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

• concurrent_vector <T>• Dynamically growable array of T: grow_by and grow_to_atleast• clear() method is not thread-safe with respect to resizing• ConcurrentVector never moves the element until the array cleared

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksConcurrent Containers : concurrent_vector

Page 18: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

• concurrent_queue <T>• For single threaded run it supports “first-in-first-out” ordering• If one thread pushes two values and the other thread pops those two

values they will come out in the order as they were pushed• The type of ‘size’ is signed number: if queue is empty and size() returns ‘–n’

this means ‘n’ pops are pending• Method ‘empty’ returns true if size is a negative value

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksConcurrent Containers : concurrent_queue

Page 19: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks

Synchronization Primitives

Page 20: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksSynchronization Primitives : Mutex Concept Mutexes are C++ objects based on scoped locking pattern

M() Construct unlocked mutex

~M() Destroy unlocked mutex

typename M::scoped_lock Corresponding scoped_lock type

M::scoped_lock () Construct lock w/out acquiring a mutex

M::scoped_lock (M&) Construct lock and acquire lock on mutex

M::~scoped_lock () Release lock if acquired

M::scoped_lock::acquire (M&) Acquire lock on mutex

M::scoped_lock::release () Release lock

Page 21: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksSynchronization Primitives : Mutex Flavors

spin_mutex• Non-reentrant, unfair, spins in the user space• VERY FAST in lightly contended situations; use it if you need to protect very few

instructions queuing_mutex

• Non-reentrant, fair, spins in the user space• Use Queuing_Mutex when scalability and fairness are important

queuing_rw_mutex• Non-reentrant, fair, spins in the user space

spin_rw_mutex• Non-reentrant, fair, spins in the user space• Use ReaderWriterMutex to allow non-blocking read for multiple threads

mutex• Wrapper for OS sync: CRITICAL_SECTION for Windows*, pthread_mutex on Linux*

Page 22: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksSynchronization Primitives : Example of spin_rw_mutex

• Allows multiple threads to read the protected data, but only one can exclusively change the data (writer)

• Upgrade/Downgrade operations• update_to_writer: returns true if it successfully upgraded a lock without temporarily releasing

the mutex• downgrade_to_reader#include “tbb/spin_rw_mutex.h”

using namespace tbb;

spin_rw_mutex MyMutex;

int foo (){/* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; /* Destructor of ‘lock’ releases ‘MyMutex’ */}

Page 23: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks

Advanced Features Overview

Page 24: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building BlocksSynchronization Primitives : Mutex Concept Generic Parallel Algorithms

parallel_forparallel_while

parallel_reducepipeline

parallel_sortparallel_scan

Concurrent Containersconcurrent_hash_table

concurrent_queueconcurrent_vector

task_scheduler

Low-Level Synchronization Primitivesspin_mutex

queuing_rw_mutexspin_rw_mutex

mutex

Page 25: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

Intel® Threading Building Blocks : Summary

• Scalable data-parallel decomposition providing patterns for parallel algorithms and concurrent data structures

• Paradigm of logical tasks that are efficiently and automatically mapped onto physical threads by task scheduler

• Works good for computationally intensive tasks as task scheduler efficiently load balances tasks across the physical threads and it’s cache aware

Page 26: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

RESOURCES

Resource-1 http://www.threadingbuildingblocks.org/

Resource-2 http://www.threadingbuildingblocks.org/ You may participate in our community support web site. Tools Knowledge Base: http://software.intel.com/en-us/articles/tools User forums: http://software.intel.com/en-us/forums/ Intel® Software Product support info: http://www.intel.com/software/support

Page 27: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysINDIA │ 18-20 august 2010

RELATED CONTENT

Session-1 Speaker Name Timing

Session-2 Speaker Name Timing

Session-3 Speaker Name Timing

Page 28: Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

virtual techdaysTHANKS│18-20 august 2010

email id │[email protected]