One, Two, ... Many

Florian Rappl, Departement of Theoretical Physics, University of Regensburg

One, Two, ... Many - Frameworks for programming on massively parallel architectures


  • Motivation and challenges
  • Popular choices
  • OpenMP, Intel Cilk Plus and Intel TBB
  • Comparison
  • Benchmarks and conclusions

Reason for many cores

  • Moore's law is still holding due to improved engineering
  • Frequency stagnated due to Dennard's scaling theory
  • More transistors possible - but how to use?
  • One possible solution is to place more cores
  • Another is to integrate a full system (SoC)

Programming challenges

  • How to use the system efficiently?
  • What about deadlocks, barriers and memory?
  • How to debug a massive parallel code?
  • How to do communication?
  • Role of the operating system ...

Detecting and splitting problems

Parallel programming frameworks

  • MPI
  • OpenMP
  • Cilk
  • Intel Cilk Plus
  • Intel Threading Building Blocks (TBB)
  • Intel Array Building Blocks (ArBB)
  • AMP, CUDA, OpenAcc, OpenCL

The old way...

... and the new way!

Regarding GPUs

  • GPUs already provide a lot of so called kernels
  • We need to transfer the data from the CPU to the GPU explicitly
  • A problem is the portability of applications
  • At the moment various frameworks try to attack this issue
  • Also CPU vendors like AMD and Intel provide solutions

Looks familiar?


  • The magic phrase is Many Integrated Cores
  • The concept is to provide a (PCIe) board with MIC
  • CPU vendors produce x86 like architecture
  • This remains compatibility and strengthens portability
  • However, for programming efficiently the right framework is needed

A simple program

void main(int argc, char **argv) {
	int sum = 0;
	int add = atoi(argv[1]);
	for(int i = 0; i < 100; i++) {
		sum += i + add;
	printf("Result = %d\n", sum); 


  • Old and robust standard with compiler specific implementation
  • Idea is to make changes in the code which remain portable (serial, parallel)
  • Latest standard 3.0 also offers task driven model
  • Became popular due to very easy parallelization of for-loops
  • Is available for C, C++ and Fortran

How can we use OpenMP?

void main(int argc, char **argv) {
	int sum = 0;
	int add = atoi(argv[1]);
#pragma omp parallel for reduction(+: sum)
	for(int i = 0; i < 100; i++) {
		sum += i + add;
	printf("Result = %d\n", sum); 

A look behind the curtain

	;some moves...
	addl	%eax, -8(%rbp)
	addl	$1, -12(%rbp)
	cmpl	$99, -12(%rbp)
	;again some moves
	call	printf
movl	$OMP_SECTION, %edi
call	GOMP_parallel_start
;load and move
call	GOMP_parallel_end
;some moves
call	printf

What happened to the loop body?

	call omp_get_num_threads
	call omp_get_thread_num
	; moves and shifts
	idivl	%ebx
	imull	%ebx, %edx
	cmpl	$100, %edx
	; further computations
	; some moves
	addl	-20(%rbp), %eax
	addl	%eax, -24(%rbp)
	addl	$1, -20(%rbp)
	cmpl	%edx, -20(%rbp)
	; some moves again
	lock addl	%eax, (%rdx)

OpenMP (up to 3.0) features

  • Blocks for single, critical and master thread only
  • Reductions and other global operations
  • omp_set_nested, omp_get_nested
  • Synchronization and scheduling possibilities
  • omp_set_schedule, omp_get_schedule
  • Sections with different code for different threads

Intel Cilk Plus

  • Originally developed by the MIT
  • Very minimalistic approach - just 3 keywords
  • Idea is to make as few changes as possible to serial code
  • Language extension available to Intel's compiler and GCC
  • Extends C (partly) and C++ but is not available for Fortran

How can we use Cilk Plus?

void main(int argc, char **argv) {
	reducer_opadd<int> sum = 0;
	int add = atoi(argv[1]);
	_Cilk_for(int i = 0; i < 100; i++) {
		sum += i + add;
	printf("Result = %d\n", sum.get_value()); 

Cilk Plus features

  • cilk_spawn, cilk_sync and cilk_for
  • #pragma cilk_grainsize
  • Special reducer objects
  • Possibility to perform set_worker_count()
  • Special vectorization features:
    c[0:2*n-1] = 0; //Initialize
    for (size_t i=0; i<n; ++i)
    	c[i:n] += a[i]*b[0:n];

Intel Threading Building Blocks

  • Task driven programming model
  • Collection of parallel algorithms
  • C++ is required
  • Heavy usage of templates
  • Available in commercial and open source form
  • For some cases it could be useful to use C++11

How can we use TBB?

void main(int argc, char **argv) {
	int add = atoi(argv[1]);
	int sum = parallel_reduce(blocked_range<int>(0, 100), 0,
		[](const blocked_range<int>& r, int local)->int {
			for(int i = r.begin(); i != r.end(); i++) {
				local += i + add;
			return local; }},
		[](int x, int y)->int {
			return x + y; });
	printf("Result = %d\n", sum); 

TBB features

  • Lots of premade algorithms
  • Scalable memory allocation possible
  • Race-condition free containers e.g. concurrent_vector
  • Atomic operations e.g. compare_and_swap
  • Efficient task scheduling

A look back ...

  • Options
  • Portable
  • Languages
  • Performance
  • Memory hog
  • Unreliable
Cilk Plus
  • Performance
  • Scaling
  • Global variables
  • No Fortran
  • Less control
  • Possibilities
  • Possibilities
  • Managed res.
  • Maintenance
  • C++ only
  • OOP bound
  • Not much control


  • under NDA


  • Programming in parallel by using threads will no longer be optional
  • Different frameworks offer different philosophies
  • The best framework is the one that is suitable for the job
  • Depending on the requirements some choices will drop out instantly
  • Benchmarks suggest that the system as well as the compiler are influencing the choice
There are 3 rules to follow when parallelizing large codes. Unfortunately, no one knows what these rules are.

W. Somerset Maugham, Gary Montry