Parallel Programming

../../_images/hero_081de380-fc9a-42b0-8312-f414667ccd0d.jpg

© iStock.com/Photokanok

Shared-Variables Model

On a computer, a variable corresponds to some piece of information that we need to store in memory. In the traffic model, for example, we need to store all the cells in the old (containing the state from the previous step) and new roads (containing the current state of the road), and other quantities that we calculate such as the number of cars that move or the density of the cars. All of these are variables — they take different values throughout the calculation.

Remember that the the shared memory architecture (many CPU-cores connected to the same piece of memory) is like several office mates sharing a whiteboard. In this model, we have two choices as to where we store any variables:

  • shared variables: accessible by everyone in the office

  • private variables: can only be accessed by the person who owns them

A shared variable corresponds to writing the value on the whiteboard so that everyone in the office can read or modify it. You can think of private variables being stored on a personal notepad that can only be seen by the owner.

Although writing everything on the whiteboard for all to see might seem like a good idea, it is important to ensure that the officemates do not interfere with each other’s calculations. If you are working on the cells for a section of road, you do not want someone else changing the values without you knowing about it. It is crucial to divide up the work so that the individual tasks are independent of each other (if possible) and to make sure that workers coordinate whenever there is a chance that they might interfere with each other.

In the shared-variables model, the workers are often referred to as threads.

Things to consider

When parallelising a calculation in the shared-variables model, the most important questions are:

  • which variables are shared (stored on the whiteboard) and which are private (written in your own notepad);

  • how to divide up the calculation between workers;

  • how to ensure that, when workers need to coordinate with each other, they do so correctly;

  • how to minimise the number of times workers must coordinate with each other.

The most basic methods of coordination are:

  • master region: certain calculations are only carried out by one of the workers - a nominated boss worker;

  • barrier: everybody waits until all workers have reached a certain point in the calculation; when everyone has reached that point, workers can then proceed;

  • locking: if you are working with a variable and don’t want anyone else to touch it, you can lock it. This means that only one worker can access the variable at a time - if the variable is locked by someone else, you have to wait until they unlock it. On a shared whiteboard you could imagine circling a variable to show to everyone else that you have it locked, then erasing the circle when you are finished.

Clearly, all of these have the potential to slow things down as they can lead to workers waiting around for others to finish, so you should try and do as little coordination as possible (while still ensuring that you get the correct result!).

Adding to a Variable

One of our basic operations is to increment a variable, for example to add up the total number of cars that move each iteration. It may not be obvious but, on a computer, adding one to a variable does not comprise a single operation. Using the whiteboard analogy, it has the following stages:

  • take a copy of the value on the whiteboard and write it in your notepad (load a value from memory into register);

  • add one to the value on your notepad (issue an increment instruction on the register);

  • copy the new value back to the whiteboard (store the new value from register to memory).

In the shared-variables model, the problem occurs if two or more workers try and do this at the same time: if one worker takes a copy of the variable while another worker is modifying it on their notepad, then you will not get the correct answer. Sometimes you might be lucky and no-one else modifies the variable while you are working on your notepad, but there is no guarantee.

This situation is called a race condition and is a disaster for parallel programming: sometimes you get the right answer, but sometimes the wrong answer. To fix this you need to coordinate the actions of the workers, for example using locking as described above.


../../_images/hero_fefa5108-2096-4de1-a50c-275f73bfe94d.jpg

© iStock.com/Dutko

How to parallelise the Income Calculation example?

Consider how to parallelise the salaries example using the shared-variables model, i.e. how could 4 office mates add up the numbers correctly on a shared whiteboard?

Remember that you are not allowed to talk directly to your office mates - all communications must take place via the whiteboard. Things to consider include:

  • how do you decompose the calculation evenly between workers?

  • how do you minimise communication between workers?

  • how do you ensure that you do not have any race conditions?

Can you think of any other aspects that should be taken into account?


Solution to Income calculation in Shared-Variables

In this video David outlines how to parallelise the income calculation on a shared whiteboard.

Making sure that the workers cooperate correctly is the main issue - ensuring correct synchronisation can be surprisingly subtle.

Compare your answers from the last step with this solution. How did you do? Have you learned anything surprising? We are curious to know!


../../_images/hero_b3c04339-9b4e-4471-a57b-295a8ac2c1e5.jpg

© iStock.com/erhui1979

Message-Passing Model

The Message-Passing Model is closely associated with the distributed-memory architecture of parallel computers.

Remember that a distributed-memory computer is effectively a collection of separate computers, each called a node, connected by some network cables. It is not possible for one node to directly read or write to the memory of another node, so there is no concept of shared memory. Using the office analogy, each worker is in a separate office with their own personal whiteboard that no-one else can see. In this sense, all variables in the Message-Passing Model are private - there are no shared variables.

In this model, the only way to communicate information with another worker is to send data over the network. We say that workers are passing messages between each other, where the message contains the data that is to be transferred (e.g. the values of some variables). A very good analogy is making a phone call.

The fundamental points of message passing are:

  • the sender decides what data to communicate and sends it to a specific destination (i.e. you make a phone call to another office);

  • the data is only fully communicated after the destination worker decides to receive the data (i.e. the worker in the other office picks up the phone);

  • there are no time-outs: if a worker decides they need to receive data, they wait by the phone for it to ring; if it never rings, they wait forever!

The message-passing model requires participation at both ends: for it to be successful, the sender has to actively send the message and the receiver has to actively receive it. It is very easy to get this wrong and write a program where the sends and receives do not match up properly, resulting in someone waiting for a phone call that never arrives. This situation is called deadlock and typically results in your program grinding to a halt.

In this model, each worker is called a process rather than a thread as it is in shared-variables, and each worker is given a number to uniquely identify it.

Things to consider

When parallelising a calculation in the message-passing model, the most important questions are:

  • how are the variables (e.g. the old and new roads) divided up among workers?

  • when do workers need to send messages to each other?

  • how do we minimise the number of messages that are sent?

Because there are no shared variables (i.e. no shared whiteboard), you do not usually have to consider how the work is divided up. Since workers can only see the data on their own whiteboards, the distribution of the work is normally determined automatically from the distribution of the data: you work on the variables you have in front of you on your whiteboard.

To communicate a lot of data we can send one big message or lots of small ones, what do you think is more efficient? Why?


../../_images/hero_43a975cf-2390-4948-8ac2-9d596a2214b0.jpg

© iStock.com/adrian825

How to parallelise the traffic simulation?

Consider how you could parallelise the traffic model among 4 workers, each with their own whiteboards in separate offices, communicating by making phone calls to each other.

Remember that the cars are on a roundabout (we are using periodic boundary conditions) so cars leaving the end of the road reappear at the start.

To get you started:

  • think carefully about how the traffic model works; what are its basic rules?

  • think about the characteristics of the message-passing model;

  • how can you combine them?

  • which workers need to phone each other, when and how often?

You do not need to provide a clear-cut answer. Instead, list the things that you think need to be considered and why.

Extra Exercises

In fact, sending a message can be implemented in two different ways:

  • like making a phone call (synchronously) or

  • like sending an email (asynchronously).

The difference is whether the sender waits until the receiver is actively taking part (a phone call) or carries on with their own work regardless (sending an email).

Do you think that solving the traffic model in parallel is simpler using synchronous or asynchronous messages? Which do you think might be faster? Do you think the boundary conditions are important here?

Imagine that you want all workers to know the average speed of the cars at every iteration. How could you achieve this using as few phone calls as possible?


Solution to Traffic simulation in Message-Passing

In this video David describes the basics of how you can parallelise the traffic model using message passing, i.e. on a distributed-memory machine.

Try to list the most important points of this parallelisation. Was there anything that you failed to consider when coming up with your answer? For example, how does each worker know whether it’s supposed to call or wait for a call? Can you think of any other rules that need to be established for this to work?

Hopefully, you now have a better understanding of how both programming models work and how they differ from each other. In the next two steps we will talk about the actual implementations of both models.


../../_images/hero_1ea03df9-7321-4b2f-ac7c-8a400de20bc6.jpg

© iStock.com/georgeclerk

MPI and processes

So far we have discussed things at a conceptual level, but it’s worth going into some of the details, particularly so you are familiar with certain terminology such as process, thread, MPI and OpenMP.

Message Passing

The way that message-passing typically works is that you write one computer program, but run many copies of it at the same time. Any supercomputer will have ways to let you spread these programs across the nodes, and we normally ensure we run exactly one copy of the program for every CPU-core so that they can all run at full speed without being swapped out by the operating system. From the OS’s point of view, each program is a separate process and by default they all run completely independently from each other.

For example, if you run a Word Processor and a Spreadsheet Application at the same time, each of these becomes a separate process that can only run on a single CPU-core at a time. In the message-passing model, we exploit the parallel nature of our distributed-memory computer by running many processes. The only real differences from the word processor and spreadsheet example is that every process is a copy of the same program, and that we want our parallel processes to work together and not independently.

Each process only has access to its own memory, totally separate from the others (this is strictly enforced by the OS to ensure that, for example, your Word Processor cannot accidentally overwrite the memory belonging to your Spreadsheet Application!). This way of implementing message-passing is called the Single Program Multiple Data or SPMD approach.

When they want to send messages to each other, the processes call special functions to exchange data. For example, there will be a function to send data and a function to receive data. These functions are not directly part of the program but stored in a separate library which will be pre-installed on the supercomputer (if you are familiar with the way that applications are created from source code, this means that the compiler is not directly involved in the parallelisation).

Almost all modern programs use the Message-Passing Interface library - MPI. Essentially, MPI is a collection of communication functions, that can be called from any user process.

So, to summarise:

  • the message-passing model is implemented by running many processes at the same time;

  • each process can only run on a single CPU-core and is allocated its own private memory;

  • inter-process communication is enabled by using the MPI library and so does not require a special compiler;

  • this is also called the SPMD approach.

../../_images/hero_4177e963-f697-4b49-bcce-01940d651fd3.png

Can you see any problems with the Message-Passing approach if one of the nodes has a hardware failure and crashes? As supercomputers are getting larger does this become a more or less of an issue?


../../_images/hero_45bcc77c-696b-4e72-a685-2f93205b0aa6.jpg

© iStock.com/Westhoff

OpenMP and threads

Shared Variables

Shared variables are implemented in quite a different way from message passing. For shared variables, our CPU-cores need to be able to share the same memory (i.e. read and write to the same whiteboard). However, we said above that different processes cannot access each other’s memory, so what can we do?

The shared-variables approach is implemented using threads. Threads are just like normal programs except they are created by processes while they are running, not explicitly launched by the user. So, every thread belongs to a parent process; unlike processes, threads can share memory.

The sequence is:

  1. we run a single program which starts out running as a single process on a single CPU-core with its own block of memory;

  2. while it is running, the process creates many threads which act like separate programs except they can all share the memory belonging to their parent process;

  3. the operating system will notice that there are lots of threads running at the same time and ensure that, if possible, they are assigned to different CPU-cores.

So, in the shared-variables model, we exploit the parallel nature of our shared-memory computer by running many threads, all created from a single program (the parent process). We rely on the operating system to do a good job of spreading the threads across the CPU-cores.

In supercomputing, we normally use something called OpenMP to create and manage all our threads. Unlike the MPI library, OpenMP is something that needs to be built in to the compiler. There are actually many ways of creating parallel threads, but OpenMP was designed to be suited to large-scale numerical computations which is why it is so popular in the field.

To summarise:

  • the shared-variables model is implemented by running many threads at the same time;

  • each thread can only run on a single CPU-core, but they can all share memory belonging to their parent process;

  • in supercomputing, we usually create threads using a special compiler that understands OpenMP.

../../_images/hero_fdf2bb5d-b1af-4305-aa9b-795c7cd61fec.png

When we create threads we rely on the OS to assign them to different CPU-cores. How do think the OS makes that decision? What does it need to take into account, when there may be many more threads than CPU-cores?


../../_images/hero_25614aaf-f218-4284-af91-503871ddae9f.jpg

© iStock.com/vkbhat

Comparing the Message-passing and Shared-Variables models

In your opinion, what are the pros and cons of the two models of parallel programming?

Things to consider include:

  • how difficult it is to parallelise a calculation in the two models

  • how do they use memory? Can they share it? Why? How?

  • how many CPU-cores can you use in each model?

  • what happens if you do it incorrectly - will the program ever complete? will it get the right answer?

  • how does the speed of the two models compare - what are the overheads of each?


Terminology Quiz

Question 1

What does the term programming model describe?

A) a particular kind of computer simulation

B) a specific computer programming language

C) a high-level view of how to solve a problem using a computer

D) the low-level details of how a particular computer is constructed

Question 2

What is a race condition in the shared-variables model?

A) when a CPU-core is expecting to receive a message but it never arrives

B) when two CPU-cores have different values for some private variable

C) when lack of synchronisation leads to one CPU-core running ahead of the others

D) when two CPU-cores try to modify the same shared variable at the same time

Question 3

Which of these could cause deadlock in the message-passing model?

A) a CPU-core asks to receive a message but no message is ever sent to it

B) a CPU-core asks to receive a message from another a few seconds before it is sent

C) a CPU-core sends a message to another a few seconds before it is ready to receive it

D) two CPU-cores try to modify the same shared variable at the same time

Question 4

Which of the following best describes the process of decomposing a calculation?

A) deciding which variables should be private and which should be shared

B) deciding which parts of a calculation can be done independently by different CPU-cores

C) choosing between the shared-variables and message-passing models

D) running a parallel program on a parallel computer

Question 5

What is a cellular automaton?

A) a type of computer specifically built to run simple computer simulations

B) a modular system for building parallel computers from simple cellular components

C) a computer simulation technique based on repeated application of simple rules to a grid of cells

D) a parallel programming model