How does your laptop work?

../../_images/hero_506fdfe2-9cb8-4e6d-9c7c-c62c6f286dc1.jpg

© iStock.com/Denis83

Computer Basics

Before we look at how supercomputers are built, it’s worth recapping what we learned last week about how a standard home computer or laptop works.

Things have become slightly more complicated in the past decade, so for a short while let’s pretend we are back in 2005 (notable events from 2005, at least from a UK point of view, include Microsoft founder Bill Gates receiving an honorary knighthood and the BBC relaunching Dr Who after a gap of more than a quarter of a century).

A personal computer from 2005 had three major components: a single processor for performing calculations, Random Access Memory (RAM) for temporary storage of data and a hard-disk for long-term storage of programs and files.

../../_images/hero_9090d93c-0a48-4a33-8ed4-3b8fc6acf6cf.png

For us, it turns out that the way memory is configured is the most fundamental aspect so we’ll not worry about the disk for now.

Moore’s law had meant that, over the previous three decades, processors had become faster and faster at an exponential rate due to ever increasing CPU clock speeds. However, around 2005, the increase in clock speed stopped and frequencies started to plateau at around a couple of GHz.

The reason was simple: the amount of electrical power required to run processors at these speeds had become so large that they were becoming too hot for the domestic market (could not be cooled by a simple fan) and too expensive to run for the commercial market (large electricity bills and expensive cooling infrastructure). So, around 2005, the application of Moore’s law changed: rather than using twice as many transistors to build a new, more complicated CPU with twice the frequency, manufacturers started to put two of the old CPUs on the same silicon chip - this is called a dual-core CPU.

The trend continued with four CPUs on a chip, then more … Generically, they are called multicore CPUs, although for very large numbers the term manycore CPU is now commonplace.

The terminology gets somewhat confusing now as, when you say processor or CPU, it’s sometimes not clear if you mean the physical chip that you buy from a hardware store (which actually has more than one CPU on it) or whether you mean the individual CPUs themselves. To avoid confusion, in this course we call each individual CPU a CPU-core, and reserve CPU or processor for the entire multicore CPU. So, a quad-core CPU (or quad-core processor) has four CPU-cores.

We now have two complementary ways of building a parallel computer: we can build a single multicore computer using a processor with perhaps a few tens of CPU-cores, or we can take lots of separate computers, each with their own processor and memory, and link them together using a high-speed network. These two approaches are called the shared-memory architecture and the distributed-memory architecture and we will now look at them in detail.

What do you think the main differences between these two approaches are? Can you think of any advantages and/or disadvantages for both of them?


../../_images/hero_55c8a23e-686f-42a9-b7e9-de0a12208486.jpg

© iStock.com/oonal

Shared Memory Architecture

The fundamental feature of a shared-memory computer is that all the CPU-cores are connected to the same piece of memory.

../../_images/hero_2160d5f2-d404-4b10-b77e-e7b2a04ac529.png

This is achieved by having a memory bus that takes requests for data from multiple sources (here, each of the four separate CPU-cores) and fetches the data from a single piece of memory. The term bus apparently comes from the Latin omnibus meaning for all, indicating that it is a single resource shared by many CPU-cores.

This is the basic architecture of a modern mobile phone, laptop or desktop PC. If you buy a system with a quad core processor and 4 GBytes of RAM, each of the 4 CPU-cores will be connected to the same 4 Gbytes of RAM, and they’ll therefore have to play nicely and share the memory fairly between each other.

A good analogy here is to think of four office-mates or workers (the CPU-cores) sharing a single office (the computer) with a single whiteboard (the memory). Each worker has their own set of whiteboard pens and an eraser, but they are not allowed to talk to each other: they can only communicate by writing to and reading from the whiteboard.

Later on, we’ll start to think about how we can use this shared whiteboard to get the four workers to cooperate to solve the same problem more quickly than they can do it alone. However, the analogy already illustrates two key limitations of this approach:

  1. memory capacity: there is a limit to the size of the whiteboard that you can fit into an office, i.e. there is a limit to the amount of memory you can put into a single shared-memory computer;

  2. memory access speed: imagine that there were ten people in the same office - although they can in principle all read and write to the whiteboard, there’s simply not enough room for more than around four of them to do so at the same time as they start to get in each other’s way. Although you can fill the office full of more and more workers, their productivity will stall after about 4 workers because the additional workers will spend more and more time idle as they have to queue up to access the shared whiteboard.

Limitations

It turns out that memory access speed is a real issue in shared-memory machines. If you look at the processor diagram above, you’ll see that all the CPU-cores share the same bus: the connection between the bus and the memory eventually becomes a bottleneck and there is simply no point in adding additional CPU-cores. Coupled with the fact that the kinds of programs we run on supercomputers tend to read and write large quantities of data, it is often memory access speed that is the limiting factor controlling how fast we can do a calculation, not the floating-point performance of the CPU-cores.

There are various tricks to overcoming these two issues, but the overcrowded office clearly illustrates the fundamental challenges of this approach if we require many hundreds of thousands of CPU-cores.

Despite its limitations, shared memory architectures are universal in modern processors. What do you think the advantages are?

Think of owning one quad-core laptop compared to two dual-core laptops - which is more useful to you and why?


../../_images/hero_658e81ad-80f0-4383-8c94-f469ace0b757.jpg

© iStock.com/silverjohn

Simple Parallel Calculation

We can investigate a very simple example of how we might use multiple CPU-cores by returning to the calculation we encountered in the first week: computing the average income of the entire world’s population.

If we’re a bit less ambitious and think about several hundred people rather than several billion, we can imagine that all the individual salaries are already written on the shared whiteboard. Let’s imagine that the whiteboard is just large enough to fit 80 individual salaries. Think about the following:

  • how could four workers cooperate to add up the salaries faster than a single worker?

  • using the estimates of how fast a human is from last week, how long would a single worker take to add up all the salaries?

  • how long would 4 workers take for the same number of salaries?

  • how long would 8 workers take (you can ignore the issue of overcrowding)?

  • would you expect to get exactly the same answer as before?

We’ll revisit this problem in much more detail later but you know enough already to start thinking about the fundamental issues.


../../_images/hero_57d4f566-8830-4f0e-a6a9-e48351b03a5c.jpg

© iStock.com/Rost-9D

Who needs a multicore laptop?

We’ve motivated the need for many CPU-cores in terms of the need to build more powerful computers in an era when the CPU-cores themselves aren’t getting any faster. Although this argument makes sense for the world’s largest supercomputers, we now have multicore laptops and mobile phones - why do we need them?

You might think the answer is obvious: surely two CPU-cores will run my computer program twice as fast as a single CPU-core? Well, it may not be apparent until we cover how to parallelise a calculation next week, but it turns out that this is not the case. It usually requires manual intervention to enable a computer program to take advantage of multiple CPU-cores. Although this is possible to do, it certainly wouldn’t have been the case back in 2005 when multicore CPUs first became commonplace.

So what is the advantage for a normal user who is not running parallel programs? We call these serial programs.

Operating Systems

The important point is that, as a user, you don’t actually say please run this program on that CPU-core. There is a piece of software that sits between you and the hardware, the Operating System or OS, that isolates you from direct access to the CPU-cores, memory etc. There are several common OS’s around today - e.g. Windows, macOS, Linux and Android - but they all perform the same basic function: you ask the OS to execute a program, and it then decides if and when to actually run it on a physical CPU-core.

../../_images/hero_6d93ece3-84b2-495f-b5c5-0e0f652196ea.png

This enables even a single CPU-core machine to appear to be doing more than one thing at once - it will seem to be running dozens of programs at the same time. What is actually happening is that the OS is running one program for, say, a hundredth of a second, then stopping that program and running another one for a hundredth of a second, etc. Just like an animation running at many individual frames per second, this gives the illusion of continuous motion.

How the OS exploits many CPU-cores

On a shared-memory computer, the important point is that all the CPU-cores are under the control of a single OS (meaning you don’t need to buy 4 Windows licences for your quadcore laptop!). This means that your computer can genuinely run more than one program at the same time. It’s a bit more complicated for the OS - it has to decide not just which programs to run but also where to run them - but a good OS performs a juggling act to keep all the CPU-cores busy.

../../_images/hero_4a65543e-9635-4624-9811-5da1a0ab431e.png

This means that you can run a web browser, listen to music, edit a document and run a spreadsheet all at the same time without these different programs slowing each other down. Because of the shared memory, the OS can stop a program on CPU-core 1 and then restart it later on CPU-core 3: all the workers can read and write to the same shared whiteboard so can easily pick up where someone else has left off.

A shared-memory computer looks like a more powerful single-core computer: it operates like a single computer because it has a single OS, which fundamentally relies on all the CPU-cores being able to access the same memory. It is this flexibility that makes multicore shared-memory systems so useful.

So, for home use, the Operating System does everything for us, running many separate programs at the same time. On the other hand, in supercomputing we want to run a single program but make it go faster - the OS can’t really help us here and we’ll see that we have to work a bit harder.

In your opinion what are the downsides of this more advanced ‘single-core computer’ approach?


How does your laptop use multiple CPU-cores?

This video shows a simple demo to illustrate how modern operating systems take advantage of many CPU-cores.

Watch what happens when David runs multiple copies of a simple income calculation program on his quad-core laptop. Do you find this behaviour surprising?

../../_images/hero_4a65543e-9635-4624-9811-5da1a0ab431e.png

Note that, although we are running few instances of our toy program at the same time, there is no time advantage in doing so. We are running exactly the SAME program many times, so each run will give us exactly the same results in approximately the same time. This demo illustrates nicely how an operating system handles execution on multiple CPU-cores, but otherwise is a waste of resources.

Can you think of a situation in which this kind of execution may be useful?

We haven’t really explained what the concept of minimum interference is about - think of David closing down his browser before running his code - but can you think of a reason why it may be important to isolate your program as much as possible, especially when running on a supercomputer? What are the implications of not doing this?

Share your thoughts in the comments section!

If you are interested, here is the function that David actually timed. It’s written in the C language - it’s purely here for reference and its not meant to be compiled or run as it stands.

// Add up a given number of salaries to compute total income.
// Use floating-point numbers to better represent real calculations.

double salarysum(double salarylist[], int npeople)
{
  double total;
  int i;

  total = 0.0;

  for (i=0; i < npeople; i++)
    {
      total = total + salarylist[i];
    }

  return total;
}

David: I re-ran the same studies covered in the video but with almost all other tasks disabled , for example I did not run the graphical performance monitor, which allowed me to have access to all four CPU-cores. Here are the results.

dataset

#copies

runtime (seconds)

small

1

9.7

small

4

11.1

small

8

22.2


../../_images/hero_e4699d69-bd25-4d86-b020-f3ad31557589.jpg

© iStock.com/4x6

Memory Caches

We mentioned before that memory access speeds are a real issue in supercomputing, and adding more and more CPU-cores to the same memory bus just makes the contention even worse.

The standard solution is to have a memory cache. This is basically a small amount of scratch memory on every CPU-core, which is very fast. However, it is also quite small - well under a megabyte when the total memory will be more than a thousand times larger - so how can it help us?

Think of the analogy with many workers sharing an office. The obvious solution to avoid always queueing up to access the shared whiteboard is to take a temporary copy of what you are working on. Imagine that each worker has a small notebook: when you need to read data from the whiteboard, you fill your notebook with everything you need and then you can work happily on your own at your desk while other people access the whiteboard.

This works very well for a single worker: you can work entirely from your personal notebook for long periods, and then transfer any updated results to the whiteboard before moving on to the next piece of work.

Writing data

It also works very well for multiple workers if they only ever read data. Unfortunately, real programs also write data, i.e. workers will want to modify the data on the whiteboard. If two people are working on the same data at the same time, we have a problem: if one worker changes some numbers in their notebook then the other worker needs to know about it. The compromise solution is to let everyone know whenever you modify any results in your notebook. Whenever you alter a number, you have to shout out:

“I’ve just changed the entry for the 231st salary - if you have a copy of it then you’ll need to get the new value from me!”

Although this is OK for a small number of workers, it clearly has problems when there are lots of workers. Imagine 100 workers: whenever you change a number you have to let 99 other people know about it, which wastes time. Even worse, you have to be continually listening for updates from 99 other workers instead of concentrating on doing your own calculation.

This is the fundamental dilemma: memory access is so slow that we need small, fast caches so we can access data as fast as we can process it. However, whenever we write data there is an overhead which grows with the number of CPU-cores and will eventually make everything slow down again.

Keeping the data consistent and up-to-date on all the CPU-cores is called cache coherency. It means that we always have up-to-date values in our notebook (or, at the very least, that we know when our notebook is out of date and we must return to the whiteboard). Ensuring cache coherency is the major obstacle to building very large multicore processors.

../../_images/hero_f158c8fd-2092-4272-a9dc-e4806b44f9cc.png

Keeping all the caches coherent when we write data is the major challenge.

What do you think is the current state-of-the-art? How many CPU-cores do high-end processors have?


Resource Contention

This video shows a simple demo to illustrate what happens when multiple cores try to use the same resources at the same time.

As we have mentioned before, the situation when multiple CPU-cores try to use the same resources, e.g. memory, disk storage or network buses, is called resource contention. Here we look at memory access.

Watch what happens when three copies of a larger income calculation program are running on three CPU-cores at the same time. Is this what you expected?

Keep in mind that, although, the CPU-cores do affect each other it’s not because they exchange any data, but because they need the same data from the memory. In other words, the CPU-cores do not collaborate with each other i.e. they do not share the total work amongst themselves.

Note that I accidentally mis-spoke in the video and the larger calculation actually processes 100 million salaries and not 1 million - David

As for Step 2.6, I also re-ran the same calculations with the graphical monitor turned off so I had access to all 4 CPU-cores. Here are the timings for this large dataset where I reproduce the previous small dataset results for comparison.

dataset

#copies

runtime (seconds)

small

1

9.7

small

4

11.1

small

8

22.2

large

1

10.7

large

4

28.5

large

8

57.0


Terminology Quiz

Question 1

A system built from a single multicore processor (perhaps with a few tens of CPU-cores) is an example of the ____ ____ architecture, whereas a system composed of many separate processors connected via a high-speed network is referred to as the ____ ____ architecture.

Qestion 2

The two main limitations of the shared-memory architecture are: memory ____ and memory ____ ____. The hierarchical memory structure is used to improve the memory access speeds. The smallest but also the fastest memory is called ____ memory. And keeping the data consistent and up-to-date on all the CPU-cores is called ____ ____.

Question 3

The situation when multiple CPU-cores try to use the same resources, e.g. memory, disk storage or network buses, is called ____ ____.