Connecting Multiple Computers

../../_images/hero_cc560367-265d-4dd8-aaf1-330a55528278.jpg

© iStock.com/3D_generator

Distributed Memory Architecture

Because of the difficulty of having very large numbers of CPU-cores in a single shared-memory computer, all of today’s supercomputers use the same basic approach to build a very large system: take lots of separate computers and connect them together with a fast network.

For the moment, let’s ignore the complication that each computer is itself a shared-memory computer, and just say we have one processor in each computer:

../../_images/hero_cb43e824-caf6-43f9-9d5c-85f347dc7373.png

The most important points are:

  • every separate computer is usually called a node

  • each node has its own memory, totally separate from all the other nodes

  • each node runs a separate copy of the operating system

  • the only way that two nodes can interact with each other is by communication over the network.

The office analogy is very useful here: a distributed-memory parallel computer is like workers all in separate offices, each with their own personal whiteboard, who can only communicate by phoning each other.

Advantages

The number of whiteboards (i.e. the total memory) grows as we add more offices.

There is no overcrowding so every worker has easy access to a whiteboard.

We can, in principle, add as many workers as we want provided the telephone network can cope.

Disadvantages

If we have large amounts of data, we have to decide how to split it up across all the different offices.

We need to have lots of separate copies of the operating system.

It is more difficult to communicate with each other as you cannot see each others whiteboards so you have to make a phone call.

The second disadvantage can be a pain when we do software updates - we have to upgrade thousands of copies of the OS! However, it doesn’t have any direct cost implications as almost all supercomputers use some version of the Linux OS which is free.

It turns out that it is much easier to build networks that can connect large numbers of computers together than it is to have large numbers of CPU-cores in a single shared-memory computer. This means it is relatively straightforward to build very large supercomputers - it is an engineering challenge, but a challenge that the computer engineers seem to be very good at tackling!

So, if building a large distributed-memory supercomputer is relatively straightforward then we’ve cracked the problem?

Well, unfortunately not. The compromises we have had to make (many separate computers each with their own private memory) mean that the difficulties are now transferred to the software side. Having built a supercomputer, we now have to write a program that can take advantage of all those thousands of CPU-cores and this can be quite challenging in the distributed-memory model.

Why do you think the distributed memory architecture is common in supercomputing but is not used in your laptop?


../../_images/hero_91d652a7-98f2-49d1-85ee-62d3ff46bac6.jpg

© iStock.com/Maxiphoto

Simple Parallel Calculation

Let’s return to the income calculation example. This time we’ll be a bit more ambitious and try and add up 800 salaries rather than 80. This list of salaries fills 8 whiteboards (100 on each) all in separate offices.

Here we are exploiting the fact that distributed-memory architectures allow us to have a large amount of memory.

If we have one worker per office, think about how you could get them all to cooperate to add up all the salaries. Consider two cases:

  • only one boss worker needs to know the final result;

  • all the workers need to know the final result.

To minimise the communication-related costs, try to make as few phone calls as possible.


../../_images/ARCHER2.jpg

© ARCHER2

Case study of a real machine

To help you understand the general concepts we have introduced this week, we’ll now look at a specific supercomputer. I’m going to use the UK National Supercomputer, ARCHER2, as a concrete example. As well as being a machine I’m very familiar with, it has a relatively straightforward construction and is therefore a good illustration of supercomputer hardware in general.

General

Archer2 is a HPE Cray EX machine, built by American supercomputer company Cray, a Hewlett Packard Enterprises company. It contains 750,080 CPU-cores and has a theoretical performance of 28 Pflop/s. It is operated by EPCC at the University of Edinburgh on behalf of EPSRC and NERC, and is the major HPC resource for UK research in engineering and in physical and environmental science.

Node design

The basic processor used in ARCHER2 is the AMD Zen2 (Rome) EPYC 7742 CPU, which has a clock speed of 2.25 Ghz . The nodes on ARCHER2 have 128 cores across two of the AMD processors. All the cores are under the control of a single operating system. The OS is the HPE Cray Linux Environment, which is a specialised version of SUSE Linux.

Network

The complete ARCHER2 system contains 5,860 nodes, i.e. ARCHER2 is effectively 6,000 seperate computers each running their own copy of Linux. They are connected by the HPE Slingshot interconnect, which has a complicated hierarchical structure specifically designed for supercomputing applications. Each node has two 100 Gb/s network connections, this means each node has a network bandwidth 2048 times faster than what is possible over a 100 Mb/s fast broadband connection!

System performance

ARCHER2 has a total of 750,080 CPU-cores: 5,860 nodes wch with 128 CPU-cores. With a Clock frequency of 2.25 Ghz, the CPU-cores can operate at 2.25 billion instructions per second. However, on a modern processor, a single instruction can perform more than one floating-point operation.

For example, on ARCHER2 one instruction can perform up to four separate additions. In fact, the cores have separate units for doing additions and for doing multiplications that run in parallel. With the wind in the right direction and everything going to plan, a core can therefore perform 16 floating-point operations per cycle: eight additions and eight multiplications.

This gives a peak performance of 750,080 * 2.25 * 16 Gflop/s = 27,002,880 Glop/s, agreeing with the 25.8 Pflop/s figure in the top500 list.

ARCHER2 comprises 23 seperate cabinets, each about the height and width of a standard door, with around 32,768 CPU-cores (256 nodes) or about 60,000 virtual cores (using multi-threading) in each cabinet.

../../_images/hero_73afa9aa-74db-4ad2-893e-971956518bdf.jpg

© EPCC

Storage

Most of the ARCHER2 nodes have 256 GByte of memory (some have 512 GByte), giving a total memory in excess of 1.5 PByte of RAM.

Disk storage systems are quite complicated, but they follow the same basic approach as supercomputers themselves: connect many standard units together to create a much more powerful parallel system. ARCHER2 has over a 15 PByte of Disk storage.

Power and Cooling

If all the CPU-cores are fully loaded, ARCHER2 requires in excess of 4 Megwatts of power, roughly equivalent to a small town of around 4000 houses. This is a significant amount of power to consume and has an associated environmental imact that we must look to mitigate therefore ARCHER2 is supplied by a 100% renewable energy contract.

The ARCHER2 cabinets are kept cool by pumping water through cooling pipes. The water enters at approximately 18°C and, after passing through the cabinets, comes out at around 29°C; it can then be cooled back down and re-circulated. When necessary the water is cooled by electrical chillers but, most of the time, ARCHER2 can take advantage of the mild Scottish climate and cool the water for free simply by pumping it through external cooling towers, so saving significant amounts of energy.

../../_images/hero_87e2018b-86eb-4aa5-a7c4-efd271a505b2.webp

© Mike Brown

../../_images/hero_a887d8cf-e9a0-4810-b7ab-b7a016dfc47f.webp

ARCHER’s cooling towers © Mike Brown


Wee ARCHIE case study

Finally, Wee ARCHIE makes its appearance again! This video uses Wee ARCHIE to explain the parallel computer architecture concepts introduced in this week.

It is worth emphasising that the physical distance between the nodes does impact their communication time i.e. the further apart they are the longer it takes to send a message between them. Can you think of any reason why this behaviour may be problematic on large machines and any possible workarounds?

As usual, share your thought with your fellow learners!

For anyone interested in how Wee ARCHIE has been put together (and possibly wanting to build their own cluster), we invite you to follow the links from this blog article - Setting up your own Raspberry Pi cluster.

../../_images/hero_d827be57-5840-4339-b47c-f70c0d36fcd1.jpg
../../_images/hero_5bca5e55-5548-4a13-8f20-f07f498cec7e.jpg

../../_images/hero_eee58e21-6f26-4139-b812-54694c3d254e.jpg

© iStock.com/filonmar

ARCHER2 - it’s more complicated…

In the last few steps we have glossed over a few details of the processors and the network.

If you look up the specifications of the AMD Zen2 (Rome) EPYC 7742 processor you will see that it has 64 CPU-cores, whereas the ARCHER2 nodes have 128 CPU-cores. This is possible because each node actually contains two physical processors. However, they are put together so that both processors share the same memory so, to the user, it basically looks identical to a single 128-core processor. How this is done is illustrated below and is called a Non-Uniform Memory Access (NUMA) architecture.

../../_images/hero_9f93cf41-f24d-4ab2-8a7e-d25a78a8089c.png

NUMA architecture

Although every CPU-core can access all the memory regardless of which processor it is located on, this can involve going through an additional memory bus so reading data from another CPU’s memory is slower than reading from your own memory. This hardware arrangement is really a detail that isn’t usually very important - the most important feature is that the 128 CPU-cores form a single shared-memory computer, all under the control of a single operating system.

Danger

Need archer 2 node picture

The details of the network are even more complicated with four separate levels ranging from direct connections between the nodes packaged together on the same blade, up to fibre-optic connections between separate cabinets. If you are interested in the details see the ARCHER2 website.


ARCHER2: building a real supercomputer

Wee ARCHIE is very small and was built on someone’s desk. Real supercomputers are very large and require a lot of infrastructure to support them and manpower to build them.

This time lapse video documents the installation of the ARCHER2 system at EPCC in Edinburgh, UK. We use it to pick out various aspects of supercomputer hardware that are not so well illustrated by Wee ARCHIE.

Is there anything that surprised you? We are curious to know so share your impressions in the comments section!


Quiz - Processors, ARCHER2 and Wee ARCHIE

Question 1

Which of these are true about a typical processor in a modern supercomputer?

Select all the answers you think are correct.

A) it contains a single CPU-core

B) it contains many separate CPU-cores

C) it is a special processor, custom-designed for supercomputing

D) it is basically the same processor you would find in a high-end PC or compute server

Question 2

How are the CPU-cores attached to the memory in a modern multicore processor?

Select all the answers you think are correct.

A) the memory is physically sliced up between them

B) the memory is shared between all the CPU-cores

C) cores share access to the memory so they can sometimes slow each other down

D) each core can access the memory completely unaffected by the other cores

Question 3

Like almost all supercomputers, ARCHER2 is constructed as a series of separate cabinets (23 in the case of ARCHER2), each standing about as high and wide as a standard door. Why do you think this size of cabinet is chosen?

A) it is the minimum size that can be cooled effectively

B) it is the maximum size that can be run from a single power supply

C) any larger and the cabinets would not fit through the doors of the computer room

D) freight companies will not ship anything larger than this

Question 4

How are ARCHER2’s 750,080 cores arranged?

A) as one large shared-memory system

B) as 750,080 separate nodes

C) as 5,860 nodes each with 128 cores

D) as 11,720 nodes each with 64 cores

Question 5

Which of these features make the UK national Supercomputer ARCHER2 different from the toy system Wee ARCHIE?

Select all the answers you think are correct.

A) it has multicore CPUs

B) it has multiple nodes connected by a network

C) it runs the Linux operating system

D) it has a much faster network

E) it has many more CPU-cores