

# Introduction to High-Performance Computing

Session 07 Performance Optimization



## **Performance Modelling**

• the following slides are based on

https://moodle.rrze.uni-erlangen.de/course/view.php?id=311

- 2-day course during MCS Summer School 2014 given by Georg Hager
- Book: G. Hager and G. Wellein:

Introduction to High Performance Computing for Scientists and Engineers,

CRC Computational Science Series, 2010. ISBN 978-1439811924 http://www.hpc.rrze.uni-erlangen.de/HPC4SE/



## **Computer Software and Hardware**



16.03.2021



## Modern Computer Architecture



- today: dual-socket node
  - multiple cores per socket/CPU
  - ccNUMA architecture
  - socket interconnect





## **Detailed View Compute Core**





## Example: Divide Throughput

• in the **Pi.cpp** code the function **f(x)** has one division

```
// define f so that integral of f from 0 to 1 is Pi
double f(const double x) {
   return (4.0/(1.0+x*x));
}
```

- division is the dominant operation (other instructions can be hidden)
- for **n** evaluations of **f** we get  $t = n \cdot \frac{c}{v}$
- Broadwell CPUs need c = 5 cycles/division (throughput) and assuming turbo mode (clock speed  $\nu = 2.5$ GHz) we would expect t = 0.2s for  $n = 10^8$



## **Execution of Instructions**

| <ul> <li>programmer's view:</li> </ul>                  | <ul> <li>hardware's view:</li> </ul>                             |
|---------------------------------------------------------|------------------------------------------------------------------|
| for (int i=0; i <n; i++)<br="">A[i] = A[i] + B[i];</n;> | load r1 = A(i)<br>load r2 = B(i)<br>add r1 = r1 + r2             |
| <ul> <li>user work:</li> <li>N Flops (ADDs)</li> </ul>  | <pre>store A(i) = r1 inc i branch top if i<n< pre=""></n<></pre> |

programm performs computation, FLOP is the basic work done processor executes instructions, instructions is the basic work done

7



## **Basic Compute Resources**

- instruction execution
  - primary resource for computations, hardware is designed to increase instruction throughput as much as possible
  - difficult for general purpose computing, what is a typical workload?
- data movement
  - consequence of instruction execution
  - in the example two loads and one store (double 24 byte)

### What is the bottleneck of an application?



## Flop/s vs. Memory Bandwidth

- a floating-point operation (Flop) is the basic unit of work
  - theoretical peak performance Intel Xeon E5-2650 v4

 $P_{\text{peak}} = 422.5 \text{ GFlop/s}$ 

- equivalent to  $16 \operatorname{Flop}/(\operatorname{core} \cdot \operatorname{cy})$
- memory bandwidth
  - maximum for Intel Xeon E5-2650 v4 is 76.8 GB/s

(https://ark.intel.com/products/91767/Intel-Xeon-Processor-E5-2650-v4-30M-Cache-2\_20-GHz)

equivalent to

(more info: http://sites.utexas.edu/jdm4372/tag/memory-bandwidth/)

<sup>35</sup> Byte/cy



## **Example Bandwidth Limited Execution**

consider the vector-triad

# for (j=0; j<STREAM\_ARRAY\_SIZE; j++) a[j] = b[j]+scalar\*c[j];</pre>

- included in the STREAM benchmark (see <a href="https://www.cs.virginia.edu/stream/">https://www.cs.virginia.edu/stream/</a>)
- 2 Flop/iteration and 24 Byte/iteration
- at 16 Flop/cy on a single core 192 Byte/cy are needed

### ➔ memory bandwidth is the limiting factor here



## STREAM Benchmark

https://www.cs.virginia.edu/stream/

- simple tool to measure memory bandwidth
  - timing of bandwidth-limited vector operations
- some results
  - single core bandwidth is about 20 GB/s
  - maximum bandwidth measured is about 64 GB/s per socket and 128 GB/s per node (two sockets)
  - about half of the cores are need to get (close to) maximum bandwidth



## **Parallel Speedup**

version 1, very good scaling version 2, almost no scaling







• 3d "Stencil" update (Jacobi)

note that the order of the loops is important (and depends on the ordering of multi-dimensional arrays in memory)



## **Memory Access Patterns**

- Caches help with getting instructions and data to the CPU "fast"
- How does data travel from memory to the CPU and back?
- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- MISS: Load or store instruction does not find the data in a cache level
   CL transfer required
- Example: Array copy A(:)=C(:)





## Hardware Locality

- compute nodes are increasingly complex
  - ccNUMA architectures
- the hwloc library provides some tools to

(https://www.open-mpi.org/projects/hwloc/)

- obtain information about the node topology (lstopo)
- bind processes to specific cores/sockets/...
- binding/pinning of threads may improve performance (hwloc-bind ... <command>)
- difficult to decide, e.g. is it better to use neighboring cores or different sockets?

| Package P#0                                           |            |            |            |            |            |             |            |            |            |            |           |
|-------------------------------------------------------|------------|------------|------------|------------|------------|-------------|------------|------------|------------|------------|-----------|
| L3 (30MB)                                             |            |            |            |            |            |             |            |            |            |            |           |
| L2 (256KB)                                            | L2 (256KB) | L2 (256KB) | L2 (256KB) | L2 (256KB) | L2 (256KB) | L2 (256KB)  | L2 (256KB) | L2 (256KB) | L2 (256KB) | L2 (256KB) | L2 (256KE |
| L1d (32KB)                                            | L1d (32KB) | Lld(32KB)  | L1d (32KB) | Lld(32KB)  | Lld(32KB)  | L1d (32KB)  | Lld(32KB)  | L1d (32KB) | L1d(32KB)  | L1d (32KB) | L1d (32KE |
| L1i (32KB)                                            | L11(32KB)  | L11(32KB)  | L11 (32KB) | L11(32KB)  | L11(32KB)  | L11(32KB)   | L11(32KB)  | L11(32KB)  | L11(32KB)  | L11(32KB)  | L11(32KB  |
| Core P#0                                              | Core P#1   | Core P#2   | Core P#3   | Core P#4   | Core P#5   | Core P#S    | Core P#9   | Core P#10  | Core P#11  | Core P#12  | Core P#13 |
|                                                       |            |            |            |            |            |             |            |            |            |            |           |
| PU PW0                                                | PU PW1     | PU PW2     | PU PV3     | PU 9#4     | PU P#5     | PU 996      | PU PW7     | PU PW8     | PU P#9     | PU P#10    | PU P#11   |
| PU PW0                                                | PU Pv1     | PU PW2     | PU PV3     | PU P#4     | PU P#5     | PU P#6      | PU PW7     | PU P#8     | PU P#9     | PU P#10    | PU P#11   |
| PU P#0                                                |            | PU PW2     | PU PV3     | PU P#4     | PU PW5     | PU P#6      | PU PW7     | PU Pws     | PU P#9     | PU Pw10    | PU P#11   |
|                                                       |            | PU PW2     | PU PV3     | PU PM      | PU PW5     | PU P#6      | PU PW7     | PU PV3     | PU Pit9    | PU P#10    | PU PU11   |
| NUMANode P#                                           |            | PU PW2     | PU PV3     | PU PM4     | PU P#5     | PU PP6      | PU PW7     | PU Pvs     | PU Pii     | PU Pe10    | PU P#11   |
| NUMANode P#<br>Package P#1                            |            | PU PW2     | PU 993     | PU PH4     | PU P#5     | PU P96      | PU PW7     | PU Pv8     | PU PH9     | PU P#10    | PU P911   |
| NUMANode P#<br>Package P#1<br>L3 (30MB)               | 1 (123GB)  |            |            |            |            |             |            |            |            |            |           |
| NUMANode P#<br>Package P#1<br>L3 (30MB)<br>L2 (256KB) | 1 (123GB)  | L2 (256KB) | L2 (256KB) | L2 (256KB) | L2 (256KB) | 1.2 (256KB) | L2 (256KB) | L2 (255KB) | L2 (256KB) | L2 (256KB) | L2 (256KE |

CARL VON OSSIETZKY UNIVERSITÄT OLDENBURG

## performance of version 2 is better by factor of few

## **Parallel Performance**





## How is the Hardware optimized for performance?

- speedup memory access with cache (see before)
- pipelining of arithmethic units
- instruction pipeline
- instruction level parallelism
- simultaneous multi-threading (SMT)
- SIMD processing



## Pipelining

### Idea:

- Split complex instruction into several simple / fast steps (stages)
- Each step takes the same amount of time, e.g., a single cycle
- Execute different steps on different instructions at the same time (in parallel)

### Allows for shorter cycle times (simpler logic circuits), e.g.:

- floating point multiplication takes 5 cycles, but
- processor can work on 5 different multiplications simultaneously
- one result at each cycle after the pipeline is full

#### Drawback:

- Pipeline must be filled startup times (#Instructions >> pipeline steps)
- Efficient use of pipelines requires large number of independent instructions → instruction level parallelism
- Requires complex instruction scheduling by compiler/hardware softwarepipelining / out-of-order

### Pipelining is widely used in modern computer architectures



## Pipelinig – 5 stage Multiplication



First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages

Introduction to HPC - Session 07





## simultaneous multi-threading (SMT)

SMT principle (2-way example):





## SIMD processing

- Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on "wide" registers
- x86 SIMD instruction sets:
  - SSE: register width = 128 Bit  $\rightarrow$  2 double precision floating point operands
  - AVX: register width = 256 Bit  $\rightarrow$  4 double precision floating point operands
- Adding two registers holding double precision floating point operands R0 R1 R2 R0 R1





## **Processor Peak Performance**



### But: P = 8.8 GFlop/s for serial, non-SIMD code



## **Performance Bottleneck**

- many floating point computation on little data
   → bound by the processing speed of the CPU
  - possibly increase number of cores
  - make use of SIMD processing
  - note: recent CPU may have lower clock speed for AVX
- few floating point operation per data
   → bound by memory bandwidth
  - change algorithm/parallelization to make better use of cache
  - increase compute intensity





- OMP\_Pi
  - how many CPU cycles are required for a DIV operation?
- STREAM
  - determine memory bandwidth
- Stencil
  - optimization vs. speedup
  - memory access pattern

# measuring/getting optimal performance may require process binding