Previously in CME213

- Warp = 32 threads
- Block: 1D, 2D, 3D group of threads; max = 1,024 threads
- Grid: 1D, 2D, 3D group of blocks; required for performance and large problems
- Memory access:
  - Analysis based on warp: memory requests made by all threads in warp
  - Map requests from all threads to 128-byte memory segments
  - Peak performance: full segment/cache line used by warp
  - Loss in performance reflected by fraction of cache line not used by warp.
How to calculate the warp ID

• How do you know if 2 threads are in the same warp?
• Answer: look at the thread ID. Divide by 32: you get the warp ID.
• Thread ID is computed from the 1D, 2d or 3D thread index using:

\[ x + yD_x + zD_xD_y \]

```c
int tID = threadIdx.x
           + threadIdx.y * blockDim.x
           + threadIdx.z * blockDim.x * blockDim.y;
int warpID = tID/32;
```
Matrix transpose

• Let’s put all these concepts into play through a specific example: a matrix transpose.
• It’s all about bandwidth!
• Even for such a simple calculation, there are many optimizations.
```cpp
#define simpleTranspose

template<typename T>
__global__
void simpleTranspose(T* array_in, T* array_out, int n_rows, int n_cols) {
    const int tid = threadIdx.x + blockDim.x * blockIdx.x;

    int col = tid % n_cols;
    int row = tid / n_cols;

    if(col < n_cols && row < n_rows) {
        array_out[col * n_rows + row] = array_in[row * n_cols + col];
    }
}
```
Memory access pattern

Coalesced reads

Strided writes
2D kernel

```cpp
template<typename T>
__global__
void simpleTranspose2D(T* array_in, T* array_out, int n_rows, int n_cols) {
    const int col = threadIdx.x + blockDim.x * blockIdx.x;
    const int row = threadIdx.y + blockDim.y * blockIdx.y;

    if(col < n_cols && row < n_rows) {
        array_out[col * n_rows + row] = array_in[row * n_cols + col];
    }
}
```
Access pattern

```c
dim3 block_dim(8, 32);
dim3 grid_dim(n / 8, n / 32);
```

• For a given warp:
  • col: 0 to 7
  • row: 0 to 3

• Reads are partially coalesced

• Writes are partially coalesced
Performance

Bandwidth bench
GPU took 0.305329 ms
Effective bandwidth is 109.896 GB/s

simpleTranspose
GPU took 22.8106 ms
Effective bandwidth is 16.181 GB/s

simpleTranspose2D
GPU took 6.98102 ms
Effective bandwidth is 52.8717 GB/s

<table>
<thead>
<tr>
<th>Avg</th>
<th>Min</th>
<th>Max</th>
<th>Name</th>
</tr>
</thead>
<tbody>
<tr>
<td>299.22us</td>
<td>298.72us</td>
<td>300.54us</td>
<td>[CUDA memcpy DtoD]</td>
</tr>
<tr>
<td>2.0720ms</td>
<td>2.0589ms</td>
<td>2.0858ms</td>
<td>simpleTranspose(int)</td>
</tr>
<tr>
<td>629.13us</td>
<td>627.36us</td>
<td>631.22us</td>
<td>simpleTranspose2D(:)</td>
</tr>
</tbody>
</table>
Increase in performance is due to

more threads are being used (2D vs 1D block)

2D indexing maps naturally to the shape of the matrix

Memory access is partially coalesced

Start the presentation to activate live content

If you see this message in presentation mode, install the add-in or get help at PollEv.com/app
How can you reconcile the reads and writes?
Shared memory

• You need to read data in a coalesced way.
• Transpose locally using fast memory.
• Write data in a coalesced way.
• For this, we need to use shared memory!
It’s here!
Shared memory

• On-chip: high bandwidth, low latency
• Data in shared memory is only accessible by threads in the same thread block!
Imagine a highway with 32 lanes.
Load data from global memory

```c
const int bc = blockIdx.x;
const int br = blockIdx.y;

// Load 32x32 block into shared memory
int gc = bc * warp_size + lane; // Global column index

for(int i = 0; i < warp_size / num_warps; ++i) {
    int gr = br * warp_size + i * num_warps + warp_id; // Global row index
    block[i * num_warps + warp_id][lane] = array_in[gr * n_cols + gc];
}
__syncthreads();
```

All threads in block participate in this; so we need to wait that all threads are done before continuing.
Write is the same but transpose

```c
int gr = br * warp_size + lane;

for(int i = 0; i < warp_size / num_warps; ++i) {
    int gc = bc * warp_size + i * num_warps + warp_id;
    array_out[gc * n_rows + gr] = block[lane][i * num_warps + warp_id];
}
```

You know this works now!

```c
lane = threadIdx.x
gr = ... + lane
array_out[gc * n_rows + gr]
```

Stride is 1 with respect to threadIdx.x
chronize blocks in start the presentation to activate live content

synchronize warps inside a block

synchronize threads inside a block

avoid a read/write race condition

race condition

void a write/read syncthreads is required to
Performance

Bandwidth bench
GPU took 0.306039 ms
Effective bandwidth is 109.641 GB/s

simpleTranspose
GPU took 22.7943 ms
Effective bandwidth is 16.1926 GB/s

simpleTranspose2D
GPU took 6.99616 ms
Effective bandwidth is 52.7573 GB/s

fastTranspose
GPU took 7.3289 ms
Effective bandwidth is 50.3621 GB/s

Yeah!
Wait...

• We only went from 53 GB/s down to 50 GB/s
• We are still far from the peak of 110 GB/s
• What happened?
Banks

• Shared memory suffers from bank conflicts.
• The shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously.
• Any memory read or write request made of n addresses that fall in n distinct memory banks can be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.
• However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized.
• Each bank has a bandwidth of 4 bytes per two clock cycles.
Conflict free
Two way conflict
Conflict free

Stride of 3
Conflict free

General permutation
Conflict free

Broadcast capability
Accessing the same word inside the bank
Shared memory writes

$$\text{block}[i \ast \text{num\_warps} + \text{warp\_id}][\text{lane}] = \text{array\_in}[gr \ast \text{n\_cols} + gc];$$

Stride of 1
Shared memory reads

```c
__shared__ int block[warp_size][warp_size];

array_out[gc * n_rows + gr] = block[lane][i * num_warps + warp_id];
```

Stride of 32!
```
__shared__ int block[warp_size][warp_size+1];
```

<p>| | | | | | | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>32</td>
<td>64</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>x</td>
<td>1</td>
<td>33</td>
<td>65</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>x</td>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

32 banks

Contiguous memory space

```
array_out[gc * n_rows + gr] = block[lane][i * num_warps + warp_id];
```

Loop with stride num_warps
<table>
<thead>
<tr>
<th>Name</th>
<th>Avg</th>
<th>Min</th>
<th>Max</th>
<th>GPU Time</th>
<th>Effective Bandwidth</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bandwidth bench</td>
<td>299.24us</td>
<td>298.68us</td>
<td>300.28us</td>
<td>0.305327 ms</td>
<td>109.897 GB/s</td>
</tr>
<tr>
<td>simpleTranspose</td>
<td>2.0783 ms</td>
<td>2.0614ms</td>
<td>2.0762ms</td>
<td>22.7924 ms</td>
<td>16.194 GB/s</td>
</tr>
<tr>
<td>simpleTranspose2D</td>
<td>629.47us</td>
<td>621.31us</td>
<td>633.11us</td>
<td>6.98426 ms</td>
<td>52.8473 GB/s</td>
</tr>
<tr>
<td>fastTranspose</td>
<td>335.04us</td>
<td>333.96us</td>
<td>336.33us</td>
<td>3.74134 ms</td>
<td>98.6541 GB/s</td>
</tr>
</tbody>
</table>

Note: The images are not relevant to the text content.
You avoid bank conflicts when

- The stride is not a multiple of 32
- The stride is an odd integer
- The stride is a prime number
- The stride is equal to 33

Threads access the same memory location in each bank.
Concurrency, latency

• Imagine you are a pencil manufacturer.
• You outsource your manufacturing plants to China but your market is in the US.
• How do you organize the logistics of the transport?
• You may have great and cheap bandwidth.
• But the latency is very long.
• This works as long as you have enough parallel work to do, i.e., lots of boxes of pencils to ship.
• Then they can be put on slow but big ships.
• Once the shipping pipeline is full, you get cheap and frequent deliveries in large quantities.
• This is the same on a GPU.
Concurrency!

- Concurrency is used to hide memory latency.
- Concurrency is also important for other pipelines in the system, for example in the floating point units.
- The system keeps switching between warps, to avoid being idle.
How is concurrency managed?

• The key is therefore to have as many threads live on an SM as possible. (That’s not 100% true all the time but that’s a good rule of thumb.)

• This is constrained by 2 requirements:
  • We can only fit a whole number of blocks on an SM.
  • There are hardware limits.

• Main ones:
  • Max dim of grid along x, y, and z: 65,535
  • Max x-y dim of block: 1,024
  • Max no. of threads per block: 1,024
  • Max blocks per SM: 8
  • Max resident warps: 48
  • Max threads per SM: 1,536
  • No. of 4-byte registers per SM: 32 K
  • Max shared mem per SM: 48 KB
How can we make sense of this?

• Use the spreadsheet: CUDA occupancy calculator.
• Use CUDA API:

  \texttt{cudaOccupancyMaxActiveBlocksPerMultiprocessor}

  Provides an occupancy prediction based on the block size and shared memory usage of a kernel

  \texttt{cudaOccupancyMaxPotentialBlockSize}

  \texttt{cudaOccupancyMaxPotentialBlockSizeVariableSMem}

  Returns grid and block size that achieves maximum potential occupancy for a device function.
CUDA Occupancy Calculator

1. Select Compute Capability (click):
   - 2.0

2. Enter your resource usage:
   - Threads Per Block: 256
   - Registers Per Thread: 14
   - Shared Memory Per Block (bytes): 4224

3. GPU Occupancy Data is displayed here and in the graphs:
   - Active Threads per MultiProcessor: 1536
   - Active Warps per MultiProcessor: 48
   - Active Thread Blocks per MultiProcessor: 6
   - Occupancy of each MultiProcessor: 1000%

Physical Limits for GPU Compute Capability:

- Threads per Warp: 32
- Max Warps per MultiProcessor: 48
- Max Thread Blocks per MultiProcessor: 8
- Max Threads per MultiProcessor: 1536
- Maximum Thread Block Size: 1024
- Registers per MultiProcessor: 32768
- Max Registers per Thread Block: 32768
- Max Registers per Thread: 63
- Shared Memory per MultiProcessor (bytes): 49152
- Max Shared Memory per Block: 49152
- Register allocation unit size: 64
- Register allocation granularity: warp
- Shared Memory allocation unit size: 128
- Warp allocation granularity: 2

Allocated Resources

<table>
<thead>
<tr>
<th>Resource</th>
<th>Per Block</th>
<th>Limit Per SM</th>
<th>= Allocatable Blocks Per SM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Warps</td>
<td>8</td>
<td>48</td>
<td>6</td>
</tr>
<tr>
<td>Registers</td>
<td>8</td>
<td>72</td>
<td>9</td>
</tr>
<tr>
<td>Shared Memory (Bytes)</td>
<td>4224</td>
<td>49152</td>
<td>11</td>
</tr>
</tbody>
</table>

Maximum Thread Blocks Per MultiProcessor

<table>
<thead>
<tr>
<th>Limited by Max Warps or Max Blocks per MultiProcessor</th>
<th>Blocks/SM</th>
</tr>
</thead>
<tbody>
<tr>
<td>6 * Warps/Block = Warps/SM</td>
<td>48</td>
</tr>
</tbody>
</table>

Limited by Registers per MultiProcessor

Limited by Shared Memory per MultiProcessor

Note: Occupancy limit is shown in orange

Your chosen resource usage is indicated by the red triangle on the graphs. The data points represent the range of possible block sizes, register counts, and shared memory allocation.
How to get kernel information

$ nvcc --ptxas-options=-v -03 -arch=sm_20 transpose.cu
ptxas info  : 0 bytes gmem
ptxas info  : Compiling entry function '_Z17simpleTranspose2DPiS_ii' for 'sm_20'
ptxas info  : Function properties for _Z17simpleTranspose2DPiS_ii
              0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info  : Used 8 registers, 56 bytes cmem[0]
ptxas info  : Compiling entry function '_Z15simpleTransposePiS_ii' for 'sm_20'
ptxas info  : Function properties for _Z15simpleTransposePiS_ii
              0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info  : Used 8 registers, 56 bytes cmem[0]
ptxas info  : Compiling entry function '_Z13fastTranspose1Li8EEvPiS0_ii' for 'sm_20'
ptxas info  : Function properties for _Z13fastTranspose1Li8EEvPiS0_ii
              0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info  : Used 14 registers, 4224 bytes smem, 56 bytes cmem[0]

$ nvcc --ptxas-options=-v
How to choose the number of warps per block?

• With our implementation, the number of warps must divide the 
  L1 cache line size of 32.
• Generally, you pick the smallest number of warps that gives you 
  the highest occupancy.
• Here that number is 8 warps = 256 threads / block.

```c
for(int i = 0; i < warp_size / num_warps; ++i) {
    int gc = bc * warp_size + i * num_warps + warp_id;
    array_out[gc * n_rows + gr] = block[lane][i * num_warps + warp_id];
}
```

4 iterations required to complete read
It's best to maximize occupancy because you generate more parallel memory requests and... you cannot achieve peak performance unless you reach 100% occupancy. It makes it easier to hide latency otherwise part of the processor becomes idle.

Start the presentation to activate live content.
Branch divergence

• Should be avoided at all cost.
• This leads to idle threads.
Due to this lock-step behavior the divergency in the left warp causes longer execution. The threads cannot advance individually.
Branch occurrences

- **If/while** statements
- **For loop** of varying lengths, e.g., processing an unstructured grid
- Branch divergence is an issue only for threads belonging to the same warp.
- If warps do different things, there is no performance impact.
__global__ void branch_thread(float* out) {
    int tid = threadIdx.x;

    if(tid%2 == 0) {
        ...
    } else {
        ...
    }
}

__global__ void branch_warp(float* out) {
    int wid = threadIdx.x/32;

    if(wid%2 == 0) {
        ...
    } else {
        ...
    }
}
<table>
<thead>
<tr>
<th>Threads that execute the same branch</th>
<th>The number of groups execute each branch</th>
</tr>
</thead>
<tbody>
<tr>
<td>start the presentation to activate live content</td>
<td>how many warps execute the same branch or not</td>
</tr>
</tbody>
</table>