CUDA Succinctly^®
by Chris Rose

CHAPTER 7

Blocking with Shared Memory

In Chapter 5, we examined a simple algorithm for solving the nearest neighbor problem for many points in a list of 3-D points. We followed this in Chapter 6, where we examined shared memory in some detail. We will now apply this new knowledge to greatly increase the efficiency of the program from Chapter 5.

One of the most basic uses of shared memory is known as blocking. Blocking allows more efficient use of the global memory bus, and it also allows the bulk of the calculations to be performed on shared memory and registers instead of relying on the L1 and L2 caches of global memory for speed. It works like this: instead of repeatedly having all threads read and write to and from global memory, we copy blocks of global memory to shared memory. The threads work on the shared memory copy of the data, and when they are done they load another block from global memory. In this way the threads do almost all of their operations with shared memory, minimizing the reads and writes to global memory.

Tip: Another side benefit of blocking is that it offers an opportunity for the programmer to change the format of the stored data when it is copied to shared memory. For instance, we could copy 3-D points from global memory and store them as 4-D points in shared memory. This particular operation may or may not have an impact on performance, but it certainly opens up many possibilities.

Shared Memory Nearest Neighbor

The following version of the nearest neighbor (Listing 7.1) is an optimized version of the one we saw previously in Chapter 5. It uses the blocking technique I just described. The threads of a thread block each copy a single point from global memory to shared memory. They all check if a neighbor near their own point is in the list of points in shared memory. Once they are done, they copy more points from global memory, greatly increasing the speed of the algorithm. The code in Listing 7.1 can replace the code in the FindClosestGPU.cu file.

// FindClosestGPU.cu

#include <cuda.h>

#include <cuda_runtime.h>

#include <device_launch_parameters.h>

#include "FindClosestGPU.h" // Not required!

// Constant block size

__device__ const int blockSize = 128;

// Find nearest neighbor using shared memory blocking

__global__ void FindClosestGPU(float3* points, int* indices, int count) {

// Shared memory for block

__shared__ float3 OtherBlocksPoints[128];

if(count <= 1) return;

// Calculate a unique idx

int idx = threadIdx.x + blockIdx.x * blockSize;

// Assume the closest if points[-1]

int indexOfClosest = -1;

// This thread's point

float3 thisPoint;

// Read in this block's points

if(idx < count) {

thisPoint = points[idx];

}

// Assume distance to nearest is float.max

float distanceToClosest = 3.40282e38f;

// Read in blocks of other points

for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) {

// Read in a block of points to the OtherBlocksPoints array

if(threadIdx.x + currentBlockOfPoints * blockSize < count)

OtherBlocksPoints[threadIdx.x] = points[threadIdx.x + (currentBlockOfPoints * blockSize)];

// Wait until blocks read from global into shared memory

__syncthreads();

if(idx < count) {

// Use pointer for faster addressing

float* ptr = &OtherBlocksPoints[0].x;

// For each point in shared memory block:

for(int i = 0; i < blockSize; i++) {

// Calculate distance

float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]);

dist += (thisPoint.y - ptr[1])*(thisPoint.y - ptr[1]);

dist += (thisPoint.z - ptr[2])*(thisPoint.z - ptr[2]);

ptr+=3;

// If this point is within the list and nearer than the

// current closest, update nearest with this one:

if(dist<distanceToClosest &&(i+currentBlockOfPoints*blockSize)

< count && (i+currentBlockOfPoints*blockSize) != idx) {

distanceToClosest = dist;

indexOfClosest = (i + currentBlockOfPoints * blockSize);

}

__syncthreads();

}

if(idx < count)

indices[idx] = indexOfClosest;

}

Listing 7.1: Nearest neighbor with shared memory

As you can see, the kernel in Listing 7.1 is much faster than the previous version from Chapter 5. This version requires that the grid be launched with blocks of 128 threads. At the top of the listing I have used a constant int called blockSize; this will become a literal in the compiled code. Literals are usually faster than variables, even automatic variables like blockDim.x.

Each thread calculates a unique idx value and reads the corresponding point from the list of points. Each thread will find the nearest neighbor to its own point. This is identical to the operation the previous version performs in Chapter 5.

Next we have a for loop, which counts up using the variable currentBlockOfPoints. Each time this loop iterates, the threads read a collection of points from global memory into the shared memory belonging to their block. Note how I use __syncthreads after reading the points from global memory into the shared memory block, ensuring that no threads begin checking the distances to the points before the entire block is copied to shared memory.

Once the points are in shared memory, threads iterate through the data and see if any are the nearest neighbors so far. Once the entire global memory array has passed through shared memory in these blocks, the calculation is complete. Note also there is another __syncthreads call after the for loop. This ensures that all the threads have finished checking the points in the shared memory block prior to reading in a new block.

I have manually used a pointer to address the values of the shared memory points (this is in the variable called ptr). This is often (but possibly not always) faster than using a structure addressing syntax. It might also be a good idea when performing complicated calculations to use small steps (like the calculation of the dist variable in Listing 7.1). This is more efficient for a compiler to optimize, rather than placing an entire calculation on a single, long line of code. Of course, like any code, it's always worth trying to see if something is effective.

This version of the algorithm (Listing 7.1) runs at three times the speed of the original CUDA version presented in Chapter 5, and at 14 times the speed of the original CPU implementation (these times will differ depending on the performance ratio between your particular GPU and CPU). It is very possible that this code could be optimized further, but for the current illustration, it is fast enough.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

Blocking with Shared Memory

Shared Memory Nearest Neighbor

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.