## Torch CUDA Extension Tricks

### Some tricks I found useful for writing CUDA extensions for PyTorch.

*May 06, 2020*

This is a tracking document for some things I’ve found useful when writing CUDA extensions for PyTorch.

## Python

I found it useful to put these at the top of my Python file. `manual_seed`

is for reproducability and `set_printoptions`

is to make it easier to quickly identify whether or not two numbers match up.

```
torch.manual_seed(seed)
torch.set_printoptions(precision=6, sci_mode=False)
```

## CUDA Debugging

This answer suggests the first step for debugging CUDA code is to enable CUDA launch blocking using this at the top of the Python file:

```
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
```

However, this didn’t work for a weird memory access issue I was having. This guide was more helpful. Actually, I had a minor improvement over that function with:

```
#define CUDA_CHECK(X) \
do { \
cudaError_t err = X; \
if (err != cudaSuccess) { \
std::cerr << "CUDA error in " << __FILE__ << "(" << __LINE__ \
<< "): " << cudaGetErrorString(err) << std::endl; \
exit(EXIT_FAILURE); \
} \
} while (0);
#define cudaMemoryTestREMOVE_WHEN_DONE() \
do { \
const int N = 1337, bytes = N * sizeof(float); \
std::vector<float> cpuvec(N); \
for (size_t i = 0; i < N; i++) \
cpuvec[i] = (float)i; \
float *gpuvec = NULL; \
CUDA_CHECK(cudaMalloc(&gpuvec, bytes)); \
assert(gpuvec != NULL); \
CUDA_CHECK( \
cudaMemcpy(gpuvec, cpuvec.data(), bytes, cudaMemcpyHostToDevice)); \
CUDA_CHECK( \
cudaMemcpy(cpuvec.data(), gpuvec, bytes, cudaMemcpyDeviceToHost)); \
CUDA_CHECK(cudaFree(gpuvec)); \
} while (0);
```

# Math Tricks

I found it to be very important to know how to divide *rounding up*. Normally integer division *rounds down*. For example, the code below will print `0, 0, 0, 0, 1, 1, 1, 1, 1, 2`

.

```
int denominator = 5;
for (int numerator = 1; numerator < 10; numerator++)
printf("%d\n", numerator / denominator);
```

To *round up*, you use the identity `(numerator + denominator - 1) / denominator`

. The code below will print `1, 1, 1, 1, 1, 2, 2, 2, 2, 2`

:

```
int denominator = 5;
for (int numerator = 1; numerator < 10; numerator++)
printf("%d\n", (numerator + denominator - 1) / denominator);
```

## C++ Definitions

These are some definitions which I found useful.

```
// Short-hand for getting a packed accessor of a particular type.
#define ACCESSOR(x, n, type) \
x.packed_accessor32<type, n, torch::RestrictPtrTraits>()
// Short-hand for getting the CUDA thread index.
#define CUDA_IDX(x) (blockIdx.x * blockDim.x + threadIdx.x)
// Checks that a number is between two other numbers.
#define CHECK_BETWEEN(v, a, b, n) \
TORCH_CHECK(v >= a && v < b, n, " should be between ", a, " and ", b, \
", got ", v);
// Checks that a tensor has the right dimensionality at some index.
#define CHECK_DIM(x, v, n, d) \
TORCH_CHECK(x.size(d) == v, n, " should have size ", v, " in dimension ", d, \
", got ", x.size(d))
// Checks that a tensor has the right number of dimensions.
#define CHECK_DIMS(x, d, n) \
TORCH_CHECK(x.ndimension() == d, n, " should have ", d, " dimensions, got ", \
x.ndimension())
// Checks that the tensor is a CUDA tensor.
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
// Checks that the tensor is contiguous.
#define CHECK_CONTIGUOUS(x) \
TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
// Combines the CUDA tensor and contiguousness checks.
#define CHECK_INPUT(x) \
CHECK_CUDA(x); \
CHECK_CONTIGUOUS(x)
```

Additionally, here’s a useful struct to handle CUDA streams.

```
template <int N> struct streams_t {
cudaStream_t streams[N];
streams_t() {
for (int i = 0; i < N; i++)
cudaStreamCreate(&streams[i]);
}
~streams_t() {
sync();
destroy();
}
void destroy() {
for (int i = 0; i < N; i++)
cudaStreamDestroy(streams[i]);
}
void sync() {
for (int i = 0; i < N; i++)
cudaStreamSynchronize(streams[i]);
}
void sync(int i) {
assert(i < N);
cudaStreamSynchronize(streams[i]);
}
cudaStream_t get(int i) const {
assert(i < N);
return streams[i];
}
};
```

Here’s a worthwhile addition to print `dim3`

objects.

```
#include <iostream>
std::ostream &operator<<(std::ostream &os, const dim3 &d) {
os << "(" << d.x << ", " << d.y << ", " << d.z << ")";
return os;
}
```

## Additional Resources

Below are some of the resources that I found useful.

- Illegal Memory Access: “An empirical method of debugging “illegal memory access” bug in CUDA programming”, useful guide for debugging memory issues.
- CUDA Extension Write-up: Introduces how to get started writing a CUDA extension for PyTorch and walks through a complete code example.
- Optimizing Parallel Reduction in CUDA: Slides describing how to speed up reductions (useful for any operations on rings).