Fixing the RuntimeError: cuda error: invalid device ordinal in Python

When working with GPUs in Python for accelerating code with frameworks like PyTorch or TensorFlow, you may encounter the “RuntimeError: cuda error: invalid device ordinal” error. This frustrating error typically means your GPU is not being detected properly. However, the issue can be resolved with a few simple steps.

What Causes the RuntimeError?

The “invalid device ordinal” error occurs when you try to access a GPU that either does not exist or is not set up correctly. Some common causes include:

  • Specifying a GPU index that is not available on your system. For example:
import torch
torch.cuda.set_device(1)

But you only have one GPU device.

  • Setting an invalid value for the CUDA_VISIBLE_DEVICES environment variable. For example:
export CUDA_VISIBLE_DEVICES=3

But you have no GPU with index 3.

  • Having outdated GPU drivers installed. The drivers may not detect all devices properly.
  • Using incompatible CUDA and PyTorch/TensorFlow versions. This can cause GPU detection issues.
  • Hardware problems with the GPU device. It may be disabled or not seated correctly.
  • Using remote access without GPU passthrough configured. The GPU is not visible to the remote session.

So in summary, the “invalid device ordinal” error ultimately stems from a disconnect between the GPU index you specify and the actual available GPU hardware.

Fixing the RuntimeError

Here are some troubleshooting steps to resolve the “invalid device ordinal” error:

1. Check Your GPU Index

First, confirm you are specifying a valid index for the GPU. Print the available devices and their indices:

import torch
print(torch.cuda.device_count())

This will output the number of GPUs available. Specify a device index lower than this number.

For example, if it prints 1 you only have one GPU device. But your code is trying to access index 1. Change it to 0 instead.

2. Verify the CUDA_VISIBLE_DEVICES Variable

Make sure this environment variable is not set to a non-existent GPU.

Print its current value:

echo $CUDA_VISIBLE_DEVICES

If it is set incorrectly, modify your bash profile to correct it:

export CUDA_VISIBLE_DEVICES=0 # For one GPU

3. Update GPU Drivers

Outdated drivers can cause GPU detection issues.

Update to the latest stable drivers from Nvidia or AMD for your specific GPU model.

4. Install Compatible Framework Versions

Use GPU/driver/framework combinations that are validated to work properly together.

For example, for CUDA 10.2:

  • PyTorch >= 1.5
  • TensorFlow >= 2.2

Consult the official compatibility tables for each framework.

5. Verify GPU Hardware and PCIe Slot

Check that the GPU is seated properly and powered on.

Try moving it to a different PCIe slot on the motherboard if possible.

Restarting the system can help the OS redetect devices as well.

6. Enable GPU Access in Remote Sessions

If accessing the GPU remotely, you need to explicitly map the device to the virtual session.

For SSH, use the -X flag to enable X11 forwarding.

For VNC, setup PCI passthrough to assign the GPU.

For remote desktops, install virtual GPU drivers and agents.

Without remote GPU access configured, the device will not be visible.

7. Create a New Python Environment

In some cases, creating a fresh Python environment can resolve CUDA mismatch issues:

conda create -n cuda_test python=3.8
conda activate cuda_test
pip install torch tensorflow

Then run your Python code inside this environment.

Example Fixes

Here are some real examples of how to fix the “invalid device ordinal” error:

Specify correct index

# Old
torch.cuda.set_device(1) 

# Fix
torch.cuda.set_device(0)

Set CUDA_VISIBLE_DEVICES properly

# Old 
export CUDA_VISIBLE_DEVICES=2

# Fix
export CUDA_VISIBLE_DEVICES=0

Update drivers

# Linux
sudo apt update
sudo apt install nvidia-driver-510

# Windows 
nvidia-smi # check version
nvidia-installer --update # update drivers

Install compatible versions

# TensorFlow GPU
pip install tensorflow-gpu==2.5.0

# PyTorch GPU 
pip install torch===1.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

Enable remote GPU access

# SSH
ssh -X user@remote_host

# VNC
vncserver -virtualdevice GPU

Summary

In summary, the “invalid device ordinal” RuntimeError occurs when the specified GPU index does not match the available hardware. Carefully check the index value, GPU environment variables, drivers, framework versions, hardware connections, and remote access configuration. Matching the Python code to the actual GPU device will resolve this issue and allow proper utilization of the GPU. With the correct troubleshooting steps, you can get past this error and accelerate your code using the power of the GPU.

Leave a Comment