wiki/docs/Miscellaneous/470-datacenter-drivers.md
2023-10-20 11:27:00 -05:00

4 KiB

Nvidia 470 datacenter drivers

Note: Desktop drivers and datacenter drivers are different.

Debian

This hasn't yet been tested. If you have tested it, please open a PR to update this section.

Add apt repos

You can skip this if you already have the repositories enabled To add the non-free and contrib repos, edit /etc/apt/sources.list and add non-free contrib to the end of each line, like this:

deb http://deb.debian.org/debian/ bullseye main non-free contrib
deb-src http://deb.debian.org/debian/ bullseye main non-free contrib

Then, run apt update

Installation

To install the driver:

apt install nvidia-tesla-470-driver

And to install CUDA:

apt install nvidia-cuda-dev nvidia-cuda-toolkit

Fedora

This guide uses the RPM Fusion repositories, and if you install CUDA, it uses Nvidia repositories as well. Note that this guide is only compatible with Fedora 35+, I'm not sure about RHEL versions.

Add RPM Fusion repository

You can skip this if you already have the repository installed.

To add the RPM Fusion repository:

# Add gpg key
sudo dnf install distribution-gpg-keys
sudo rpmkeys --import /usr/share/distribution-gpg-keys/rpmfusion/RPM-GPG-KEY-rpmfusion-free-fedora-$(rpm -E %fedora)
# Add repo with gpg check
sudo dnf --setopt=localpkg_gpgcheck=1 install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm

Install Driver

First, update everything, and reboot if you're not on the latest kernel.

dnf update -y

Then, install the driver:

dnf install akmod-nvidia-470xx

Do not reboot yet.

Before rebooting, use top or ps to make sure there is no akmods, cc*, kthreadd, or gcc* process running (* is either nothing or a number)—or anything using tons of CPU that you don't expect.

Note: nvidia-smi and other tools are not included with the driver. For that, you need to install CUDA.

Install CUDA

Install packages needed for CUDA with:

export FEDORA_VERSION=$(rpm -E %fedora) # Nvidia's repo doesn't support Fedora 38 yet, so change this to 37 if you're on Fedora 38
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora35/x86_64/cuda-fedora${FEDORA_VERSION}.repo
dnf clean all
dnf module disable nvidia-driver
dnf -y install cuda

Note: Don't re-enable nvidia-driver

Problems

Suspend Issues

I had issues with my K80 not working after being suspended. For example, torch.cuda.is_available() would give an error and return False, rather than saying True. To fix this, install xorg-x11-drv-nvidia-470xx-power

dnf install xorg-x11-drv-nvidia-470xx-power

CUDA is higher version than driver

Sometimes the driver in the CUDA repo, and therefore dependencies for CUDA are of a later version than the driver. To fix this, run:

dnf module enable nvidia-driver -y && dnf download cuda-drivers && dnf module disable nvidia-driver -y
rpm -Uvh cuda-drivers*.rpm --nodeps
dnf update

More stuff

Why not install xorg-x11-drv-nvidia-470xx?

  • That's the display driver, not the data center driver. It is the same version number, but is not the same.