Recently we noticed that our AWS G2 GPU instances were no longer working correctly after a reboot. We were being greeted with the joyful message of
[ 6710.061115] NVRM: The NVIDIA GRID K520 GPU installed in this system is
NVRM: supported through the NVIDIA 367.xx Legacy drivers. Please
NVRM: visit http://www.nvidia.com/object/unix.html for more
NVRM: information. The 375.39 NVIDIA driver will ignore
NVRM: this GPU. Continuing probe…
It was evident that at some point in the past, the NVIDIA driver had been upgraded to a version that now no longer supports the GRID K520 GPU card in the machine. Of course the first thought is to blame whoever had root access on the system. Let’s have a look at /var/log/apt/history.log then…
Start-Date: 2017-03-21 08:51:00
Commandline: /usr/bin/unattended-upgrade
Install: libcuda1-375:amd64 (375.39-0ubuntu0.16.04.1, automatic), nvidia-opencl-icd-375:amd64 (375.39-0ubuntu0.16.04.1, automatic), nvidia-375:amd64 (375.39-0ubuntu0.16.04.1, automatic)
Upgrade: libc6-dev:amd64 (2.23-0ubuntu5, 2.23-0ubuntu6), libcuda1-367:amd64 (367.57-0ubuntu0.16.04.1, 375.39-0ubuntu0.16.04.1), libc6:amd64 (2.23-0ubuntu5, 2.23-0ubuntu6), locales:amd64 (2.23-0ubuntu5, 2.23-0ubuntu6), libc-bin:amd64 (2.23-0ubuntu5, 2.23-0ubuntu6), libc6-
i386:amd64 (2.23-0ubuntu5, 2.23-0ubuntu6), libc-dev-bin:amd64 (2.23-0ubuntu5, 2.23-0ubuntu6), multiarch-support:amd64 (2.23-0ubuntu5, 2.23-0ubuntu6), libfreetype6:amd64 (2.6.1-0.1ubuntu2, 2.6.1-0.1ubuntu2.1), nvidia-opencl-icd-367:amd64 (367.57-0ubuntu0.16.04.1, 375.39-0
ubuntu0.16.04.1), nvidia-367:amd64 (367.57-0ubuntu0.16.04.1, 375.39-0ubuntu0.16.04.1)
End-Date: 2017-03-21 08:52:28
There we go, it was the unattended-upgrade feature of Ubuntu that’s upgrading NVIDIA drivers to an unsupported version for AWS G2 GPU machines.
To fix this, since version 367 of NVIDIA is no longer available in the Ubuntu archives, it has to be obtained as a build artifact. It’s not the cleanest way, but it would seem that the quickest way to resolve this is to apt-get remove nvidia-375, and any dependencies, and then install the build artifacts from https://launchpad.net/~ubuntu-security/+archive/ubuntu/ppa/+build/11078476
Namely,
apt-get remove libcuda1-375 nvidia-opencl-icd-375 nvidia-375 nvidia-cuda-toolkit
wget https://launchpad.net/~ubuntu-security/+archive/ubuntu/ppa/+build/11078476/+files/nvidia-367_367.57-0ubuntu0.16.04.1_amd64.deb
wget https://launchpad.net/~ubuntu-security/+archive/ubuntu/ppa/+build/11078476/+files/nvidia-opencl-icd-367_367.57-0ubuntu0.16.04.1_amd64.deb
wget https://launchpad.net/~ubuntu-security/+archive/ubuntu/ppa/+build/11078476/+files/libcuda1-367_367.57-0ubuntu0.16.04.1_amd64.deb
dpkg -i –auto-deconfigure libcuda1-367_367.57-0ubuntu0.16.04.1_amd64.deb nvidia-opencl-icd-367_367.57-0ubuntu0.16.04.1_amd64.deb nvidia-367_367.57-0ubuntu0.16.04.1_amd64.deb
If there are any lingering versions of a package that depends on nvidia-375, uninstall it, rinse and repeat, and re-install it. It most likely does not depend on -375 explicitly, but a metapackage provided by -375, which we’re providing instead from -365
Once the core -367 packages are installed and happy, check dmesg to make sure the GPU has been discovered, and then reinstall nvidia-cuda-toolkit and any other packages. Assuming all goes well, you can now test your software against the installed package suit.
If things are working as expected, simply mark your now critical packages as held in order to prevent them from being upgraded again
apt-mark hold libcuda1-367 nvidia-367 nvidia-opencl-icd-367
This has been reported to Canonical at https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-375/+bug/1674666
Recent Comments