Recently, I needed to use GPU instances with a custom AMI1 on EC2. The image I needed to use did not contain the necessary NVIDIA and CUDA2 modules and instead contained nouveau. This posed a unique situation because normally a reboot is necessary after installing the official module. However, the software I was provisioning the instances with was unable to resume after rebooting. Using this short script I was able to swap the opensource nouveau module for the NVIDIA module without rebooting the instance.
Using the script below and a kernel boot parameter the proper modules will be installed without requiring a reboot. This will work on all AWS GPU instance types including g2.2xlarge, g2.8xlarge, and p2.16xlarge. This script can be run in the user-data portion of cloud-init during instance spin-up.
This was written and tested for CentOS 6 but can be adapted to other OSes.
# install gcc, cpp, kernel-headers
# dkms is also installed to rebuild the modules during kernel upgrades
sudo yum groupinstall -y 'Development tools'
sudo yum install -y kernel-devel-$(uname -r) dkms
# download the nvidia module and extract it
wget --quiet 'http://us.download.nvidia.com/XFree86/Linux-x86_64/367.35/NVIDIA-Linux-x86_64-367.35.run' -O 'NVIDIA-Linux-x86_64-367.35.run'
chmod +x NVIDIA-Linux-x86_64-*.run
# unbind the vtconsole from the nouveau module
sudo sh -c 'echo 0 > /sys/class/vtconsole/vtcon1/bind'
# unload nouveau and the direct rendering modules
sudo rmmod nouveau
sudo rmmod ttm
sudo rmmod drm_kms_helper
sudo rmmod drm
# install the nvidia modules silently
sudo ./NVIDIA-Linux-x86_64-*.run --accept-license \
--no-questions --ui=none --silent \
--run-nvidia-xconfig --dkms --disable-nouveau
# rebuild the initramfs so it will include the nvidia modules
sudo dracut -f -v /boot/initramfs-$(uname -r).img $(uname -r)
# load the nvidia and direct rendering modules
sudo modprobe nvidia
sudo modprobe ttm
sudo modprobe drm_kms_helper
sudo modprobe drm
# loop through all the pci devices matching the nvidia vendor and K80 device
# identifier and informs linux to remove them from the system to force them to reset.
sudo sh -c "for DEV in $(lspci -d 10de:118a -mm | cut -d' ' -f1); do \
echo 1 > \"/sys/bus/pci/devices/0000:${DEV}/remove\"
done"
# tell linux to rescan the pci bus. The cards will be added using the nvida modules
sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
# download and install cuda
wget --quiet 'http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-rhel6-7-5-local-7.5-18.x86_64.rpm' \
-O 'cuda-repo-rhel6-7-5-local-7.5-18.x86_64.rpm'
sudo rpm -i ./cuda-repo-rhel6-7-5-local-7.5-18.x86_64.rpm
sudo yum clean all
sudo yum install -y cuda
The above does the following:
- Installs items necessary to build the NVIDIA modules.
- Downloads the NVIDIA module.
- Unbinds the virtual text console from the loaded nouveau module so it can be unloaded.
- Unloads the nouveau and direct rendering manager modules.
- Installs/Builds the NVIDIA module and tools.
- Rebuilds the kernel boot image with the new modules.
- Loads the nvidia module and reloads the direct rendering module. This effectively switches the kernel drm modules to use NVIDIA modules.
- Ejects all NVIDIA K80 GPUs by their vendor:model identifier. This is necessary to force the cards to reset as they were initialized with nouveau.
- Instruct Linux to rescan the pci bus to initialize the NVIDIA cards with the official module.
- Download and install CUDA.
You will also need to edit the kernel
line in /boot/grub/grub.conf
and append the following:
rd.driver.blacklist=nouveau nouveau.modeset=0
This will tell the kernel to blacklist or not load the nouveau module at boot time. It will also set the display options to none for the module in KMS3.