Troubleshooting NICE DCV on AWS EC2 with Nvidia Drivers

NICE DCV Troubleshooting

Context and Symptoms

  • We had been using the boxes and DCV just fine for several days.
  • When we deployed 23 of these boxes this morning, we found that we couldn’t connect to DCV
  • Nobody could connect to NICE DCV. Blank bluish screen.

Troubleshooting Attempted

  • Looked at logs, but didn’t find anything helpful:
    • DCV Client Logs: ~/.local/share/nicedcv/loggy-mcloggerson.log
    • DCV Server Logs: /var/log/dcv/server.log
      • Set the log level to DEBUG
      • there was a red herring in there about “certificate not found”, but that was completely unrelated
      • there just didn’t appear to be anything in the output that indicated that anything had gone wrong
  • Decided that it might a problem with an Xserver misconfiguration
    • Xorg Logs: /var/log/Xorg.0.log

The kernel updated and broke the Nvidia driver

Ultimately it was the /var/log/Xorg.0.log that helped me figure out what was going on.

  • /var/log/Xorg.0.log talks about not being able to load the NVIDIA driver:
[    28.902] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[    28.902] (EE) NVIDIA:     system's kernel log for additional error messages and
[    28.902] (EE) NVIDIA:     consult the NVIDIA README for details.
[    28.902] (EE) No devices detected.

...

[    28.907] (EE) Device(s) detected, but none match those in the config file.
[    28.907] (EE)
Fatal server error:
[    28.907] (EE) no screens found(EE)
[    28.907] (EE)
Please consult the The X.Org Foundation support
	 at http://wiki.x.org
 for help.
[    28.907] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[    28.908] (EE)
[    28.966] (EE) Server terminated with error (1). Closing log file.
  • nvidia-smi says “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”
  • ls -al /var/modules shows TWO kernel versions:
    • 5.13.0-1031-aws was created on July 8 (probably the day we built the Packer image)
    • 5.15.0-1015-aws was created on July 15 (today, probably because we ran some unsafe apt upgrade command as part of the Ansible playbooks for application-layer and create-user

Research