Troubleshooting
A program fails with the message too many open files
This limit can be changed for the current user session with ulimit -Sn followed by the desired number. For example:
ulimit -Sn 4096
It is advisable to execute the ulimit command only when a workflow requires keeping many files open at the same time. For example:
ulimit -Sn 4096 && python mywork.py
Fixing permissions for shared files
If the permissions of files that are supposed to be shared are too restrictive, ask the owner to extend the permissions with:
# Allow everyone to read everything in a directory.
chmod -R a+r <path to directory>
# Allow the owner's group to modify every file in a directory.
chmod -R g+w <path to directory>
# Extend the execution and directory browsing permissions that the owner has in a directory to everyone.
find <path to directory> -executable -exec chmod a+x {} \;
Are GPU drivers installed everywhere?
Yes, but you might need to install CUDA in your environment:
Enrico writes that
the cuda version output from nvidia-smi doesn't mean necessarily that cuda is installed. It simply tells you which cuda release matches the installed drivers```
on slurm nodes, cuda is not preinstalled because I encourage people to install it in their own conda environments so that they can keep it stable
On non-slurm nodes, i.e. g1-7 and g1-9 currently, cuda is still preinstalled
GPU memory is held by orphan processes
If processes are holding GPU memory while their parent terminates abnormally, they could be left with claimed memory and no work to do.
In such a case, nvidia-smi will display occupied memory with no associated processes.
The fuser command can help perform cleanup and reclaim GPU memory.
For example, if nvidia-smi reports memory allocated in GPU 0, fuser -v /dev/nvidia0 can display which processes are accessing it.
USER PID ACCESS COMMAND
/dev/nvidia0: bob 869108 F...m python
bob 1236874 F...m python
bob 1236922 F...m python
Users can only see processes owned by themselves.
Administrators can run fuser as root to see all processes.
fuser -k sends a termination signal to the listed processes.
It can be convenient as a last resort if certain processes are difficult to terminate cleanly.
A typical troubleshooting sessions against leftover GPU memory might be as follows:
- Examine the output of
nvidia-smiand look for GPUs with allocated memory but no processes attached. - Execute
fuser -v /dev/nvidiaNto list processes owned by you accessing the device (replaceNwith a GPU index). - Save all important work, then try closing open programs cleanly.
- Execute
fuser -v /dev/nvidiaNagain. If any processes show up and you don't have a way to terminate them cleanly, you can usefuser -k /dev/nvidiaNto kill them bluntly. - Examine the output of
nvidia-smiagain. If GPU memory is still occupied, ask an administrator to look into processes owned by other users.