Troubleshooting Issues with GPU Experiments on Google Cloud (Bus Error)
Recently, I conducted experiments on Google Cloud using the A100-40GB GPU. Initially, I faced numerous program crashes, some of which occurred without any error messages, making it challenging to identify the root cause. Occasionally, there were specific error messages such as:
- No space left
- Bus Error
- Generation of numerous
core.python
files.
Possible Solutions
I explored various potential causes, including issues with my image, PyTorch version, CUDA, and DeepSpeed version. However, none of these turned out to be the actual problem.
After extensive searching online, I discovered that the issue was related to insufficient space on /dev/shm
. It turned out that my /dev/shm
had only 64MB of space.
To address this problem, there are two solutions:
If you are using Docker, you can set the shm size using the following command:
docker run -it --shm-size 40G
If you are not using Docker, you need to mount a larger size to /dev/shm
.
Note: Sometimes setting num_workers = 0
for the dataloader might work, but it can result in slower performance.