Recently, I conducted experiments on Google Cloud using the A100-40GB GPU. Initially, I faced numerous program crashes, some of which occurred without any error messages, making it challenging to identify the root cause. Occasionally, there were specific error messages such as:

No space left
Bus Error
Generation of numerous core.python files.

Possible Solutions

I explored various potential causes, including issues with my image, PyTorch version, CUDA, and DeepSpeed version. However, none of these turned out to be the actual problem.

After extensive searching online, I discovered that the issue was related to insufficient space on /dev/shm. It turned out that my /dev/shm had only 64MB of space.

To address this problem, there are two solutions:

If you are using Docker, you can set the shm size using the following command:

docker run -it --shm-size 40G

If you are not using Docker, you need to mount a larger size to /dev/shm.

Note: Sometimes setting num_workers = 0 for the dataloader might work, but it can result in slower performance.

Possible Solutions

Related Links