-
Notifications
You must be signed in to change notification settings - Fork 506
Weird "Missing ranks" error in parallel training using horovod #3937
-
Dear all,
These errors typically occured after thousands of training steps, while they might also occur immediately after a training begins. During training, the GPU memory usage seemed okay, but GPU-Util was not balanced on four cards. For example, see nvidia-smi output below (the CUDA Runtime api version is actually 11.8 according to nvcc):
I have no idea what's going wrong, and I don't think it's a deepmd-kit bug since everything worked fine before the unsuccessful CUDA update. I've tried some methods, including (1) totally purge and reinstall driver, CUDA, and deepmd-kit, and then reboot my machine; (2) try different versions of deepmd-kit, from 2.2.7 to 2.2.10; (3) try different CUDA versions, from 11.8 to 12.0. Unluckily, none of these methods work. Does anyone have suggestions? Thanks. P.S. The Lammps that shipped with deepmd-kit works fine. |
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment · 3 replies
-
Could you use |
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes. Here is the output:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
I don't see anything wrong before step 1500. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Finally, I've figured out that this problem is caused by the environment variable KMP_AFFINITY (I manually changed it to scatter). This variable should be automatically set by deepmd-kit ...... |
Beta Was this translation helpful? Give feedback.
Finally, I've figured out that this problem is caused by the environment variable KMP_AFFINITY (I manually changed it to scatter). This variable should be automatically set by deepmd-kit ......