GPU高显存占用、低使用率的原因分析过程-觉醒专题-网游活动速报

2025-11-21 07:05:22GPU高显存占用、低使用率的原因分析过程

今天在训练模型的时候，发现GPU的显存都快满了，但是GPU的利用率很低，基本是隔几秒才会到100%，然后马上恢复为0。

如下图所示。训完一个epoch要一天左右，心态都给我整崩了

在网上找到了一些不错的资料：

训练效率低？GPU利用率上不去？快来看看别人家的tricks吧～

深度学习PyTorch，TensorFlow中GPU利用率较低，CPU利用率很低，且模型训练速度很慢的问题总结与分析

GPU: high memory usage, low GPU volatile-util

猜测在train函数中，在cpu上运行的时间太长了，毕竟我是确实看到有那么一瞬间GPU的利用率很高的。然后我做了以下尝试：

修改DataLoader中的num_workers和pin_memory参数，都没效果。

猜测是每次生成batch时，padding的时间太长导致的。我先去看了dataloader.py和fetch.py的源码，发现每次迭代的时候，确实都会调用collate_fn，然后把padding后的数据返回。

class DataLoader(Generic[T_co]):

...

def __iter__(self) -> '_BaseDataLoaderIter':

# When using a single worker the returned iterator should be

# created everytime to avoid reseting its state

# However, in the case of a multiple workers iterator

# the iterator is only created once in the lifetime of the

# DataLoader object so that workers can be reused

if self.persistent_workers and self.num_workers > 0:

if self._iterator is <