Tensorflow：GPU 设备之间的内存增长不会有差异 |如何将多 GPU 与张量流结合使用

Question

我正在尝试在集群内的 GPU 节点上运行 keras 代码。 GPU 节点每个节点有 4 个 GPU。我确保 GPU 节点中的所有 4 个 GPU 可供我使用。我运行下面的代码让tensorflow使用GPU：

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

输出中列出了 4 个可用的 GPU。但是，我在运行代码时遇到以下错误：

Traceback (most recent call last):
  File "/BayesOptimization.py", line 20, in <module>
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 439, in list_logical_devices
    return context.context().list_logical_devices(device_type=device_type)
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1368, in list_logical_devices
    self.ensure_initialized()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 511, in ensure_initialized
    config_str = self.config.SerializeToString()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1015, in config
    gpu_options = self._compute_gpu_options()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1074, in _compute_gpu_options
    raise ValueError("Memory growth cannot differ between GPU devices")
ValueError: Memory growth cannot differ between GPU devices

代码不应该列出所有可用的 GPU 并将每个 GPU 的内存增长设置为 true 吗？

我目前正在使用tensorflow库和python 3.97：

tensorflow                2.4.1           gpu_py39h8236f22_0
tensorflow-base           2.4.1           gpu_py39h29c2da4_0
tensorflow-estimator      2.4.1              pyheb71bc4_0
tensorflow-gpu            2.4.1                h30adc30_0

知道问题是什么以及如何解决吗？预先感谢！

Answer 1

仅尝试：os.environ["CUDA_VISIBLE_DEVICES"]="0" 而不是 tf.config.experimental.set_memory_growth。这对我有用。

Answer 2

对于多 GPU 设备，所有可用 GPU 的内存增长应保持恒定。要么对所有 GPU 设置为 true，要么保持为 false。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Tensorflow GPU 文档

Answer 3

我认为这只是要小心缩进的问题。对我来说你的代码不起作用：

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

但是如果您在调用逻辑_gpus = tf.config.list_logic_devices('GPU') 之前等待直到结束，那么它就可以工作。所以改为

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

我认为发生的情况是您需要在调用配置的任何其他访问器之前在配置中设置所有可见的 GPU。因此，您的另一个选择是首先调用 tf.config.set_visible_devices。如果您限制为一个 GPU，则不需要循环，如果您执行两个或多个 GPU，则需要小心在访问配置或运行任何 tf 代码之前对所有 GPU 调用 set_memory_growth。

Tensorflow：GPU 设备之间的内存增长不会有差异 |如何将多 GPU 与张量流结合使用

问题描述投票：0回答：3

3个回答

最新问题

Tensorflow：GPU 设备之间的内存增长不会有差异 |如何将多 GPU 与张量流结合使用

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3