因此,我用2个titan xp搭建了一个装备,并遵循https://github.com/awslabs/keras-apache-mxnet/wiki/Multi-GPU-Model-Training-with-Keras-MXNet中的多GPU训练示例。我只更改了两段代码。型号部分中的gpus=4
和批量大小部分中的batchsize=32*2
。
我收到这个奇怪的错误,因为在第一部分中它实际上显示了我的GPU(计算等),但是在错误的最后部分中,它仅识别出我的CPU:
2019-11-19 10:43:32.935282: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-19 10:43:32.940953: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-11-19 10:43:33.115668: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-19 10:43:33.116756: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x27557f0 executing computations on platform CUDA. Devices:
2019-11-19 10:43:33.116793: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): TITAN Xp COLLECTORS EDITION, Compute Capability 6.1
2019-11-19 10:43:33.116799: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): TITAN Xp COLLECTORS EDITION, Compute Capability 6.1
2019-11-19 10:43:33.135701: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3500025000 Hz
2019-11-19 10:43:33.137115: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x277ba60 executing computations on platform Host. Devices:
2019-11-19 10:43:33.137144: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-11-19 10:43:33.139168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device0 with properties: name: TITAN Xp COLLECTORS EDITION major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0
2019-11-19 10:43:33.139381: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-19 10:43:33.140815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: TITAN Xp COLLECTORS EDITION major: 6 minor: 1 memoryClockRate(GHz): 1.582pciBusID: 0000:41:00.0
2019-11-19 10:43:33.141201: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141268: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141330: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141389: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141452: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141512: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.207406: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-11-19 10:43:33.207452: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-11-19 10:43:33.207550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-19 10:43:33.207568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1
2019-11-19 10:43:33.207578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y
2019-11-19 10:43:33.207584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N
2019-11-19 10:43:33.229007: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.
If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Traceback (most recent call last):
File "multi-gpu.py", line 42, in <module>
model = keras.utils.multi_gpu_model(model, gpus=2)
File "/home/gormosity/.local/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py", line 184, in multi_gpu_model available_devices))
ValueError: To call `multi_gpu_model` with `gpus=2`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0']. Try reducing `gpus`.
nvidia-smi
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp COLLEC... On | 00000000:0A:00.0 Off | N/A |
| 23% 24C P8 10W / 250W | 157MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp COLLEC... On | 00000000:41:00.0 On | N/A |
| 23% 36C P5 27W / 250W | 460MiB / 12192MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3860 C python3 145MiB |
| 1 1253 G /usr/lib/xorg/Xorg 18MiB |
| 1 1282 G /usr/bin/gnome-shell 51MiB |
| 1 1650 G /usr/lib/xorg/Xorg 116MiB |
| 1 1781 G /usr/bin/gnome-shell 124MiB |
| 1 3860 C python3 145MiB |
+-----------------------------------------------------------------------------+
您的错误消息显示tensorflow作为后端(cuda 10.1可能存在兼容性问题-如果您自己未编译它,也许这是这里的问题),也许您也需要安装mxnet-cu101(当然,如果需要,使用mxnet作为后端,但如果没有,则不适合使用keras-mxnet)。您可以尝试将后端更改为mxnet backend。