我正在训练 Keras 模型,我需要切换设备以获得更多功能(从 Windows i3 core 到 Ubuntu i7)。问题是,我的代码在 Windows 上运行良好,但显示以下错误,该错误甚至在运行第一个纪元之前就停止了计算。 这是完整的输出:
/home/willylutz/PycharmProjects/hiv_image_analysis/venv/bin/python /home/willylutz/PycharmProjects/hiv_image_analysis/main.py
2022-09-19 09:36:52.801711: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-19 09:36:52.956260: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-19 09:36:53.502748: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/willylutz/PycharmProjects/hiv_image_analysis/venv/lib/python3.8/site-packages/cv2/../../lib64:
2022-09-19 09:36:53.502794: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/willylutz/PycharmProjects/hiv_image_analysis/venv/lib/python3.8/site-packages/cv2/../../lib64:
2022-09-19 09:36:53.502800: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
0
Found 480 files belonging to 2 classes.
Using 384 files for training.
2022-09-19 09:37:00.058171: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-19 09:37:00.058202: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (zhang): /proc/driver/nvidia/version does not exist
2022-09-19 09:37:00.067388: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Found 480 files belonging to 2 classes.
Using 96 files for validation.
['INF', 'NI']
2022-09-19 09:37:10.149236: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:390] Filling up shuffle buffer (this may take a while): 373 of 512
2022-09-19 09:37:10.197351: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:415] Shuffle buffer filled.
(64, 1024, 1024, 3)
(64,)
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
sequential (Sequential) (None, 1024, 1024, 3) 0
rescaling (Rescaling) (None, 1024, 1024, 3) 0
conv2d (Conv2D) (None, 1024, 1024, 16) 448
max_pooling2d (MaxPooling2D (None, 512, 512, 16) 0
)
conv2d_1 (Conv2D) (None, 512, 512, 32) 4640
max_pooling2d_1 (MaxPooling (None, 256, 256, 32) 0
2D)
conv2d_2 (Conv2D) (None, 256, 256, 64) 18496
max_pooling2d_2 (MaxPooling (None, 128, 128, 64) 0
2D)
dropout (Dropout) (None, 128, 128, 64) 0
flatten (Flatten) (None, 1048576) 0
dense (Dense) (None, 128) 134217856
outputs (Dense) (None, 2) 258
=================================================================
Total params: 134,241,698
Trainable params: 134,241,698
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
2022-09-19 09:37:16.377367: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 4294967296 exceeds 10% of free system memory.
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
如果需要,我也可以放置我的代码,但我觉得这不是问题。 感谢您的帮助。
keras/tensorflow 2.9 和 2.10 中存在一个 bug,导致像重新缩放这样的预处理层极其缓慢: https://github.com/tensorflow/tensorflow/issues/56242
尝试没有重新缩放层的模型。如果您想使用此或类似的预处理层,您应该使用 TF 2.8.3 或更早版本。
我不确定以下内容,但是
Could not load dynamic library 'libnvinfer.so.7'
似乎表明还有其他一些问题,可能是tensorflow / keras / cuda版本错误。
我遇到了同样的问题。我通过将源代码从 keras Rescaling 层复制到我的代码中来解决这个问题,并制作了我的“自己的”Rescaling 类。我将 math_ops.cast() 更改为 tf.cast(),它的效果非常好。没有警告,代码运行得更快。
class Rescaling(tf.keras.layers.Layer):
"""Multiply inputs by `scale` and adds `offset`.
For instance:
1. To rescale an input in the `[0, 255]` range
to be in the `[0, 1]` range, you would pass `scale=1./255`.
2. To rescale an input in the `[0, 255]` range to be in the `[-1, 1]`
range,
you would pass `scale=1./127.5, offset=-1`.
The rescaling is applied both during training and inference.
Input shape:
Arbitrary.
Output shape:
Same as input.
Arguments:
scale: Float, the scale to apply to the inputs.
offset: Float, the offset to apply to the inputs.
name: A string, the name of the layer.
"""
def __init__(self, scale, offset=0., name=None, **kwargs):
self.scale = scale
self.offset = offset
super(Rescaling, self).__init__(name=name, **kwargs)
def call(self, inputs):
dtype = self._compute_dtype
scale = tf.cast(self.scale, dtype)
offset = tf.cast(self.offset, dtype)
return tf.cast(inputs, dtype) * scale + offset
def compute_output_shape(self, input_shape):
return input_shape
def get_config(self):
config = {
'scale': self.scale,
'offset': self.offset,
}
base_config = super(Rescaling, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
我也出现过同样的错误。可能的原因是您用于增强目的的预处理层导致了此错误。
这两个警告:
Could not load dynamic library 'libnvinfer.so.7'
libnvinfer_plugin.so.7: cannot open shared object file
可以安全地忽略。
NVidia 有一个名为 TensorRT 的自定义神经网络框架。
简而言之,TensorRT 对内部模型图进行优化,这意味着模型执行得更快。通常,您首先将张量流模型保存到onnx,然后将模型从onnx转换为TensorRT。最近,TF 和 NVidia 的进步使得在推理模式下运行时(在具有 NVidia GPU 的计算机上运行时)可以利用这一点更快地执行张量流模型。
以前,您需要从源代码构建 TF 才能启用 TensorRT 集成。似乎在最新版本中,它现在默认启用(我想在降级到 tf v2.8 之前,您使用的是 tf v2.11)。
但是,要使用 TensorRT 优化运行 TF 模型,您需要在 PC 上安装 NVidia 的某些驱动程序和库。因为您没有安装它们,TF 会警告您找不到它们。
如顶部所述,您可以安全地忽略该警告,它不会以任何方式影响您的训练或 TF 表现。
数据增强是这个问题的原因之一。我注释了旋转选项(layers.experimental.preprocessing.RandomRotation(0.2)并再次尝试,警告消失了。
我对 autokeras 也有同样的问题。第一次训练总是效果很好,然后我收到这些警告,从那里开始训练变得非常慢。我尝试返回 Tensorflow 2.8.3 和 Keras 2.8,但它引发了另一个问题(无法在 tf 实验库中找到某些内容)。 还在 GPU 上尝试过 TF 2.10,但由于很快就面临内存不足问题而放弃了它。 又尝试了最新的TF和Keras,现象是一样的,只是没有警告。第一次训练后,训练一个周期从 2-3 分钟缩短到 2-3 小时。
您使用什么版本的 Keras?