为什么这个 AutoKeras NAS 出现故障?

问题描述 投票:0回答:1

我正在使用

  1. nVidia GeForce GTX 780(开普勒)
  2. 驱动程序版本:470.223.02
  3. CUDA工具包v11.4.0
  4. cuDNN v8.2.4
  5. TensorFlow 和 Keras v2.8.0
  6. AutoKeras v1.0.17
  7. Ubuntu 20.04

========================

我有两个目录,

train_data_npy
valid_data_npy
,其中分别有3013个和1506个
*.npy
文件。

每个

*.npy
文件有12列float类型,其中前九列是特征,后三列是三个类别的one-hot编码标签。

以下Python脚本的任务是分块加载这些

*.npy
文件,以便在搜索神经网络模型时内存不会溢出。

但是,脚本失败了。

给定脚本到底有什么问题?

为什么脚本失败?

或者,不是脚本的问题,而是CUDA、TF、AutoKeras的安装问题?

# File: cnn_search_by_chunk.py
import numpy as np
import tensorflow as tf
import os
import autokeras as ak

N_FEATURES = 9
BATCH_SIZE = 100

def get_data_generator(folder_path, batch_size, n_features):
    """Get a generator returning batches of data from .npy files in the specified folder.

    The shape of the features is (batch_size, n_features).
    """
    def data_generator():
        files = os.listdir(folder_path)
        npy_files = [f for f in files if f.endswith('.npy')]

        for npy_file in npy_files:
            data = np.load(os.path.join(folder_path, npy_file))
            x = data[:, :n_features]
            y = data[:, n_features:]
            y = np.argmax(y, axis=1)  # Convert one-hot-encoded labels back to integers

            for i in range(0, len(x), batch_size):
                yield x[i:i+batch_size], y[i:i+batch_size]

    return data_generator

train_data_folder = '/home/my_user_name/original_data/train_data_npy'
validation_data_folder = '/home/my_user_name/original_data/valid_data_npy'

train_dataset = tf.data.Dataset.from_generator(
    get_data_generator(train_data_folder, BATCH_SIZE, N_FEATURES),
    output_signature=(
        tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32),
        tf.TensorSpec(shape=(None,), dtype=tf.int32)  # Labels are now 1D integers
    )
)

validation_dataset = tf.data.Dataset.from_generator(
    get_data_generator(validation_data_folder, BATCH_SIZE, N_FEATURES),
    output_signature=(
        tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32),
        tf.TensorSpec(shape=(None,), dtype=tf.int32)  # Labels are now 1D integers
    )
)

clf = ak.StructuredDataClassifier(overwrite=True, max_trials=1, seed=5)
clf.fit(x=train_dataset, validation_data=validation_dataset, batch_size=BATCH_SIZE)
print(clf.evaluate(validation_dataset))
my_user_name@192:~/my_project_name_v2$ python3 cnn_search_by_chunk.py
2023-11-29 20:05:53.532005: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Using TensorFlow backend
2023-11-29 20:05:55.467804: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

Search: Running Trial #1

Hyperparameter    |Value             |Best Value So Far
structured_data...|True              |?
structured_data...|2                 |?
structured_data...|False             |?
structured_data...|0                 |?
structured_data...|32                |?
structured_data...|32                |?
classification_...|0                 |?
optimizer         |adam              |?
learning_rate     |0.001             |?

Epoch 1/1000
33143/33143 [==============================] - 149s 4ms/step - loss: 0.0670 - accuracy: 0.9677 - val_loss: 0.0612 - val_accuracy: 0.9708
Epoch 2/1000
33143/33143 [==============================] - 146s 4ms/step - loss: 0.0625 - accuracy: 0.9697 - val_loss: 0.0598 - val_accuracy: 0.9715
Epoch 3/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0617 - accuracy: 0.9702 - val_loss: 0.0593 - val_accuracy: 0.9717
Epoch 4/1000
33143/33143 [==============================] - 146s 4ms/step - loss: 0.0614 - accuracy: 0.9703 - val_loss: 0.0591 - val_accuracy: 0.9718
Epoch 5/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0612 - accuracy: 0.9705 - val_loss: 0.0590 - val_accuracy: 0.9719
Epoch 6/1000
33143/33143 [==============================] - 145s 4ms/step - loss: 0.0610 - accuracy: 0.9707 - val_loss: 0.0588 - val_accuracy: 0.9721
Epoch 7/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0608 - accuracy: 0.9707 - val_loss: 0.0586 - val_accuracy: 0.9721
Epoch 8/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0607 - accuracy: 0.9709 - val_loss: 0.0585 - val_accuracy: 0.9723
Epoch 9/1000
33143/33143 [==============================] - 146s 4ms/step - loss: 0.0605 - accuracy: 0.9710 - val_loss: 0.0584 - val_accuracy: 0.9723
Epoch 10/1000
33143/33143 [==============================] - 146s 4ms/step - loss: 0.0604 - accuracy: 0.9710 - val_loss: 0.0583 - val_accuracy: 0.9724
Epoch 11/1000
33143/33143 [==============================] - 148s 4ms/step - loss: 0.0603 - accuracy: 0.9711 - val_loss: 0.0583 - val_accuracy: 0.9724
Epoch 12/1000
33143/33143 [==============================] - 146s 4ms/step - loss: 0.0602 - accuracy: 0.9712 - val_loss: 0.0582 - val_accuracy: 0.9724
Epoch 13/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0601 - accuracy: 0.9712 - val_loss: 0.0582 - val_accuracy: 0.9724
Epoch 14/1000
33143/33143 [==============================] - 148s 4ms/step - loss: 0.0601 - accuracy: 0.9712 - val_loss: 0.0582 - val_accuracy: 0.9724
Epoch 15/1000
33143/33143 [==============================] - 146s 4ms/step - loss: 0.0600 - accuracy: 0.9713 - val_loss: 0.0582 - val_accuracy: 0.9724
Epoch 16/1000
33143/33143 [==============================] - 146s 4ms/step - loss: 0.0600 - accuracy: 0.9713 - val_loss: 0.0581 - val_accuracy: 0.9725
Epoch 17/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0600 - accuracy: 0.9713 - val_loss: 0.0581 - val_accuracy: 0.9725
Epoch 18/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0599 - accuracy: 0.9713 - val_loss: 0.0582 - val_accuracy: 0.9724
Epoch 19/1000
33143/33143 [==============================] - 145s 4ms/step - loss: 0.0599 - accuracy: 0.9713 - val_loss: 0.0581 - val_accuracy: 0.9724
Epoch 20/1000
33143/33143 [==============================] - 144s 4ms/step - loss: 0.0599 - accuracy: 0.9713 - val_loss: 0.0582 - val_accuracy: 0.9724
Epoch 21/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0599 - accuracy: 0.9713 - val_loss: 0.0582 - val_accuracy: 0.9724
Epoch 22/1000
33143/33143 [==============================] - 144s 4ms/step - loss: 0.0599 - accuracy: 0.9713 - val_loss: 0.0581 - val_accuracy: 0.9724
Epoch 23/1000
33143/33143 [==============================] - 146s 4ms/step - loss: 0.0600 - accuracy: 0.9713 - val_loss: 0.0582 - val_accuracy: 0.9724
Epoch 24/1000
33143/33143 [==============================] - 145s 4ms/step - loss: 0.0599 - accuracy: 0.9714 - val_loss: 0.0581 - val_accuracy: 0.9725
Epoch 25/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0599 - accuracy: 0.9714 - val_loss: 0.0581 - val_accuracy: 0.9724
Epoch 26/1000
33143/33143 [==============================] - 147s 4ms/step - loss: 0.0599 - accuracy: 0.9713 - val_loss: 0.0581 - val_accuracy: 0.9724
Trial 1 Complete [01h 16m 38s]
val_accuracy: 0.9724819660186768

Best val_accuracy So Far: 0.9724819660186768
Total elapsed time: 01h 16m 38s
WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.1
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.1
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.2
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.2
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.3
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.3
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.4
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.4
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.5
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.5
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.6
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.6
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.7
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.7
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.8
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.8
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.9
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.9
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.10
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.10
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.11
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.11
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.12
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer._variables.12
2023-11-29 21:23:57.450991: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
2023-11-29 21:23:57.451029: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
2023-11-29 21:23:57.451059: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
2023-11-29 21:23:57.451091: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
2023-11-29 21:23:57.451123: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
2023-11-29 21:23:57.451157: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
2023-11-29 21:23:57.451185: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
2023-11-29 21:23:57.451213: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
2023-11-29 21:23:57.451250: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at lookup_table_op.cc:929 : FAILED_PRECONDITION: Table not initialized.
Traceback (most recent call last):
  File "cnn_search_by_chunk.py", line 50, in <module>
    print(clf.evaluate(validation_dataset))
  File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/tasks/structured_data.py", line 187, in evaluate
    return super().evaluate(x=x, y=y, **kwargs)
  File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/auto_model.py", line 492, in evaluate
    return utils.evaluate_with_adaptive_batch_size(
  File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/utils/utils.py", line 68, in evaluate_with_adaptive_batch_size
    return run_with_adaptive_batch_size(
  File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/utils/utils.py", line 101, in run_with_adaptive_batch_size
    history = func(x=x, validation_data=validation_data, **fit_kwargs)
  File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/utils/utils.py", line 70, in <lambda>
    lambda x, validation_data, **kwargs: model.evaluate(
  File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/my_user_name/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.FailedPreconditionError: Graph execution error:

Detected at node 'model/multi_category_encoding/string_lookup_15/None_Lookup/LookupTableFindV2' defined at (most recent call last):
    File "cnn_search_by_chunk.py", line 50, in <module>
      print(clf.evaluate(validation_dataset))
    File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/tasks/structured_data.py", line 187, in evaluate
      return super().evaluate(x=x, y=y, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/auto_model.py", line 492, in evaluate
      return utils.evaluate_with_adaptive_batch_size(
    File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/utils/utils.py", line 68, in evaluate_with_adaptive_batch_size
      return run_with_adaptive_batch_size(
    File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/utils/utils.py", line 101, in run_with_adaptive_batch_size
      history = func(x=x, validation_data=validation_data, **fit_kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/utils/utils.py", line 70, in <lambda>
      lambda x, validation_data, **kwargs: model.evaluate(
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 2200, in evaluate
      logs = test_function_runner.run_step(
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 4000, in run_step
      tmp_logs = self._function(dataset_or_iterator)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1972, in test_function
      return step_function(self, iterator)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1956, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1944, in run_step
      outputs = model.test_step(data)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1850, in test_step
      y_pred = self(x, training=False)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 569, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/functional.py", line 512, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/functional.py", line 669, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/keras_layers.py", line 91, in call
      for input_node, encoding_layer in zip(split_inputs, self.encoding_layers):
    File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/keras_layers.py", line 92, in call
      if encoding_layer is None:
    File "/home/my_user_name/.local/lib/python3.8/site-packages/autokeras/keras_layers.py", line 100, in call
      output_nodes.append(
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/engine/base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/layers/preprocessing/index_lookup.py", line 756, in call
      lookups = self._lookup_dense(inputs)
    File "/home/my_user_name/.local/lib/python3.8/site-packages/keras/src/layers/preprocessing/index_lookup.py", line 792, in _lookup_dense
      lookups = self.lookup_table.lookup(inputs)
Node: 'model/multi_category_encoding/string_lookup_15/None_Lookup/LookupTableFindV2'
Table not initialized.
         [[{{node model/multi_category_encoding/string_lookup_15/None_Lookup/LookupTableFindV2}}]] [Op:__inference_test_function_5785123]
2023-11-29 21:23:57.618149: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]
2023-11-29 21:23:57.618266: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]
2023-11-29 21:23:57.618360: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]
2023-11-29 21:23:57.618434: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]
my_user_name@192:~/my_project_name_v2$
python tensorflow keras deep-learning auto-keras
1个回答
1
投票

查看错误日志,这要么是您的数据,要么是您的 GPU 设置和安装。

  1. GPU 库警告:无法 dlopen 某些 GPU 库

    2023-11-29 20:05:55.467804: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] 无法 dlopen 某些 GPU 库。

此警告表明 TensorFlow 在加载某些 GPU 库时遇到问题。确保安装了必要的 GPU 驱动程序和库。有关设置 GPU 支持的详细信息,您可以参考 TensorFlow GPU 官方安装指南:TensorFlow GPU 支持指南

  1. LookupTableFindV2 错误:表未初始化

2023-11-29 21:23:57.451123:W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES 在lookup_table_op.cc:929 处失败:FAILED_PRECONDITION:表未初始化。

模型的定义方式或数据的处理方式可能存在问题。从评论中我判断模型定义正确,但是如果无法访问您的数据,我们只能搭建脚手架(正如 MNIST 成功完成的那样)。解决此问题的可能步骤:检查输入数据以确保模型的格式正确。

© www.soinside.com 2019 - 2024. All rights reserved.