当使用 TFX 生成数据集时，如何将 tf.Dataset 适合 Keras 自动编码器模型？

Question

问题

正如标题所示，我一直在尝试创建一个使用 TFX 训练自动编码器模型的管道。我遇到的问题是将

DataAccessor.tf_dataset_factory

对象返回的 tf.Dataset 拟合到自动编码器。

下面我总结了我在这个项目中所采取的步骤，如果您想跳过背景信息，请在底部提出一些问题。

简介

TFX 管道

到目前为止我使用过的 TFX 组件是：

CsvExampleGenerator（数据集有 82 列，全部为数字，样本 csv 有 739 行）
StatisticsGenerator / SchemaGenerator，模式已被编辑，现在使用导入器加载
转变
Trainer（这是我目前遇到问题的组件）

型号

我尝试训练的模型基于此处列出的示例https://www.tensorflow.org/tutorials/generative/autoencoder。然而，我的模型正在表格数据上进行训练，搜索异常结果，而不是图像数据。

由于我尝试了几种解决方案，所以我尝试使用 Keras.layers 和 Keras.model 格式来定义模型，我在下面概述了这两种解决方案：

子类化 Keras.Model

class Autoencoder(keras.models.Model):
    def __init__(self, features):
        super(Autoencoder, self).__init__()
        
        self.encoder = tf.keras.Sequential([
            keras.layers.Dense(82, activation = 'relu'),
            keras.layers.Dense(32, activation = 'relu'),
            keras.layers.Dense(16, activation = 'relu'),
            keras.layers.Dense(8, activation = 'relu')
        ])
        
        self.decoder = tf.keras.Sequential([
            keras.layers.Dense(16, activation = 'relu'),
            keras.layers.Dense(32, activation = 'relu'),
            keras.layers.Dense(len(features), activation = 'sigmoid')
        ])

    def call(self, x):
        inputs = [keras.layers.Input(shape = (1,), name = f) for f in features]
        dense = keras.layers.concatenate(inputs)
        
        encoded = self.encoder(dense)
        decoded = self.decoder(encoded)
    
        return decoded

子类化 Keras.Layers

def _build_keras_model(features: List[str]) -> tf.keras.Model:
    inputs = [keras.layers.Input(shape = (1,), name = f) for f in features]
    dense = keras.layers.concatenate(inputs)

    dense = keras.layers.Dense(32, activation = 'relu')(dense)
    dense = keras.layers.Dense(16, activation = 'relu')(dense)
    dense = keras.layers.Dense(8, activation = 'relu')(dense)
    dense = keras.layers.Dense(16, activation = 'relu')(dense)
    dense = keras.layers.Dense(32, activation = 'relu')(dense)
    outputs = keras.layers.Dense(len(features), activation = 'sigmoid')(dense)
    
    model = keras.Model(inputs = inputs, outputs = outputs)
    model.compile(
        optimizer = 'adam',
        loss = 'mae'
    )

    return model

TFX 训练器组件

为了创建 Trainer 组件，我主要遵循此处列出的实现细节：https://www.tensorflow.org/tfx/guide/trainer

以及遵循默认的企鹅示例：https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple#write_model_training_code

run_fn 定义

def run_fn(fn_args: tfx.components.FnArgs) -> None:
    tft_output = tft.TFTransformOutput(fn_args.transform_output)
    
    train_dataset = _input_fn(
        file_pattern = fn_args.train_files,
        data_accessor = fn_args.data_accessor,
        tf_transform_output = tft_output,
        batch_size = fn_args.train_steps
    )

    eval_dataset = _input_fn(
        file_pattern = fn_args.eval_files,
        data_accessor = fn_args.data_accessor,
        tf_transform_output = tft_output,
        batch_size = fn_args.custom_config['eval_batch_size']
    )

#   model = Autoencoder(
#       features = fn_args.custom_config['features']
#   )
    model = _build_keras_model(features = fn_args.custom_config['features'])
        
    model.compile(optimizer = 'adam', loss = 'mse')
    
    model.fit(
        train_dataset,
        steps_per_epoch = fn_args.train_steps,
        validation_data = eval_dataset,
        validation_steps = fn_args.eval_steps
    )
    
    ...

_input_fn 定义

def _apply_preprocessing(raw_features, tft_layer):
    transformed_features = tft_layer(raw_features)
    return transformed_features

def _input_fn(
    file_pattern,
    data_accessor: tfx.components.DataAccessor,
    tf_transform_output: tft.TFTransformOutput,
    batch_size: int) -> tf.data.Dataset:
    """
    Generates features and label for tuning/training.
      Args:
        file_pattern: List of paths or patterns of input tfrecord files.
        data_accessor: DataAccessor for converting input to RecordBatch.
        tf_transform_output: A TFTransformOutput.
        batch_size: representing the number of consecutive elements of returned
          dataset to combine in a single batch
      Returns:
        A dataset that contains features where features is a
          dictionary of Tensors.
    """
    dataset = data_accessor.tf_dataset_factory(
        file_pattern,
        tfxio.TensorFlowDatasetOptions(batch_size = batch_size),
        tf_transform_output.transformed_metadata.schema
     )
    
    transform_layer = tf_transform_output.transform_features_layer()
    def apply_transform(raw_features):
        return _apply_preprocessing(raw_features, transform_layer)
    
    return dataset.map(apply_transform).repeat()

这与上面给出的 _input_fn 示例不同，因为我遵循此处找到的下一个 tfx 教程中的示例：https://www.tensorflow.org/tfx/tutorials/tfx/penguin_tft#run_fn

另外供参考，示例数据中没有 Target，因此没有

label_key

可以传递给

tfxio.TensorFlowDatasetOptions

对象。

错误

尝试使用 TFX

InteractiveContext

对象运行 Trainer 组件时，我收到以下错误。

ValueError: No gradients provided for any variable: ['dense_460/kernel:0', 'dense_460/bias:0', 'dense_461/kernel:0', 'dense_461/bias:0', 'dense_462/kernel:0', 'dense_462/bias:0', 'dense_463/kernel:0', 'dense_463/bias:0', 'dense_464/kernel:0', 'dense_464/bias:0', 'dense_465/kernel:0', 'dense_465/bias:0'].

从我自己解决这个问题的尝试来看，我认为问题在于自动编码器的训练方式。从此处链接的自动编码器示例https://www.tensorflow.org/tutorials/generative/autoencoder数据的拟合如下：

autoencoder.fit(x_train, x_train,
                epochs=10,
                shuffle=True,
                validation_data=(x_test, x_test))

因此，按理说 tf.Dataset 也应该模仿这种行为，并且当使用普通张量对象进行测试时，我已经能够重新创建上面的错误，然后在添加与训练数据相同的目标时解决它.fit() 函数。

到目前为止我尝试过的事情

复制火车数据集

    model.fit(
        train_dataset,
        train_dataset,
        steps_per_epoch = fn_args.train_steps,
        validation_data = eval_dataset,
        validation_steps = fn_args.eval_steps
    )

由于 Keras 在传递数据集时不接受“y”值，因此引发错误。

ValueError: `y` argument is not supported when using dataset as input.

返回一个包含自身的元组的数据集

def _input_fn(...


    dataset = data_accessor.tf_dataset_factory(
        file_pattern,
        tfxio.TensorFlowDatasetOptions(batch_size = batch_size),
        tf_transform_output.transformed_metadata.schema
     )
    
    transform_layer = tf_transform_output.transform_features_layer()
    def apply_transform(raw_features):
        return _apply_preprocessing(raw_features, transform_layer)
    
    dataset = dataset.map(apply_transform)
    
    return dataset.map(lambda x: (x, x))

这会引发错误，其中特征字典中的键与模型的输出不匹配。

ValueError: Found unexpected keys that do not correspond to any Model output: dict_keys(['feature_string', ...]). Expected: ['dense_477']

此时，我切换到使用 keras.model Autoencoder 子类，并尝试使用我尝试以与输入相同的方式动态创建的输出将输出键添加到模型中。

    def call(self, x):
        inputs = [keras.layers.Input(shape = (1,), name = f) for f in x]
        dense = keras.layers.concatenate(inputs)
        
        encoded = self.encoder(dense)
        decoded = self.decoder(encoded)
    
        outputs = {}
        for feature_name in x:
            outputs[feature_name] = keras.layers.Dense(1, activation = 'sigmoid')(decoded)

        return outputs

这会引发以下错误：

TypeError: Cannot convert a symbolic Keras input/output to a numpy array. This error may indicate that you're trying to pass a symbolic value to a NumPy call, which is not supported. Or, you may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model.

我一直在考虑解决这个问题，但不再确定数据是否正确传递，并且开始认为我偏离了实际问题。

问题

有人设法让自动编码器在通过 TFX 示例连接时工作吗？
您是否更改了 tf.Dataset 或以与 _input_fn 演示不同的方式处理示例？

Answer 1

所以我设法找到了这个问题的答案，并想留下我在这里找到的内容，以防其他人偶然发现类似的问题。

事实证明，我对错误的感觉是正确的，解决方案确实在于 tf.Dataset 对象的呈现方式。

当我运行一些使用随机生成的张量模拟传入数据的代码时，可以证明这一点。

tensors = [tf.random.uniform(shape = (1, 82)) for i in range(739)]
# This gives us a list of 739 tensors which hold 1 value for 82 'features' simulating the dataset I had

dataset = tf.data.Dataset.from_tensor_slices(tensors)
dataset = dataset.map(lambda x : (x, x))
# This returns a dataset which marks the training set and target as the same
# which is what the Autoecnoder model is looking for

model.fit(dataset ...)

接下来，我继续对 _input_fn 返回的数据集执行相同的操作。鉴于 tfx DataAccessor 对象返回 features_dict 但我需要将该字典中的张量组合在一起以创建单个张量。

这就是我的 _input_fn 现在的样子：

def create_target_values(features_dict: Dict[str, tf.Tensor]) -> tuple:
    value_tensor = tf.concat(list(features_dict.values()), axis = 1)
    return (features_dict, value_tensor)

def _input_fn(
    file_pattern,
    data_accessor: tfx.components.DataAccessor,
    tf_transform_output: tft.TFTransformOutput,
    batch_size: int) -> tf.data.Dataset:
    """
    Generates features and label for tuning/training.
      Args:
        file_pattern: List of paths or patterns of input tfrecord files.
        data_accessor: DataAccessor for converting input to RecordBatch.
        tf_transform_output: A TFTransformOutput.
        batch_size: representing the number of consecutive elements of returned
          dataset to combine in a single batch
      Returns:
        A dataset that contains (features, target_tensor) tuple where features is a
          dictionary of Tensors, and target_tensor is a single Tensor that is a concatenated tensor of all the
          feature values.
    """
    dataset = data_accessor.tf_dataset_factory(
        file_pattern,
        tfxio.TensorFlowDatasetOptions(batch_size = batch_size),
        tf_transform_output.transformed_metadata.schema
    )
    
    dataset = dataset.map(lambda x: create_target_values(features_dict = x))
    return dataset.repeat()

当使用 TFX 生成数据集时，如何将 tf.Dataset 适合 Keras 自动编码器模型？

问题描述投票：0回答：1

问题

简介

TFX 管道

型号

子类化 Keras.Model

子类化 Keras.Layers

TFX 训练器组件

run_fn 定义

_input_fn 定义

错误

到目前为止我尝试过的事情

复制火车数据集

返回一个包含自身的元组的数据集

问题

1个回答

最新问题

当使用 TFX 生成数据集时，如何将 tf.Dataset 适合 Keras 自动编码器模型？

问题描述 投票：0回答：1

问题

简介

TFX 管道

型号

子类化 Keras.Model

子类化 Keras.Layers

TFX 训练器组件

run_fn 定义

_input_fn 定义

错误

到目前为止我尝试过的事情

复制火车数据集

返回一个包含自身的元组的数据集

问题

1个回答

最新问题

问题描述投票：0回答：1