自定义批处理tf.data.Dataset

Question

我正在使用tensorflow的Estimator API，并希望为培训创建自定义批次。

我的示例如下所示

example1 = {
   "num_sentences": 3,
   "sentences": [[1, 2], [3, 4], [5, 6]] 
}
example2 = {
   "num_sentences": 2,
   "sentences": [[1, 2], [3, 4]] 
}

因此，一个例子可以有任意数量的固定大小的句子。现在我想构建批量大小取决于批次中句子数量的批次。否则我必须使用批量大小1，因为一些示例可能具有“批量大小”句子并且大批量大小不适合GPU内存。

例如：我的批量大小为6，例子的数量为[5,3,3,2,2,1]。然后我将这些例子分组到批次[5]，[3,3]和[2,2,1]。请注意，最后一批中的示例“1”将被填充。

我编写了一个算法，将示例分组到这样的批次中。现在我无法将批次提供给tf.data.Dataset。

我已经尝试过使用tf.data.Dataset.from_generator，但是这个方法似乎需要个别的例子，如果生成器产生类似[example1，example2]的批量，我会收到错误。

如何使用自定义批次提供数据集？有没有更优雅的方法来解决我的问题？

更新：我假设我无法正确提供输出形状参数。以下代码工作正常。

import tensorflow as tf

def gen():
    for b in range(3):
        #yield [{"num_sentences": 3, "sentences": [[1, 2], [3, 4], [5, 6]]}]
        yield {"num_sentences": 3, "sentences": [[1, 2], [3, 4], [5, 6]]}


dataset = tf.data.Dataset.from_generator(generator=gen, 
                                         output_types={'num_sentences': tf.int32, 'sentences': tf.int32},
                                         #output_shapes=tf.TensorShape([None,  {'num_sentences': tf.TensorShape(None), 'sentences': tf.TensorShape(None)}])
                                         output_shapes={'num_sentences': tf.TensorShape(None), 'sentences': tf.TensorShape(None)}
                                        )

def print_dataset(dataset):
    it = dataset.make_one_shot_iterator()
    with tf.Session() as sess:
        print(dataset.output_shapes)
        print(dataset.output_types)
        while True:
            try:
                data = it.get_next()
                print("data" + str(sess.run(data)))
            except tf.errors.OutOfRangeError:
                break

print_dataset(dataset)

如果我生成一个数组而取消注释output_shapes我得到一个错误“int（）参数必须是一个字符串，一个类似字节的对象或一个数字，而不是'dict'”

Answer 1

我想我已经想出如何解决上述问题。我想我必须将这些例子合并到“一个”字典中。

# a batch with two examples each with sentence size 3
yield {"num_sentences": [3, 3], "sentences": [[[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]]]}

自定义批处理tf.data.Dataset

问题描述投票：2回答：1

1个回答

最新问题

自定义批处理tf.data.Dataset

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1