我有一个
.jsonl
数据集,我正在尝试将其转换为张量流数据集。
.jsonl 的每一行都是以下形式
{"text": "some text", "meta": "irrelevant"}
我需要将其放入张量流数据集中,其中每个元素都有一个与 tf.string 值关联的键“文本”。
我得到的最接近的是以下
import tensorflow as tf
ds = tf.data.TextLineDataset('train_mini.jsonl')
def f(tnsr):
text = eval(tnsr.numpy())['text']
return tf.constant(text)
#return {'text':text}
ds = ds.map(lambda x: tf.py_function(func=f,inp=[x], Tout=tf.string))
ds = tf.data.Dataset({"text": list(ds.as_numpy_iterator())})
抛出以下错误
InvalidArgumentError: ValueError: Error converting unicode string while converting Python sequenc
e to Tensor.
Traceback (most recent call last):
File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", l
ine 241, in __call__
return func(device, token, args)
File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", l
ine 130, in __call__
ret = self._func(*args)
File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py
", line 309, in wrapper
return func(*args, **kwargs)
File "/home/crytting/persuasion/json_to_tfds.py", line 7, in f
return tf.constant(text)
File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op
.py", line 262, in constant
allow_broadcast=True)
File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op
.py", line 270, in _constant_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/crytting/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op
.py", line 96, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Error converting unicode string while converting Python sequence to Tensor.
[[{{node EagerPyFunc}}]]
我已经尝试了很多方法来做到这一点,但没有任何效果。看起来它不应该这么难,我想知道我是否错过了一些非常简单的方法。
现在(02/2024)您可以将 pandas 与 TensorFlow 一起使用,以使其变得简单。
import tensorflow as tf
import pandas as pd
# assume data is the data comming from your .jsonl file
# something like reading it with json.load()
df = pf.Dataframe.from_dict(data, orient='record')
dataset = tf.data.Dataset.from_tensor_slices(df.to_dict(orient='list'))