通过huggingface tokenizer映射文本数据。

Question

我的编码函数是这样的。

from transformers import BertTokenizer, BertModel

MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)

def encode(texts, tokenizer=tokenizer, maxlen=10):
#     import pdb; pdb.set_trace()
    inputs = tokenizer.encode_plus(
        texts,
        return_tensors='tf',
        return_attention_masks=True, 
        return_token_type_ids=True,
        pad_to_max_length=True,
        max_length=maxlen
    )

    return inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]

我想通过这样的方式让我的数据快速编码

x_train = (tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
           .map(encode))

但是，这样做会出错

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

根据我的理解，当我在里面设置一个断点时 encode 这是因为我发送了一个非整数的数组。我如何让huggingface变换器与tensorflow字符串作为输入玩得好？

如果你需要一个虚拟的数据框架，这里是。

df_train = pd.DataFrame({'comment_text': ['Today was a good day']*5})

我的尝试

所以我试着用 from_generator 这样我就可以将字符串解析为 encode_plus 功能。但是，这对TPU不起作用。

AUTO = tf.data.experimental.AUTOTUNE

def get_gen(df):
    def gen():
        for i in range(len(df)):
            yield encode(df.loc[i, 'comment_text']) , df.loc[i, 'toxic']
    return gen

shapes = ((tf.TensorShape([maxlen]), tf.TensorShape([maxlen]), tf.TensorShape([maxlen])), tf.TensorShape([]))

train_dataset = tf.data.Dataset.from_generator(
    get_gen(df_train),
    ((tf.int32, tf.int32, tf.int32), tf.int32),
    shapes
)
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(AUTO)

版本信息。

transformers.__version__, tf.__version__ => ('2.7.0', '2.1.0')

Answer 1

当你用.Tokenizer创建tensorflow数据集时，你会发现，你的字符串已经被转换成了你自己的字符串。tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)tensorflow 将你的字符串转换为字符串类型的 tensors，而字符串类型的 tensors 是 tensorflow 不接受的输入。tokenizer.encode_plus. 就像错误信息说的那样，它只接受 a string, a list/tuple of strings or a list/tuple of integers. 您可以通过添加一个 print(type(texts)) 在您的编码函数（Output:<class 'tensorflow.python.framework.ops.Tensor'>).

我不知道你的后续计划是什么，为什么你需要一个 tf.data.Dataset但你必须在把你的输入变成一个 tf.data.Dataset:

import tensorflow as tf
from transformers import BertTokenizer, BertModel

MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)

texts = ['Today was a good day', 'Today was a bad day',
       'Today was a rainy day', 'Today was a sunny day',
       'Today was a cloudy day']


#inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
inputs = tokenizer.batch_encode_plus(
        texts,
        return_tensors='tf',
        return_attention_masks=True, 
        return_token_type_ids=True,
        pad_to_max_length=True,
        max_length=10
    )

dataset = tf.data.Dataset.from_tensor_slices((inputs['input_ids'],
                                              inputs['attention_mask'],
                                              inputs['token_type_ids']))
print(type(dataset))

bert的tokenizer可以在字符串、字符串列表组或整数列表组上工作。

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>

Answer 2

bert的记号器可以处理一个字符串，一个字符串列表组或一个整数列表组。所以，请检查你的数据是否被转换为字符串。为了在整个数据集上应用tokenizer，我使用了Dataset.map，但这是在图形模式下运行的。所以，我需要把它封装在一个 tf.py_function 中。tf.py_function会将常规的tensors（带有一个值和一个.numpy()方法来访问它）传递给封装的python函数。我的数据在使用py_function后被转换为字节，因此我应用了tf.compat.as_str将字节转换为字符串。

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def encode(lang1, lang2):
    lang1 = tokenizer.encode(tf.compat.as_str(lang1.numpy()), add_special_tokens=True)
    lang2 = tokenizer.encode(tf.compat.as_str(lang2.numpy()), add_special_tokens=True)
    return lang1, lang2
def tf_encode(pt, en):
    result_pt, result_en = tf.py_function(func = encode, inp = [pt, en], Tout=[tf.int64, tf.int64])
    result_pt.set_shape([None])
    result_en.set_shape([None])
    return result_pt, result_en
train_dataset = dataset3.map(tf_encode)
BUFFER_SIZE = 200
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE, 
                                                           padded_shapes=(60, 60))
a,p = next(iter(train_dataset))

通过huggingface tokenizer映射文本数据。

问题描述投票：1回答：1

我的尝试

版本信息。

1个回答

最新问题

通过huggingface tokenizer映射文本数据。

问题描述 投票：1回答：1

我的尝试

版本信息。

1个回答

最新问题

问题描述投票：1回答：1