我正在使用问答数据集
UCLNLP/adversarial_qa
。
from datasets import load_dataset
ds = load_dataset("UCLNLP/adversarial_qa", "adversarialQA")
使用 BERT 等标记器将上下文和问题标记在一起后,如何将基于字符的答案索引映射到基于标记的索引。这是我的数据集中的示例行:
d0 = ds['train'][0]
d0
{'id': '7ba1e8f4261d3170fcf42e84a81dd749116fae95',
'title': 'Brain',
'context': 'Another approach to brain function is to examine the consequences of damage to specific brain areas. Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid, and isolated from the bloodstream by the blood–brain barrier, the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage. In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function. Because there is no ability to experimentally control the nature of the damage, however, this information is often difficult to interpret. In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.',
'question': 'What sare the benifts of the blood brain barrir?',
'answers': {'text': ['isolated from the bloodstream'], 'answer_start': [195]},
'metadata': {'split': 'train', 'model_in_the_loop': 'Combined'}}
标记化后,答案索引为 56 和 16:
from transformers import BertTokenizerFast
bert_tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased', return_token_type_ids=True)
bert_tokenizer.decode(bert_tokenizer.encode(d0['question'], d0['context'])[56:61])
'isolated from the bloodstream'
我想使用答案的标记索引创建一个新数据集,例如 56 ad 60。
这是来自 linkedin 学习课程。讲师完成了转换并创建了 csv 文件,但他没有共享该文件或执行此操作的代码。这是预期的结果:
您应该对问题和上下文进行编码,在标记化上下文中找到答案的标记范围,并使用标记级索引更新数据集。
以下函数可以为您执行上述操作:
def get_token_indices(example):
# Tokenize with `return_offsets_mapping=True` to get character offsets for each token
encoded = tokenizer(
example['question'],
example['context'],
return_offsets_mapping=True
)
# Find character start and end from the original answer
char_start = example['answers']['answer_start'][0]
char_end = char_start + len(example['answers']['text'][0])
# Identify token indices for the answer
start_token_idx = None
end_token_idx = None
for i, (start, end) in enumerate(encoded['offset_mapping']):
if start <= char_start < end:
start_token_idx = i
if start < char_end <= end:
end_token_idx = i
break
example['answer_start_token_idx'] = start_token_idx
example['answer_end_token_idx'] = end_token_idx
return example
以下是如何使用和测试此功能:
ds = load_dataset("UCLNLP/adversarial_qa", "adversarialQA")
tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased', return_token_type_ids=True)
tokenized_ds = ds['train'].map(get_token_indices)
# Example
d0_tokenized = tokenized_ds[0]
print("Tokenized start index:", d0_tokenized['answer_start_token_idx'])
print("Tokenized end index:", d0_tokenized['answer_end_token_idx'])
answer_tokens = tokenizer.decode(
tokenizer.encode(d0_tokenized['question'], d0_tokenized['context'])[d0_tokenized['answer_start_token_idx']:d0_tokenized['answer_end_token_idx']+1]
)
print("Tokenized answer:", answer_tokens)
输出:
Tokenized start index: 56
Tokenized end index: 60
Tokenized answer: isolated from the bloodstream