我正在尝试在音频数据上训练一个整合模型,但不断出现以下错误。
"ValueError: Unable to create tensor, you should probly activate truncation and/or padding with 'padding=True' 'truncation=True' 以具有相同长度的批处理张量。也许你的特征(
input_ids
在这种情况下)有过度嵌套(输入类型 list
,其中类型 int
是预期的)。”
虽然我已经在 DataCollator 函数和 prepare_dataset 函数中将截断和填充值激活为 True,如下所示,但问题仍然存在。
数据整理者:
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
if not features:
return {}
# split inputs and labels since they have to be of different lenghts and need
# different padding methods
input_features = [{"input_values": feature["input_values"]} for feature in features]
label_features = [{"input_ids": feature["labels"]} for feature in features]
batch = self.processor.pad(
input_features,
padding=self.padding,
max_length=self.max_length,
truncation=True, # added truncation parameter
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
with self.processor.as_target_processor():
labels_batch = self.processor.pad(
label_features,
padding=self.padding,
max_length=self.max_length_labels,
pad_to_multiple_of=self.pad_to_multiple_of_labels,
return_tensors="pt",
)
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
batch["labels"] = labels
return batch
prepare_dataset 函数:
MAX_TRANSCRIPTION_LENGTH=128
def prepare_dataset(batch):
speech, transcription = batch["audio"], batch["text"]
#speech_tensors = [torch.tensor(s["array"]) for s in speech]
speech_tensors = []
for s in speech:
tensor = torch.tensor(s["array"])[:MAX_SPEECH_LENGTH] if isinstance(s, dict) else torch.zeros((1,))
speech_tensors.append(tensor)
inputs = torch.nn.utils.rnn.pad_sequence(speech_tensors, batch_first=True, padding_value=0.0)
with tokenizer.as_target_tokenizer():
labels = tokenizer(transcription, padding=True, truncation=True,add_special_tokens=True,max_length=MAX_TRANSCRIPTION_LENGTH,return_tensors="pt").input_ids
with tokenizer.as_target_tokenizer():
input_ids = tokenizer(transcription, padding=True,truncation=True,add_special_tokens=True,max_length=MAX_TRANSCRIPTION_LENGTH, return_tensors="pt").input_ids
return {"input_values": inputs, "attention_mask": inputs != 0.0, "input_ids": input_ids, "labels": labels}
请帮忙解决这个问题
padding=True, truncation=True 将使用默认的 max_length,尝试将标记器参数更改为:padding="max_length", truncation="max_length"