使用Huggingface Trainer训练时如何在模型的forward()函数中获取自定义列?

问题描述 投票:0回答:1

我正在使用 Huggingface Trainer 来训练一个子类为 Llama llm 的 cumstom 模型。经过标记器标记后,我的数据集具有这些字段“

input_ids
”、“
labels
”等,并且我还添加了 2 个自定义列“
interact_ids
”和“
candidate_ids
”。但我无法在我的模型'
class LLMWithCustomLayer(LlamaForCausalLM)
'的forward()函数中获取这些自定义字段。

    def forward(
            self,
            input_ids: torch.LongTensor = None,
            attention_mask: Optional[torch.Tensor] = None,
            position_ids: Optional[torch.LongTensor] = None,
            past_key_values: Optional[List[torch.FloatTensor]] = None,
            inputs_embeds: Optional[torch.FloatTensor] = None,
            labels: Optional[torch.LongTensor] = None,
            use_cache: Optional[bool] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
            interact_ids = None,
            candidate_ids = None,
        ):
            print('interact_ids, candidate_ids', interact_ids, candidate_ids) # they are none
    
            interact_embs = []
            candidate_embs = []
            for i in range(interact_ids.shape(0)):
                # O_i = F_i (e_i)
                interact_embs.append(self.item_emb_proj(self.get_item_emb(interact_ids)))
                # O_i = F_i (e_i)
                candidate_embs.append(self.item_emb_proj(self.get_item_emb(candidate_ids)))
                # replace [CandidateEmb] and [HistoryEmb]
                inputs_embeds = self.replace_hist_candi_token(input_ids, inputs_embeds ,interact_embs, candidate_embs)
    
            return super().forward(
                input_ids=input_ids,
                attention_mask=attention_mask,
                position_ids=position_ids,
                past_key_values=past_key_values,
                inputs_embeds=inputs_embeds,
                use_cache=use_cache,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
                labels = labels
            )

我是法学硕士微调的新人。谁能帮助我吗?我将非常感激。

pytorch nlp large-language-model huggingface-trainer
1个回答
0
投票

您需要修改数据整理器以将

interact_ids
candidate_ids
传递给您的模型,因为 Trainer 默认情况下会忽略额外的列。

修改数据整理器

class CustomDataCollator(DataCollatorWithPadding):
    def __call__(self, features):
        batch = super().__call__(features)
        batch["interact_ids"] = torch.tensor([f["interact_ids"] for f in features])
        batch["candidate_ids"] = torch.tensor([f["candidate_ids"] for f in features])
        return batch

然后将其传递给

Trainer

trainer = Trainer(
    model=LLMWithCustomLayer.from_pretrained("your-llama-model"),
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=CustomDataCollator(tokenizer)
)

现在,您的

forward()
方法将收到
interact_ids
candidate_ids

希望,它会起作用!

© www.soinside.com 2019 - 2024. All rights reserved.