Pytorch BertForSequenceClassification 模型总是预测 0

问题描述 投票:0回答:0

我在 https://www.kaggle.com/datasets/deepcontractor/supreme-court-judgment-prediction 数据集上使用 Bert 来执行二进制分类,我的模型预测全 0 时遇到问题。我认识到这 2/3 的数据是 0 个标签与 1 个标签,无论我调整了哪些参数,我的准确度始终是 67%,当我切换到 50/50 0 和 1 个标签时,我的准确度达到 50%,表明我的模型仅预测其中一个变量。

这是我的预处理和准备代码:

cases = pd.read_csv("justice.csv")
cases.drop(columns=['Unnamed: 0', 'ID', 'name', 'href', 'docket', 'term',  
                    'majority_vote', 'minority_vote', 'decision_type', 'disposition', 'issue_area'], inplace=True)
cases.dropna(inplace=True)

cases = cases.rename(columns={'first_party_winner': 'winning_party_idx'})
for i, row in cases.iterrows():
    if row['winning_party_idx'] == True:
        cases.loc[i, 'winning_party_idx'] = 0
    else:
        cases.loc[i, 'winning_party_idx'] = 1

# Create a mirrored case for each case, where the parties are swapped to prevent favoring first_party
mirrored_cases = cases.copy()
mirrored_cases['first_party'], mirrored_cases['second_party'] = mirrored_cases['second_party'], mirrored_cases['first_party']
mirrored_cases['winning_party_idx'] = (mirrored_cases['winning_party_idx'] == 0).astype(int)
mirrored_cases.reset_index(drop=True, inplace=True)

cases = pd.concat([cases, mirrored_cases])
cases.reset_index(drop=True, inplace=True)

cases['facts'] = cases['facts'].str.replace(r'<[^<]+?>', '', regex=True)
cases['facts'] = cases['facts'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\'\s]', '', x))
#cases['facts'] = cases['facts'].str.lower()

def word_count(text):
  return len(text.split())

cases['facts_len'] = cases['facts'].apply(word_count)
cases['facts_len'].describe()

cases['facts'] = cases.loc[cases['facts_len'] <= 390, 'facts']
cases['facts'] = cases.apply(lambda x: f"{x['first_party']} [SEP] {x['second_party']} [SEP] {x['facts']}", axis=1)
cases = cases.drop(columns=['first_party', 'second_party', 'facts_len'])

train_facts, val_facts, train_winners,  val_winners = train_test_split(
    cases['facts'], cases['winning_party_idx'], test_size=0.20)

train_facts, val_facts = train_facts.tolist(), val_facts.tolist()
train_winners, val_winners = [str(i) for i in train_winners], [str(i) for i in val_winners]

#leave truncate flag off to ensure that no data is truncated
#if data is too large this code will not run
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
train_encodings = tokenizer(train_facts, padding=True)
val_encodings = tokenizer(val_facts, padding=True)

class TextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))
        type(item)
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TextDataset(train_encodings, train_winners)
val_dataset = TextDataset(val_encodings, val_winners)


这里是加载和训练模型的代码:

#Load pretrained model
model = BertForSequenceClassification.from_pretrained('bert-base-cased', 
                                                      num_labels=2, 
                                                      hidden_dropout_prob=0.4,
                                                      attention_probs_dropout_prob=0.4)

training_args = TrainingArguments(
    output_dir="test_trainer", 
    logging_dir='logs', 
    evaluation_strategy="epoch",
    per_device_train_batch_size=32,  
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    logging_steps=50,
)
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

如果有人对为什么我的模型似乎无法做出预测有任何见解,我将不胜感激!我当时认为这可能是损失函数的问题,但我不清楚这里的默认损失函数是什么或如何在这种情况下正确覆盖它。

python machine-learning deep-learning pytorch bert-language-model
© www.soinside.com 2019 - 2024. All rights reserved.