huggingface 评估函数使用多个标签

Question

我有两个与encode_plus函数结合的句子，我想通过微调BERT基础模型来完成NLI任务
我想要一个 Huggingface 评估器函数的度量名称来评估多个标签
我从这段代码中使用

metric = evaluate.combine(["accuracy", "f1", "precision", "recall"])
metrics = metric.compute(predictions=[0,1,1,2], references=[0,2,1,0])

得到了这个结果

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[31], line 2
      1 metric = evaluate.combine(["accuracy", "f1", "precision", "recall"])
----> 2 metrics = metric.compute(predictions=[0,1,1,2], references=[0,2,1,0])
      4 metrics

File ~/anaconda3/envs/NER/lib/python3.10/site-packages/evaluate/module.py:862, in CombinedEvaluations.compute(self, predictions, references, **kwargs)
    860     batch = {"predictions": predictions, "references": references, **kwargs}
    861     batch = {input_name: batch[input_name] for input_name in evaluation_module._feature_names()}
--> 862     results.append(evaluation_module.compute(**batch))
    864 return self._merge_results(results)

File ~/anaconda3/envs/NER/lib/python3.10/site-packages/evaluate/module.py:444, in EvaluationModule.compute(self, predictions, references, **kwargs)
    442 inputs = {input_name: self.data[input_name] for input_name in self._feature_names()}
    443 with temp_seed(self.seed):
--> 444     output = self._compute(**inputs, **compute_kwargs)
    446 if self.buf_writer is not None:
    447     self.buf_writer = None

File ~/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--f1/0ca73f6cf92ef5a268320c697f7b940d1030f8471714bffdb6856c641b818974/f1.py:127, in F1._compute(self, predictions, references, labels, pos_label, average, sample_weight)
    126 def _compute(self, predictions, references, labels=None, pos_label=1, average="binary", sample_weight=None):
--> 127     score = f1_score(
    128         references, predictions, labels=labels, pos_label=pos_label, average=average, sample_weight=sample_weight
    129     )
...
   (...)
   1401         UserWarning,
   1402     )

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

Answer 1

发生此错误是因为

"f1", "precision", "recall"

需要在计算多类分类时设置

average

参数。

解决方案 #1 -

Evaluator

修改后的

Accuracy

指标

从🤗如何使用多个指标教程中的代码开始：

import evaluate

eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)

如果您针对多类分类问题运行此程序，您将收到错误。但是，如果您跟踪

evaluator

内部的调用堆栈以查找每个指标上的

_compute

的调用位置和方式，您会发现 this 在

Evaluator.compute_metric

方法内部:

result = metric.compute(**metric_inputs, **self.METRIC_KWARGS)

METRIC_KWARGS

类属性仅在此方法中使用，因此我们可以在实例化评估器后简单地将其设置为包含

average

参数，如下所示：

task_evaluator = evaluator("text-classification")
task_evaluator.METRIC_KWARGS = {"average": "weighted"}
eval_results = task_evaluator.compute(
    ...
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    ...
)

但是，这会产生另一个错误

TypeError: Accuracy._compute() got an unexpected keyword argument 'average'

，表明默认的

Accuracy

指标

_compute

方法不需要新的

average

参数。为了解决这个问题，我们需要定义一个自定义

MulticlassAccuracy

类。

class MulticlassAccuracy(Metric):
    """Workaround for the default Accuracy class which doesn't support passing 'average' to the compute method."""

    def _info(self):
        # need to define this method too, can be copied from the default Accuracy class
        return evaluate.MetricInfo(...)

    def _compute(self, predictions, references, normalize=True, sample_weight=None, **kwargs):
        # take **kwargs to avoid breaking when the metric is used with a compute method that takes additional arguments
        return {
            "accuracy": float(
                accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
            )
        }

现在剩下要做的最后一件事是使用此类而不是 task_evaluator 中默认的

"accuracy"

指标（使用字典而不是列表）：

task_evaluator = evaluator("text-classification")
task_evaluator.METRIC_KWARGS = {"average": "weighted"}
metrics_dict = {
    "accuracy": MulticlassAccuracy(),
    "precision": "precision",
    "recall": "recall",
    "f1": "f1",
}
eval_results = task_evaluator.compute(
    ...
    metric=evaluate.combine(metrics_dict),
    ...
)

就是这样。完整代码片段：

import datasets
import evaluate
from evaluate import evaluator, Metric
from sklearn.metrics import accuracy_score


class MulticlassAccuracy(Metric):
    """Workaround for the default Accuracy class which doesn't support passing 'average' to the compute method."""

    def _info(self):
        return evaluate.MetricInfo(
            description="Accuracy",
            citation="",
            inputs_description="",
            features=datasets.Features(
                {
                    "predictions": datasets.Sequence(datasets.Value("int32")),
                    "references": datasets.Sequence(datasets.Value("int32")),
                }
                if self.config_name == "multilabel"
                else {
                    "predictions": datasets.Value("int32"),
                    "references": datasets.Value("int32"),
                }
            ),
            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
        )

    def _compute(self, predictions, references, normalize=True, sample_weight=None, **kwargs):
        # take **kwargs to avoid breaking when the metric is used with a compute method that takes additional arguments
        return {
            "accuracy": float(
                accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
            )
        }

task_evaluator = evaluator("text-classification")
task_evaluator.METRIC_KWARGS = {"average": "weighted"}
metrics_dict = {
    "accuracy": MulticlassAccuracy(),
    "precision": "precision",
    "recall": "recall",
    "f1": "f1",
}

eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(metrics_dict),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)

（可选）解决方案#2

为了完整起见，您还可以扩展解决方案 #1。 🤗 建议的方法是将任何自定义指标发布到 🤗Space，以更容易重用并减少黑客代码，但这需要仔细设置

_info

方法。

huggingface 评估函数使用多个标签

问题描述投票：0回答：1

1个回答

解决方案 #1 -
`Evaluator`
修改后的
`Accuracy`
指标

（可选）解决方案#2

最新问题

huggingface 评估函数使用多个标签

问题描述 投票：0回答：1

1个回答

解决方案 #1 - Evaluator 修改后的 Accuracy 指标

（可选）解决方案#2

最新问题

问题描述投票：0回答：1

解决方案 #1 -
`Evaluator`
修改后的
`Accuracy`
指标