如何确定重量？取决于什么？

Question

我正在尝试使用 Python 计算 n--gram。我用于一克、二克、三克和四克的权重是 (0.25, 0.25, 0, 0)。

当我运行第一个参考的脚本时，它会给我一个 BLEU 分数0.51

脚本是：

 Define your desired weights (example: higher weight for bi-grams)
 weights = (0.25, 0.25, 0, 0)  # Weights for uni-gram, bi-gram, tri-gram, and 4-gram

 Reference and predicted texts (same as before)
 reference = [["the", "alleyway", "barely", "lives", "in", "semi", "isolation"]]
 predictions = ["midaq", "alley", "lives", "in", "almost", "complete", "isolation"]

 Calculate BLEU score with weights
 score = sentence_bleu(reference, predictions, weights=weights)
 print(score)

但是当我为第二个参考运行相同的脚本时，它给出了 BLEU 分数6.91

脚本是：

Define your desired weights (example: higher weight for bi-grams)
weights = (0.25, 0.25, 0, 0)  # Weights for uni-gram, bi-gram, tri-gram, and 4-gram

Reference and predicted texts (same as before)
reference = [["the", "alley", "is", "almost", "living", "in", "a", "state", "of", 
"isolation"]]
predictions = ["midaq", "alley", "lives", "in", "almost", "complete", "isolation"]

Calculate BLEU score with weights
score = sentence_bleu(reference, predictions, weights=weights)
print(score)

我的问题是，虽然重量和代码相同，但为什么显示出这么大的差异？如何确定重量？有什么具体标准吗？

Answer 1

如此处所述：

只有指标分数的巨大差异在 MT 中才有意义

如果系统 A 的 BLEU 分数比系统 B 高 1-2 分（学术论文中常见），那么人类评估者只有 50% 的机会更喜欢系统 A 而不是系统 B

如果系统 A 的 BLEU 分数比系统 B 高 3-5 分，则人类评估者有 75% 的机会会更喜欢 A 而不是 B。

为了让人类评估者有 95% 的机会选择 A 而不是 B，我们需要将 BLEU 提高 10 个点（他们没有说明这一点，我是通过观察他们的图表来猜测的）。

所以

5.4

的差异是可以接受的。

你有一个完全不同的输入数据，它已经很小了。所以当然，权重是不同的。

如何确定重量？取决于什么？

问题描述投票：0回答：1

1个回答

最新问题

如何确定重量？取决于什么？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1