我不明白sckit-learn的tfidfvectorizer的工作原理

Question

我知道计算 tf-idf 的公式是 TF * IDF，其中 TF 是该单词在文档 D 中出现的次数，IDF 是文档数/包含该单词的文档数 + 1。

这是我的数据集。

corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ]

现在我计算文档1中“文档”一词的td-idf，输出为0.22。但是当我使用 sckit 的 tfidf 矢量器时，输出是： 1.22314355 我使用的矢量化器具有以下参数：

vectorizer = TfidfVectorizer(norm=None)

请解释一下为什么答案不同。

Answer 1

这个差异来自于IDF计算，

TF-IDF 的 IDF 计算，

IDF(t)=log(N/DF(t)), 其中 N 是文档总数，DF(t) 是包含该术语的文档数量。

scikit-learn 的 IDF 计算：

IDF(t)=log((1+N)/(1+DF(t)))+1

scikit-learn 添加了 1，以确保所有文档中的术语不会以零 IDF 结尾，并防止被零除。