链接到原始数据集
我已经下载了这个数据集
The TREC 2006 Public Corpus -- 75MB (trec06p.tgz)
。这是文件夹结构:
.
└── trec 06p/
├── data
├── data-delay
├── full
├── full-delay
├── ham25
├── ham25-delay
├── ham50
├── ham50-delay
├── spam25
├── spam25-delay
├── spam50
└── spam50-delay
一些问题:
data-delay
、full-delay
)full
是什么意思? (只是标签吗?)full-delay
子文件夹中的HAM和火腿有什么区别?data-delay
文件夹是空的?在阅读答案之前,请注意,由于我没有参与 TREC06 任务,也不是数据创建者/提供者,所以我只能对您对数据集的问题进行一些有根据的猜测。
首先,阅读任务论文会有所帮助 https://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf =)
接下来,未来读者的正确下载链接是 https://plg.uwaterloo.ca/~gvcormac/treccorpus06/
现在,一些总结:
A:所有实际的文本数据实际上都可以在
trec06/data/**/*
文件中找到
/trec06p
/data
/000
/000
...
/299
...
/126
/000
/021
对于其余目录,它们只是指向子集的索引,以模拟不同形式的评估。
trec06p/full/index
:电子邮件列表索引,指向trec06p/data/**/*
trec06p/full-delay/index
:指向延迟反馈评估的指标
trec06p/ham*-delay/index
:仅指向延迟反馈评估中的非垃圾邮件标记电子邮件的索引
trec06p/spam*-delay/index
:仅指向延迟反馈评估中的垃圾邮件标记电子邮件的索引
trec06p/ham*-delay/index
+
trec06p/spam*-delay/index
= trec06p/full-delay/index
的唯一列表问:为什么data-delay文件夹是空的?
问:有什么特殊的方法来解析data文件夹中的内容吗?
让我们退一步思考一下我们本质上有什么:
trec06/data/**/*
中的电子邮件列表
spam/ham
中每封电子邮件的
trec06/full/index
标签
Spam/SPAM/Ham/HAM
中电子邮件子集的
trec06/full-delay/index
标签
import pandas as pd
from tqdm import tqdm
from lazyme import find_files
data_rows = {}
# Assuming you're on `trec06p` directory.
# P/S: you can use any other file path list function,
# I just use lazyme.find_files because I find it convenient.
for fn in tqdm(find_files('./data/**/*')):
if fn.endswith('.DS_Store'):
continue
# Note that not all files are in utf8/ascii charset
# so you'll have to read them in binary to store them.
# Also note: THIS CAN BE DANGEROUS IF THERE'S EXCUTABLES IN THE DATA!!!
# Assuming that there isn't.
with open(fn, 'rb') as fin:
data_id = tuple(fn.split('/')[-2:])
data_rows[data_id] = fin.read()
full_labels = {}
with open('./full/index') as fin:
for line in tqdm(fin):
label, fn = line.strip().split()
data_id = tuple(fn.split('/')[-2:])
full_labels[data_id] = label
full_delay_labels = {}
with open('./full-delay/index') as fin:
for line in tqdm(fin):
label, fn = line.strip().split()
data_id = tuple(fn.split('/')[-2:])
# You'll realize that the labels repeated per data point.
# but they are exactly the same.... -_-
if data_id in full_delay_labels:
assert label.lower() == full_delay_labels[data_id].lower()
full_delay_labels[data_id] = label.lower()
问:
trec06p/*-delay/index
中的HAM、Ham、SPAM和Spam标签有什么区别如果我们仔细观察
if data_id in full_delay_labels: assert label.lower() == full_delay_labels[data_id].lower()
行,我们会发现所有大写字母和非大写字母标签都是相同的。
问:那为什么会有差异?A:不确定,最好询问数据提供者/创建者
问:
trec06p/full-delay/index
trec06p/full/index
的标签有区别吗?好像没有。>>> any(full_labels[data_id] != full_delay_labels[data_id] for data_id in full_labels)
False
问:如何将其读入 pandas 数据框?
import pandas as pd
from tqdm import tqdm
from lazyme import find_files
data_rows = {}
for fn in tqdm(find_files('./data/**/*')):
if fn.endswith('.DS_Store'):
continue
with open(fn, 'rb') as fin:
data_id = tuple(fn.split('/')[-2:])
data_rows[data_id] = fin.read()
full_labels = {}
with open('./full/index') as fin:
for line in tqdm(fin):
label, fn = line.strip().split()
data_id = tuple(fn.split('/')[-2:])
full_labels[data_id] = label
df = pd.DataFrame({'binary':pd.Series(data_rows),'label':full_labels})
问:但是输入列仍然是二进制文件,我可以以某种方式猜测编码吗?
charset=...
)
import re, mmap
def find_charset(fn):
with open(fn, 'rb') as f:
view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
return re.split(";|,|\n",
next(
re.finditer(br'charset\=([!-~\s]{%i,})\n' % 5, view)).group(1).decode('utf8')
)[0].strip('"').strip("'")