我有一个相当大的 csv,大约 5 GB,其中包含如下条目:
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"
"3443","this is the subject text of, some document, and can pretty much contain anything"
仔细观察第二列,会发现有很多完全相同的重复项。我想删除这些。我的问题是:
[1] is `sort` the correct tool for this job?
[2] how do I ask sort to work only on the second column?
[3] does sort find duplicates (via -u flag) anywhere on the file or just immediately next line duplicates?
我试过这个:
sort -u infile > outfile
它似乎可以工作,但是文件很大,所以我无法检查这是否真的完成了我想要它做的事情,因为我在命令行中没有指定对第二列进行操作。
如果这些问题是愚蠢的问题,我深表歉意。
由于您的数据引用了带有嵌入式逗号的字段,因此像
sort
这样的简单工具不适合此任务。您需要本身理解 CSV 格式的东西。
这是一个 perl 单行代码,它会跳过第二列已经打印过一次的打印行(换句话说,如果有重复项,它只打印第一个条目):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({binary=>1, always_quote=>1});
while (my $rec = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, $rec) unless $seen{$rec->[1]}++
}' input.csv
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"
Text::CSV_XS
模块,可通过您的操作系统包管理器或最喜欢的 CPAN 客户端使用。
警告:公然自我推销。
tawk
实用程序,一个围绕 awk
构建的类似 tcl
的程序,具有 CSV 感知输入模式:
$ tawk -csv 'line { if {![info exists seen($F(2))]} { set seen($F(2)) 1; print }}' input.csv
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"
我认为choose是用于此工作负载的工具(我是作者)。
解决方案如下:
$ cat file_contents | choose -u --field '^[^,]*+\K.*+'
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"
这使得行的第一个实例具有唯一性,仅查看与字段 arg 匹配的部分。请参阅此处的表达方式。 gnu sort 无法做到这一点,因为它只能匹配逗号之间的内容,但您的数据字段本身可以包含逗号。