我有一个 3 GB 的 txt 文件:
Lucy
Mary
Lily
John
Mary
Ann
John
Lily
Lily
Mary
Lucy
Mark
在输出中,必须有这样的文件:
3instances.txt:
Lily
Lily
Lily
Mary
Mary
Mary
2instances.txt:
John
John
Lucy
Lucy
1实例.txt:
Ann
Mark
如果要使用shell,可以使用如下:
sort < INPUT_FILE | uniq -c | awk '{ print $2 > $1"instances.txt"}'
sort
:对相同的名字进行分组
uniq -c
:计算唯一的重复行和打印计数,即。 1 Alice
awk '{ print $2 > $1"instances.txt"}
:根据计数(第一列)将名称(第二列)写入文件
将产生:
1instances.txt
Ann
Mark
.
.
.
2instances.txt
John
Lucy
.
.
.
3instances.txt
Lily
Mary
.
.
.
有一些向导可以使用
awk
和其他 shell 命令来做这些事情。
对于普通人来说,有python。
请注意,我已经在您的小示例中测试了以下代码,但没有在 3GB 文件上进行测试。
#!/usr/bin/env python3
from collections import Counter
import sys
def group_by_count(filename):
with open(filename, 'r') as f:
c = Counter(line.strip() for line in f)
groups = {}
for (line, count) in c.items():
groups.setdefault(count, []).append(line)
return groups
def write_files(groups):
for n, lines in sorted(groups.items()):
filename = f'{n}instances.txt'
with open(filename, 'w') as f:
for line in lines:
f.write(line + '\n')
def main(argv):
if len(argv) > 1:
groups = group_by_count(argv[1])
write_files(groups)
else:
print('Please specify a file name to read from.')
if __name__ == '__main__':
main(sys.argv)
结果:
$ chmod +x sort_by_repetitions.py
$ cat test.txt
Lucy
Mary
Lily
John
Mary
Ann
John
Lily
Lily
Mary
Lucy
Mark
$ ./sort_by_repetitions.py test.txt
$ ls *instances*
1instances.txt 2instances.txt 3instances.txt
$ cat 1instances.txt
Ann
Mark
$ cat 2instances.txt
Lucy
John
$ cat 3instances.txt
Mary
Lily
awk '
{
cnt[$0]++
}
END{
n=asorti(cnt, sorted);
for (i=1; i<=n; i++) {
out = cnt[sorted[i]] (cnt[sorted[i]]>1 ? "instances.txt" : "instance.txt")
for (j=1; j<=cnt[sorted[i]]; j++)
print sorted[i] > out
}
}' file
$ head ?instance*.txt
==> 1instance.txt <==
Ann
Mark
==> 2instances.txt <==
John
John
Lucy
Lucy
==> 3instances.txt <==
Lily
Lily
Lily
Mary
Mary
Mary