cat file1.txt
abc bcd abc ...
abcd bcde cdef ...
abcd bcde cdef ...
abcd bcde cdef ...
efg fgh ...
efg fgh ...
hig ...
我的预期结果如下:
abc bcd abc ...
abcd bcde cdef ...
<!!! pay attention, above sentence has repeated 3 times !!!>
efg fgh ...
<!!! pay attention, above sentence has repeated 3 times !!!>
hig ...
我找到了解决问题的方法,但我的代码有点吵。
cat file1.txt | uniq -c | sed -e 's/ \+/ /g' -e 's/^.//g' | awk '{print $0," ",$1}'| sed -e 's/^[2-9] /\n/g' -e 's/^[1] //g' |sed -e 's/[^1]$/\n<!!! pay attention, above sentence has repeated & times !!!> \n/g' -e 's/[1]$//g'
abc bcd abc ...
abcd bcde cdef ...
<!!! pay attention, above sentence has repeated 3 times !!!>
efg fgh ...
<!!! pay attention, above sentence has repeated 2 times !!!>
hig ...
我想知道你是否能给我更高效的方式来实现这个目标。谢谢你们。
如果您的线路尚未分组,那么您可以使用
awk '
NR == FNR {count[$0]++; next}
!seen[$0]++ {
print
if (count[$0] > 1)
print "... repeated", count[$0], "times"
}
' file1.txt file1.txt
如果文件非常大,这将消耗大量内存。您可能希望先对其进行排序。
sort
+ uniq
+ sed
解决方案:
sort file1.txt | uniq -c | sed -E 's/^ +1 (.+)/\1\n/;
s/^ +([2-9]|[0-9]{2,}) (.+)/\2\n<!!! pay attention, the above sentence has repeated \1 times !!!>\n/'
输出:
abc bcd abc ...
abcd bcde cdef ...
<!!! pay attention, the above sentence has repeated 3 times !!!>
efg fgh ...
<!!! pay attention, the above sentence has repeated 2 times !!!>
hig ...
或者与awk
:
sort file1.txt | uniq -c | awk '{ n=$1; sub(/^ +[0-9]+ +/,"");
printf "%s\n%s",$0,(n==1? ORS:"<!!! pay attention, the above sentence has repeated "n" times !!!>\n\n") }'
$ awk '
$0==prev { cnt++; next }
{ prt(); prev=$0; cnt=1 }
END { prt() }
function prt() {
if (NR>1) print prev (cnt>1 ? ORS "repeated " cnt " times" : "") ORS
}
' file
abc bcd abc ...
abcd bcde cdef ...
repeated 3 times
efg fgh ...
repeated 2 times
hig ...