如何计算Shell中的重复句子

Question

cat file1.txt
abc bcd abc ...
abcd bcde cdef ...
abcd bcde cdef ...
abcd bcde cdef ...
efg fgh ...
efg fgh ...
hig ...

我的预期结果如下：

abc bcd abc ...      

abcd bcde cdef ...  
<!!! pay attention, above sentence has repeated 3 times !!!>

efg fgh ...
<!!! pay attention, above sentence has repeated 3 times !!!>

hig ...

我找到了解决问题的方法，但我的代码有点吵。

cat file1.txt | uniq -c | sed -e 's/ \+/ /g' -e 's/^.//g' | awk '{print $0," ",$1}'| sed -e 's/^[2-9] /\n/g' -e 's/^[1] //g' |sed -e 's/[^1]$/\n<!!! pay attention, above sentence has repeated & times !!!> \n/g' -e 's/[1]$//g'

abc bcd abc ...

abcd bcde cdef ...
<!!! pay attention, above sentence has repeated 3 times !!!>

efg fgh ...
<!!! pay attention, above sentence has repeated 2 times !!!>

hig ...

我想知道你是否能给我更高效的方式来实现这个目标。谢谢你们。

Answer 1

如果您的线路尚未分组，那么您可以使用

awk '
    NR == FNR {count[$0]++; next} 
    !seen[$0]++ {
        print
        if (count[$0] > 1)
            print "... repeated", count[$0], "times"
    }
' file1.txt file1.txt

如果文件非常大，这将消耗大量内存。您可能希望先对其进行排序。

Answer 2

sort + uniq + sed解决方案：

sort file1.txt | uniq -c | sed -E 's/^ +1 (.+)/\1\n/; 
 s/^ +([2-9]|[0-9]{2,}) (.+)/\2\n<!!! pay attention, the above sentence has repeated \1 times !!!>\n/'

输出：

abc bcd abc ...

abcd bcde cdef ...
<!!! pay attention, the above sentence has repeated 3 times !!!>

efg fgh ...
<!!! pay attention, the above sentence has repeated 2 times !!!>

hig ...

或者与awk：

sort file1.txt | uniq -c | awk '{ n=$1; sub(/^ +[0-9]+ +/,""); 
printf "%s\n%s",$0,(n==1? ORS:"<!!! pay attention, the above sentence has repeated "n" times !!!>\n\n") }'

Answer 3

$ awk '
    $0==prev { cnt++; next }
    { prt(); prev=$0; cnt=1 }
    END { prt() }
    function prt() {
        if (NR>1) print prev (cnt>1 ? ORS "repeated " cnt " times" : "") ORS
    }
' file
abc bcd abc ...

abcd bcde cdef ...
repeated 3 times

efg fgh ...
repeated 2 times

hig ...

如何计算Shell中的重复句子

问题描述投票：1回答：3

3个回答

最新问题

如何计算Shell中的重复句子

问题描述 投票：1回答：3

3个回答

最新问题

问题描述投票：1回答：3