在 Linux 终端中比较两个文件

Question

有两个名为 "a.txt" 和 "b.txt" 的文件都有一个单词列表。现在我想检查哪些单词在 "a.txt" 中是多余的，而不是在 "b.txt" 中。

我需要一个有效的算法，因为我需要比较两个字典。

Answer 1

如果你安装了 vim，试试这个：

vimdiff file1 file2

或

vim -d file1 file2

你会发现它太棒了。 enter image description here

Answer 2

对它们进行排序并使用

comm

：

comm -23 <(sort a.txt) <(sort b.txt)

comm

比较（已排序）输入文件，默认情况下输出三列：a 独有的行、b 独有的行以及两者中都存在的行。通过指定

-1

、

-2

和/或

-3

，您可以抑制相应的输出。因此

comm -23 a b

仅列出 a 独有的条目。我使用

<(...)

语法对文件进行动态排序，如果它们已经排序，则不需要它。

Answer 3

如果您喜欢

git diff

的 diff 输出样式，您可以将其与

--no-index

标志一起使用来比较不在 git 存储库中的文件：

git diff --no-index a.txt b.txt

使用几个文件，每个文件包含大约 200k 文件名字符串，我对这种方法与此处的一些其他答案进行了基准测试（使用内置

time

命令）：

git diff --no-index a.txt b.txt
# ~1.2s

comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s

diff a.txt b.txt
# ~2.6s

sdiff a.txt b.txt
# ~2.7s

vimdiff a.txt b.txt
# ~3.2s

comm

似乎是迄今为止最快的方法，而

git diff --no-index

似乎是 diff 样式输出最快的方法。

更新 2018-03-25 您实际上可以省略

--no-index

标志，除非您位于 git 存储库中并且想要比较该存储库中未跟踪的文件。来自手册页：

此形式用于比较文件系统上给定的两个路径。当在 Git 控制的工作树中运行命令且至少有一个路径指向工作树外部时，或者在 Git 控制的工作树外部运行命令时，您可以省略 --no-index 选项。

Answer 4

37
投票

尝试

sdiff

（

man sdiff

）

sdiff -s file1 file2

Answer 5

您可以使用Linux中的

diff

工具来比较两个文件。您可以使用 --changed-group-format 和 --unchanged-group-format 选项来过滤所需数据。

可以使用以下三个选项为每个选项选择相关组：

'%<' get lines from FILE1
'%>' 从 FILE2 获取行
''（空字符串）用于从两个文件中删除行。

例如：diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt

[root@vmoracle11 tmp]# cat file1.txt 
test one
test two
test three
test four
test eight
[root@vmoracle11 tmp]# cat file2.txt 
test one
test three
test nine
[root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt 
test two
test four
test eight

Answer 6

您还可以使用： colordiff：用颜色显示 diff 的输出。

关于 vimdiff：它允许您通过 SSH 比较文件，例如：

vimdiff /var/log/secure scp://192.168.1.25/var/log/secure

摘自：http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

Answer 7

另外，不要忘记 mcdiff - GNU Midnight Commander 的内部差异查看器。

例如：

mcdiff file1 file2

享受吧！

Answer 8

使用

comm -13

（需要排序的文件）：

$ cat file1
one
two
three

$ cat file2
one
two
three
four

$ comm -13 <(sort file1) <(sort file2)
four

Answer 9

您还可以使用：

sdiff file1 file2

在终端内并排显示差异！

Answer 10

这是我的解决方案：

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english

Answer 11

diff a.txt b.txt | grep '<'

然后可以通过管道进行切割以获得干净的输出

diff a.txt b.txt | grep '<' | cut -c 3

Answer 12

您可以使用cmp。

cmp file1.c file2.c

示例（-b选项用于打印不同的字节。）：

$ cmp -b quine2.c quine3.c
quine2.c quine3.c differ: byte 13, line 1 is  15 ^M  12 ^J

请务必查看 cmp 的手册页。

Answer 13

使用awk。测试文件：

$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one

awk：

$ awk '
NR==FNR {                    # process b.txt  or the first file
    seen[$0]                 # hash words to hash seen
    next                     # next word in b.txt
}                            # process a.txt  or all files after the first
!($0 in seen)' b.txt a.txt   # if word is not hashed to seen, output it

输出重复项：

four
four

为了避免重复，请将 a.txt 中每个新遇到的单词添加到

seen

哈希中：

$ awk '
NR==FNR {
    seen[$0]
    next
}
!($0 in seen) {              # if word is not hashed to seen
    seen[$0]                 # hash unseen a.txt words to seen to avoid duplicates 
    print                    # and output it
}' b.txt a.txt

输出：

four

如果单词列表以逗号分隔，例如：

$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three

你必须多跑几圈（

for

循环）：

awk -F, '                    # comma-separated input
NR==FNR {
    for(i=1;i<=NF;i++)       # loop all comma-separated fields
        seen[$i]
    next
}
{
    for(i=1;i<=NF;i++)
        if(!($i in seen)) {
             seen[$i]        # this time we buffer output (below):
             buffer=buffer (buffer==""?"":",") $i
        }
    if(buffer!="") {         # output unempty buffers after each record in a.txt
        print buffer
        buffer=""
    }
}' b.txt a.txt

本次输出：

four
five,six

在 Linux 终端中比较两个文件

问题描述投票：0回答：13

13个回答

最新问题

在 Linux 终端中比较两个文件

问题描述 投票：0回答：13

13个回答

最新问题

问题描述投票：0回答：13