如何计算Linux上两个文件之间的差异？

Question

我需要处理大文件，并且必须找到两个文件之间的差异。我不需要不同的位，而是差异的数量。

找出我想出的不同行的数量

diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l

它确实有效，但是有更好的方法吗？

以及如何计算差异的确切数量（使用 bash、diff、awk、sed 等旧版本的 perl 等标准工具）？

Answer 1

如果您想计算不同的行数，请使用以下命令：

diff -U 0 file1 file2 | grep ^@ | wc -l

约翰的回答没有重复计算不同的行吗？

Answer 2

diff -U 0 file1 file2 | grep -v ^@ | wc -l

diff

列表顶部的两个文件名减去 2。统一格式可能比并排格式快一点。

Answer 3

如果使用 Linux/Unix，可以使用

comm -1 file1 file2

打印 file1 中不在 file2 中的行，使用

comm -1 file1 file2 | wc -l

对它们进行计数，对于

comm -2 ...

也类似吗？

Answer 4

由于每个不同的输出行都以

或

字符开头，我建议这样做：

diff file1 file2 | grep ^[\>\<] | wc -l

通过在脚本行中仅使用

\<

或

\>

，您可以仅计算其中一个文件中的差异。

Answer 5

我相信正确的解决方案就在这个答案，即：

$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1

Answer 6

如果您正在处理具有类似内容的文件，这些文件应该按相同的行进行排序（例如描述类似内容的 CSV 文件），并且您会例如想要在以下文件中找到 2 个差异：

File a:    File b:
min,max    min,max
1,5        2,5
3,4        3,4
-2,10      -1,1

你可以像这样用Python实现它：

different_lines = 0
with open(file1) as a, open(file2) as b:
    for line in a:
        other_line = b.readline()
        if line != other_line:
            different_lines += 1

Answer 7

这是一种计算两个文件之间任何类型差异的方法，并为这些差异指定正则表达式 - 这里

对于除换行符之外的任何字符：

git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l

摘自

man git-diff

：

--patience
           Generate a diff using the "patience diff" algorithm.
--word-diff[=<mode>]
           Show a word diff, using the <mode> to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below.
           porcelain
               Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff
               format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input
               are represented by a tilde ~ on a line of its own.
--word-diff-regex=<regex>
           Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it
           was already enabled.
           Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!)
           for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches
           all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline.
           For example, --word-diff-regex=.  will treat each character as a word and, correspondingly, show differences character by character.

pcre2grep

是 Ubuntu 20.04 上

pcre2-utils

软件包的一部分。

Answer 8

我很想编辑@Josh或@John的答案，但编辑队列已满，所以这里是：

diff -U 0 file1 file2 | tail -n +3 | grep -c '^@'

为什么？

diff -U 0 file1 file2

输出类似：

--- file1  (+ timestamp)
+++ file2  (+ timestamp)
@@ range information for first difference @@
+ some added
+ lines
@@ range info for second difference @@
- some
- removed
- lines
@@ range info for edit @@
- I changed
- this
+ into
+ these new
+ lines

有关范围信息的更多信息，请参阅此答案

所以：

```
tail -n +3
```
删除直到第三行的内容。换句话说，这删除了 2 个文件信息行
```
grep -c '^@'
```
计算以“@”开头的行，即修改范围

因此，输出是差异的计数，这里“差异”被视为经过修改的范围。

但不同之处在于我指的是修改行的计数！

因为，正如其他答案中所指出的，对单行的修改将显示两次，既作为删除

又作为添加

，那么最好将添加和删除分开。

给你：

diff -U 0 file1 file2 | tail -n +3 | perl -ne 'if (/^\+/) { $add +=1 }; if (/^-/) { $del += 1 }; END { if (!$add) { $add=0 }; if (!$del) { $del=0 }; print "+$add -$del\n"}'

perl“one-liner”有什么作用？

if (/^\+/) { $add +=1 }; # counts the number of added lines, starting with +
if (/^-/) { $del += 1 }; # counts the number of deleted lines, starting with -
END {
  if (!$add) { $add=0 }; # set count to 0 if no added line
  if (!$del) { $del=0 }; # set count to 0 if no deleted line
  print "+$add -$del\n"  # print the count of added lines and the count of deleted lines
}

输出示例：

+23 -10

如何计算Linux上两个文件之间的差异？

问题描述投票：0回答：8

8个回答

最新问题

如何计算Linux上两个文件之间的差异？

问题描述 投票：0回答：8

8个回答

最新问题

问题描述投票：0回答：8