bash - 删除所有Unicode空格并替换为Normal Space

Question

我有一个包含大量文本的文件，并且混合了特殊的空格字符，那些是Unicode Spaces

我需要用普通的“空格”字符替换所有这些字符。

Answer 1

易于使用perl：

perl -CSDA -plE 's/\s/ /g' file

但正如@ mklement0 corectly在评论中所说，它也将匹配\t（TAB）。如果这是问题，你可以使用

perl -CSDA -plE 's/[^\S\t]/ /g'

演示：

X            　X

以上包含：

U+00058 X LATIN CAPITAL LETTER X
U+01680   OGHAM SPACE MARK
U+02002   EN SPACE
U+02003   EM SPACE
U+02004   THREE-PER-EM SPACE
U+02005   FOUR-PER-EM SPACE
U+02006   SIX-PER-EM SPACE
U+02007   FIGURE SPACE
U+02008   PUNCTUATION SPACE
U+02009   THIN SPACE
U+0200A   HAIR SPACE
U+0202F   NARROW NO-BREAK SPACE
U+0205F   MEDIUM MATHEMATICAL SPACE
U+03000 　 IDEOGRAPHIC SPACE
U+00058 X LATIN CAPITAL LETTER X

使用：

perl -CSDA -plE 's/\s/_/g'  <<<"X            　X"

注意，对于替换为下划线的演示，打印

X_____________X

也可以使用纯粹的bash

LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

while read -r line; do
    echo "${line//[$spaces]/ }"
done

仅当您的默认语言环境不是LC_ALL=en_US.UTF-8时才需要UTF-8。（你应该有，如果你使用utf8文本）:)演示：

str="X            　X"
echo "${str//[$spaces]/_}"

再次打印：

X_____________X

同样使用sed - 如上所述准备变量$spaces并使用：

sed "s/[$spaces]/ /g" file

编辑 - 因为一些奇怪的复制/粘贴（或Locale）问题：

xxd -ps <<<"$spaces"

节目

c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e2
8087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a

md5摘要（两个不同的程序）

md5sum <<<"$spaces"
LC_ALL=C md5 <<<"$spaces"

打印相同的md5

35cf5e1d7a5f512031d18f3d2ec6612f  -
35cf5e1d7a5f512031d18f3d2ec6612f

Answer 2

可以通过他们的unicode来识别角色，不幸的是sed 's/[[:space:]]\+/\ /g'不会这样做。

通过重新处理另一个SO answer，我们列出所有unicodes将它们保存在变量中，然后使用sed进行替换（注意使用-i.bak我们还将保存原始文件的副本）

 CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

 sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt

Answer 3

如果你反复面对这个任务，可以考虑安装nws（normalize whitespace），这是一个简化任务的实用程序（我的）。

nws --ascii file # convert non-ASCII whitespace and punctuation to ASCII

nws --ascii -i file  # update file in place

--ascii的nws模式：

音译（非ASCII）Unicode空格（例如不间断空格（））和标点符号（如曲线引号（“”），短划线（–），...）到它们最接近的ASCII等价物
同时保留任何其他Unicode字符。

此模式对于已经格式化以显示印刷引号，短划线等的源代码示例很有用，这通常使代码对编译器/解释器不可消。

Installation of `nws` from the npm registry (Linux and macOS)

注意：即使您不使用Node.js，它的包管理器npm也可以跨平台工作，并且易于安装;尝试 curl -L https://git.io/n-install | bash

安装Node.js后，安装如下：

[sudo] npm install nws-cli -g

注意：

你是否需要sudo取决于你如何安装Node.js以及你是否有changed permissions later;如果你得到EACCES错误，请再次使用sudo。
-g确保global installation并且需要将nws-cli放入你的系统的$PATH。

Manual installation (any Unix platform with `bash`)

下载this bash script作为nws。
使用chmod +x nws使其可执行。
将其移动或符号链接到$PATH中的文件夹，例如/usr/local/bin（macOS）或/usr/bin（Linux）。

Optional reading: POSIX character classes `[:space:]` and `[:blank:]` and non-ASCII Unicode whitespace

在基于UTF-8的语言环境中，POSIX兼容的实用程序应使POSIX字符类[:space:]和[:blank:]匹配（非ASCII）Unicode空格。

这依赖于语言环境charmap基于POSIX-mandated character classifications的Unicode字符的正确分类，[:space:]直接对应于Ubuntu 16.04等字符类，可用于模式和正则表达式。

有两个陷阱：

Unicode是一种不断发展的标准（撰写本文时为第9版）;您的平台的UTF-8 charmap可能不是最新的。例如，在[:space:]上，以下字符未正确分类，因此不匹配[:blank:] / cut：不间断的空间，数字空间，狭窄的不间断空间，下一行
实用程序应该使用活动区域设置的charmap - 但是有一些令人遗憾的例外 - 以下实用程序不支持Unicode（可能还有更多）：在GNU实用程序中（从coreutils v8.27开始）： tr，Mawk awk，例如，在Ubuntu上默认的awk实现。在BSD / macOS实用程序中（从macOS 10.12开始）： sed

因此，在具有当前UTF-8 charmap的平台上，以下[:space:]命令应该有效，但请注意sed 's/[[:space:]]/ /g' file也匹配制表符，因此也用一个空格替换它们：

FILENAME = 'File.txt'
OUTPUTNAME = 'Fixed.txt'
f = open(FILENAME, 'r+', encoding='utf8')
o = open(OUTPUTNAME, 'w+', encoding='utf8')
for line in f:
    for ch in line:
        if ch == '\u2003':
            ch = ' '
            o.write(ch)
        else:
            o.write(ch)
o.close()
f.close()

Answer 4

如果你使用python3这对我来说，它的临时代码，但确实有效。

qazxswpoi

bash - 删除所有Unicode空格并替换为Normal Space

问题描述投票：5回答：4

4个回答

Installation of `nws` from the npm registry (Linux and macOS)

Manual installation (any Unix platform with `bash`)

Optional reading: POSIX character classes `[:space:]` and `[:blank:]` and non-ASCII Unicode whitespace

最新问题

bash - 删除所有Unicode空格并替换为Normal Space

问题描述 投票：5回答：4

4个回答

Installation of nws from the npm registry (Linux and macOS)

Manual installation (any Unix platform with bash)

Optional reading: POSIX character classes [:space:] and [:blank:] and non-ASCII Unicode whitespace

最新问题

问题描述投票：5回答：4

Installation of `nws` from the npm registry (Linux and macOS)

Manual installation (any Unix platform with `bash`)

Optional reading: POSIX character classes `[:space:]` and `[:blank:]` and non-ASCII Unicode whitespace