如何在 Cygwin 上的 sed 中处理 UTF-8 表情符号？

Question

我见过很多关于 SED 中转义和替换特殊字符的主题，但没有一个对我有帮助。

我需要在文件上使用这个 sed 命令：

sed -i "s/This[^\|]\+/& (cool) /g" "file.txt"

出于我不明白的原因，它适用于这个测试用例：

This is my funny 🎺 char and this | char is the char after which  i want to stop my job.

...并将其转换为：

This is my funny 🎠(cool) ڠchar and this | char is the char after which  i want to stop my job.

...而不是：

This is my funny 🎺 char and this  (cool) | char is the char after which  i want to stop my job.

有人可以告诉我如何处理这种情况吗？

注意：该文件是 UTF-8 编码的，我使用 UTF-8 编码的 Cygwin，我的 SED 命令也位于 UTF-8 编码的“.sh”文件中。

Answer 1

该错误似乎是由于我在 CYGWIN 上使用 SED 造成的，因为它在 GNU Linux 上运行良好。

感谢您的关注。我希望这个帖子可以帮助另一个 Cygwin 用户。

Answer 2

我可以使用

sed

(v4.8) 在 Cygwin (v3.3.4) 上确认此问题。虽然我不太确定这里发生了什么，但我发现通过设置 UTF-16 兼容的 system locale 可以让它工作：

$ python3 -c "print('This is my funny {} char and this | char is the char after which i want to stop my job.'.format(chr(0x1f3ba)))" | \
  sed 's/This[^|]*/&(cool) /g'
This is my funny (cool)  char and this | char is the char after which i want to stop my job.

$  python3 -c "print('This is my funny {} char and this | char is the char after which i want to stop my job.'.format(chr(0x1f3ba)))" | \
  LANG=C.UTF16 sed 's/This[^|]*/&(cool) /g'
This is my funny 🎺 char and this (cool) | char is the char after which i want to stop my job.

顺便说一句，在使用

sed

时始终首选单引号，以防

和

被扩展。

Answer 3

我陷入了类似的情况，但不涉及文件，所以它不是文件编码。

< 7:36a> <10%> C:\>echo gOlIaTh |:u8 sed -e 's/goliath/🦇GOLIATH🦇/gi'
/cygdrive/c/cygwin/bin/sed: -e expression #1, char 1: unknown command: `''

< 7:37a> <15%> C:\>echo gOlIaTh |:u8 sed -e 's/goliath/GOLIATH/gi'
GOLIATH

如何在 Cygwin 上的 sed 中处理 UTF-8 表情符号？

问题描述投票：0回答：3

3个回答

最新问题

如何在 Cygwin 上的 sed 中处理 UTF-8 表情符号？

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3