URI.unescape在尝试将“％C3％9Fą”转换为“ßą”时崩溃

Question

我使用URI.unescape来浏览文本，不幸的是我遇到了奇怪的错误：

 # encoding: utf-8
 require('uri')
 URI.unescape("%C3%9Fą")

结果是

 C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `gsub': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
    from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `unescape'
    from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:649:in `unescape'
    from exe/fail.rb:3:in `<main>'

为什么？

Answer 1

对于非ASCII输入，URI.unescape的实现被破坏。 1.9.3 version看起来像这样：

def unescape(str, escaped = @regexp[:ESCAPED])
  str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(str.encoding)
end

正在使用的正则表达式是/%[a-fA-F\d]{2}/。所以它通过字符串查找百分号后跟两个十六进制数字;块中的$&将是匹配的文本（例如'％C3'），$&[1,2]是没有前导百分号的匹配文本（'C3'）。然后我们调用String#hex将十六进制数转换为Fixnum（195）并将其包装在一个数组（[195]）中，以便我们可以使用Array#pack为我们执行字节重整。问题是pack给我们一个二进制字节：

> puts [195].pack('C').encoding
ASCII-8BIT

ASCII-8BIT编码也称为“二进制”（即没有特定编码的普通字节）。然后块返回该字节，并且String#gsub尝试插入str正在处理的gsub的UTF-8编码副本，并且您得到了错误：

不兼容的字符编码：ASCII-8BIT和UTF-8（Encoding :: CompatibilityError）

因为你不能（通常）只将二进制字节填充到UTF-8字符串中;你经常可以逃脱它：

URI.unescape("%C3%9F")         # Works
URI.unescape("%C3µ")           # Fails
URI.unescape("µ")              # Works, but nothing to gsub here
URI.unescape("%C3%9Fµ")        # Fails
URI.unescape("%C3%9Fpancakes") # Works

一旦开始将非ASCII数据混合到URL编码字符串中，事情就会开始崩溃。

一个简单的解决方法是在尝试解码之前将字符串切换为二进制：

def unescape(str, escaped = @regexp[:ESCAPED])
  encoding = str.encoding
  str = str.dup.force_encoding('binary')
  str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
end

另一个选择是将force_encoding推入块中：

def unescape(str, escaped = @regexp[:ESCAPED])
  str.gsub(escaped) { [$&[1, 2].hex].pack('C').force_encoding(encoding) }
end

我不确定为什么gsub在某些情况下失败但在其他情况下成功。

Answer 2

不知道为什么，但你可以使用CGI.unescape方法：

# encoding: utf-8
require 'cgi'
CGI.unescape("%C3%9Fą")

Answer 3

要扩展Vasiliy的答案，建议使用CGI.unescape：

从Ruby 2.5.0开始，URI.unescape已经过时了。

见https://ruby-doc.org/stdlib-2.5.0/libdoc/uri/rdoc/URI/Escape.html#method-i-unescape。

“此方法已过时，不应使用。请使用CGI.unescape，URI.decode_www_form或URI.decode_www_form_component，具体取决于您的具体用例。”

URI.unescape在尝试将“％C3％9Fą”转换为“ßą”时崩溃

问题描述投票：2回答：3

3个回答

最新问题

URI.unescape在尝试将“％C3％9Fą”转换为“ßą”时崩溃

问题描述 投票：2回答：3

3个回答

最新问题

问题描述投票：2回答：3