有没有办法从UTF-8编码的文件中删除BOM？

Question

我知道我所有的 JSON 文件都是以 UTF-8 编码的，但是编辑 JSON 文件的数据录入人员将其保存为带 BOM 的 UTF-8。

当我运行 Ruby 脚本来解析 JSON 时，它失败并出现错误。我不想手动打开 58+ JSON 文件并转换为没有 BOM 的 UTF-8。

Answer 1

使用 ruby >= 1.9.2 您可以使用该模式

r:bom|utf-8

这应该可行（我还没有与 json 结合测试）：

json = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  json = JSON.parse(file.read)
}

文件中是否有 BOM 并不重要。

Andrew 指出，

File#rewind

不能与 BOM 一起使用。

如果您需要倒带功能，您必须记住位置并将

rewind

替换为

pos=

：

#Prepare test file
File.open('file.txt', "w:utf-8"){|f|
  f << "\xEF\xBB\xBF" #add BOM
  f << 'some content'
}

#Read file and skip BOM if available
File.open('file.txt', "r:bom|utf-8"){|f|
  pos =f.pos
  p content = f.read  #read and write file content
  f.pos = pos   #f.rewind  goes to pos 0
  p content = f.read  #(re)read and write file content
}

Answer 2

所以，解决方案是通过 gsub 对 BOM 进行搜索和替换！我强制将字符串编码为 UTF-8，并强制将正则表达式模式编码为 UTF-8。

我能够通过查看 http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv 和 http:// 得出解决方案blog.grayproducts.net/articles/ruby_19s_string

def read_json_file(file_name, index)
  content = ''
  file = File.open("#{file_name}\\game.json", "r") 
  content = file.read.force_encoding("UTF-8")

  content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

  json = JSON.parse(content)

  print json
end

Answer 3

您还可以使用

File.read

和

CSV.read

方法指定编码，但不指定

read

模式。

File.read(path, :encoding => 'bom|utf-8')
CSV.read(path, :encoding => 'bom|utf-8')

Answer 4

如果您只读取文件一次，“bom | UTF-8”编码效果很好，但如果您调用 File#rewind 则失败，就像我在代码中所做的那样。为了解决这个问题，我做了以下事情：

def ignore_bom
  @file.ungetc if @file.pos==0 && @file.getc != "\xEF\xBB\xBF".force_encoding("UTF-8")
end

这似乎运作良好。不确定是否还有其他类似类型的字符需要注意，但它们可以很容易地内置到此方法中，可以在您倒带或打开时随时调用。

Answer 5

对我有用的 utf-8 bom 字节的服务器端清理：

csv_text.gsub!("\xEF\xBB\xBF".force_encoding(Encoding::BINARY), '')

Answer 6

我刚刚为 smarter_csv gem 实现了这个，并且想分享这个以防有人遇到这个问题。

问题是删除与字符串编码无关的字节序列。解决方案是使用

bytes

类中的方法

byteslice

和

String

。

参见：https://ruby-doc.org/core-3.1.1/String.html#method-i-bytes

    UTF_8_BOM = %w[ef bb bf].freeze

    def remove_bom(str)
      str_as_hex = str.bytes.map{|x| x.to_s(16)}
      return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM

      str
    end

Answer 7

0
投票

为我工作

data.delete("\uFEFF")

有没有办法从UTF-8编码的文件中删除BOM？

问题描述投票：0回答：7

7个回答

最新问题

有没有办法从UTF-8编码的文件中删除BOM？

问题描述 投票：0回答：7

7个回答

最新问题

问题描述投票：0回答：7