Windows-1252 编码为 UTF-8

Question

我正在使用旧软件从制造过程中收集声学数据。生成的文件的编码对于我和我用来打开该文件的每个应用程序来说都是未知的。我还使用 python 尝试将文件从 latin1 转换为 UTF-8...没有运气。

有人可以建议一种替代方法将此代码转换为合理的代码吗？或者至少，帮助我确认我正在处理什么编码？非常感谢！

输出应该只是数字。理想情况下分为列和行，但如有任何建议，我们将不胜感激。

Answer 1

首先，我们需要确定文件的字符集。因此，我们可以使用

chardet

（python lib）或

find -bi $file

（Linux

file

命令）来确定字符集。在实践中，我观察到与

chardet

命令相比，

file

需要稍长的处理时间。另外，我在 Linux 容器中执行代码，因此我不必担心

file

命令的可用性。因此，出于这个原因，我在 Python 的

file

lib 的帮助下使用

subprocess

命令来获取字符集。

通过Python运行Linux命令的函数。
```
run_cmd
```
将运行提供的 linux 命令并以字符串格式返回
```
stdout
```
和
```
stderr
```
的元组。

def run_cmd(cmd: str) -> tuple[str, str]:
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        shell=True
    )
    std_out, std_err = process.communicate()
    return std_out, std_err

执行
```
file
```
命令并处理
```
run_cmd
```
的结果。如果文件使用
```
us-ascii
```
进行编码，它将返回此
```
us-ascii
```
作为字符集。

def get_charset(file: str) -> str | None:
    charset_cmd = f"file -bi {file}"
    std_out, std_err = run_cmd(charset_cmd)
    for result in std_out.strip("\n").split():
        if result.startswith('charset='):
            return result.split('charset=')[1]
    return None

使用识别的字符集，您可以将文件转换为
```
utf-8
```
或其他某种编码。

def convert_charset(file: str, from_encoding: str, to_encoding: str, inplace=False):
    output_file = f"{get_file_name(file)}.utf8.{get_file_extension(file)}"
    try:
        with codecs.open(file, 'r', encoding=from_encoding) as f_in:
            with codecs.open(output_file, 'w', encoding=to_encoding) as f_out:
                f_out.write(f_in.read())
        if inplace:
            shutil.move(output_file, file)
    except Exception as ex:
        log.error(f"Error occurred while converting charset of file={file} from={from_encoding} to={to_encoding}")
        raise ex

现在使用上面的命令来转换您的文件。

charset = get_charset(file)
log.info(f"charset: {charset}")

convert_charset(file, charset, 'utf-8', inplace=True)

这会将文件转换为

utf-8

编码文件。

注意：我使用

log

来打印消息。您可以使用打印代替。

Answer 2

VSCode 在建议正确的编码方面具有相当好的智能感知。这是 Python 标准编码的完整列表https://docs.python.org/3/library/codecs.html#standard-encodings

@Rahul 的示例不起作用，因为缺少一些功能。以下是一些可能有帮助的实用函数：

import glob
from chardet.universaldetector import UniversalDetector

def list_encodings(glob_pattern: str):
    detector = UniversalDetector()
    for filename in glob.glob(glob_pattern):
        print(filename.ljust(60), end='')
        detector.reset()
        for line in open(filename, 'rb'):
            detector.feed(line)
            if detector.done: break
        detector.close()
        print(detector.result)

def convert_to_utf8(file_path: str, out: str, from_enc: str):
    with open(file_path, 'r', encoding=from_enc) as inp,\
            open(out, 'w', encoding='utf-8') as outp:
        for line in inp:
            outp.write(line)

list_encodings('data/*.csv')
convert_to_utf8('data/weird_excel.csv', 'data/nice.csv', 'cp1252')

需要 https://pypi.org/project/chardet/ 才能运行。我自己刚刚注意到，为 Pandas 提供正确的编码是有效的，但由于分隔符参数不正确而失败，这从晦涩的

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

错误

中并不明显

Windows-1252 编码为 UTF-8

问题描述投票：0回答：2

2个回答

最新问题

Windows-1252 编码为 UTF-8

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2