我是 Python 新手,总体而言,也是编码新手。因此,非常感谢任何帮助。
我在一个目录中有超过 3000 个具有多种编码的文本文件。我需要将它们转换为单一编码(例如 utf8)以进行进一步的 NLP 工作。当我使用 shell 检查这些文件的类型时,我确定了以下编码:
Algol 68 source text, ISO-8859 text, with very long lines
Algol 68 source text, Little-endian UTF-16 Unicode text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
ASCII text
ASCII text, with very long lines
data
diff output text, ASCII text
ISO-8859 text, with very long lines
ISO-8859 text, with very long lines, with LF, NEL line terminators
Little-endian UTF-16 Unicode text, with very long lines
Non-ISO extended-ASCII text
Non-ISO extended-ASCII text, with very long lines
Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
UTF-8 Unicode (with BOM) text, with CRLF line terminators
UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators
UTF-8 Unicode text, with very long lines, with CRLF line terminators
有什么想法如何将具有上述编码的文本文件转换为具有 utf-8 编码的文本文件吗?
我也遇到了和你一样的问题。 我用了两个步骤来解决这个问题。
代码如下:
import os, sys, codecs
import chardet
首先使用chardet包来识别文本的编码。
for text in os.listdir(path):
txtPATH = os.path.join(path, text)
txtPATH=str(txtPATH)
f = open(txtPATH, 'rb')
data = f.read()
f_charInfo = chardet.detect(data)
coding2=f_charInfo['encoding']
coding=str(coding2)
print(coding)
data = f.read()
其次,如果文本编码不是utf-8,则将文本重写为utf-8编码到目录中。
if not re.match(r'.*\.utf-8$', coding, re.IGNORECASE):
print(txtPATH)
print(coding)
with codecs.open(txtPATH, "r", coding) as sourceFile:
contents = sourceFile.read()
with codecs.open(txtPATH, "w", "utf-8") as targetFile:
targetFile.write(contents)
希望这能有所帮助!谢谢
我今天遇到了和你们一样的问题,我根据自己的需要改编了Jessica20119的脚本,因为我在转换文件时做了一些调整。
import os
import chardet
import codecs
import zipfile
zip_path = 'YourZipFile'
temp_dir = 'temp_dir'
# Ensure temp_dir exists
os.makedirs(temp_dir, exist_ok=True)
def detect_encoding(file_path, chunk_size=1024):
detector = chardet.UniversalDetector()
with open(file_path, 'rb') as f:
while True:
chunk = f.read(chunk_size)
if not chunk:
break
detector.feed(chunk)
if detector.done:
break
detector.close()
return detector.result['encoding']
with zipfile.ZipFile(zip_path, 'r') as zip_file:
for file_name in zip_file.namelist():
if not file_name.endswith('/'): # Ensuring it's a file
full_path = os.path.join(temp_dir, file_name)
# Check if file exists and determine the need for re-processing
if os.path.exists(full_path):
current_encoding = detect_encoding(full_path)
if current_encoding == 'utf-8':
print(f"{file_name} is already UTF-8 encoded and will not be processed again.")
continue # Skip processing if already UTF-8
# Extract the file as it needs processing
zip_file.extract(file_name, temp_dir)
# Process the file for encoding conversion
detected_encoding = detect_encoding(full_path)
if detected_encoding != 'utf-8':
try:
with codecs.open(full_path, "r", encoding=detected_encoding) as source_file:
contents = source_file.read()
with codecs.open(full_path, "w", encoding='utf-8') as target_file:
target_file.write(contents)
print(f"Converted {full_path} to UTF-8.")
except Exception as e:
print(f"Error processing {full_path}: {e}")
希望有帮助!