目前,我正在尝试通过 Spyder IDE/GUI 让 Python 3 程序对填充信息的文本文件进行一些操作。但是,当尝试读取该文件时,我收到以下错误:
File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
parser(f)
File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
data = infile.read()
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>
程序代码如下:
import os
os.getcwd()
import glob
import re
import sqlite3
import csv
def parser(file):
# Open a TXT file. Store all articles in a list. Each article is an item
# of the list. Split articles based on the location of such string as
# 'Document PRN0000020080617e46h00461'
articles = []
with open(file, 'r') as infile:
data = infile.read()
start = re.search(r'\n HD\n', data).start()
for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
end = m.end()
a = data[start:end].strip()
a = '\n ' + a
articles.append(a)
start = end
# In each article, find all used Intelligence Indexing field codes. Extract
# content of each used field code, and write to a CSV file.
# All field codes (order matters)
fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']
for a in articles:
used = [f for f in fields if re.search(r'\n ' + f + r'\n', a)]
unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n ' + f + r'\n', a)]
fields_pos = []
for f in used:
f_m = re.search(r'\n ' + f + r'\n', a)
f_pos = [f, f_m.start(), f_m.end()]
fields_pos.append(f_pos)
obs = []
n = len(used)
for i in range(0, n):
used_f = fields_pos[i][0]
start = fields_pos[i][2]
if i < n - 1:
end = fields_pos[i + 1][1]
else:
end = len(a)
content = a[start:end].strip()
obs.append(content)
for f in unused:
obs.insert(f[0], '')
obs.insert(0, file.split('/')[-1].split('.')[0]) # insert Company ID, e.g., GVKEY
# print(obs)
cur.execute('''INSERT INTO articles
(id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
co, ina, ns, re, ipc, ipd, pub, an)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
?, ?, ?, ?, ?, ?, ?, ?)''', obs)
# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS articles')
# Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
cur.execute('''CREATE TABLE articles
(nid integer primary key, id text, hd text, cr text, wc text, pd text,
et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
ipd text, pub text, an text)''')
for f in glob.glob('*.txt'):
print(f)
parser(f)
# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
with conn:
cur = conn.cursor()
cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
colname = [desc[0] for desc in cur.description]
writer.writerow(colname)
for obs in cur.fetchall():
writer.writerow(obs)
正如您从https://en.wikipedia.org/wiki/Windows-1252看到的,代码0x9D并未在CP1252中定义。
“错误”是例如在您的
open
函数中:您没有指定编码,因此 python(仅在 Windows 中)将使用一些系统编码。一般来说,如果您读取的文件可能不是在同一台机器上创建的,那么最好指定编码。
我建议在您的
open
上也添加一个编码,用于编写 csv。说得清楚一点确实更好。
我不知道原始文件格式,但添加打开
, encoding='utf-8'
通常是一件好事(这是 Linux 和 MacO 中的默认设置)。
在open语句中添加编码 例如:
f=open("filename.txt","r",encoding='utf-8')
上面的方法对我不起作用,请尝试这个:
, errors='ignore'
创造了奇迹!
如果您不需要解码它,您也可以尝试
file = open(filename, 'rb')
'rb' 翻译为读取二进制文件。如果您只想上传到网站
errors='ignore' 解决了我的头痛:
如何在目录和子目录中找到“coma”这个词=
import os
rootdir=('K:\\0\\000.THU.EEG.nedc_tuh_eeg\\000edf.01_tcp_ar\\01_tcp_ar\\')
for folder, dirs, files in os.walk(rootdir):
for file in files:
if file.endswith('.txt'):
fullpath = os.path.join(folder, file)
with open(fullpath, 'r', errors='ignore') as f:
for line in f:
if "coma" in line:
print(fullpath)
break
我不认为编码
当我尝试将 html 作为文本附加到文件中时,我也遇到了这个问题。你可以像我一样尝试,首先将内容返回为 bytes 类型,然后使用 'utf-8' 解码将其转换为字符串
converted_file = binary_file.decode('utf-8')
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
代替这个: 以 open(file_path, 'r') 作为文件: 内容=文件.read()