UnicodeEncodeError在BeautifulSoup webscraper

Question

我遇到了一个简单的web刮刀下面的代码的Unicode编码错误。

print 'JSON scraper initializing'

from bs4 import BeautifulSoup
import json
import requests
import geocoder


# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
    uniqueUrl = page + str(i)
    urlBucket.append(uniqueUrl)

# Build response container
responseBucket = []

for i in urlBucket:
    uniqueResponse = requests.get(i)
    responseBucket.append(uniqueResponse)


# Build soup container
soupBucket = []
for i in responseBucket:
    individualSoup = BeautifulSoup(i.text, 'html.parser')
    soupBucket.append(individualSoup)


# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
    script = i.find_all("script")[4]

    eventsJSON = json.loads(script.text)

    allSanFranciscoEvents.append(eventsJSON)


with open("allSanFranciscoEvents.json", "w") as writeJSON:
   json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')

奇怪的是有时，此代码的工作，并没有给出一个错误。它与代码的for i in range线做。例如，如果我把(2,4)的范围内，它工作正常。如果我改变它1,3,记载：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)

谁能告诉我如何解决我的代码中这个问题？如果我打印allSanFranciscoEvents，它是阅读中的所有数据，所以我相信这个问题是发生在代码的最后一块，用JSON转储。非常感谢。

Answer 1

eventsJSON是反对它不能使用eventsJSON.encode('utf-8')。对于Python 2.7写在utf-8或Unicode文件，你可以使用codecs或使用二进制或wb标志写。

with open("allSanFranciscoEvents.json", "wb") as writeJSON:
   jsStr = json.dumps(allSanFranciscoEvents)
   # the decode() needed because we need to convert it to binary
   writeJSON.write(jsStr.decode('utf-8')) 
print ('end')

# and read it normally
with open("allSanFranciscoEvents.json", "r") as readJson:
    data = json.load(readJson)
    print(data[0][0]["startDate"])
    # 2019-02-04

Answer 2

最好的解决

使用Python 3！ Python 2中是going EOL很快。今天写在传统的蟒蛇新代码将有一个很短的保质期。

我不得不改变，使在Python 3代码工作的唯一的事情就是打电话给print()功能，而不是print关键字。你的示例代码，然后工作没有任何错误。

与Python 2坚持

奇怪的是有时，此代码的工作，并没有给出一个错误。它与做对我的代码的范围线。例如，如果我把在（2,4）的范围内，它工作正常。

那是因为你要请求与那些不同范围不同的页面，而不是每一页都有不能转换使用ASCII编码解码器str一个字符。我不得不去响应的5页让你犯过同样的错误。就我而言，这是艺术家的名字，导致该问题u'Mø'。所以这里有一个1个衬垫再现的问题：

>>> str(u'Mø')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 0: ordinal not in range(128)

你的错误明确选拔出的字符u'\xe9'：

>>> str(u'\xe9')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

同样的问题，只是不同性格。性格Latin small letter e with acute。 Python是试图使用默认的编码，“ASCII”，到Unicode字符串转换为str，但“ASCII”不知道代码点是什么。

我相信这个问题是发生在代码的最后一块，用JSON转储。

是的：

>>> with open('tmp.json', 'w') as f:
...     json.dump(u'\xe9', f, ensure_ascii=False)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
    fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

而从回溯，你可以看到，它实际上是从写入文件（fp.write(chunk)）的到来。

file.write()写入string到一个文件，但u'\xe9'是unicode对象。错误消息：'ascii' codec can't encode character...告诉我们蟒蛇试图编码unicode对象把它变成一个str类型，所以它可以将其写入文件。上Unicode字符串调用编码使用"default string encoding"，其被定义here是“ASCII”。

要解决，不要把它留给Python中使用的默认编码：

>>> with open('tmp.json', 'w') as f:
...     json.dump(u'\xe9'.encode('utf-8'), f, ensure_ascii=False)
...
# No error :)

在您的具体的例子，你可以通过改变这个修复间歇性的错误：

allSanFranciscoEvents.append(eventsJSON)

为此：

allSanFranciscoEvents.append(eventsJSON.encode('utf-8'))

这样一来，你是明确使用“UTF-8”编解码器转换的Unicode字符串str，所以写入文件时Python不会应用默认的编码，“ASCII”。

UnicodeEncodeError在BeautifulSoup webscraper

问题描述投票：1回答：2

2个回答

最新问题

UnicodeEncodeError在BeautifulSoup webscraper

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2