python请求post返回纯文本

Question

就像标题中所说的那样，我正试图刮掉一个需要使用post以外的get的网站。

以下是代码，任何帮助将深表感谢

headers = {'Accept-Encoding': 'gzip, deflate',
           'Accept-Language': 'en,zh;q=0.9,zh-CN;q=0.8',
           'Connection': 'keep-alive',
           'Content-Length': '71',
           'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
           'Cookie':'acw_tc=65c86a0915562424980896166e8d7e63f2a68a3ce0960e074dfd8883b55f5a; __utmc=105455707; __utmz=105455707.1556243245.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); ajaxkey=1F7A239ABF2F548B9A3EF4A0F6FF5FDC66906C5D06FBF3C0; __utma=105455707.894288893.1556243245.1556400728.1556404658.5; __utmt=1; __utmb=105455707.1.10.1556404658; SERVERID=8abfb74b5c7dce7c6fa0fa50eb3d63af|1556404667|1556404656',
           'Host': 'www.ipe.org.cn',
           'Origin': 'http://www.ipe.org.cn',
           'Referer': 'http://www.ipe.org.cn/GreenSupplyChain/Communication.aspx',
           'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest'}

url = "http://www.ipe.org.cn/data_ashx/GetAirData.ashx"
from_data = {'cmd': 'getcommunicationlist',
             'pageSize': 4,
             'pageIndex': 2,
             'industryId': 'on',
             'storyId': 0}
html = requests.get(url,
                     data=from_data,
                     headers=headers)

bsobj = BeautifulSoup(html.content,'html.parser')
bsobj # just a part of all the results
{isSuccess:'1',content:'%3Cul%3E%3Cli%3E%3Ctable%3E%3Ctr%3E%3Ctd%3E%3Cimg%20id%3D%223

我可以成功访问该网站，但我无法理解返回的结果（它既不是html / xml也不是json，只是text / plain，为什么会发生这种情况的任何原因？此外，这种方法不会返回我实际可以观察到的所有内容这个页面，但使用selenium可以（这很慢，所以我试图找到一个更好的解决方案）。

My desired result If find("div", {"class": "f26"})) should return something like '推动一家泡沫材料对废气违规记录做出整改' (this site has an English version) other than only the HTML tag or none. EDIT:

我知道在通常的情况下，我可以使用bs来解析结果，但我不能使用它，因为返回的类型只是text/plain，如果你可以尝试上面的代码会很好。

Answer 1

这是一个非常hacky的方法，但它似乎工作...

从检查数据看来，服务器似乎返回了已转换为字符串的Python字典，如

>>> s = str({'a': 'b'})
>>> s
"{'a': 'b'}"

从字符串中提取字典的通常方法是使用ast.literal_eval，但ast.literal_eval无法评估字符串（它与ValueError: malformed node or string: <_ast.Name object at 0x7f719518c7b8>失败）*。

但是，字符串化字典似乎只有两个键，“isSuccess”和“content”。只有“内容”的值才有意义，所以我们可以从字符串中提取它。

quoted = re.sub(r'\{.*content:', '', html.text[:-1])

quoted看起来像这样：

quoted[:20]
"'%3Cul%3E%3Cli%3E%3C"

这看起来像包含％-encoded文本。这可以使用urllib.parse.unquote解码：

unquoted = urllib.parse.unquote(quoted)

unquoted看起来像

unquoted[:60]
'\'<ul><li><table><tr><td><img id="3383" title="%u54C1%u724CX"'

这看起来更好，但看起来应该是unicode转义的字符序列有一个“％”，其中应该有一个“\”。当有“u”和四个十六进制字符后，让我们尝试用反斜杠替换“％”。

replaced = re.sub(r'(%)(u[A-Fa-f0-9]{4})', r'\\\g<2>', unquoted)  
replaced[:60]
'\'<ul><li><table><tr><td><img id="3383" title="\\u54C1\\u724CX"'

这几乎是正确的，但需要删除加倍的反斜杠。将文本编码为latin-1将保留所有字节，然后使用'unicode-escape'编解码器解码将删除额外的反斜杠。

markup = replaced.encode('latin-1').decode('unicode-escape')
markup[:60]
'\'<ul><li><table><tr><td><img id="3383" title="品牌X" src="http'

这看起来很好，可以传递给BeautifulSoup。

soup = bs4.BeautifulSoup(markup)
soup.find("div", {"class": "con"})
<div class="con"><img src="/public/static/images/icons/g-gas.png"/> 废气<br/>● 环境违规事项：工业废气污染源；<br/>● 潜在影响：空气质量、公众健康。</div>

*我有兴趣知道为什么ast.literal_eval无法解析字符串化的字典。

Answer 2

为了解析你应该使用BeautifulSoup库，你的代码应该是这样的：

from bs4 import BeautifulSoup


headers = {'Accept-Encoding': 'gzip, deflate',
           'Accept-Language': 'en,zh;q=0.9,zh-CN;q=0.8',
           'Connection': 'keep-alive',
           'Content-Length': '71',
           'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
           'Cookie':'acw_tc=65c86a0915562424980896166e8d7e63f2a68a3ce0960e074dfd8883b55f5a; __utmc=105455707; __utmz=105455707.1556243245.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); ajaxkey=1F7A239ABF2F548B9A3EF4A0F6FF5FDC66906C5D06FBF3C0; __utma=105455707.894288893.1556243245.1556400728.1556404658.5; __utmt=1; __utmb=105455707.1.10.1556404658; SERVERID=8abfb74b5c7dce7c6fa0fa50eb3d63af|1556404667|1556404656',
           'Host': 'www.ipe.org.cn',
           'Origin': 'http://www.ipe.org.cn',
           'Referer': 'http://www.ipe.org.cn/GreenSupplyChain/Communication.aspx',
           'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest'}

url = "http://www.ipe.org.cn/data_ashx/GetAirData.ashx"
from_data = {'cmd': 'getcommunicationlist',
             'pageSize': 4,
             'pageIndex': 2,
             'industryId': 'on',
             'storyId': 0}
html = requests.get(url,
                     data=from_data,
                     headers=headers)
soup = BeautifulSoup(html.content,"lxml")
all_div = soup.find("div", {"class": "list-recent"})

（如果你们都试图找到多个div，请确保使用findAll("div", {"class": "list-recent"})而不是find("div", {"class": "list-recent"})）。

希望这可以帮助！

python请求post返回纯文本

问题描述投票：1回答：2

2个回答

最新问题

python请求post返回纯文本

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2