我正试图刮掉这个名为whoscored.com的网站,这里是我用来刮掉它的特定页面的简单代码。
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page =
"https://www.whoscored.com/Teams/13/RefereeStatistics/England-Arsenal"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'lxml')
print(pageSoup)
代码运行得很好,但这是它返回的 -
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>404 - File or directory not found.</title>
<style type="text/css">
<!--
body{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-
serif;background:#EEEEEE;}
fieldset{padding:0 15px 10px 15px;}
h1{font-size:2.4em;margin:0;color:#FFF;}
h2{font-size:1.7em;margin:0;color:#CC0000;}
h3{font-size:1.2em;margin:10px 0 0 0;color:#000000;}
#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-
family:"trebuchet MS", Verdana, sans-serif;color:#FFF;
background-color:#555555;}
#content{margin:0 0 0 2%;position:relative;}
.content-container{background:#FFF;width:96%;margin-
top:8px;padding:10px;position:relative;}
-->
</style>
</head>
<body>
<div id="header"><h1>Server Error</h1></div>
<div id="content">
<div class="content-container"><fieldset>
<h2>404 - File or directory not found.</h2>
<h3>The resource you are looking for might have been removed, had its name
changed, or is temporarily unavailable.</h3>
</fieldset></div>
</div>
<script type="text/javascript">
//<![CDATA[
(function() {
var _analytics_scr = document.createElement('script');
_analytics_scr.type = 'text/javascript'; _analytics_scr.async = true;
_analytics_scr.src = '/_Incapsula_Resource?
SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=1578388490';
var _analytics_elem = document.getElementsByTagName('script')[0];
_analytics_elem.parentNode.insertBefore(_analytics_scr, _analytics_elem);
})();
// ]]>
</script></body>
</html>
如你所见,它会返回404 - file or directory not found
或The resource you are looking for might have been removed, had its name
changed, or is temporarily unavailable.
最后还有另一堆错误,我并不太熟悉。
我有一些想法可能会发生这种情况。也许有JavaScript(我最后看到)或者是由于网站的某种反制措施。但是,我想知道究竟是什么问题,我该怎么做才能解决这个问题,并确保我得到的数据我试图从page中获取 - 顺便说一下,这是整个表格。我在这里阅读类似问题得到的一点点就是我需要使用Selenium,但我不确定如何。任何帮助,将不胜感激。
我在闲聊。我的Python版本是37(64位),我的计算机是64位。
在代码中,你在网址中有England/Arsenal
但它必须是England-Arsenal
- 请参阅/
和-
但是页面使用JavaScript,因此使用BeautifulSoup
无法获取数据。您将不得不使用Selenium
来控制将加载页面并运行JavaScript的Web浏览器。渲染页面后,您可以从浏览器获取HTML(使用Selenium)并使用BeautifulSoup
搜索您的数据。
获取Selenium和BeautifulSoup的表格
import selenium.webdriver
from bs4 import BeautifulSoup
url = "https://www.whoscored.com/Teams/13/RefereeStatistics/England-Arsenal"
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome()
driver.get(url)
#print(driver.page_source) # HTML
soup = BeautifulSoup(driver.page_source, 'lxml')
all_tables = soup.find_all('table')
print('len(all_tables):', len(all_tables))
for table in all_tables:
print(table)