我正在尝试使用 Scrapy 抓取 网站。要获取我想要的内容,我需要先登录。网址是 login_url
我的表格如下:
我的代码如下:
LOGIN_URL1 = "https://www.partslink24.com/partslink24/user/login.do"
class PartsSpider(scrapy.Spider):
name = "parts"
login_url = LOGIN_URL1
start_urls = [
login_url,
]
def parse(self, response):
form_data = {
'accountLogin': COMPANY_ID,
'userLogin': USERNAME,
'loginBean.password': PASSWORD
}
yield FormRequest(url=self.login_url, formdata=form_data, callback=self.parse1)
def parse1(self, response):
inspect_response(response, self)
print("RESPONSE: {}".format(response))
def start_scraper(vin_number):
process = CrawlerProcess()
process.crawl(PartsSpider)
process.start()
但问题是他们检查会话是否已激活,我收到错误,表单无法提交。
当我检查提交登录表单后收到的响应时,出现以下错误:
他们网站上检查的代码如下:
var JSSessionChecker = {
check: function()
{
if (!Ajax.getTransport())
{
alert('NO_AJAX_IN_BROWSER');
}
else
{
new Ajax.Request('/partslink24/checkSessionCookies.do', {
method:'post',
onSuccess: function(transport)
{
if (transport.responseText != 'true')
{
if (Object.isFunction(JSSessionChecker.showError)) JSSessionChecker.showError();
}
},
onFailure: function(e)
{
if (Object.isFunction(JSSessionChecker.showError)) JSSessionChecker.showError();
},
onException: function (request, e)
{
if (Object.isFunction(JSSessionChecker.showError)) JSSessionChecker.showError();
}
});
}
},
showError: function()
{
var errorElement = $('sessionCheckError');
if (errorElement)
{
errorElement.show();
}
}
};
JSSessionChecker.check();
成功时它仅返回true。
有什么方法可以在提交表单之前激活会话吗?
提前致谢。
编辑
我使用@fam 的答案得到的错误页面。
请检查此代码。
import scrapy
LOGIN_URL1 = "https://www.partslink24.com/partslink24/user/login.do"
class PartsSpider(scrapy.Spider):
name = "parts"
login_url = LOGIN_URL1
start_urls = [
login_url,
]
def parse(self, response):
form_data = {
'loginBean.accountLogin': "COMPANY_ID",
'loginBean.userLogin': "USERNAME",
'loginBean.sessionSqueezeOut' : "false",
'loginBean.password': "PASSWORD",
'loginBean.userOffsetSec' : "18000",
'loginBean.code2f' : ""
}
yield scrapy.FormRequest.from_response(response=response, url=self.login_url, formdata=form_data, callback=self.parse1)
def parse1(self, response):
#scrapy.inspect_response(response, self)
print("RESPONSE: {}".format(response))
def start_scraper(vin_number):
process = scrapy.CrawlerProcess()
process.crawl(PartsSpider)
process.start()
我没有收到错误,响应如下:
RESPONSE: <200 https://www.partslink24.com/partslink24/user/login.do>
编辑: 以下代码适用于 Selenium。它会让您轻松登录到该页面。您只需要下载 chrome 驱动并安装 Selenium 即可。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
#chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)
driver.get("https://www.partslink24.com/partslink24/user/login.do")
# enter the form fields
company_ID = "company id"
user_name = "user name"
password = "password"
company_ID_input = driver.find_element_by_xpath("//input[@name='accountLogin']")
company_ID_input.send_keys(company_ID)
time.sleep(1)
user_name_input = driver.find_element_by_xpath("//input[@name='userLogin']")
user_name_input.send_keys(user_name)
time.sleep(1)
password_input = driver.find_element_by_xpath("//input[@id='inputPassword']")
password_input.send_keys(password)
time.sleep(1)
# click the search button and get links from first page
click_btn = driver.find_element_by_xpath("//a[@tabindex='5']")
click_btn.click()
time.sleep(5)
不要忘记更改凭据。
我也想在这里抓取信息。如果你成功了,请联系我。 [电子邮件受保护]