当我使用此帐号0523620090003
在此
网站上发起搜索时,我可以在结果中看到有关该帐户的相关详细信息。我使用 requests 模块创建了一个脚本来抓取结果的两部分:
account details
和 fiduciary
。我已经可以刮掉左上角的account details
了。但是,我无法解析与位于右上角中间的Fiduciary相关的信息。
import requests
from pprint import pprint
link = 'https://arcweb.hcad.org/server/rest/services/public/public_query/MapServer/0/query'
params = {
'f': 'json',
'distance': 2,
'outFields': '*',
'outSR': '102100',
'spatialRel': 'esriSpatialRelIntersects',
'units': 'esriSRUnit_StatuteMile',
'where': "HCAD_NUM = '0523620090003'",
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link,params=params)
pprint(res.json()['features'][0]['attributes'])
如何使用请求模块从网站上抓取信托相关信息?
正如评论中所建议的,您可能会发现使用 Selenium 自动与网站交互很有用。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
ACCOUNT_NUMBER = "0523620090003"
URL = "https://hcad.org/property-search/property-search"
options = Options()
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", options=options)
driver.get(URL)
# Change focus to <iframe>.
iframe = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe")))
driver.switch_to.frame(iframe)
# Locate the <input> field.
input = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'input[type="search"]')))
# Insert search term.
input.send_keys(ACCOUNT_NUMBER)
time.sleep(2)
# Trigger search.
button = driver.find_element(By.CSS_SELECTOR, ".input-group-append button")
button.click()
time.sleep(5)
# Find first search result and click.
row = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "tr.resulttr.odd")))
row.click()
time.sleep(5)
# Get fiduciary details.
fiduciary = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "i.fa-person-walking-luggage")))
details = fiduciary.find_element(By.XPATH, './../following-sibling::*[1]')
print(details.text)
driver.close()
我正在使用 Selenium 的远程实例。您可以将对
webdriver.Remote()
的调用替换为:
driver = webdriver.Chrome(options=options)
页面受托部分的输出:
BETTENCOURT TAX ADVISORS LLC - 05082