我试图从https://www.utahrealestate.com/search/map.search/page/1刮掉房地产列表,我很难获得selenium的webdriver来刮掉所有的HTML。
据我所知,该网站正在使用javascript函数动态加载地图上的列表。
它不是返回包含标记下所需数据的HTML,而是返回如下内容:
<div id="results-listings">
<div style="height: 400px;"></div>
</div>
</div>
</div>
<!--right ad zone-->
<div class="advert-160-600 advert-right-zone" data-google-query-id="CKDYtP2Ol-ECFVAMswAd7vcDAg" id="div-gpt-ad-1533933823557-0" style="">
<div id="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Right-Side-160x600_0__container__" style="border: 0pt none; display: inline-block; width: 160px; height: 600px;"><iframe data-google-container-id="1" data-is-safeframe="true" data-load-complete="true" frameborder="0" height="600" id="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Right-Side-160x600_0" marginheight="0" marginwidth="0" name="" sandbox="allow-forms allow-pointer-lock allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-top-navigation-by-user-activation" scrolling="no" src="https://tpc.googlesyndication.com/safeframe/1-0-32/html/container.html" style="border: 0px; vertical-align: bottom;" title="3rd party ad content" width="160"></iframe></div></div>
<div id="map_notification"></div>
<div id="map_markers_container" style="display: none;"></div>
</div>
</div>
<div class="advert-728-90" data-google-query-id="CKHYtP2Ol-ECFVAMswAd7vcDAg" id="div-gpt-ad-1533933779531-0" style="margin-top: 15px">
<div id="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Center-Below-Map-728x90_0__container__" style="border: 0pt none;"><iframe data-google-container-id="2" data-load-complete="true" frameborder="0" height="90" id="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Center-Below-Map-728x90_0" marginheight="0" marginwidth="0" name="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Center-Below-Map-728x90_0" scrolling="no" srcdoc="" style="border: 0px; vertical-align: bottom;" title="3rd party ad content" width="728"></iframe></div></div>
<div class="container" style="margin-top: 20px;">
<p style="margin: 20px 0 40px 0;">UtahRealEstate.com is Utah's favorite place to find a home. MLS Listings are provided by the Wasatch Front Regional Multiple Listing Service, Inc., which is powered by Utah's REALTORS®. UtahRealEstate.com offers you the most complete and current property information available. Browse our website to find an accurate list of homes for sale in Utah and homes for sale in Southeastern Idaho.</p>
<h5>Find Utah Homes for Sale by City</h5>
<div class="row">
<div class="col-sm-7 five-three">
<div class="row">
<div class="col-sm-4">
<b><a href="/davis-county-homes">Davis County</a></b>
<ul>
<li><a href="/bountiful-homes">Bountiful</a></li>
<li><a href="/clearfield-homes">Clearfield</a></li>
<li><a href="/clinton-homes">Clinton</a></li>
<li><a href="/layton-homes">Layton</a></li>
<li><a href="/kaysville-homes">Kaysville</a></li>
<li><a href="/north-salt-lake-homes">North Salt Lake</a></li>
<li><a href="/south-weber-homes">South Weber</a></li>
<li><a href="/syracuse-homes">Syracuse</a></li>
<li><a href="/woods-cross-homes">Woods Cross</a></li>
我当前的代码如下所示:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
utahRealEstate = 'https://www.utahrealestate.com/search/map.search/page/1'
browser = webdriver.Chrome()
page = browser.get(utahRealEstate)
innerHTML = browser.execute_script("return document.body.innerHTML")
page_soup = soup(innerHTML)
page_soup
我真的关注“listings-info-left-col”和“listings-info-right-col”类中包含的信息。
我对此很新,所以请尽可能地解释你的解释。我感谢任何帮助!
以下计算分页信息(以便在分页信息更改时更灵活)并循环可用结果的所有页面。它将价格,属性地址和属性详细信息提取到列表列表中,这些列表被展平,转换为数据帧,并写入csv。正则表达式用于整理输出信息。它使用等待条件来获取信息。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import math
from bs4 import BeautifulSoup as bs
import pandas as pd
def getInfo(html): #function to return price and other listing info for the current page. Accepts the page source html as parameter
soup = bs(html, 'lxml')
items = soup.select('.inline_info')
rowsToReturn = []
for item in items:
data = item.select('.list-info-content') #list containing address info and property details e.g. baths, beds
price = item.select_one('h3').text.strip()
address = re.sub('\s\s+', ' ', data[0].text.strip()) #replace 2+ white space with single space
propertyInfo = re.sub('\s\s+', ' ', data[1].text.strip())
rowToReturn = [price, address, propertyInfo]
rowsToReturn.append(rowToReturn)
return rowsToReturn
url = 'https://www.utahrealestate.com/search/map.search/page/1' #landing page
driver = webdriver.Chrome()
driver.get(url)
WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".list-info-content"))) #wait for all listings content
reg = re.compile(r'(\d+)') #regex pattern looking for 1 or more numbers to be applied to class view-results which has the pagination and total results info
matches = reg.findall(driver.find_element_by_css_selector('.view-results').text) # [1,50,500] from 1 to 50 of 500
numResults = int(matches[2])
resultsPerPage = int(matches[1])
numPages = math.ceil(numResults/resultsPerPage)
results = []
results.append(getInfo(driver.page_source)) #add page one results
if numPages > 1:
for page in range(2, numPages + 1): #loop calculated number of pages
driver.get('https://www.utahrealestate.com/search/map.search/page/{}'.format(page)) #add new page number into url
WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".list-info-content"))) #wait for all listings content
results.append(getInfo(driver.page_source)) #add next page results
#flatten list of lists
finalList = [item for sublist in results for item in sublist]
df = pd.DataFrame(finalList, columns = ['price', 'address', 'property details']) #convert to dataframe and write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
driver.quit()
示例结果:
此代码从第一页开始,解析它以获取详细信息,然后继续加载其余页面,一次解析它们以获取详细信息,直到不再有页面为止。如果您愿意,可以根据自己的需要进行优化。
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from selenium.common.exceptions import NoSuchElementException
utahRealEstate = 'https://www.utahrealestate.com/search/map.search/page/1'
browser = webdriver.Chrome()
page = browser.get(utahRealEstate)
# parse the page
def parse(html):
soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('div', {'class': 'listings-info'}):
print(i.get_text())
while True:
try:
# parse the current page.
time.sleep(3)
parse(browser.page_source)
# Find the next page button and click it.
browser.find_element_by_xpath("//a[text()='Next ']").click()
except NoSuchElementException:
# Couldn't find a next page button must have got to the end.
break
browser.quit()
输出:
$615,000
3217 W 10305 S
South Jordan, UT 84095
5Beds
5Baths
4002Sq.Ft.
#1588082
Domain Real Estate LLC
...