无法从网页解析网站链接

问题描述 投票:1回答:1

我在python中使用selenium创建了一个脚本,以便在网站中搜索位于Contact details内的网站地址。但是,问题是没有与该链接相关联的URL(我可以点击该链接)。

如何解析Contact details内的网站链接?

from selenium import webdriver

URL = 'https://www.truelocal.com.au/business/vitfit/sydney'

def get_website_link(driver,link):
    driver.get(link)
    website = driver.find_element_by_css_selector("[ng-class*='getHaveSecondaryWebsites'] > span").text
    print(website)

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        get_website_link(driver,URL)
    finally:
        driver.quit()

当我运行脚本时,我得到与该链接相关的可见文本,即Visit website

python python-3.x selenium selenium-webdriver web-scraping
1个回答
1
投票

“访问网站”文本的元素是一个span,有vm.openLink(vm.getReadableUrl(vm.getPrimaryWebsite()),'_blank') javascript而不是实际的href。我的建议是,如果您的目标是刮擦而不是测试,您可以使用requests包下面的解决方案获取数据作为json并提取您需要的任何信息。 另一个实际上是点击,就像你一样。

import requests
import re

headers = {
    'Referer': 'https://www.truelocal.com.au/business/vitfit/sydney',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/73.0.3683.75 Safari/537.36',
    'DNT': '1',
}
response = requests.get('https://www.truelocal.com.au/www-js/configuration.constant.js?v=1552032205066',
                        headers=headers)
assert response.ok

# extract token from response text
token = re.search("token:\\s'(.*)'", response.text)[1]

headers['Accept'] = 'application/json, text/plain, */*'
headers['Origin'] = 'https://www.truelocal.com.au'

response = requests.get(f'https://api.truelocal.com.au/rest/listings/vitfit/sydney?&passToken={token}', headers=headers)
assert response.ok
# use response.text to get full json as text and see what information can be extracted.

contact = response.json()["data"]["listing"][0]["contacts"]["contact"]
website = list(filter(lambda x: x["type"] == "website", contact))[0]["value"]
print(website)

print("the end")
© www.soinside.com 2019 - 2024. All rights reserved.