我创建了一个脚本来使用请求模块从这个网站收集不同的公司名称,但是当我执行该脚本时,它最终什么也没得到。我在页面源代码中查找了公司名称,发现这些名称在那里可用,因此它们似乎是静态的。
import requests
from bs4 import BeautifulSoup
link = 'https://clutch.co/agencies/digital-marketing'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("h3.company_info > a"):
print(item.text)
根据您网站下方代码的输出,返回
status code
为 403
,这意味着客户端被禁止访问有效的 URL。
此响应的标头表明该站点受
Cloudflare
的保护
“服务器”:“cloudflare”,“CF-RAY”:“78d95f0bafebad68-ATL”
import requests
link = 'https://clutch.co/agencies/digital-marketing'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
print(res.status_code)
403
print('\n')
print(res.headers)
{'Date': 'Sun, 22 Jan 2023 15:37:30 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'close', 'Permissions-Policy': 'accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()', 'Referrer-Policy': 'same-origin', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Set-Cookie': '__cf_bm=SR3boTo67liuRCP9u9YJcmvRZKWm5jFrnJcxtKXB42c-1674401850-0-AWOak5THdaypQLptfJnLhSTY5z2JO5+6rWurdKJQLQBPXB5tYhE0Z4NYGvJ3mjcG89KTFEkgKruhJ8XN/kTnfpo=; path=/; expires=Sun, 22-Jan-23 16:07:30 GMT; domain=.clutch.co; HttpOnly; Secure; SameSite=None', 'Vary': 'Accept-Encoding', 'Strict-Transport-Security': 'max-age=2592000', 'Server': 'cloudflare', 'CF-RAY': '78d95f0bafebad68-ATL', 'Content-Encoding': 'gzip', 'alt-svc': 'h3=":443"; ma=86400, h3-29=":443"; ma=86400'}
由于该网站受 Cloudflare 保护,因此有一个名为 cloudscraper 的 Python 模块试图绕过 Cloudflare 的反机器人页面。
使用该模块您可以获得所需的数据。
例如:
import cloudscraper
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
scraper = cloudscraper.create_scraper()
source_html = scraper.get("https://clutch.co/agencies/digital-marketing").text
soup = BeautifulSoup(source_html, "lxml")
company_data = [
[item.getText(strip=True), f"https://clutch.co{item['href']}"]
for item in soup.select("h3.company_info > a")
]
df = pd.DataFrame(company_data, columns=["Company", "URL"])
print(tabulate(df, headers="keys", tablefmt="github", showindex=False))
这应该打印:
| Company | URL |
|----------------------------------|------------------------------------------------------------|
| WebFX | https://clutch.co/profile/webfx |
| Ignite Visibility | https://clutch.co/profile/ignite-visibility |
| SocialSEO | https://clutch.co/profile/socialseo |
| Lilo Social | https://clutch.co/profile/lilo-social |
| Favoured | https://clutch.co/profile/favoured |
| Power Digital | https://clutch.co/profile/power-digital |
| Belkins | https://clutch.co/profile/belkins |
| SmartSites | https://clutch.co/profile/smartsites |
| Straight North | https://clutch.co/profile/straight-north |
| Victorious | https://clutch.co/profile/victorious |
| Uplers | https://clutch.co/profile/uplers |
| Daniel Brian Advertising | https://clutch.co/profile/daniel-brian-advertising |
| Thrive Internet Marketing Agency | https://clutch.co/profile/thrive-internet-marketing-agency |
| Big Leap | https://clutch.co/profile/big-leap |
| Mad Fish Digital | https://clutch.co/profile/mad-fish-digital |
| Razor Rank | https://clutch.co/profile/razor-rank |
| Brolik | https://clutch.co/profile/brolik |
| Search Berg | https://clutch.co/profile/search-berg |
| Socialfix Media | https://clutch.co/profile/socialfix-media |
| Kanbar Digital, LLC | https://clutch.co/profile/kanbar-digital |
| NextLeft | https://clutch.co/profile/nextleft |
| Fruition | https://clutch.co/profile/fruition |
| Impactable | https://clutch.co/profile/impactable |
| Lets Tok | https://clutch.co/profile/lets-tok |
| Pyxl | https://clutch.co/profile/pyxl |
| Sagefrog Marketing Group | https://clutch.co/profile/sagefrog-marketing-group |
| Foreignerds INC. | https://clutch.co/profile/foreignerds |
| Social Driver | https://clutch.co/profile/social-driver |
| 3 Media Web | https://clutch.co/profile/3-media-web |
| Brand Vision | https://clutch.co/profile/brand-vision-1 |
试试这个:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
link = 'https://clutch.co/agencies/digital-marketing'
driver = webdriver.Chrome()
# Go to the website
driver.get(link)
# Wait for the page to load
time.sleep(5)
# Get the page source
html = driver.page_source
# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')
# Find all elements with class "company_info" and extract the text
for item in soup.select("h3.company_info > a"):
print(item.text)
# Close the browser
driver.quit()
亲爱的堆栈溢出,
我希望这条消息对您有好处。我的名字是 Vinod Antony,我写信是为了提交一篇文章供 ASK2PRO 考虑。这篇文章的标题是“如何优化我的网站以更好地产生潜在客户?”,它探讨了有效的登陆页面、优化网站速度和性能、有吸引力的内容和潜在客户磁铁、优化的形式、个性化、信任和社交证明、分析和持续性改进。
我相信这篇文章会引起您的读者的共鸣,因为我相信这篇文章会引起您的读者的共鸣,因为它解决了当前员工敬业度和保留率方面的挑战,提供了领导一代可以实施的可行策略,以培养一支更具生产力和满意度的员工队伍。这篇文章很长,并且符合您的提交指南。
我已在[此处][1]附上该文章,并很乐意根据您的反馈进行必要的修改。如果您需要更多信息或有任何疑问,请随时与我们联系。
感谢您考虑我提交的内容。我期待着为 ASK2PRO 做出贡献的可能性。
致以诚挚的问候,
维诺德·安东尼