Python：从多个网址抓取链接

Question

我正在尝试调整以下代码以从各个页面抓取链接，例如。如果 1 个页面有 40 个链接，那么 10 个页面我预计会获得 400 个链接。

网页遵循以下模式：

htt ps://www.examples.com/user/username#page1-videos（“page1”中的“1”是变化的元素）。

网页上的链接遵循以下模式：

htt ps://www.example.com/video/1423061

我有几个问题：

原始代码引用“name_list”和“link_list”，我不需要最终csv中的“name”列，只需要1列（即url）。我尝试简单地删除涉及 name_list 的所有内容，但 df 最终为空。我该如何纠正这个问题？
我想将所有要抓取的网址放入 .txt 文件中，并让代码迭代 txt 中的每一行。我该怎么做？

import requests
from bs4 import BeautifulSoup
import pandas as pd
i=0
name_list =[]
link_list = []
while(i<=13780):
    print("https://jito.org/members?start={}".format(i))
    res=requests.get("https://jito.org/members?start={}".format(i))
    soup=BeautifulSoup(res.text,"html.parser")
    for item in soup.select('.name>a'):
        name_list.append(item.text)
        link_list.append("https://jito.org" + item['href'])
    i=i+20
    print(name_list)
    print(link_list)

df=pd.DataFrame({"Name":name_list,"Link":link_list})
print(df)
df.to_csv('JITO_Directory.csv', index=False)
print('Done')

我抓取的每个网页都有 40 个链接，因此如果我抓取 2 个页面：

http://www.example.com/user/person#page2

http://www.example.com/user/person3#page7

我想要一个 txt 或 csv 文件，其中包含从 2 页抓取的 80 个链接：

htt ps://www.example.com/video/1423061

htt ps://www.example.com/video/1417634

htt ps://www.example.com/video/1417639

Answer 1

import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

BASE_URL = "https://jito.org"

i=0

profiles =[]

while True:
    url = BASE_URL + f"/members?start={i}"
    print(url)
    response = requests.get(url)
    print(f"- Status: {response.status_code}.")

    soup = BeautifulSoup(response.text,"html.parser")

    links = soup.select('h3.name > a')
    print(f"- Got {len(links)} links.")
    
    # Stop when there are no more profiles.
    if len(links) == 0:
        break

    for link in links:
        profiles.append({
            "name": link.text,
            "url": BASE_URL + link["href"]
        })

    # Go easy on the site.
    time.sleep(3)

    i += 20

df = pd.DataFrame(profiles)
print(df)
df.to_csv('jito-directory.csv', index=False)

我对您的原始代码所做的更改：

不要对配置文件的数量进行硬编码。而是继续分页，直到没有更多数据为止。
无需维护两个名称和链接列表，只需构建一个字典列表即可。
请求之间暂停。这是很好的网页抓取实践，因为它可以减少对目标网站性能的影响。

Python：从多个网址抓取链接

问题描述投票：0回答：1

1个回答

最新问题

Python：从多个网址抓取链接

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1