Python beautifulsoup 和 openpyxl

Question

所以，我正在尝试使用 beautifulsoup 进行数据提取（网络爬虫/抓取器），并且我正在尝试迭代 html 中的每个标签以查找我想要的数据。我的目标是获取特定信息并使用 openpyxl 库将其放入 Excel 工作表中。举个例子：

<table id="Table">   
    <tr>
        <th>Info A1</th>
        <th>Info B1</th>
        <th>Info C1</th>
        <th>Info D1</th>
        <th>Info E1</th>
    </tr>
    <tr>
        <th>Info A2</th>
        <th>Info B2</th>
        <th>Info C2</th>
        <th>Info D2</th>
        <th>Info E2</th>
    </tr>
</table>

基本上，我想要做的是比较表上的所有“A number”信息，如果其中一个与我拥有的信息匹配，我将获得同一 tr 中的其余信息，并将其放入excel文件中。真实的表比示例中的这个表大，而且我已经成功迭代它，但我不知道如何识别我想要的信息并将其与我已有的信息进行比较。

Answer 1

d={}
for tr in soup.findAll('tr'):
    key = tr.text.split()[0]
    val = tr.text.split()[1:]
    d[key] = val
for key in d:
    if key in my_list:
        print(key) #prints the match from your list
        print(d[key]) #prints the values attached to the match

创建一个空字典，迭代汤（您的表应驻留的位置），将每个 A 值添加为键，将每个 B/C/D/E 添加为列表中的键值。

然后，对于字典中的每个键（A 值），检查它们是否出现在 my_list（您的 A 值列表）中；如果找到匹配项，则执行打印语句（应根据您的需要进行更改），其中 key 对应于 A 值，d[key] 对应于给定 A 值的 B/C/D/E 值。

Answer 2

导入请求从 bs4 导入 BeautifulSoup 将 pandas 导入为 pd

YÖK 大学列表的网址

yok_url =“https://www.yok.gov.tr/universiteler-listesi”

模仿浏览器的标头

标题= { “用户代理”：“Mozilla/5.0（Windows NT 10.0；Win64；x64）AppleWebKit/537.36（KHTML，如 Gecko）Chrome/91.0.4472.124 Safari/537.36” }

尝试： # 获取页面的HTML内容响应 = requests.get(yok_url, headers=headers) 响应.raise_for_status()

# Parse the page using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find the section containing the university list
universities = []
for row in soup.select("table tr"):  # Assuming data is in a table structure
    columns = row.find_all("td")
    if columns:
        name = columns[0].get_text(strip=True)
        website = columns[1].find("a")["href"] if columns[1].find("a") else "No Website"
        universities.append({"University Name": name, "Website": website})

# Save results to an Excel file
df = pd.DataFrame(universities)
file_path = "Turkish_Universities_List.xlsx"
df.to_excel(file_path, index=False)
print(f"Data successfully scraped and saved to '{file_path}'.")

例外情况为 e： print(f"发生错误：{e}")

Python beautifulsoup 和 openpyxl

问题描述投票：0回答：2

2个回答

YÖK 大学列表的网址

模仿浏览器的标头

最新问题

Python beautifulsoup 和 openpyxl

问题描述 投票：0回答：2

2个回答

YÖK 大学列表的网址

模仿浏览器的标头

最新问题

问题描述投票：0回答：2