如何制作一个分页特定页面（页面每天不同）的分页循环

Question

Summary

我正在研究我的供应链管理大学项目，并希望分析网站上的每日帖子，以分析和记录行业对服务/产品的需求。每天更改的特定页面以及不同数量的容器和页面：

https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today

Bacground

代码通过抓取HTML标记和记录数据点来生成csv文件（不介意标题）。试图使用'for'循环，但代码仍然只扫描第一页。

Python知识水平：初学者，通过youtube和google搜索“艰难”。找到的例子对我的理解水平起作用，但在结合人们不同的解决方案时遇到了麻烦。

Code at the moment

从urllib.request导入bs4导入urlopen作为uReq从bs4导入BeautifulSoup作为汤

problem starts here

for page in range (1,3):my_url = 'https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"rc"})

this part does not write in addition to existing line items

filename = "BuyandSell.csv"
f = open(filename, "w")
headers = "Title, Publication Date, Closing Date, GSIN, Notice Type, Procurement Entity\n"
f.write(headers)

for container in containers:
    Title = container.h2.text

    publication_container = container.findAll("dd",{"class":"data publication-date"})
    Publication_date = publication_container[0].text

    closing_container = container.findAll("dd",{"class":"data date-closing"})
    Closing_date = closing_container[0].text

    gsin_container = container.findAll("li",{"class":"first"})
    Gsin = gsin_container[0].text

    notice_container = container.findAll("dd",{"class":"data php"})
    Notice_type = notice_container[0].text

    entity_container = container.findAll("dd",{"class":"data procurement-entity"})
    Entity = entity_container[0].text

    print("Title: " + Title)
    print("Publication_date: " + Publication_date)
    print("Closing_date: " + Closing_date)
    print("Gsin: " + Gsin)
    print("Notice: " + Notice_type)
    print("Entity: " + Entity)

    f.write(Title + "," +Publication_date + "," +Closing_date + "," +Gsin + "," +Notice_type + "," +Entity +"\n")

f.close()

Please let me know if you would like to see further. Rest is defining data containers that are getting found in HTML code and getting printed to csv.Any help/advice would be highly appreciated. Thanks!

实际结果：

代码仅为第一页生成CSV文件。

代码至少不会在已扫描的内容（每天）之上编写

预期成绩：

代码扫描下一页并识别何时没有页面可以通过。

CSV文件每页将生成10个csv行。（无论最后一页上的金额是多少，因为数字并不总是10）。

代码将在已经删除的内容之上编写（使用带有历史数据的Excel工具进行更高级的分析）

Answer 1

有些人可能会说使用熊猫有点矫枉过正，但我个人觉得使用它并就像使用它来创建表格和写入文件一样。

也可能有一种更强大的方式来进行页面到页面，但我只是希望得到这个，你可以使用它。

截至目前，我只是硬编码下一页的值（我只是随意挑选了20页作为最大值）所以它从第1页开始，然后经过20页（或者一旦到达无效页面就停止）。

import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

filename = "BuyandSell.csv"

# Initialize an empty 'results' dataframe
results = pd.DataFrame()

# Iterarte through the pages
for page in range(0,20):
    url = 'https://buyandsell.gc.ca/procurement-data/search/site?page=' + str(page) + '&f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'

    page_html = requests.get(url).text
    page_soup = BeautifulSoup(page_html, "html.parser")
    containers = page_soup.findAll("div",{"class":"rc"})

    # Get data from each container
    if containers != []:
        for each in containers:
            title = each.find('h2').text.strip()
            publication_date = each.find('dd', {'class':'data publication-date'}).text.strip()
            closing_date = each.find('dd', {'class':'data date-closing'}).text.strip()
            gsin = each.find('dd', {'class':'data gsin'}).text.strip()
            notice_type = each.find('dd', {'class':'data php'}).text.strip()
            procurement_entity = each.find('dd', {'data procurement-entity'}).text.strip()

            # Create 1 row dataframe
            temp_df = pd.DataFrame([[title, publication_date, closing_date, gsin, notice_type, procurement_entity]], columns = ['Title', 'Publication Date', 'Closing Date', 'GSIN', 'Notice Type', 'Procurement Entity'])

            # Append that row to a 'results' dataframe
            results = results.append(temp_df).reset_index(drop=True)
        print ('Aquired page ' + str(page+1))

    else:
        print ('No more pages')
        break


# If already have a file saved
if os.path.isfile(filename):

    # Read in previously saved file
    df = pd.read_csv(filename)

    # Append the newest results
    df = df.append(results).reset_index()

    # Drop and duplicates (incase the newest results aren't really new)
    df = df.drop_duplicates()

    # Save the previous file, with appended results
    df.to_csv(filename, index=False)

else:

    # If a previous file not already saved, save a new one
    df = results.copy()
    df.to_csv(filename, index=False)

如何制作一个分页特定页面（页面每天不同）的分页循环

问题描述投票：0回答：1

Summary

Bacground

Code at the moment

problem starts here

this part does not write in addition to existing line items

Please let me know if you would like to see further. Rest is defining data containers that are getting found in HTML code and getting printed to csv.Any help/advice would be highly appreciated. Thanks!

1个回答

最新问题

如何制作一个分页特定页面（页面每天不同）的分页循环

问题描述 投票：0回答：1

Summary

Bacground

Code at the moment

problem starts here

this part does not write in addition to existing line items

Please let me know if you would like to see further. Rest is defining data containers that are getting found in HTML code and getting printed to csv.Any help/advice would be highly appreciated. Thanks!

1个回答

最新问题

问题描述投票：0回答：1