我正在抓取并从网站下载链接,并且每天都会使用新链接更新网站。我希望每次我的代码运行时,它只会在上次程序运行时擦除/下载更新的链接,而不是再次运行整个代码。
我已经尝试将以前删除的链接添加到空列表中,并且如果在列表中找不到已删除的链接,则仅执行其余代码(下载并重命名文件)。但它似乎没有像希望的那样工作,因为每次运行代码时,它都会从“0”开始并覆盖以前下载的文件。
我应该尝试不同的方法吗?
这是我的代码(也是关于如何清理它并使其更好的一般性建议)
import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
from difflib import get_close_matches
import os
period = '2018 Q4'
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
#set soup
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]
#create list of desired file names from existing directory names
candidates = os.listdir('/Users/test/Desktop/Test')
#set directory to download scraped files to
downloads_folder = '/Users/test/Desktop/Python/project/downloaded_files/'
#create empty list of names
scraped_name_list = []
#scrape site for names and links
for anchor in table.findAll('a'):
try:
if not anchor:
continue
name = anchor.text
letter_link = anchor['href']
#if name doesn't exist in list of names: append it to the list, download it, and rename it
if name not in scraped_name_list:
#append it to name list
scraped_name_list.append(name)
#download it
urllib.request.urlretrieve(letter_link, '/Users/test/Desktop/Python/project/downloaded_files/' + period + " " + name + '.pdf')
#rename it
best_options = get_close_matches(name, candidates, n=1, cutoff=.33)
try:
if best_options:
name = (downloads_folder + period + " " + name + ".pdf")
os.rename(name, downloads_folder + period + " " + best_options[0] + ".pdf")
except:
pass
except:
pass
#else skip it
else:
pass
每次运行它时,它都会重新创建scraped_name_list
作为新的空列表。您需要做的是在运行结束时保存列表,然后尝试在任何其他运行中导入它。 pickle
图书馆非常适合这个。
而不是定义scraped_name_list = []
,尝试这样的事情
try:
with open('/path/to/your/stuff/scraped_name_list.lst', 'rb') as f:
scraped_name_list = pickle.load(f)
except IOError:
scraped_name_list = []
这将尝试打开您的列表,但如果它是第一次运行(意味着列表尚不存在),它将以空列表开头。然后在代码结束时,您只需要保存文件,以便在运行时可以使用它:
with open('/path/to/your/stuff/scraped_name_list.lst', 'wb') as f:
pickle.dump(scraped_name_list, f)