我目前正在使用BS4从Kickstarter网页中提取一些信息:https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour
项目信息位于其中一个脚本标记内:(伪代码)
...
<script>...</script>
<script>
window.current_ip = ...
...
window.current_project = "<I want this part>"
</script>
...
我目前的代码:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import html
html_ = urlopen("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour").read()
soup = BeautifulSoup(html_, 'html.parser')
# why does this not work?
# soup.find('script', re.compile("window.current_project"))
# currently, I'm doing this:
all_string = html.unescape(soup.find_all('script')[4].get_text())
# then some regex here on all_string to extract the current_project information
目前我可以使用索引[4]
获取我想要的部分,但由于我不确定这一般是否正确,如何从正确的脚本标记中提取文本?
谢谢!
您可以收集所有脚本元素并循环。使用请求访问响应对象内容
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour")
soup = BeautifulSoup(res.content, 'lxml')
scripts = soup.select('script')
scripts = [script for script in scripts]
for script in scripts:
if 'window.current_project' in script.text:
print(script)
这应该工作(而不是转储到json,你可能能够打印输出,如果需要,哦是的,记住改变我所说的“选择路径”和“如果任何类在这里添加它”的变量“):
from bs4 import BeuatifulSoup
import requests
import json
website = requests.get("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour")
soup= BeautifulSoup(website.content, 'lxml')
mytext = soup.findAll("script", {"class": "If theres any class add it here, or else delete this part"})
save_path = 'CHOOSE A PATH'
ogname = "kickstarter_text.json"
completename = os.path.join(save_path, ogname)
with open(completename, "w") as output:
json.dump(listofurls, output)