使用bs4的Python web抓取不使用类pg-bodyCopy has-apos

问题描述 投票:1回答:2

我正试图废弃:

https://www.washingtonpost.com/graphics/politics/trump-claims-database/?noredirect=on&utm_term=.66adf0edf80b

所有日期和文本在左侧enter image description here

到目前为止,我尝试了以下代码,它只检索了17个结果,并从正确的文本中获得了一些结果。

import requests
from bs4 import BeautifulSoup

r=requests.get('https://www.washingtonpost.com/graphics/politics/trump-claims-database/?noredirect=on&utm_term=.4a34a0231c12')
html=BeautifulSoup(r.content,'html.parser')
results=html.find_all('p','pg-bodyCopy')

我的问题是:

如何获得包含所有左侧文本的列表和另一个列表,其中包含与文本对应的日期?

样本输出:

[(Mar 3 2019,After more than two years of Presidential Harassment, the only things that have been proven is that Democrats and other broke the law. The hostile Cohen testimony, given by a liar to reduce his prison time, proved no Collusion! His just written book manuscript showed what he said was a total lie, but Fake Media won't show it. I am an innocent man being persecuted by some very bad, conflicted & corrupt people in a Witch Hunt that is illegal & should never have been allowed to start - And only because I won the Election!)]

编辑:只是想知道是否有可能按照图像检索源(Twitter,Facebook等)

loc=row.find('div',class_='details expanded').text.strip()

python web-scraping beautifulsoup request
2个回答
1
投票

您要查找的所有项目都无法直接使用。您可以使用selenium多次单击加载更多按钮来加载所有数据并获取页面源。

码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path='/home/bitto/chromedriver')
url="https://www.washingtonpost.com/graphics/politics/trump-claims-database/?noredirect=on&utm_term=.777b6a97b73d"#your url here
driver.get(url)
claim_list=[]
date_list=[]
source_list=[]
i=50
while i<=50: #change to 9000 to scrap all the texts
    element=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR ,'button.pg-button')))
    element.click()
    i+=50
#getting the data and printng it out
soup=BeautifulSoup(driver.page_source,'html.parser')
claim_rows=soup.find_all('div',class_='claim-row')
for row in claim_rows:
    date=row.find('div',class_='dateline').text.strip()
    claim=row.find('div',class_='claim').text.replace('"','').strip()
    source=row.find('div',class_='details not-expanded').find_all('p')[1].find('span').text
    claim_list.append(claim)
    date_list.append(date)
    source_list.append(source)

#we will zip it make it easier to view the output
print(list(zip(date_list,claim_list,source_list)))

产量

[('Mar 3 2019', "“Presidential Harassment by 'crazed' Democrats at the highest level in the history of our Country. Likewise, the most vicious and corrupt Mainstream Media that any president has ever had to endure.”", 'Twitter'), ('Mar 3 2019', "“After more than two years of Presidential Harassment, the only things that have been proven is that Democrats and other broke the law. The hostile Cohen testimony, given by a liar to reduce his prison time, proved no Collusion! His just written book manuscript showed what he said was a total lie, but Fake Media won't show it. I am an innocent man being persecuted by some very bad, conflicted & corrupt people in a Witch Hunt that is illegal & should never have been allowed to start - And only because I won the Election!”", 'Twitter'), ('Mar 3 2019', '“The reason I do not want military drills with South Korea is to save hundreds of millions of dollars for the U.S. for which we are not reimbursed. ”', 'Twitter'), ('Mar 3 2019', "“For the Democrats to interview in open hearings a convicted liar & fraudster, at the same time as the very important Nuclear Summit with North Korea, is perhaps a new low in American politics and may have contributed to the 'walk.' Never done when a president is overseas. Shame!”", 'Twitter'), ('Mar 3 2019', '“The most successful first two years for any President. We are WINNING big, the envy of the WORLD.”', 'Twitter'), ('Mar 2 2019', '“Remember you have Nebraska. We won both [Electoral College votes] in Nebraska. We won the half.”', 'Remarks'),...]

1
投票

您要查找的数据如下:

https://www.washingtonpost.com/graphics/politics/trump-claims-database/js/base.js?c=230b1e82e2fc6c49a25a4c6554455c3bf0f527d5-1551707436

它是一个名为'claim'的JS数组。每个条目看起来像:

{
  id: "8920",
  date: "Mar 3 2019",
  location: "Twitter",
  claim: "“Presidential Harassment by 'crazed' Democrats at the highest level in the history of our Country. Likewise, the most vicious and corrupt Mainstream Media that any president has ever had to endure.â€",
  analysis: 'The scrutiny of President Trump by the House of Representatives is little different than the probes launched by Republicans of Barack Obama, Democrats of George W. Bush or Republicans of Bill Clinton, just to name of few recent examples. President John Tyler was actually ousted by his party (the Whigs) while Andrew Johnson and Clinton were impeached. As for media coverage, Trump regularly appears to believe it should only be positive. He has offered little evidence the media is "corrupt."',
  pinocchios: null,
  category: "Miscellaneous",
  repeated: null,
  r_id: null,
  full_story_url: null,
  unixDate: "1551589200"
}

代码(我已将页面内容下载到我的文件系统 - cliams.txt)

我正在使用demjson以使json字符串成为一个字典

import demjson
start_str = 'e.exports={claims:'
end_str = 'lastUpdated'
with open('c:\\temp\\claims.txt','r',encoding="utf8") as claims_file:
    dirty_claims = claims_file.read()
    start_str_idx = dirty_claims.find(start_str)
    end_str_idx = dirty_claims.rfind(end_str)
    print('{} {}'.format(start_str_idx,end_str_idx))
    claims_str = dirty_claims[start_str_idx + len(start_str):end_str_idx-1]
    claims = demjson.decode(claims_str)
    for claim in claims:
        print(claim)
© www.soinside.com 2019 - 2024. All rights reserved.