我已经看到了一些关于如何抓取特定句柄的推文的问题和帖子,但没有看到如何通过 Jupyter Notebook 使用 Python 获取特定推文的所有回复。
示例:我想抓取并导出对这条 BBC 公开推文“首次在新鲜南极雪中发现微塑料”的所有 340 条回复(https://twitter.com/BBCWorld/status/1534777385249390593 )
我需要以下信息:回复日期、回复(因此我只收到对 BBC 的回复,而不是此线程中其他用户的回复)和回复文本。
检查 URL 的元素,我发现回复容器的类名为:
css-1dbjc4n
。同样:
css-1dbjc4n r-1loqt21 r-18u37iz r-1ny4l3l r-1udh08x r-1qhn6m8 r-i023vh r-o7ynqc r-6416eg
css-4rbku5 css-18t94o4 css-901oao r-14j79pv r-1loqt21 r-1q142lx r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-3s2u2q r-qvutc0
css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0
我尝试运行下面的代码,但列表仍然为空:(
目前的结果:
空数据框
列:[推文日期、回复、推文]
索引:[]
有人可以帮助我吗? 非常感谢! :)
代码:
import sys
sys.path.append("path to site-packages in your pc")
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path=r"C:chromedriver path in your pc")
dates=[] #List to store date of tweet
replies=[] #List to store reply to info
comments=[] #List to store comments
driver.get("https://twitter.com/BBCWorld/status/1534777385249390593")
twts=[]
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('div',href=True, attrs={'class':'css-1dbjc4n'}):
datetweet=a.find('div', attrs={'class':'css-1dbjc4n r-1loqt21 r-18u37iz r-1ny4l3l r-1udh08x r-1qhn6m8 r-i023vh r-o7ynqc r-6416eg'})
replytweet=a.find('div', attrs={'class':'css-4rbku5 css-18t94o4 css-901oao r-14j79pv r-1loqt21 r-1q142lx r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-3s2u2q r-qvutc0'})
commenttweet=a.find('div', attrs={'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
dates.append(datetweet.text)
replies.append(replytweet.text)
comments.append(commenttweet.text)
df = pd.DataFrame({'Date of Tweet':dates,'Replying to':replies,'Tweet':comments})
df.to_csv('tweets.csv', index=False, encoding='utf-8')
print(df)
我发现两个问题:
页面使用 JavaScript 添加元素,JavaScript 可能需要时间将所有元素添加到 HTML - 在获得
time.sleep(...)
之前,您可能需要 driver.page_source
。或者在 Selenium 中使用 waits 来等待某些对象(在获得 driver.page_source
之前)。
HTML 不使用
<div href="...">
所以你的 findAll('div', href=True, ...)
是错误的。你必须删除href=True
编辑:
我放置了我创建的代码,但它还需要滚动页面才能获取更多推文。稍后可能需要单击
Show more replies
才能获取更多推文。
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
import pandas as pd
import time
#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))
driver.get("https://twitter.com/BBCWorld/status/1534777385249390593")
time.sleep(10)
# TODO: scroll page to get more tweets
#for _ in range(2):
# last = driver.find_elements(By.XPATH, '//div[@data-testid="cellInnerDiv"]')[-1]
# driver.execute_script("arguments[0].scrollIntoView(true)", last)
# time.sleep(3)
all_tweets = driver.find_elements(By.XPATH, '//div[@data-testid]//article[@data-testid="tweet"]')
tweets = []
print(len(all_tweets)-1)
for item in all_tweets[1:]: # skip first tweet because it is BBC tweet
#print('--- item ---')
#print(item.text)
print('--- date ---')
try:
date = item.find_element(By.XPATH, './/time').text
except:
date = '[empty]'
print(date)
print('--- text ---')
try:
text = item.find_element(By.XPATH, './/div[@data-testid="tweetText"]').text
except:
text = '[empty]'
print(text)
print('--- replying_to ---')
try:
replying_to = item.find_element(By.XPATH, './/div[contains(text(), "Replying to")]//a').text
except:
replying_to = '[empty]'
print(replying_to)
tweets.append([date, replying_to, text])
df = pd.DataFrame(tweets, columns=['Date of Tweet', 'Replying to', 'Tweet'])
df.to_csv('tweets.csv', index=False, encoding='utf-8')
print(df)