将HTML数据转换为文本格式 - Python

Question

我正在使用Selenium Web Driver从LinkedIn个人资料中提取数据点。在这个例子中，我想从技能部分中提取每个技能，但数据被提取为HTML格式。

尝试将HTML代码转换为文本时，我收到附加的错误消息。

from parsel import Selector  
from selenium import webdriver 
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup 

driver = webdriver.Chrome('/Users/davidcraven/Downloads/chromedriver')

# get profile URL
driver.get('https://www.linkedin.com/AnyProfileURL')

# assigning the source code for the web page to variable sel
sel = Selector(text=driver.page_source)

# get skills
skills = sel.xpath('//*[starts-with(@class, "skills searchable has-several ")]').extract()

newtext = BeautifulSoup(skills, "lxml").text

Answer 1

您可以使用selenium从页面获取所有文本。

试试这个：以下代码将在控制台中打印文本。

from selenium import webdriver

driver = webdriver.Chrome(executable_path="chromedriver.exe")

driver.get("https://www.linkedin.com/in/profile")

elem = driver.find_element_by_tag_name("body")
print(elem.text)

driver.quit()

编辑：

在你的代码中，sel.xpath().extract返回一个列表给skills。您必须迭代列表才能获得文本。以下代码在控制台中打印找到的文本。

from parsel import Selector
from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome(executable_path="chromedriver")

# get profile URL
driver.get('https://www.linkedin.com/in/AnyProfile')

# assigning the source code for the web page to variable sel
sel = Selector(text=driver.page_source)

# get skills
skills = sel.xpath('//*[starts-with(@class, "skills searchable has-several ")]').extract()

# newtext = BeautifulSoup(skills, "lxml").text
for skl in skills:
    print(BeautifulSoup(skl,"lxml").text)

driver.quit()

将HTML数据转换为文本格式 - Python

问题描述投票：0回答：1

1个回答

最新问题

将HTML数据转换为文本格式 - Python

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1