BeautifulSoup用句号和空格替换换行符

问题描述 投票:0回答:4

我和BeautifulSoup有点关系。

以下是我正在废弃的网址源代码的相关部分:

<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 
</div>

这是我的BeautifulSoap代码(仅限相关部分)来获取description标签内的文本:

quote_page = sys.argv[1]
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')

description_box = soup.find('div', {'class':'description'})
description = description_box.get_text(separator=" ").strip()
print description

使用python script.py运行脚本https://example.com/page/2000提供以下输出:

Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 

如何用一个句点后跟一个空格替换换行符,如下所示:

Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

我有什么想法可以做到这一点?

python web-scraping beautifulsoup
4个回答
1
投票

来自here

html = '''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''
n = 2                                # occurrence i.e. 2nd in this case
sep = '\n'                           # sep i.e. newline 
cells = html.split(sep)


from bs4 import BeautifulSoup

html = sep.join(cells[:n]) + ". " + sep.join(cells[n:])
soup = BeautifulSoup(html, 'html.parser')
title_box = soup.find('div', attrs={'class': 'description'})
title = title_box.get_text().strip()
print (title)

OUTPUT:

Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

编辑:

from bs4 import BeautifulSoup

page = requests.get("https://blablabla.com")
soup = BeautifulSoup(page.content, 'html.parser')
description_box  = soup.find('div', attrs={'class': 'description'})
description = description_box.get_text().strip()

n = 2                                # occurrence i.e. 2nd in this case
sep = '\n'                           # sep i.e. newline
cells = description.split(sep)
desired = sep.join(cells[:n]) + ". " + sep.join(cells[n:])

print (desired)

0
投票

试试这个

description = description_box.get_text(separator=" ").rstrip("\n")

0
投票

使用拆分和连接选择

from bs4 import BeautifulSoup as bs

html = '''
<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 
</div>
'''
soup = bs(html, 'lxml')
text = ' '.join(soup.select_one('.description').text.split('\n'))
print(text)

0
投票

拆分行然后在进行解析之前加入。

from bs4 import BeautifulSoup

htmldata='''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 
</div>'''
htmldata="".join(item.strip() for item in htmldata.split("\n"))
soup=BeautifulSoup(htmldata,'html.parser')
description_box = soup.find('div', class_='description')
print(description_box.text)

输出:

Planet Nine was initially proposed to explain the clustering of orbitsOf Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

编辑:

import requests
from bs4 import BeautifulSoup

htmldata=requests.get("url here").text

htmldata="".join(item.strip() for item in htmldata.split("\n"))
soup=BeautifulSoup(htmldata,'html.parser')
description_box = soup.find('div', class_='description')
print(description_box.text.strip())
© www.soinside.com 2019 - 2024. All rights reserved.