我和BeautifulSoup有点关系。
以下是我正在废弃的网址源代码的相关部分:
<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>
这是我的BeautifulSoap代码(仅限相关部分)来获取description
标签内的文本:
quote_page = sys.argv[1]
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
description_box = soup.find('div', {'class':'description'})
description = description_box.get_text(separator=" ").strip()
print description
使用python script.py运行脚本https://example.com/page/2000提供以下输出:
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
如何用一个句点后跟一个空格替换换行符,如下所示:
Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
我有什么想法可以做到这一点?
来自here:
html = '''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''
n = 2 # occurrence i.e. 2nd in this case
sep = '\n' # sep i.e. newline
cells = html.split(sep)
from bs4 import BeautifulSoup
html = sep.join(cells[:n]) + ". " + sep.join(cells[n:])
soup = BeautifulSoup(html, 'html.parser')
title_box = soup.find('div', attrs={'class': 'description'})
title = title_box.get_text().strip()
print (title)
OUTPUT:
Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
编辑:
from bs4 import BeautifulSoup
page = requests.get("https://blablabla.com")
soup = BeautifulSoup(page.content, 'html.parser')
description_box = soup.find('div', attrs={'class': 'description'})
description = description_box.get_text().strip()
n = 2 # occurrence i.e. 2nd in this case
sep = '\n' # sep i.e. newline
cells = description.split(sep)
desired = sep.join(cells[:n]) + ". " + sep.join(cells[n:])
print (desired)
试试这个
description = description_box.get_text(separator=" ").rstrip("\n")
使用拆分和连接选择
from bs4 import BeautifulSoup as bs
html = '''
<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>
'''
soup = bs(html, 'lxml')
text = ' '.join(soup.select_one('.description').text.split('\n'))
print(text)
拆分行然后在进行解析之前加入。
from bs4 import BeautifulSoup
htmldata='''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''
htmldata="".join(item.strip() for item in htmldata.split("\n"))
soup=BeautifulSoup(htmldata,'html.parser')
description_box = soup.find('div', class_='description')
print(description_box.text)
输出:
Planet Nine was initially proposed to explain the clustering of orbitsOf Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
编辑:
import requests
from bs4 import BeautifulSoup
htmldata=requests.get("url here").text
htmldata="".join(item.strip() for item in htmldata.split("\n"))
soup=BeautifulSoup(htmldata,'html.parser')
description_box = soup.find('div', class_='description')
print(description_box.text.strip())