我正在尝试通过br标签分割文本。
我有这个标签:
<div class="grseq"><p class="tigrseq"><span id="id0-I."></span>Section I: Contracting authority</p><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.1)</span><span class="timark" style="font-weight:bold;color:black;">Name and addresses</span><div style="color:black" class="txtmark">Official name: WOBA mbH Oranienburg<br>Postal address: Villacher Straße 2<br>Town: Oranienburg<br>NUTS code: <span class="nutsCode" title="Oberhavel">DE40A</span><br>Postal code: 16515<br>Country: Germany<br>E-mail: <a class="ojsmailto" href="mailto:[email protected]?subject=TED">[email protected]</a><p><b>Internet address(es): </b></p><p>Main address: <a class="ojshref" href="http://www.woba.de" target="_blank">www.woba.de</a></p></div><!--//txtmark end--></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.2)</span><span class="timark" style="font-weight:bold;color:black;">Information about joint procurement</span></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.4)</span><span class="timark" style="font-weight:bold;color:black;">Type of the contracting authority</span><div style="color:black" class="txtmark">Other type: Wohnungswirtschaft</div><!--//txtmark end--></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.5)</span><span class="timark" style="font-weight:bold;color:black;">Main activity</span><div style="color:black" class="txtmark">Housing and community amenities</div><!--//txtmark end--></div></div>
我试图像这样接收每行的列表:
['Official name: WOBA mbH Oranienburg', 'Postal address: Villacher Straße 2', ...]
这是我的代码:
webpage = 'https://ted.europa.eu/udl?uri=TED:NOTICE:565570-2019:TEXT:EN:HTML&src=0&tabId=0#id1-I.'
webpage_response = requests.get(webpage)
soup = BeautifulSoup(webpage_response.content, 'lxml')
tags = soup.find(class_="mlioccur")
br_tags = tags.text.strip().split('\n\n')
print(br_tags)
我列举的是一个有一个条目的列表:
['I.1)Name and addressesOfficial name: WOBA mbH OranienburgPostal address: Villacher Straße 2Town: OranienburgNUTS code: DE40APostal code: 16515Country: GermanyE-mail: [email protected] address(es): Main address: www.woba.de']
非常感谢您的帮助:)
您可以通过.get_text()
参数使用separator=
方法。然后根据此分隔符单击str.split()
:
txt = '''<div class="grseq"><p class="tigrseq"><span id="id0-I."></span>Section I: Contracting authority</p><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.1)</span><span class="timark" style="font-weight:bold;color:black;">Name and addresses</span><div style="color:black" class="txtmark">Official name: WOBA mbH Oranienburg<br>Postal address: Villacher Straße 2<br>Town: Oranienburg<br>NUTS code: <span class="nutsCode" title="Oberhavel">DE40A</span><br>Postal code: 16515<br>Country: Germany<br>E-mail: <a class="ojsmailto" href="mailto:[email protected]?subject=TED">[email protected]</a><p><b>Internet address(es): </b></p><p>Main address: <a class="ojshref" href="http://www.woba.de" target="_blank">www.woba.de</a></p></div><!--//txtmark end--></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.2)</span><span class="timark" style="font-weight:bold;color:black;">Information about joint procurement</span></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.4)</span><span class="timark" style="font-weight:bold;color:black;">Type of the contracting authority</span><div style="color:black" class="txtmark">Other type: Wohnungswirtschaft</div><!--//txtmark end--></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.5)</span><span class="timark" style="font-weight:bold;color:black;">Main activity</span><div style="color:black" class="txtmark">Housing and community amenities</div><!--//txtmark end--></div></div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
out = []
for tag in soup.select('.txtmark'):
out.append(tag.get_text(strip=True, separator='|'))
out = '|'.join(out).replace(':|', ': ').split('|')
from pprint import pprint
pprint(out)
打印:
['Official name: WOBA mbH Oranienburg',
'Postal address: Villacher Straße 2',
'Town: Oranienburg',
'NUTS code: DE40A',
'Postal code: 16515',
'Country: Germany',
'E-mail: [email protected]',
'Internet address(es): Main address: www.woba.de',
'Other type: Wohnungswirtschaft',
'Housing and community amenities']
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://ted.europa.eu/udl?uri=TED:NOTICE:565570-2019:TEXT:EN:HTML&src=0&tabId=0#id1-I.")
soup = BeautifulSoup(r.text, 'html.parser')
result = [item.text for item in soup.findAll("div", {"class": "mlioccur"})]
print(result)