beautifulsoup 相关问题

如何使用Python通过指定的不配对标签分割html字符串？例如 split('你好那里', 'br') 应该返回 ['hello', 'there'], 分割（'... 如何使用Python通过指定的不配对标签分割html字符串？例如 split('hello<br >there', 'br') 应该返回['hello', 'there']， split('<div id="d71">text1<br data-i="1">text2<br>text3</div>', 'br') 应该返回['<div id="d71">text1', 'text2', 'text3</div>'] 我看过了 def get_start_stop(source, tag_name): soup = BeautifulSoup(source, 'html.parser') return dir(soup.find(tag_name)) 但是我所希望的事情，sourcepos，string，strings，self_and_descendants，.nextSibling.sourcepos没有获得开始和结束索引所需的信息（据我所知）源字符串中的标签。我也尝试过类似的事情 from lxml import html def split(input_str, tag_name): tree = html.fromstring(input_str) output_list = [] for element in tree.iter(): if element.tag == tag_name: output_list.append(element.tail) else: output_list.append(html.tostring(element, encoding='unicode', with_tail=False)) return output_list 但是with_tail=False没有达到我的预期使用 lxml 作为 HTML 解析器假设文档片段是 <div id="d71">text1<br/>text2<br/>text3</div> 查找 div 元素内的所有文本节点 from lxml import html doc = html.parse('temp.html') # returns a node set of text nodes d71txt = doc.xpath('//div[@id="d71"]/text()' 结果 ['text1', 'text2', 'text3'] 子元素的 tail 属性也包含其中一些文本节点。使用 descendant-or-self xpath 轴 >>> d71nodes = doc.xpath('//div[@id="d71"]/descendant-or-self::*') >>> d71nodes[0].text 'text1' >>> d71nodes[1].tail 'text2' >>> d71nodes[2].tail 'text3' 这是一个例子： from bs4 import BeautifulSoup text = """ <p>one<br>two</p> <p>two<br>three</p> """ page = BeautifulSoup(text, "html.parser") for t in page.find_all('p'): print("Got this:", t) for u in t: print(u) 输出： Got this: <p>one<br/>two</p> one <br/> two Got this: <p>two<br/>three</p> two <br/> three 我认为如何从中获取之前和之后应该非常清楚。我认为BeautifulSoup并没有直接提供HTML标签的开始和结束索引，但是你可以通过在原始字符串中定位标签来找到它们 def get_start_stop(source, tag_name): soup = BeautifulSoup(source, 'html.parser') tag = soup.find(tag_name) if tag: start = source.find(f"<{tag_name}") end = source.find(">", start) + 1 return start, end return None

python beautifulsoup lxml

回答 3 投票 0

Beautiful Soup：查找包含字符串和其他元素的标签

我正在尝试查找包含“Country:”作为内部文本的标签。该标签还包含其他子标签 ()。国家：我正在尝试查找包含“Country:”作为内部文本的 <li> 标签。该标签还包含其他子项 (<a>)。 <li> Country: <a href="example.com">Germany</a> </li> 使用soup.find("li", string="Country: ") 似乎仅当 <li> 包含确切的字符串且不包含其他元素时才有效。使用正则表达式 soup.find("li", string=re.compile("Country: ")) 也不会返回任何结果。在这种情况下正确的查询是什么？您无法执行您想要的操作的原因是soup.find(string=re.compile("Country: "))在li元素中找到文本节点，而不是li节点，因此节点名称过滤器不匹配。实现此目的的一种方法是将 filter 函数传递给 soup.find，它仅接受包含与正则表达式匹配的节点的 li 元素： import re from bs4 import BeautifulSoup soup = BeautifulSoup("""<li> Country: <a href="example.com">Germany</a> </li>""", features="html.parser") country_re = re.compile("Country:") def match_li_with_country(node): if node.name != "li": return False return bool(node.find(string=country_re)) country_li = soup.find(match_li_with_country) print(country_li) 另一个有相同的想法，但手动操作：首先找到所有 li 节点，然后过滤它们以找到包含所需字符串的节点： country_li = next(( node for node in soup.find_all('li') if node.find(string=country_re) ), None) print(country_li)

python-3.x beautifulsoup

回答 1 投票 0

beautifulsoup 相关问题

最新问题