如何在Beautiful Soup中使用相同的标记来提取特定的嵌套元素？

Question

我是Python的新手，所以我仍然想弄清楚美丽的汤。我试图抓取一个网站并拉出五个元素，紧跟我在代码中找到的标记。

我试过了next.element，它只提取我在我的soup.find中使用的标签文本，我尝试了next.sibling，它返回为空白。

页面上有许多“第一”和“最后”类，因此我必须使用文本指定我想要的行。这是我想要抓住的：

 <li>
        <ul>
            <li class="first">Maintenance</li>
                        <li>$number1</li>
                        <li>$number2</li>
                        <li>$number3</li>
                        <li>$number4</li>
                        <li>$number5</li>
                    <li class="last">$linetotal</li>
        </ul>
    </li>

这就是我想要的：

for x,y in zip(make, model):
    url = ('https://URL with variables goes here')
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    search = requests.get(url, headers = headers)
    html = search.text
    soup = BeautifulSoup(html, 'lxml')
    search_results = soup.find('li', class_ = 'first', text = re.compile('Maintenance'))
    try:
        d = search_results.next_element
        print(d)
    except:
        print('pass')

最终目标是将number1：number5的数组附加到列表中，但是使用上面的代码，输出就是“维护”。我哪里错了？此外，由于我是如此新颖，如果你能够提供背景，我将非常感激。

Answer 1

给出您的示例，最简单的方法是将所有未定义类的li元素附加到结果列表。

from bs4 import BeautifulSoup

html = """ <li>
        <ul>
            <li class="first">Maintenance</li>
                        <li>$number1</li>
                        <li>$number2</li>
                        <li>$number3</li>
                        <li>$number4</li>
                        <li>$number5</li>
                    <li class="last">$linetotal</li>
        </ul>
    </li>"""

soup = BeautifulSoup(html, 'lxml')
start = soup.find('li', class_ = 'first').parent
result = []

for ele in start.find_all('li'):

    if not ele.get('class'):
        result.append(ele.text)

print(result)

输出：

['$number1', '$number2', '$number3', '$number4', '$number5']

Answer 2

您可以使用类似tree.xpath的xpath表达式

//li[@class='first' and text()='Maintenance']/following-sibling::li[not(@class)]

EG

from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
 <li>
    <ul>
        <li class="first">Maintenance</li>
        <li>$number1</li>
        <li>$number2</li>
        <li>$number3</li>
        <li>$number4</li>
        <li>$number5</li>
        <li class="last">$linetotal</li>
    </ul>
</li>
'''
tree = fromstring(h)
items = [item.text for item in tree.xpath("//li[@class='first' and text()='Maintenance']/following-sibling::li[not(@class)]")]
print(items)

Answer 3

QHarr回答的问题，但有些不同：

 h = '''
   <li>
     <ul>
       <li class="first">Maintenance</li>
       <li>$number1</li>
       <li>$number2</li>
       <li>$number3</li>
       <li>$number4</li>
       <li>$number5</li>
       <li class="last">$linetotal</li>
   </ul>
</li>

  '''
from lxml import etree
doc = etree.fromstring(h)
for cost in doc.xpath('//li'): 
   if not 'class' in cost.attrib:
      print(cost.text)

输出：

$number1
$number2
$number3
$number4
$number5

如何在Beautiful Soup中使用相同的标记来提取特定的嵌套元素？

问题描述投票：1回答：3

3个回答

最新问题

如何在Beautiful Soup中使用相同的标记来提取特定的嵌套元素？

问题描述 投票：1回答：3

3个回答

最新问题

问题描述投票：1回答：3