Python和Beautifulsoup 4 - 无法过滤类？

Question

我正试图从这个网址刮掉鞋码：http://www.jimmyjazz.com/mens/footwear/jordan-retro-13--atmosphere-grey-/414571-016?color=Grey

我想要做的只是获得可用的尺寸，例如只有那些没有变灰的人。

尺寸全部包裹在a元素中。可用的大小是box类，不可用的大小是box piunavailable类。

我尝试过使用lambda函数，ifs和CSS选择器 - 似乎没有用。我的猜测是因为我的代码的结构方式。

它的结构方式如下：

if尝试

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll('a', attrs={'class': 'box'}) if 'piunavailable' not in e.attrs['class']])

lambda尝试

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll(lambda tag: tag.name == 'a' and tag.get('class') == ['box piunavailable'])])

CSS选择器尝试

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll('a[class="box"]'))

因此，对于提供的URL，我期望结果是一个字符串（从列表转换），这是所有可用的大小 - 在撰写此问题时，它应该是 - '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '13'

相反，我得到各种尺寸，'7.5', '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '12', '13'

任何人都知道如何使它工作（或知道我的问题的优雅解决方案）？先感谢您！

Answer 1

你想要一个css :not伪类选择器来排除另一个类。使用bs4 4.7.1。

sizes = [item.text for item in soup.select('.box:not(.piunavailable)')]

在全：

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.jimmyjazz.com/mens/footwear/jordan-retro-13--atmosphere-grey-/414571-016?color=Grey')  
soup = BeautifulSoup(r.content,'lxml')  
sizes = [item.text for item in soup.select('.box:not(.piunavailable)')]
print(sizes)

Answer 2

您要求的是使用特定类a获取box标签，而不是其他类。这可以通过passing a custom function as filter到find_all来完成。

def my_match_function(elem):
 if isinstance(elem,Tag) and elem.name=='a' and ''.join(elem.attrs.get('class',''))=='box':
     return True

在这里''.join(elem.attrs.get('class',''))=='box'确保a标签只有类box而没有其他类。

让我们看看这个在行动

from bs4 import BeautifulSoup,Tag
html="""
<a>This is also not needed.</a>
<div class="box_wrapper">
<a id="itemcode_11398535" class="box piunavailable">7.5</a>
<a href="#" id="itemcode_11398536" class="box">8</a>
<a href="#" id="itemcode_11398537" class="box">8.5</a>
<a href="#" id="itemcode_11398538" class="box">9</a>
<a href="#" id="itemcode_11398539" class="box">9.5</a>
<a href="#" id="itemcode_11398540" class="box">10</a>
<a href="#" id="itemcode_11398541" class="box">10.5</a>
<a href="#" id="itemcode_11398542" class="box">11</a>
<a href="#" id="itemcode_11398543" class="box">11.5</a>
<a id="itemcode_11398544" class="box piunavailable">12</a>
<a href="#" id="itemcode_11398545" class="box">13</a>
</div>
"""
def my_match_function(elem):
 if isinstance(elem,Tag) and elem.name=='a' and ''.join(elem.attrs.get('class',''))=='box':
     return True
soup=BeautifulSoup(html,'html.parser')
my_list=[x.text for x in soup.find_all(my_match_function)]
print(my_list)

输出：

['8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '13']

Python和Beautifulsoup 4 - 无法过滤类？

问题描述投票：2回答：2

2个回答

最新问题

Python和Beautifulsoup 4 - 无法过滤类？

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2