我想从熊猫数据框中提取一件衣服的长度。该数据框的行如下所示:
A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4
你可以看到大小包含在About和肩膀之间,但在某些情况下,肩膀被腰部,下摆等取代。下面是我的python脚本找到长度但是当我说切割时我们说About
之后有一个逗号它失败了列表。
import re
def regexfinder(string_var):
res=''
x=re.search(r"(?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips])", string_var).group(0)
tohave=int(x[1:3])
if tohave >=16 and tohave<=36:
res="Mini"
return res
if tohave>36 and tohave<40:
res="Above the Knee"
return res
if tohave >=40 and tohave<=46:
res="Knee length"
return res
if tohave>46 and tohave<49:
res="Mid/Tea length"
return res
if tohave >=49 and tohave<=59:
res="Long/Maxi length"
return res
if tohave>59:
res="Floor Length"
return res
你的正则表达式(?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips])
使用character class的话肩,腰,下摆,胸围,颈部,胸围,顶部,臀部。
我想你想使用或者|
把它们放在一个非捕获组中。
使用可选的逗号,?
尝试这样:
(?<=About),? (\d+)(?=.*?(?:shoulder|waist|hem|bust|neck|bust|top|hips]))
大小在第一个捕获组中。
import re
s = """A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4"""
q = """'Velvet dress featuring mesh front, back and sleeves | Crewneck | Long bell sleeves | Self-tie closure at back cutout | About, 31" from shoulder to hem | Viscose/nylon | Hand wash | Imported | Model shown is 5\'10" (177cm) wearing a size Small.'1"""
def getSize(stringVal, strtoCheck):
for i in stringVal.split("|"): #Split string by "|"
if i.strip().startswith(strtoCheck): #Check if string startswith "About"
val = i.strip()
return re.findall("\d+", val)[0] #Extract int
print getSize(s, "About")
print getSize(q, "About")
输出:
23
31