使用 pandas 字符串匹配的正则表达式

Question

输入数据：

                        name  Age Zodiac Grade            City  pahun
0                   /extract   30  Aries     A            Aura  a_b_c
1  /abc/236466/touchbar.html   20    Leo    AB      Somerville  c_d_e
2                    Brenda4   25  Virgo     B  Hendersonville    f_g
3     /abc/256476/mouse.html   18  Libra    AA          Gannon  h_i_j

我正在尝试根据名称列上的正则表达式提取行。这个正则表达式提取长度为 6 的数字。

For example:
/abc/236466/touchbar.html  - 236466

这是我使用过的代码

df=df[df['name'].str.match(r'\d{6}') == True]

上面的行根本不匹配。

预期：

                         name  Age Zodiac Grade            City  pahun
0  /abc/236466/touchbar.html   20    Leo    AB      Somerville  c_d_e
1     /abc/256476/mouse.html   18  Libra    AA          Gannon  h_i_j

谁能告诉我我哪里做错了？

Answer 1

str.match

仅搜索字符串开头的匹配项。因此，如果您想使用

来匹配字符串中的

+ 6 位数字 +

str.match

，则需要使用其中之一

df=df[df['name'].str.match(r'.*/\d{6}/')]      # assuming the match is closer to the end of the string
df=df[df['name'].str.match(r'(?s).*/\d{6}/')]  # same, but allows a multiline search
df=df[df['name'].str.match(r'.*?/\d{6}/')]     # assuming the match is closer to the start of the string
df=df[df['name'].str.match(r'(?s).*?/\d{6}/')] # same, but allows a multiline search

但是，这里使用

str.contains

和像

这样的正则表达式会更加合理和高效

df=df[df['name'].str.contains(r'/\d{6}/')]

查找包含

+ 6 位数字 +

的条目。

或者，确保您只匹配 6 位数字块而不是 7 位以上数字块：

df=df[df['name'].str.contains(r'(?<!\d)\d{6}(?!\d)')]

哪里

```
(?<!\d)
```
- 确保左侧没有数字
```
\d{6}
```
- 任意六位数字
```
(?!\d)
```
- 右侧不允许有数字。

Answer 2

你就快到了，请使用

str.contains

来代替：

df[df['name'].str.contains(r'\d{6,}')]

使用 pandas 字符串匹配的正则表达式

问题描述投票：0回答：2

2个回答

最新问题

使用 pandas 字符串匹配的正则表达式

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2