带有向量化的Python_Pandas正则表达式正在生成NaN

问题描述 投票:0回答:1

我有一个数据集,其中的列如下所示:

enter image description here

Loc_Description 列包含城镇和道路的名称。每个字符串中城镇名称的位置各不相同,主要取决于 Loc_id 的第 6 个字符是否为“L”。我尝试使用 pandas 矢量化操作将城镇名称提取到名为“Town_Name”的新列中。

我的目标是生成此输出:

enter image description here

这是我尝试用来生成所需输出的方法:

def wrangle_data(self, ih08_df):
    ih08_df['Town_Name'] = ih08_df['Loc_Description'].str.extract(r'^(.+?)(?=\s?\-)')
            trans_floc_rows = ih08_df['Loc_id'].str[5] !='L'
            ih08_df.loc[trans_floc_rows, 'Town_Name'] = ih08_df.loc[trans_floc_rows, 
                                                                'Loc_Description'].str.extract(r'(?<=\s\-)(.+?)(?=\s?\-)')

相反,当 Loc.id 的第 6 个字符不是“L”时,我得到的输出显示 NaN 值。

enter image description here

由于某种原因,str.extract(r'(?<=\s-)(.+?)(?=\s?-)') is not working. I have tried variations of the code using np.where() and np.select() without success. I would prefer to use a vectorized operation because the data set is quite large. Any help you can offer is greatly appreciated.

python pandas regex vectorization nan
1个回答
0
投票

拆分描述并根据 id 是否包含“L”作为第 6 个字符(索引 5)获取适当的值:

df.loc[df["Loc id"].str[5] == "L", "Town_Name"] = (
    df["Loc_Description"].str.split().str[0]
)
df.loc[df["Loc id"].str[5] != "L", "Town_Name"] = (
    df["Loc_Description"].str.split().str[1].str.strip("-")
)
                  Loc id                Loc_Description Town_Name
0  9441-ET0043-1111-2222     753       -PORTAGE      -1   PORTAGE
1  9441-ET0043-1111-2223  75-445-1      -ALBION      -1    ALBION
2   9441-L0043-1111-2224      GREECE     -BELMONT RD-B2    GREECE
3   9441-L0043-1111-2225   ATTICA     -BELMONT RD   -B3    ATTICA
© www.soinside.com 2019 - 2024. All rights reserved.