根据字符串条件将数据帧行扩展为多行

问题描述 投票:0回答:1

我有一些类似于下面的数据框的原始数据:


df = pd.DataFrame([{'var1': '220-224 (Even) roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': 'site of 5 to 9 (odd) roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '16, 19 roadname3', 'var2': 'location 3', 'var3': 'area 3'}]
                 )
df

var1    var2    var3
0   220-224 (Even) roadname1    location 1  area 1
1   site of 5 to 9 (odd) roadname2  location 2  area 2
2   16, 19 roadname3    location 3  area 3

我想编写一个函数来分割 var1 字符串,以便每个指示的数字成为数据帧中的单独行,输出如下:


df = pd.DataFrame([{'var1': '220 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '222 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '224 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '5 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '7 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '9 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '16 roadname3', 'var2': 'location 3', 'var3': 'area 3'},
                  {'var1': '19 roadname3', 'var2': 'location 3', 'var3': 'area 3'},]
                 )
df

var1    var2    var3
0   220 roadname1   location 1  area 1
1   222 roadname1   location 1  area 1
2   224 roadname1   location 1  area 1
3   5 roadname2     location 2  area 2
4   7 roadname2     location 2  area 2
5   9 roadname2     location 2  area 2
6   16 roadname3    location 3  area 3
7   19 roadname3    location 3  area 3

字符串条件在大小写和数字范围方面有点可变,我不确定是否有一种有效的方法可以处理字符串变化。

python pandas string dataframe expand
1个回答
0
投票

使用自定义函数来分割范围(下面是使用正则表达式的示例),然后

explode
:

import re

def parse_range(s):
    pat1 = r'^\D*(\d+)(?:-|\s+to\s+)(\d+)(?:\s*\((even|odd)\))?\s*(.*)$'
    pat2 = r'^\D*([\d ,]+)\s*(.*)$'
    m1 = re.search(pat1, s.lower())
    if m1:
        end = m1.group(4)
        if m1.group(3):
            return [f'{i} {end}' for i in
                    range(int(m1.group(1)), int(m1.group(2))+1, 2)]
        else:
            return [f'{i} {end}' for i in
                    range(int(m1.group(1)), int(m1.group(2))+1)]
    m2 = re.search(pat2, s.lower())
    if m2:
        end = m2.group(2)
        return [f'{i} {end}' for i in re.findall(r'\d+', m2.group(1))]
    
out = (df.assign(var1=df['var1'].map(parse_range))
         .explode('var1')
      )

输出:

            var1        var2    var3
0  220 roadname1  location 1  area 1
0  222 roadname1  location 1  area 1
0  224 roadname1  location 1  area 1
1    5 roadname2  location 2  area 2
1    7 roadname2  location 2  area 2
1    9 roadname2  location 2  area 2
2   16 roadname3  location 3  area 3
2   19 roadname3  location 3  area 3
© www.soinside.com 2019 - 2024. All rights reserved.