在 Python 中对 Pandas 数据框进行蛇形排序

Question

我正在尝试找到一种方法在 python 中对 pandas 数据框进行特殊类型的排序：蛇形排序。

这里简单介绍一下蛇形排序：

蛇形排序会反转排序顺序，因为更高级别的排序变量会跨越每个边界，从而有助于确保相邻记录在尽可能多的排序变量方面是相似的。

下面是使用三个变量进行蛇形排序的示例，每个变量又分为三个类别（低、中、高）。在变量 1 的类别内对变量 2 进行排序时，以及在变量 2 的类别内对变量 3 进行排序时，排序顺序会颠倒，因此相邻记录是相似的关于所有三个变量。

当我能够在 SAS 中工作时，我使用 SAS 程序 (Proc Survey Select) 进行蛇形排序：


proc surveyselect data=&outfile sort=serp method=seqrate=1;
 
/* Grouping variables*/
 strata &class;
 
 /* Sort Varaibles */ 
 control &sortvar;

run;

R 包中还有一个执行蛇形排序的实用程序：

https://rdrr.io/github/adamMaier/samplingTools/man/serpentine.html

有谁知道用 Pandas 在 python 中执行此操作的相应方法吗？

这是我最好的尝试。它确实有效，但在几个变量之后它就崩溃了。我永远无法使用上述其他方法之一（例如 PROC SURVEY SELECT）使其与输出匹配。


def serpentine_sort(testframe, classvar , isortvar):
    
    
    newframe=pandas.DataFrame(testframe)
    
    #create a list of true for each element in the list.
    #These will specify that
    ascenlist = [True for x in isortvar]
    
    
    
    # Perform an initial sort where we first sort by the class, and then the sort variables
    # in ascending order within the class
    newframe = newframe.sort_values(classvar, ascending=True) \
    .groupby(classvar, sort=True) \
    .apply(lambda x: x.sort_values(isortvar, ascending=ascenlist)) \
    .reset_index(drop=True).copy(deep=True)
    
    
    
    
    
    ##### SERPENTINE SORT THE DATA ONE COLUMN AT A TIME                   #####
    ##### ==============================================                  #####
    ##### Create a sort variable that is a cumulative count within        #####
    ##### each group in the preceding variable. Use modulus division      #####
    ##### to reverse the count. If the count from the prior groping       #####
    ##### variable is 0, sort ascending, if the not, sort descending      #####
    
    
    grouplist = classvar
    for i in range(1, len(isortvar)):

        print ("Iteration: " , i )
        ranklist=[]
        ranklist.append(classvar[0])
        ranklist.append(isortvar[i-1])
        
        
        grouplist.append(isortvar[i-1])
        print (ranklist)
        print (grouplist)
        
        
        # Count the groups within the prior column within the class
        newframe["counter"] = newframe.groupby([isortvar[i-1]]).ngroup()
        
        newframe["serpvar"] = newframe.loc[ newframe["counter"] % 2 == 0 ].groupby(ranklist)[isortvar[i-1]].cumcount(ascending=True ) + 1
        newframe["t"]       = newframe.loc[ newframe["counter"] % 2 != 0 ].groupby(ranklist)[isortvar[i-1]].cumcount(ascending=False) + 1
        newframe.loc[ newframe["serpvar"].isna() , "serpvar" ] = newframe["t"]
        
        #print ("")
        #print ("Pre sorted")
        #print ("==========")
        #print ("")
        #print (newframe)
    
    
        newframe = newframe.groupby(grouplist, sort=False) \
        .apply(lambda x: x.sort_values("serpvar", ascending=[True])) \
        .reset_index(drop=True).copy(deep=True)
    
    
    return newframe



nscg=serpentine_sort(testframe=df, classvar=["BTHRGN"], isortvar=["AGEGRP", "WAPRI", "PRMBR"])

注意事项： classvar 参数是分组变量。在这里我想按组进行排序。 isortvar 参数是我的排序变量的列表。

任何帮助将不胜感激。

Answer 1

假设此示例输入与您的类似：

from itertools import product

cat = pd.CategoricalDtype(categories=['Low', 'Medium', 'High'], ordered=True)

df = (pd.DataFrame(product(['Low', 'Medium', 'High'], repeat=3), dtype=cat)
        .rename(columns=lambda x: f'Variable{x+1}')
     )

可以先按正常顺序排序，然后计算每个重复值的排名并将奇数取反，最后将奇数乘以-1后再排序：

cols = ['Variable1', 'Variable2', 'Variable3']

tmp =(df[cols]
 .sort_values(by=cols)
 .apply(lambda x: x.cat.codes)
)

order = (tmp.apply(lambda x: tmp.groupby(list(tmp.loc[:, x.name:]))
                                .cumcount().mod(2).mul(2).rsub(1))
            .mul(tmp).sort_values(by=cols).index
        )

out = df.reindex(order)

注意：我不确定它如何推广到其他数据集，如果您认为这不适合所有用例，请随时更新问题。

输出：

   Variable1 Variable2 Variable3
0        Low       Low       Low
1        Low       Low    Medium
2        Low       Low      High
5        Low    Medium      High
4        Low    Medium    Medium
3        Low    Medium       Low
6        Low      High       Low
7        Low      High    Medium
8        Low      High      High
17    Medium      High      High
16    Medium      High    Medium
15    Medium      High       Low
12    Medium    Medium       Low
13    Medium    Medium    Medium
14    Medium    Medium      High
11    Medium       Low      High
10    Medium       Low    Medium
9     Medium       Low       Low
18      High       Low       Low
19      High       Low    Medium
20      High       Low      High
23      High    Medium      High
22      High    Medium    Medium
21      High    Medium       Low
24      High      High       Low
25      High      High    Medium
26      High      High      High

中间体：

# tmp
    Variable1  Variable2  Variable3
0           0          0          0
1           0          0          1
2           0          0          2
3           0          1          0
4           0          1          1
5           0          1          2
6           0          2          0
7           0          2          1
8           0          2          2
9           1          0          0
10          1          0          1
11          1          0          2
12          1          1          0
13          1          1          1
14          1          1          2
15          1          2          0
16          1          2          1
17          1          2          2
18          2          0          0
19          2          0          1
20          2          0          2
21          2          1          0
22          2          1          1
23          2          1          2
24          2          2          0
25          2          2          1
26          2          2          2

# tmp.apply(lambda x: tmp.groupby(list(tmp.loc[:, x.name:]))
#                        .cumcount().mod(2).mul(2).rsub(1))
    Variable1  Variable2  Variable3
0           1          1          1
1           1          1          1
2           1          1          1
3           1          1         -1
4           1          1         -1
5           1          1         -1
6           1          1          1
7           1          1          1
8           1          1          1
9           1         -1         -1
10          1         -1         -1
11          1         -1         -1
12          1         -1          1
13          1         -1          1
14          1         -1          1
15          1         -1         -1
16          1         -1         -1
17          1         -1         -1
18          1          1          1
19          1          1          1
20          1          1          1
21          1          1         -1
22          1          1         -1
23          1          1         -1
24          1          1          1
25          1          1          1
26          1          1          1

在 Python 中对 Pandas 数据框进行蛇形排序

问题描述投票：0回答：1

1个回答

最新问题

在 Python 中对 Pandas 数据框进行蛇形排序

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1