如何按单列分组并从多行和多列创建Python样式的列表字符串

问题描述 投票:0回答:1

我想按

mcode
对数据进行分组,并为每个组创建两种不同类型的行。

以下是示例数据。

   Cat1  Cat2  Cat3  mcode key   pcode needed
 1 C1    C2    C31   B3100 TRUE  P001  P001  
 2 C1    C2    C31   B3100 FALSE P002  P002  
 3 C1    C2    C31   B5500 TRUE  P003  P003  
 4 C1    C2    C31   B5500 FALSE P004  NA    
 5 C1    C2    C31   B5500 FALSE P005  NA    
 6 C1    C2    C32   B1000 TRUE  P006  NA    
 7 C1    C2    C32   B1000 FALSE P007  P007  
 8 C1    C2    C32   B1000 FALSE P008  NA    
 9 C1    C2    C32   B1000 FALSE P009  P009  
10 C1    C2    C32   B1000 FALSE P010  P010  

对于每个组,我想从

Cat1
Cat2
的行获取类别值 (
Cat3
key
TRUE
)。

此外,我需要创建 Python 样式的列表字符串,分别组合

pcode
needed
列中的所有值,不包括
NA
值。

请注意,当

key
第一次具有不同值时,
TRUE
列为
mcode

以下是预期输出。

  mcode Cat1  Cat2  Cat3  type   extended_info                                             
1 B1000 C1    C2    C32   pcode  ['P006','P007','P008','P009','P010']
2 B1000 C1    C2    C32   needed ['P007','P009','P010']              
3 B3100 C1    C2    C31   pcode  ['P001','P002']                     
4 B3100 C1    C2    C31   needed ['P001','P002']                     
5 B5500 C1    C2    C31   pcode  ['P003','P004','P005']              
6 B5500 C1    C2    C31   needed ['P003']   

这里是重现数据和预期输出的 tribble

df <- tribble(
  ~Cat1, ~Cat2, ~Cat3, ~mcode, ~key,   ~pcode, ~needed,
  "C1",        "C2",        "C31",       "B3100",      TRUE,   "P001",       "P001",
  "C1",        "C2",        "C31",       "B3100",      FALSE,  "P002",       "P002",
  "C1",        "C2",        "C31",       "B5500",      TRUE,   "P003",       "P003",
  "C1",        "C2",        "C31",       "B5500",      FALSE,  "P004",       NA,
  "C1",        "C2",        "C31",       "B5500",      FALSE,  "P005",       NA,
  "C1",        "C2",        "C32",       "B1000",      TRUE,   "P006",       NA,
  "C1",        "C2",        "C32",       "B1000",      FALSE,  "P007",       "P007",
  "C1",        "C2",        "C32",       "B1000",      FALSE,  "P008",       NA,
  "C1",        "C2",        "C32",       "B1000",      FALSE,  "P009",       "P009",
  "C1",        "C2",        "C32",       "B1000",      FALSE,  "P010",       "P010"
)
expected_output <- tribble(
  ~mcode, ~Cat1, ~Cat2, ~Cat3, ~type,   ~extended_info,
  "B1000", "C1", "C2", "C32", "pcode",  "['P006','P007','P008','P009','P010']",
  "B1000", "C1", "C2", "C32", "needed", "['P007','P009','P010']",
  "B3100", "C1", "C2", "C31", "pcode",  "['P001','P002']",
  "B3100", "C1", "C2", "C31", "needed", "['P001','P002']",
  "B5500", "C1", "C2", "C31", "pcode",  "['P003','P004','P005']",
  "B5500", "C1", "C2", "C31", "needed", "['P003']"
)
r dplyr
1个回答
0
投票

看起来每个

key = TRUE
只有一行带有
mcode

这样的东西应该能满足你的需要:

expected_output <- df %>% 
  summarise(Cat1 = first(Cat1[key]), 
            Cat2 = first(Cat2[key]),
            Cat3 = first(Cat3[key]),
            pcode = list(sort(unique(pcode[!is.na(pcode)]))),
            needed = list(sort(unique(needed[!is.na(needed)]))),
            .by = mcode) %>% 
  pivot_longer(cols = c(pcode, needed), names_to = "type", values_to = "extended_info") %>% 
  arrange(mcode)
© www.soinside.com 2019 - 2024. All rights reserved.