dplyr堆栈重复列

问题描述 投票:1回答:2

我收到的数据集远大于,但类似于:

cost <- data.frame(Date=rep('1970-01-01', 3), Atr=runif(3),Atrb=runif(3),a=runif(3), b=runif(3), a=runif(3), b=runif(3),a=runif(3), b=runif(3))
names(cost) <- c('Date', 'Atr','Atrb','a','b','a','b','a','b')

有比a和b更多的列具有重复的名称。属性列比此处多得多。

理想情况下,我希望能够使用-c(Date:Atrb)样式进行任何排除。

我希望能够创建两种类型的高数据:

    Date         Atr     Atrb   Group    a         b     
1 1970-01-01 0.988181929 0.123  A        0.3836335 0.7793414
... 
1 1970-01-01 0.988181929 0.456  B        0.6464    0.98687
...
1 1970-01-01 0.988181929 0.789  C        0.123     0.3456

另一个是:

    Date         Atr     Atrb   Group    Metric         Value     
1 1970-01-01 0.988181929 0.123    A        a              0.3836335 
1 1970-01-01 0.988181929 0.456    A        b              0.7793414
... 
1 1970-01-01 0.988181929 0.678    B        a              0.6464    
1 1970-01-01 0.988181929 0.345    B        b              0.98687
...
1 1970-01-01 0.988181929 0.789    C        a              0.123     
1 1970-01-01 0.988181929 0.456    C        b              0.3456

感谢任何提示或提示

r dplyr
2个回答
1
投票

这是一个dplyr解决方案,但它需要首先对重复的列名进行重复数据删除。使用dplyr或类似的东西在rename_at管道中可能有一种方法可以做到这一点,但目前没有任何东西可供我使用:

library(tidyverse)

# Deduplicate column names
idx = grep("^[ab]$", names(cost))
names(cost)[idx] = paste0(names(cost)[idx],".", rep(LETTERS[1:(length(idx)/2)], length(unique(names(cost)[idx]))))

cost.long = cost %>%
  gather(key, value, -Date, -Atr) %>%
  separate(key, into=c("Metric", "Group"))

cost.long
         Date       Atr Metric Group      value
1  1970-01-01 0.5567203      a     A 0.43008996
2  1970-01-01 0.9421835      a     A 0.94488436
3  1970-01-01 0.4672264      a     A 0.70847981
...
16 1970-01-01 0.5567203      b     C 0.53797611
17 1970-01-01 0.9421835      b     C 0.02623668
18 1970-01-01 0.4672264      b     C 0.72440841
cost.long %>% spread(Metric, value)
        Date       Atr Group         a          b
1 1970-01-01 0.4672264     A 0.7084798 0.34580760
2 1970-01-01 0.4672264     B 0.6537164 0.69052451
3 1970-01-01 0.4672264     C 0.6811653 0.72440841
4 1970-01-01 0.5567203     A 0.4300900 0.35140699
5 1970-01-01 0.5567203     B 0.7600863 0.76989417
6 1970-01-01 0.5567203     C 0.7536469 0.53797611
7 1970-01-01 0.9421835     A 0.9448844 0.61829407
8 1970-01-01 0.9421835     B 0.8929478 0.95985575
9 1970-01-01 0.9421835     C 0.9765727 0.02623668

更新:要解决您的评论和更新的示例数据:如果grep不是一个选项,还有其他方法。例如:

idx = which(names(cost) %in% c("a","b"))

names(cost)[idx] = paste0(names(cost)[idx],".", rep(LETTERS[1:(length(idx)/2)], length(unique(names(cost)[idx]))))

head(cost)
        Date       Atr      Atrb       a.A       b.B       a.C       b.A       a.B       b.C
1 1970-01-01 0.9437837 0.6600156 0.6084679 0.7855074 0.5800305 0.5174202 0.8040882 0.5707653
2 1970-01-01 0.7802974 0.5401236 0.4929647 0.5453363 0.5287559 0.3684388 0.5405421 0.2421765
3 1970-01-01 0.5336272 0.9807934 0.4530894 0.6434532 0.8517817 0.1269552 0.6914684 0.3035965

我没有收到以下代码的警告:

cost %>%
  gather(key, value, -c(Date:Atrb)) %>%
  separate(key, into=c("Metric", "Group"))
         Date       Atr      Atrb Metric Group     value
1  1970-01-01 0.9437837 0.6600156      a     A 0.6084679
2  1970-01-01 0.7802974 0.5401236      a     A 0.4929647
3  1970-01-01 0.5336272 0.9807934      a     A 0.4530894
4  1970-01-01 0.9437837 0.6600156      b     B 0.7855074
5  1970-01-01 0.7802974 0.5401236      b     B 0.5453363
6  1970-01-01 0.5336272 0.9807934      b     B 0.6434532
7  1970-01-01 0.9437837 0.6600156      a     C 0.5800305
8  1970-01-01 0.7802974 0.5401236      a     C 0.5287559
9  1970-01-01 0.5336272 0.9807934      a     C 0.8517817
10 1970-01-01 0.9437837 0.6600156      b     A 0.5174202
11 1970-01-01 0.7802974 0.5401236      b     A 0.3684388
12 1970-01-01 0.5336272 0.9807934      b     A 0.1269552
13 1970-01-01 0.9437837 0.6600156      a     B 0.8040882
14 1970-01-01 0.7802974 0.5401236      a     B 0.5405421
15 1970-01-01 0.5336272 0.9807934      a     B 0.6914684
16 1970-01-01 0.9437837 0.6600156      b     C 0.5707653
17 1970-01-01 0.7802974 0.5401236      b     C 0.2421765
18 1970-01-01 0.5336272 0.9807934      b     C 0.3035965

1
投票

data.table的熔体延伸对此非常有用,因为您可以同时熔化多个重复的柱子:

require(data.table)

setDT(cost)
melt(cost, id=c("Date","Atr"), meas = patterns("^a$","^b$"), value.name=c("a","b"))

         Date        Atr variable          a          b
1: 1970-01-01 0.09643571        1 0.38316876 0.69935636
2: 1970-01-01 0.10714089        1 0.12154920 0.42598159
3: 1970-01-01 0.91581813        1 0.03301164 0.21327371
4: 1970-01-01 0.09643571        2 0.06866915 0.05604199
5: 1970-01-01 0.10714089        2 0.74418388 0.17013278
6: 1970-01-01 0.91581813        2 0.33784588 0.33794886
7: 1970-01-01 0.09643571        3 0.19680638 0.51427164
8: 1970-01-01 0.10714089        3 0.71372700 0.71134925
9: 1970-01-01 0.91581813        3 0.34700614 0.84975838

要获得第二种形式,您可以简单地使用melt的标准实现,更像melt(cost2, id.vars=c("Date","Atr", "variable"))

© www.soinside.com 2019 - 2024. All rights reserved.