当使用`data.table`的DT [i，j，by]时，是否可以事先设置列类型？

Question

我正在尝试为多个不同的组（例如DT[, cor.test(var1, var2), group]）计算两个变量之间的相关性。每当我使用cor.test(var1, var2, method = 'pearson')时，这都很好用，但是当我使用cor.test(var1, var2, method = 'spearman')时，会引发错误。

library(data.table)
DT <- as.data.table(iris)

# works perfectly 
DT[,cor.test(Sepal.Length,Sepal.Width, method = 'pearson'), Species]
#       Species statistic parameter      p.value  estimate null.value
# 1:     setosa  7.680738        48 6.709843e-10 0.7425467          0
# 2:     setosa  7.680738        48 6.709843e-10 0.7425467          0
# 3: versicolor  4.283887        48 8.771860e-05 0.5259107          0
# 4: versicolor  4.283887        48 8.771860e-05 0.5259107          0
# 5:  virginica  3.561892        48 8.434625e-04 0.4572278          0
# 6:  virginica  3.561892        48 8.434625e-04 0.4572278          0
#    alternative                               method
# 1:   two.sided Pearson's product-moment correlation
# 2:   two.sided Pearson's product-moment correlation
# 3:   two.sided Pearson's product-moment correlation
# 4:   two.sided Pearson's product-moment correlation
# 5:   two.sided Pearson's product-moment correlation
# 6:   two.sided Pearson's product-moment correlation
#                       data.name  conf.int
# 1: Sepal.Length and Sepal.Width 0.5851391
# 2: Sepal.Length and Sepal.Width 0.8460314
# 3: Sepal.Length and Sepal.Width 0.2900175
# 4: Sepal.Length and Sepal.Width 0.7015599
# 5: Sepal.Length and Sepal.Width 0.2049657
#> 6: Sepal.Length and Sepal.Width 0.6525292

# error
DT[,cor.test(Sepal.Length,Sepal.Width, method = 'spearman'), Species]
# Error in `[.data.table`(DT, , cor.test(Sepal.Length, Sepal.Width, method = "spearman"), : 
# Column 2 of j's result for the first group is NULL. We rely on the column types of the first 
# result to decide the type expected for the remaining groups (and require consistency). NULL 
# columns are acceptable for later groups (and those are replaced with NA of appropriate type 
# and recycled) but not for the first. Please use a typed empty vector instead, such as 
# integer() or numeric().

问题：

我知道此特定示例有解决方法，但是可以在使用data.table的情况下事先告诉DT[i,j,by = 'something']列类型是什么？

Answer 1

如果要保留所有列，而不是删除带有NULL的列，则可以手动设置'问题'列的类（在这种情况下，出现问题的列是“ parameter”）。如果该列确实包含某些组的值而不包含其他组的值，则这比删除NULL更可取。

DT[, {
  res <- cor.test(Sepal.Length, Sepal.Width, method = 'spearman')
  class(res$parameter) <- 'integer'
  res
  }, Species]

#      Species statistic parameter      p.value  estimate null.value alternative                          method                    data.name
#1:     setosa  5095.097        NA 2.316710e-10 0.7553375          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
#2: versicolor 10045.855        NA 1.183863e-04 0.5176060          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
#3:  virginica 11942.793        NA 2.010675e-03 0.4265165          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width

Answer 2

在我看来，错误味精实际上是不言而喻的：

j的第一组结果的第2列为NULL。我们依靠第一个结果的列类型来确定其余组的期望类型（并要求一致性）。 NULL列对于以后的组是可接受的（并且将它们替换为适当类型的NA并回收），但对于第一个组则不可接受。请改用有类型的空向量，例如integer（）或numeric（）。

您可能想用来过滤掉NULL（但要注意，每个by的NULL位置都相同：

DT[, Filter(Negate(is.null), cor.test(Sepal.Length,Sepal.Width, method = 'spearman')), Species]

输出：

      Species statistic      p.value  estimate null.value alternative                          method                    data.name
1:     setosa  5095.097 2.316710e-10 0.7553375          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
2: versicolor 10045.855 1.183863e-04 0.5176060          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
3:  virginica 11942.793 2.010675e-03 0.4265165          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width

请参见R: removing NULL elements from a list

当使用`data.table`的DT [i，j，by]时，是否可以事先设置列类型？

问题描述投票：0回答：2

问题：

2个回答

最新问题

当使用`data.table`的DT [i，j，by]时，是否可以事先设置列类型？

问题描述 投票：0回答：2

问题：

2个回答

最新问题

问题描述投票：0回答：2