R 因子函数在长数据帧下运行缓慢

问题描述 投票:0回答:2

我有一个很长的数据框(数百万行,几列)。为了运行固定效应回归,我想使用

factor
函数将分类变量声明为因子,但这非常慢。我正在寻找一个潜在的解决方案来加快速度。

我的代码如下:

library(lfe)
my_data=read.csv("path_to//data.csv")
attach(data.frame(my_data))

以下是非常慢线:

my_data$col <- factor(my_data$col)
r performance dataframe categorical-data dummy-variable
2个回答
5
投票

如果您知道您正在创建的因子的级别,这可以大大加快速度。观察:

library(microbenchmark)
set.seed(237)
test <- sample(letters, 10^7, replace = TRUE)
microbenchmark(noLevels = factor(test), withLevels = factor(test, levels = letters), times = 20)
Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval cld
  noLevels 523.6078 545.3156 653.4833 696.4768 715.9026 862.2155    20   b
withLevels 248.6904 270.3233 325.0762 291.6915 345.7774 534.2473    20  a 

要获取 OP 情况的级别,我们只需调用

unique

myLevels <- unique(my_data$col)
my_data$col <- factor(my_data$col, levels = myLevels)

还有 Kevin Ushley 编写的

Rcpp
产品(Rcpp 的快速因子生成)。 我稍微修改了代码,假设人们会知道级别先验。 参考网站中的函数是
RcppNoLevs
,修改后的 Rcpp 函数是下面基准测试中的
RcppWithLevs

microbenchmark(noLevels = factor(test),
               withLevels = factor(test, levels = letters),
               RcppNoLevs = fast_factor(test),
               RcppWithLevs = fast_factor_Levs(test, letters), times = 20)
Unit: milliseconds
        expr      min       lq     mean   median       uq       max neval  cld
    noLevels 571.5482 609.6640 672.1249 645.4434 704.4402 1032.7595    20    d
  withLevels 275.0570 294.5768 318.7556 309.2982 342.8374  383.8741    20   c 
  RcppNoLevs 189.5656 203.3362 213.2624 206.9281 215.6863  292.8997    20  b  
RcppWithLevs 105.7902 111.8863 120.0000 117.9411 122.8043  173.8130    20 a   

这是修改后的 Rcpp 函数,假设将级别作为参数传递:

#include <Rcpp.h>
using namespace Rcpp;

template <int RTYPE>
IntegerVector fast_factor_template_Levs( const Vector<RTYPE>& x, const Vector<RTYPE>& levs) {
    IntegerVector out = match(x, levs);
    out.attr("levels") = as<CharacterVector>(levs);
    out.attr("class") = "factor";
    return out;
}

// [[Rcpp::export]]
SEXP fast_factor_Levs( SEXP x, SEXP levs) {
    switch( TYPEOF(x) ) {
    case INTSXP: return fast_factor_template_Levs<INTSXP>(x, levs);
    case REALSXP: return fast_factor_template_Levs<REALSXP>(x, levs);
    case STRSXP: return fast_factor_template_Levs<STRSXP>(x, levs);
    }
    return R_NilValue;
}

0
投票

另一种选择是

cheapr::factor_()
,其工作方式与
factor()
类似。

设置数据和功能

  library(microbenchmark)
  library(ggplot2)
  library(cheapr)
  library(Rcpp)
#> Warning: package 'Rcpp' was built under R version 4.4.1
  
  set.seed(237)
  test <- sample(letters, 10^7, replace = TRUE)
  
  cppFunction('#include <Rcpp.h>
using namespace Rcpp;

template <int RTYPE>
IntegerVector fast_factor_template_Levs( const Vector<RTYPE>& x, const Vector<RTYPE>& levs) {
    IntegerVector out = match(x, levs);
    out.attr("levels") = as<CharacterVector>(levs);
    out.attr("class") = "factor";
    return out;
}

// [[Rcpp::export]]
SEXP fast_factor_Levs( SEXP x, SEXP levs) {
    switch( TYPEOF(x) ) {
    case INTSXP: return fast_factor_template_Levs<INTSXP>(x, levs);
    case REALSXP: return fast_factor_template_Levs<REALSXP>(x, levs);
    case STRSXP: return fast_factor_template_Levs<STRSXP>(x, levs);
    }
    return R_NilValue;
}')
#> Warning: No function found for Rcpp::export attribute at file66d42f237abb.cpp:5
  
  cppFunction('#include <Rcpp.h>
using namespace Rcpp;

template <int RTYPE>
IntegerVector fast_factor_template( const Vector<RTYPE>& x ) {
    Vector<RTYPE> levs = sort_unique(x);
    IntegerVector out = match(x, levs);
    out.attr("levels") = as<CharacterVector>(levs);
    out.attr("class") = "factor";
    return out;
}

// [[Rcpp::export]]
SEXP fast_factor( SEXP x ) {
    switch( TYPEOF(x) ) {
    case INTSXP: return fast_factor_template<INTSXP>(x);
    case REALSXP: return fast_factor_template<REALSXP>(x);
    case STRSXP: return fast_factor_template<STRSXP>(x);
    }
    return R_NilValue;
}')
#> Warning: No function found for Rcpp::export attribute at file66d438fc5261.cpp:5

基准

  
  microbenchmark(
    cheapr_no_levels = factor_(test),
    cheapr_with_levels = factor_(test, levels = letters),
    base_no_levels = factor(test), 
    base_with_levels = factor(test, levels = letters),
    rcpp_no_levels = fast_factor(test),
    rcpp_with_levels = fast_factor_Levs(test, levs = letters),
    times = 5
  ) |> 
    autoplot()

创建于 2024-08-16,使用 reprex v2.1.0

最新问题
© www.soinside.com 2019 - 2025. All rights reserved.