我正在使用glmnet和glmnetcr来拟合序数回归模型。
不幸的是,我的模型矩阵是~640000 * 5000.这比存储在32位整数中的要大,我遇到了其他人描述的相同问题:R vector size limit: "long vectors (argument 5) are not supported in .C"
如果我只使用一半的数据,我可以在我的本地服务器上运行它,内存充足,没有问题。
我试图通过使用dotCall64包在上面的帖子中实现'解决方案'。我用.C64替换了.Fortran调用,并为每个变量指定了数据类型。但是,每次运行我的代码时,我都会得到无意义的lambda值(9.9e35)或段错误,例如:
*捕获了segfault *地址0x1511aaeb0,导致'内存未映射'
我得到哪一个,确切的地址每次都不一样,所以我假设我在实施这个解决方案时做错了什么。
这是函数lognet()中的代码(函数最终由glmnetcr和glmnet调用,并将变量传递给fortran代码)
.Fortran("lognet", parm = alpha, nobs, nvars, nc, as.double(x),
y, offset, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh,
isd, intr, maxit, kopt, lmu = integer(1), a0 = double(nlam *
nc), ca = double(nx * nlam * nc), ia = integer(nx),
nin = integer(nlam), nulldev = double(1), dev = double(nlam),
alm = double(nlam), nlp = integer(1), jerr = integer(1),
PACKAGE = "glmnet")
.C64("lognet", SIGNATURE = c("double","int", "int", "int", "int64",
"double","double","int", "double","double"
"int", "int", "int", "double","double",
"double","int", "int", "int", "int",
"int", "double","double","int", "int",
"double","double","double","int", "int"),
parm = alpha, nobs, nvars, nc, as.double(x),
y, offset, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh,
isd, intr, maxit, kopt, lmu = integer(1), a0 = double(nlam * nc), ca = double(nx * nlam * nc), ia = integer(nx),
nin = integer(nlam), nulldev = double(1), dev = double(nlam),
alm = double(nlam), nlp = integer(1), jerr = integer(1),
PACKAGE = "glmnet")
library(glmnetcr)
library(dotCall64)
x1 <- cbind(c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1),c(0,0,0,1,0,1,1,1,0,0,0,0,0,1,1,1),c(0,0,1,0,1,0,1,1,0,0,0,0,1,0,1,1),c(0,1,0,0,1,1,0,1,0,0,0,0,1,1,0,1),c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1),c(0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1),c(0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1))
y1 <- c(0,0,0,1,1,1,2,2,0,1,0,1,1,2,1,2)
testA <- glmnetcr(x=x1,y=y1,method = "forward", nlambda=10,lambda.min.ratio=0.001, alpha =1,maxit = 500,standardize=FALSE)
使用原始lognet()代码运行它不会产生任何问题。使用修改后的lognet()代码运行它会导致奇数lambda值估计和/或段错误(似乎是随机发生的)。我的第一个猜测是我输入的变量之一不正确,但我已经完成了两次,但看不到问题。另一个选择是底层fortran代码不能处理64位整数。我知道零fortran,如果是这样的话,我甚至不确定如何开始解决问题。
所以我联系了glmnet的软件包维护者。他们有过转换为.C64的经验。在他们的帮助和一点点摆弄的情况下,我能够获得以下代码。为了运行它,我创建了一个名为glmnet64的新函数,该函数调用另一个新函数lognet64而不是原始的lognet调用。 lognet64与原始lognet功能相同,但用以下内容替换了.Fortran调用:
.C64("lognet", SIGNATURE = c("double", "integer","integer","integer","double",
"double", "double", "integer","double", "double",
"integer","integer","integer","double", "double",
"double", "integer","integer","integer","integer",
"integer","double", "double", "integer","integer",
"double", "double", "double","integer","integer"),
parm = alpha,nobs, nvars, nc, as.double(x),
y, offset, jd, vp, cl,
ne, nx, nlam, flmin, ulam,
thresh, isd, intr, maxit, kopt,
lmu = integer(1), a0 = double(nlam * nc),
ca = double(nx * nlam * nc), ia = integer(nx), nin = integer(nlam),
nulldev = double(1), dev = double(nlam), alm = double(nlam),
nlp = integer(1), jerr = integer(1),
INTENT = c(rep("rw",4),"r",rep("rw",15),rep("w",10)),
PACKAGE = "glmnet",
NAOK = TRUE)
关键似乎是正确指定所有变量类型。能够在.Fortran调用之前使用browser()来实现这一点。此外,通过指定INTENT并设置NAOK = TRUE(如预期的那样)来提高速度。肯定会推荐那些。