我有几个变量的数据帧:地区,季节,年,海拔高度和响应(这里为例):
region season year altitud response
IT wint 2013 800 45
IT wint 2013 815 47
IT wint 2013 840 54
IT wint 2014 800 49
IT wint 2014 815 59
等等。有三个区域,四季两年了,我想altitud与反应之间执行数线性模型和绘制,子集化根据所有可能的组合数据。即
subset(region&season&year) and get altitud~response
IT&wint&2013
IT&wint&2014
IT&spring&2013
IT&spring&2014
等等。因此,24个的组合。有任何想法吗?
预先感谢您非常
大卫
我的解决方案使用broom
与tidy
功能。
读取数据:
library(readr)
data <- read_table("region season year altitud response
IT wint 2013 800 45
IT wint 2013 815 47
IT wint 2013 840 54
IT wint 2014 800 49
IT wint 2014 815 59")
实际的解决方案:
library(dplyr)
library(broom)
data_fit <- data %>%
group_by(region, season, year) %>%
do(fit = lm(altitud ~ response, data = .))
dfCoefs <- tidy(data_fit, fit)
dfCoefs
其给出的示例数据如下回归系数:
# A tibble: 4 x 8
# Groups: region, season, year [2]
region season year term estimate std.error statistic p.value
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 IT wint 2013 (Intercept) 613. 34.7 17.7 0.0360
2 IT wint 2013 response 4.22 0.711 5.93 0.106
3 IT wint 2014 (Intercept) 726. NaN NaN NaN
4 IT wint 2014 response 1.5 NaN NaN NaN
虽然,你想altitud ~ response
(即预测从响应高度)或response ~ altitud
(预测给予高度的反应如何?)
但愿我给你正确的,这里有一个purrr的解决方案:
library(purrr)
library(dplyr)
nested<-df %>%
mutate_if(is.character,as.factor) %>%
group_by(year,season,region) %>%
nest()
my_model<-function(df){
lm(altitud~response,data=df)
}
nested %>%
mutate(Mod=map(data,my_model))
结果:部分修改的数据得到的因素。
A tibble: 3 x 5
year season region data Mod
<int> <fct> <fct> <list> <list>
1 2013 wint IT <tibble [3 x 2]> <S3: lm>
2 2014 wint IT <tibble [1 x 2]> <S3: lm>
3 2014 Summer IF <tibble [1 x 2]> <S3: lm>
与modelr
预测。你可以使用broom
统计如图对方的回答。
require(modelr)
nested %>%
mutate(Mod=map(data,my_model)) %>%
mutate(Preds=map2(data,Mod,add_predictions)) %>%
unnest(Preds)
# A tibble: 5 x 6
year season region altitud response pred
<int> <fct> <fct> <int> <int> <dbl>
1 2013 wint IT 800 45 44.4
2 2013 wint IT 815 47 47.9
3 2013 wint IT 840 54 53.7
4 2014 wint IT 800 49 49
5 2014 Summer IF 815 59 59
掌握broom
和purrr
汇总统计:
# A tibble: 4 x 8
year season region term estimate std.error statistic p.value
<int> <fct> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 wint IT (Intercept) -140. 31.8 -4.40 0.142
2 2013 wint IT altitud 0.231 0.0389 5.93 0.106
3 2014 wint IT (Intercept) 49 NaN NaN NaN
4 2014 Summer IF (Intercept) 59 NaN NaN NaN
nested %>%
mutate(Mod=map(data,my_model)) %>%
mutate(Preds=map2(data,Mod,add_predictions),Tidy=map(Mod,tidy)) %>%
unnest(Tidy)
数据:
df<-read.table(text="region season year altitud response
IT wint 2013 800 45
IT wint 2013 815 47
IT wint 2013 840 54
IT wint 2014 800 49
IF Summer 2014 815 59",header=T)
为了完整起见,这里也是基础R和data.table解决方案。
使用split()
和lapply()
一个可能的基础R的方法是suggested by Jogo:
result <- lapply(split(DT, list(DT$region, DT$season, DT$year)),
lm, formula = response ~ altitud)
print(result)
$IT.wint.2013
Call:
FUN(formula = ..1, data = X[[i]])
Coefficients:
(Intercept) altitud
-140.0510 0.2306
$IT.wint.2014
Call:
FUN(formula = ..1, data = X[[i]])
Coefficients:
(Intercept) altitud
-484.3333 0.6667
或者,使用管道来提高可读性
library(magrittr)
result <- split(DT, list(DT$region, DT$season, DT$year)) %>%
lapply(lm, formula = response ~ altitud)
随着broom
的一些帮助:
library(data.table)
library(magrittr)
setDT(DT)[, lm(response ~ altitud, .SD) %>% broom::tidy(), by = .(region, season, year)]
region season year term estimate std.error statistic p.value
1: IT wint 2013 (Intercept) -140.0510204 31.82553603 -4.400586 0.1422513
2: IT wint 2013 altitud 0.2306122 0.03888277 5.930962 0.1063382
3: IT wint 2014 (Intercept) -484.3333333 NaN NaN NaN
4: IT wint 2014 altitud 0.6666667 NaN NaN NaN
setDT(DT)[, lm(response ~ altitud, .SD) %>% broom::glance(), by = .(region, season, year)]
region season year r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1: IT wint 2013 0.9723576 0.9447152 1.111168 35.17631 0.1063382 2 -2.925132 11.85026 9.1461 1.234694 1
2: IT wint 2014 1.0000000 NaN NaN NaN NaN 2 Inf -Inf -Inf 0.000000 0
如果计算lm()
为不同的基团是耗时这可能是值得的存储结果,并使用这些用于随后的处理步骤:
mod <- setDT(DT)[, .(model = .(lm(response ~ altitud, .SD))), by = .(region, season, year)]
mod
region season year models
1: IT wint 2013 <lm>
2: IT wint 2014 <lm>
mod$models
是模型等价的列表result
。
现在,我们可以从计算模型中提取所需的信息,例如,
mod[, models[[1]] %>% broom::tidy(), by = .(region, season, year)]
region season year term estimate std.error statistic p.value
1: IT wint 2013 (Intercept) -140.0510204 31.82553603 -4.400586 0.1422513
2: IT wint 2013 altitud 0.2306122 0.03888277 5.930962 0.1063382
3: IT wint 2014 (Intercept) -484.3333333 NaN NaN NaN
4: IT wint 2014 altitud 0.6666667 NaN NaN NaN
library(data.table)
DT <- fread("
region season year altitud response
IT wint 2013 800 45
IT wint 2013 815 47
IT wint 2013 840 54
IT wint 2014 800 49
IT wint 2014 815 59")