有两个表,table_a 和 table_b。一个表具有每月数据,而另一个表具有年度数据(尽管两个表都有“年份”列/两个表可能具有不同的年份范围)。示例:
table_a table_b
year month infltn year h_index
[1] 2010 May 4.0% [1] 2011 102.0
[2] 2010 Jun 4.0% [2] 2012 102.5
[3] 2010 Jul 4.0% [3] 2013 103.0
[4] 2010 Aug 4.0% [4] 2014 103.6
[5] 2010 Sep 4.0% [5] 2015 104.1
[6] 2010 Oct 3.0% [6] 2016 104.6
[7] 2010 Nov 3.0% [7] 2017 105.1
[8] 2010 Dec 3.0% [8] 2018 105.6
[9] 2011 Jan 4.0% [9] 2019 106.2
[10] 2011 Feb 4.0% [10] 2020 106.7
[11] 2011 Mar 4.0% [11] 2021 107.2
[12] 2011 Apr 5.0% [12] 2018 107.8
[13] 2011 May 5.0% [13] 2019 108.3
[14] 2011 Jun 5.0% [14] 2020 108.8
[15] 2011 Jul 5.0% [15] 2021 109.4
问题/请求的第一部分:如果创建一个new_table_a(基于table_a的年度数据),则需要对table_b中的数据进行插值以填充新表。作为示例,table_a 中的列 h_index 已作为 h_index2 插入到 new_table_a 中(见下文,new_table_a)。另请注意,在示例中,数据开始于 2011 年,因此在 2011 年之前,h_index2 显示字符“-”。
new_table_a
year month infltn h_index2
[1] 2010 May 4% -
[2] 2010 Jun 4% -
[3] 2010 Jul 4% -
[4] 2010 Aug 4% -
[5] 2010 Sep 4% -
[6] 2010 Oct 3% -
[7] 2010 Nov 3% -
[8] 2010 Dec 3% -
[9] 2011 Jan 4% 102.0%
[10] 2011 Feb 4% 102.1%
[11] 2011 Mar 4% 102.1%
[12] 2011 Apr 5% 102.2%
[13] 2011 May 5% 102.2%
[14] 2011 Jun 5% 102.3%
[15] 2011 Jul 5% 102.3%
问题/请求的第二部分:如果根据table_b的每月数据创建new_table_b,如何平均相应年份的日历月份,以便可以用平均值填充new_table_b(从table_a中的infltn到infltn_avg)新表_b)。示例:
new_table_b
Year h_index infltn_avg
[1] 2011 102.0 4.0%
[2] 2012 102.5 4.0%
[3] 2013 103.0 4.0%
[4] 2014 103.6 4.0%
[5] 2015 104.1 4.0%
[6] 2016 104.6 3.0%
[7] 2017 105.1 3.0%
[8] 2018 105.6 3.0%
[9] 2019 106.2 4.0%
[10] 2020 106.7 4.0%
[11] 2021 107.2 4.0%
[12] 2018 107.8 5.0%
[13] 2019 108.3 5.0%
[14] 2020 108.8 5.0%
[15] 2021 109.4 5.0%
我尝试使用一系列循环和 if 来对此进行编码,这在 VBA 之类的东西中似乎是合理的,尽管在 R 中有点笨拙,而且我确信有一种更智能的方法来实现它。
table_b$month ="Jan"
new_table_a = table_a %>% left_join(table_b, by =c("year", "month"))
new_table_a$h_index2 = na.approx(new_table_a$h_index, na.rm = FALSE)
new_table_a$infltn = as.numeric(sub("%", "", new_table_a$infltn))
new_table_b = new_table_a %>% group_by(year) %>% summarise(h_index2= min(h_index2), infltn_avg = mean(infltn))
新表_a
year month infltn h_index h_index2
1 2010 May 4 NA NA
2 2010 Jun 4 NA NA
3 2010 Jul 4 NA NA
4 2010 Aug 4 NA NA
5 2010 Sep 4 NA NA
6 2010 Oct 3 NA NA
7 2010 Nov 3 NA NA
8 2010 Dec 3 NA NA
9 2011 Jan 4 102.0 102.0000
10 2011 Feb 4 NA 102.0417
11 2011 Mar 4 NA 102.0833
12 2011 Apr 5 NA 102.1250
13 2011 May 5 NA 102.1667
14 2011 Jun 5 NA 102.2083
15 2011 Jul 5 NA 102.2500
16 2011 Aug 5 NA 102.2917
17 2011 Sep 5 NA 102.3333
18 2011 Oct 5 NA 102.3750
19 2011 Nov 5 NA 102.4167
20 2011 Dec 5 NA 102.4583
21 2012 Jan 5 102.5 102.5000
新表_b
# A tibble: 6 x 3
year h_index2 infltn_avg
<int> <dbl> <dbl>
1 2010 NA 3.62
2 2011 102 4.75
3 2012 102. 5
4 2013 103 5
5 2014 104. 6
6 2015 104. 6
首先,为了便于推断
h_index
,我们在 month
中添加一个 table_b
列,设置为“Jan”
table_b$month ="Jan"
然后我们通过
year
和 month
连接两个数据框
new_table_a = table_a %>% left_join(table_b, by =c("year", "month"))
对于外推,您可以使用
na.approx
包中的 zoo
。 h_index2
是 2 个已知 h_index
之间的外推法。注意:由于此列是数字,因此最好保留 NA 而不是添加“-”。
library(zoo)
new_table_a$h_index2 = na.approx(new_table_a$h_index, na.rm = FALSE)
现在要生成
new_table_b
,我们需要每年的平均百分比。为此,我们必须通过删除“%”将 infltn
列转换为数值:
new_table_a$infltn = as.numeric(sub("%", "", new_table_a$infltn))
最后,我们只需要总结
new_table_a
就可以得到new_table_b
。按年份分组后,我们取min(h_index2)
的最低值,做每月通胀的平均值mean(infltn)
new_table_b = new_table_a %>% group_by(year) %>% summarize(h_index2= min(h_index2), infltn_avg = mean(infltn))
table_a = read.table(text="year month infltn
1 2010 May 4.00%
2 2010 Jun 4.00%
3 2010 Jul 4.00%
4 2010 Aug 4.00%
5 2010 Sep 4.00%
6 2010 Oct 3.00%
7 2010 Nov 3.00%
8 2010 Dec 3.00%
9 2011 Jan 4.00%
10 2011 Feb 4.00%
11 2011 Mar 4.00%
12 2011 Apr 5.00%
13 2011 May 5.00%
14 2011 Jun 5.00%
15 2011 Jul 5.00%
16 2011 Aug 5.00%
17 2011 Sep 5.00%
18 2011 Oct 5.00%
19 2011 Nov 5.00%
20 2011 Dec 5.00%
21 2012 Jan 5.00%
22 2012 Feb 5.00%
23 2012 Mar 5.00%
24 2012 Apr 5.00%
25 2012 May 5.00%
26 2012 Jun 5.00%
27 2012 Jul 5.00%
28 2012 Aug 5.00%
29 2012 Sep 5.00%
30 2012 Oct 5.00%
31 2012 Nov 5.00%
32 2012 Dec 5.00%
33 2013 Jan 5.00%
34 2013 Feb 5.00%
35 2013 Mar 5.00%
36 2013 Apr 5.00%
37 2013 May 5.00%
38 2013 Jun 5.00%
39 2013 Jul 5.00%
40 2013 Aug 5.00%
41 2013 Sep 5.00%
42 2013 Oct 5.00%
43 2013 Nov 5.00%
44 2013 Dec 5.00%
45 2014 Jan 6.00%
46 2014 Feb 6.00%
47 2014 Mar 6.00%
48 2014 Apr 6.00%
49 2014 May 6.00%
50 2014 Jun 6.00%
51 2014 Jul 6.00%
52 2014 Aug 6.00%
53 2014 Sep 6.00%
54 2014 Oct 6.00%
55 2014 Nov 6.00%
56 2014 Dec 6.00%
57 2015 Jan 6.00%", header= TRUE)
table_b = read.table(text="year h_index
1 2011 102.0
2 2012 102.5
3 2013 103.0
4 2014 103.6
5 2015 104.1
6 2016 104.6
7 2017 105.1
8 2018 105.6
9 2019 106.2
10 2020 106.7
11 2021 107.2
12 2018 107.8
13 2019 108.3
14 2020 108.8
15 2021 109.4", header = TRUE )