我有一个如下所示的数据集:
Account Number 6m 7m 8m 9m 10m 11m 6m_Metric 7m_metric 8m_metric 9m_metric 10m_metric 11m_metric
1 Better X < 10 X < 10 Better X < 30 X < 30 0.6 0.6 0.9 1.2 0.1 5.0
2 X < 10 X < 20 X < 30 X < 20 X < 20 X < 20 0.4 0.4 3.4 3.7 4.4 0.3
3 Better Better Better Better X < 10 X < 20 1.5 1.5 1.5 0.3 1.5 1.8
4 X < 10 Better Same Same Same Same 3.4 3.4 1.8 5.0 5.2 6.8
5 Same Better Same Same Same Same 0.1 0.1 5.0 5.3 5.0 1.8
6 Same Same Same Better Better Better 4.4 4.4 0.3 0.3 5.2 7.4
7 Same X < 10 X < 10 X < 10 X < 10 Better 5.0 5.0 1.3 2.1 2.2 0.3
8 Better Better Better Better Better Better 7.8 7.8 5.0 1.5 1.9 7.4
9 X < 10 X < 10 X < 10 X < 20 X < 30 Better 9.1 9.1 9.4 5.5 5.6 4.6
10 X < 20 X < 30 X < 30 X < 30 X < 30 X < 30 0.3 0.3 1.5 1.8 2.2 1.5
每个单元格告诉我每个帐号 6-11 个月后发生的情况以及每个帐号每个月的指标值。我希望能够在此处显示任何趋势,因此我希望能够拥有每个月“更好”等的帐户数量,以及本月的平均指标金额。所以我认为它应该看起来像:
Result 6m 7m 8m 9m 10m 11m Avg_met_6m Avg_met_7m Avg_met_8m Avg_met_9m Avg_met_10m Avg_met_11m
X < 10 3 3 3 2 3 0 4.3 4.3 3.9 2.9 3.9 2.2
X < 20 1 1 0 1 1 2 0.3 0.3 3.4 0 4.4 0.3
X < 30 0 1 2 1 2 1 0 0 1.5 2.8 2.2 3.3
Same 3 1 3 2 2 2 3.2 3.2 0.3 3.5 5.1 4.3
Better 1 4 2 4 2 4 3.3 3.3 3.3 0.9 2.2 7.4
我只是想举一个例子来说明我正在尝试做的事情,如果有任何拼写错误,请道歉。
data have;
infile datalines dlm='|';
input "Account Number"n "6m"n$ "7m"n$ "8m"n$ "9m"n$ "10m"n$ "11m"n$ "6m_Metric"n "7m_Metric"n "8m_Metric"n "9m_Metric"n "10m_Metric"n "11m_Metric"n;
datalines;
1|Better|X < 10|X < 10|Better|X < 30|X < 30|0.6|0.6|0.9|1.2|0.1|5.0
2|X < 10|X < 20|X < 30|X < 20|X < 20|X < 20|0.4|0.4|3.4|3.7|4.4|0.3
3|Better|Better|Better|Better|X < 10|X < 20|1.5|1.5|1.5|0.3|1.5|1.8
4|X < 10|Better|Same|Same|Same|Same|3.4|3.4|1.8|5.0|5.2|6.8
5|Same|Better|Same|Same|Same|Same|0.1|0.1|5.0|5.3|5.0|1.8
6|Same|Same|Same|Better|Better|Better|4.4|4.4|0.3|0.3|5.2|7.4
7|Same|X < 10|X < 10|X < 10|X < 10|Better|5.0|5.0|1.3|2.1|2.2|0.3
8|Better|Better|Better|Better|Better|Better|7.8|7.8|5.0|1.5|1.9|7.4
9| X < 10|X < 10|X < 10|X < 20|X < 30|Better|9.1|9.1|9.4|5.5|5.6|4.6
10| X < 20|X < 30|X < 30|X < 30|X < 30|X < 30|0.3|0.3|1.5|1.8|2.2|1.5
;
run;
我采取了以下方法:
(我建议将数据保留为中间格式,因为它可能更容易使用。)
* Transpose the character variables;
proc transpose data=have out=char_t (rename = col1 = Result) name = time;
by 'account number'n;
var _character_;
run;
* Transpose the numeric variables;
proc transpose data=have out=num_t (where = (time ne 'Account Number') rename = col1 = Metric) name = time ;
by 'account number'n;
var _numeric_;
run;
* Recode the time variable to match char_t;
data num_t (rename = t = time drop = time);
length t $ 3;
set num_t;
t = prxchange('s/\_Metric\s*$//', -1, time);
run;
* Merge them back together;
proc sort data = num_t; by 'Account number'n time; run;
proc sort data = char_t; by 'Account number'n time; run;
data have_t;
merge char_t num_t;
by 'Account Number'n time;
run;
* NOTE: I would leave the data in this format, and use PROCs to
do any further analysis;
* Tabulate to get the required results (also outputting to a data set);
proc tabulate data = have_t out=tab;
class Result time / order = fmt;
var Metric;
table Result, metric * (n mean) * time / misstext="0";
run;
* Need 2 transposes to get the correct column layout in the final data;
* First get the values from Metric_N and Metric_Mean in to 1 column;
proc transpose data = tab out = tab_t;
by Result time;
var metric_n metric_mean;
run;
* Then transpose them into the desired wide format;
proc transpose data = tab_t out = want (drop = _name_);
by Result;
id time _name_;
var col1;
run;
* Finally re-order the columns;
data want;
retain Result '6mMetric_N'n '7mMetric_N'n '8mMetric_N'n '9mMetric_N'n '10mMetric_N'n '11mMetric_N'n
'6mMetric_Mean'n '7mMetric_Mean'n '8mMetric_Mean'n '9mMetric_Mean'n '10mMetric_Mean'n '11mMetric_Mean'n;
set want;
run;
如果您需要列名完全符合要求,您可以在另一个数据步骤中使用
rename
。