我有一个按用户评分的电影列表。
{"_id":59607,"title":"King Corn (2007)",
"genres":["Documentary"],
"ratings":[ {"userId":1860,"rating":3},
{"userId":9970,"rating":3.5},
{"userId":16929,"rating":1.5},
{"userId":23473,"rating":4},
{"userId":23733,"rating":4},
{"userId":27584,"rating":3},
{"userId":28232,"rating":4},
{"userId":29482,"rating":3},
{"userId":40976,"rating":5},
{"userId":44631,"rating":4},
{"userId":47613,"rating":3},
{"userId":49763,"rating":3},
{"userId":58160,"rating":4.5},
{"userId":62249,"rating":3},
{"userId":65923,"rating":4},
{"userId":67507,"rating":4},
{"userId":68259,"rating":3.5},
{"userId":70331,"rating":5},
{"userId":71420,"rating":3.5}
]
}
我需要计算每个用户完成的评分数量。这是我试图进入收视率。
a = load '/movies_1m.json' using JsonLoader('id:int, title : chararray, genres : { ( genre : chararray ) }, ratings: { ( userId : int, rating: float) } ');
然后
b = FOREACH a GENERATE FLATTEN(ratings);
描述给我以下:
b: {ratings::userId: int,ratings::rating: float}
只是为了计算我需要访问评级内部的用户。但这是它没有取得成功的关键。我试过这个:
c = FOREACH b GENERATE COUNT(ratings);
它让我错了。
我需要得到这样的东西:
{userId: int, rating: float}
你需要GROUP
以便COUNT
,因为这是一个集合操作。
b = FOREACH a GENERATE FLATTEN(ratings);
gr = GROUP b by ratings::userId;
c = FOREACH gr GENERATE group,COUNT($1);
\d c
产量
请注意,示例中没有任何用户重复,因此这些都是一个。
(1860,1)
(9970,1)
(16929,1)
(23473,1)
(23733,1)
(27584,1)
(28232,1)
(29482,1)
(40976,1)
(44631,1)
(47613,1)
(49763,1)
(58160,1)
(62249,1)
(65923,1)
(67507,1)
(68259,1)
(70331,1)
(71420,1)