我计划在数据集上运行先验算法以查找产品购买之间的关联。数据集看起来像这样
order_id <- c('AG-2011-2040','IN-2011-47883', 'HU-2011-1220','IT-2011-3647632','IN-2011-47883','IN-2011-47883','IN-2011-30733','CA-2011-115161','AO-2011-1390','ID-2011-56493' )
product_name <- c('Tenex Lockers, Blue','Acme Trimmer, High Speed', 'Tenex Box, Single Width', 'Enermax Note Cards, Premium','Eldon Light Bulb, Duo Pack','Eaton Computer Printout Paper, 8.5 x 11', 'Brother Personal Copier, Laser','Sauder Facets Collection Library, Sky Alder Finish', 'Fellowes Lockers, Wire Frame','Tenex Trays, Single Width')
df <- data.frame(order_id, product_name)
df
在图中,您可以看到左侧是Order.ID,右侧是产品名称。我想将具有相同 ID 的所有产品连接在一行中。换句话说,我想聚合所有基于 Order.ID 的产品。由于产品不是数字,所以很难做到。
任何关于如何解决这个问题的想法将不胜感激。或者,如果您对如何准备这些数据以运行先验算法有不同的想法,那也很棒!
提前致谢!
您可以使用
dplyr
包的 summarize
(订单 ID 具有 .by
函数)和基本 R paste
连接所有使用相同订单 ID 的产品。在这里,我选择用逗号分隔,但可以更改为您想要的任何其他内容:
library(dplyr)
df %>%
summarise(ProductName =
paste(ProductName, collapse = ", "),
.by = OrderID)
# OrderID ProductName
#1 EX-1 B, E, J, W
#2 EX-2 D, I
#3 EX-3 S, Y
#4 EX-4 K, V
#5 EX-5 C
#6 EX-6 F, G, H, N, O
#7 EX-7 L, T
#8 EX-8 A, M, P, Q, R, U
#9 EX-9 X
# or for older versions of dplyr:
df %>%
group_by(OrderID) %>%
summarise(ProductName =
paste(ProductName, collapse=", "))
数据:
df <- data.frame(OrderID = paste0("EX-", sample(1:9, 25, replace = TRUE)),
ProductName = LETTERS[1:25])
df <- df[order(df$OrderID),]
基于拆分图的解决方案(可能是大型数据集最快的解决方案之一):
library(magrittr)
aggregater <- function(tibble) {
tibble::tibble(
order_id = tibble$order_id[1],
product_name = stringr::str_flatten(tibble$product_name, '; ')
)
}
order_id <- c('AG-2011-2040','IN-2011-47883', 'HU-2011-1220','IT-2011-3647632','IN-2011-47883','IN-2011-47883','IN-2011-30733','CA-2011-115161','AO-2011-1390','ID-2011-56493' )
product_name <- c('Tenex Lockers, Blue','Acme Trimmer, High Speed', 'Tenex Box, Single Width', 'Enermax Note Cards, Premium','Eldon Light Bulb, Duo Pack','Eaton Computer Printout Paper, 8.5 x 11', 'Brother Personal Copier, Laser','Sauder Facets Collection Library, Sky Alder Finish', 'Fellowes Lockers, Wire Frame','Tenex Trays, Single Width')
df <- data.frame(order_id, product_name)
df %>%
split.data.frame(.$order_id) %>%
purrr::map(aggregater) %>%
dplyr::bind_rows()
# A tibble: 8 x 2
order_id product_name
<chr> <chr>
1 AG-2011-2040 Tenex Lockers, Blue
2 AO-2011-1390 Fellowes Lockers, Wire Frame
3 CA-2011-115161 Sauder Facets Collection Library, Sky Alder Finish
4 HU-2011-1220 Tenex Box, Single Width
5 ID-2011-56493 Tenex Trays, Single Width
6 IN-2011-30733 Brother Personal Copier, Laser
7 IN-2011-47883 Acme Trimmer, High Speed; Eldon Light Bulb, Duo Pack; Eaton Co~
8 IT-2011-36476~ Enermax Note Cards, Premium