使用 R 的 Apriori 算法进行市场篮子分析的数据准备

问题描述 投票:0回答:2

我计划在数据集上运行先验算法以查找产品购买之间的关联。数据集看起来像这样

order_id <- c('AG-2011-2040','IN-2011-47883', 'HU-2011-1220','IT-2011-3647632','IN-2011-47883','IN-2011-47883','IN-2011-30733','CA-2011-115161','AO-2011-1390','ID-2011-56493'  )
    
product_name <- c('Tenex Lockers, Blue','Acme Trimmer, High Speed', 'Tenex Box, Single Width', 'Enermax Note Cards, Premium','Eldon Light Bulb, Duo Pack','Eaton Computer Printout Paper, 8.5 x 11', 'Brother Personal Copier, Laser','Sauder Facets Collection Library, Sky Alder Finish', 'Fellowes Lockers, Wire Frame','Tenex Trays, Single Width')
        
df <- data.frame(order_id, product_name)
df 

在图中,您可以看到左侧是Order.ID,右侧是产品名称。我想将具有相同 ID 的所有产品连接在一行中。换句话说,我想聚合所有基于 Order.ID 的产品。由于产品不是数字,所以很难做到。

任何关于如何解决这个问题的想法将不胜感激。或者,如果您对如何准备这些数据以运行先验算法有不同的想法,那也很棒!

提前致谢!

r apriori string-aggregation market-basket-analysis
2个回答
0
投票

您可以使用

dplyr
包的
summarize
(订单 ID 具有
.by
函数)和基本 R
paste
连接所有使用相同订单 ID 的产品。在这里,我选择用逗号分隔,但可以更改为您想要的任何其他内容:

library(dplyr)
df %>% 
  summarise(ProductName = 
              paste(ProductName, collapse = ", "), 
            .by = OrderID)

#  OrderID      ProductName
#1    EX-1       B, E, J, W
#2    EX-2             D, I
#3    EX-3             S, Y
#4    EX-4             K, V
#5    EX-5                C
#6    EX-6    F, G, H, N, O
#7    EX-7             L, T
#8    EX-8 A, M, P, Q, R, U
#9    EX-9                X

# or for older versions of dplyr:
df %>% 
  group_by(OrderID) %>%
  summarise(ProductName = 
              paste(ProductName, collapse=", "))

数据:

df <- data.frame(OrderID = paste0("EX-", sample(1:9, 25, replace = TRUE)),
                 ProductName = LETTERS[1:25])
df <- df[order(df$OrderID),]

0
投票

基于拆分图的解决方案(可能是大型数据集最快的解决方案之一):

library(magrittr)
aggregater <- function(tibble) {
  tibble::tibble(
    order_id = tibble$order_id[1],
    product_name = stringr::str_flatten(tibble$product_name, '; ')
  )
}


order_id <- c('AG-2011-2040','IN-2011-47883', 'HU-2011-1220','IT-2011-3647632','IN-2011-47883','IN-2011-47883','IN-2011-30733','CA-2011-115161','AO-2011-1390','ID-2011-56493'  )
    
product_name <- c('Tenex Lockers, Blue','Acme Trimmer, High Speed', 'Tenex Box, Single Width', 'Enermax Note Cards, Premium','Eldon Light Bulb, Duo Pack','Eaton Computer Printout Paper, 8.5 x 11', 'Brother Personal Copier, Laser','Sauder Facets Collection Library, Sky Alder Finish', 'Fellowes Lockers, Wire Frame','Tenex Trays, Single Width')
        
df <- data.frame(order_id, product_name)

df %>%
  split.data.frame(.$order_id) %>%
  purrr::map(aggregater) %>%
  dplyr::bind_rows()


# A tibble: 8 x 2
  order_id       product_name                                                          
  <chr>          <chr>                                                          
1 AG-2011-2040   Tenex Lockers, Blue                                            
2 AO-2011-1390   Fellowes Lockers, Wire Frame                                   
3 CA-2011-115161 Sauder Facets Collection Library, Sky Alder Finish             
4 HU-2011-1220   Tenex Box, Single Width                                        
5 ID-2011-56493  Tenex Trays, Single Width                                      
6 IN-2011-30733  Brother Personal Copier, Laser                                 
7 IN-2011-47883  Acme Trimmer, High Speed; Eldon Light Bulb, Duo Pack; Eaton Co~
8 IT-2011-36476~ Enermax Note Cards, Premium 
© www.soinside.com 2019 - 2024. All rights reserved.