从维基百科点击流数据创建转换矩阵

Question

我正在尝试从Wikipedia Clickstream数据集创建转换矩阵。有了这个，我想展示用户从一篇维基百科文章转到另一篇文章的可能性。

我有一个数据框，由三列组成。 source.category引用源文章的标题，target.category引用目标文章的标题。第三列是“总数”并且是指点击次数（即用户从该源文章移动到目标文章的次数）。

由此，我想计算在给定点击次数的情况下从源文章到目标文章的转换概率。

以下是我的数据框的摘要：

source.category    target.category        total      
 Length:98          Length:98          Min.   :   21  
 Class :character   Class :character   1st Qu.:  684  
 Mode  :character   Mode  :character   Median : 2132  
                                       Mean   : 5395  
                                       3rd Qu.: 5296  
                                       Max.   :53378

最好的方法是创建一个函数吗？

trans.matrix < - function（...）

这个功能怎么样？

然后适合它：trans.matrix（as.matrix（df））？

Answer 1

我会使用reshape2包来做到这一点。我创建了一个最小的数据集来说明这一点：

set.seed(42)

dataset <- expand.grid(letters[1:4], LETTERS[1:4])
dataset$total <- rpois(16, 1)

names(dataset) <- c("source.category", "target.category", "total")
# set the last row to the first row to illustrate fill and aggregate
dataset[16, ] <- dataset[1, ]

然后只需使用acast函数创建矩阵，最后将行和归一化为1。

require(reshape2)

# reshape to wide format
res <- acast(
  dataset, # the dataset
  source.category ~ target.category, # the margins of the result
  value.var = "total", # which variable should be in the cells
  fill=0L, # fill empty cells with this value
  fun.aggregate = sum # aggregate double cells with this function
  )

# normalize rowSums to 1
res <- res / rowSums(res)

# this is your result
res

编辑：在较大的数据集上，这将永远甚至失败。对于较大的数据集，使用稀疏矩阵，来自Matrix包，这样更快，并且提供的存储结果更小。

require(Matrix)

dataset$target.category <- factor(dataset$target.category)
dataset$source.category <- factor(dataset$source.category)

res <- sparseMatrix(
  as.integer(dataset$target.category),
  as.integer(dataset$source.category),
  x = dataset$total
)

res <- res/rowSums(res)

这足以在整个数据集上以交互方式工作。

从维基百科点击流数据创建转换矩阵

问题描述投票：1回答：1

1个回答

最新问题

从维基百科点击流数据创建转换矩阵

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1