从 R 中的切割树状图中提取标签成员资格/分类（即：树状图的 cutree 函数）

Question

我正在尝试从 R 中的树状图中提取分类，我在某个高度上

cut

。使用

cutree

对象上的

hclust

很容易做到这一点，但我不知道如何在

dendrogram

对象上做到这一点。

此外，我不能只使用原始 hclust 中的集群，因为（令人沮丧），

cutree

中的类编号与

cut

中的类编号不同。

hc <- hclust(dist(USArrests), "ave")

classification<-cutree(hc,h=70)

dend1 <- as.dendrogram(hc)
dend2 <- cut(dend1, h = 70)


str(dend2$lower[[1]]) #group 1 here is not the same as
classification[classification==1] #group 1 here

有没有办法让分类相互映射，或者从

dendrogram

对象中提取较低的分支成员资格（也许巧妙地使用

dendrapply

？），格式更像

cutree

给？

Answer 1

我建议您使用

dendextend 包中的 cutree 函数。它包括树状图方法（即：

dendextend:::cutree.dendrogram

）。

您可以从其介绍性小插图了解有关该软件包的更多信息。

我应该补充一点，虽然你的函数 (

classify

) 很好，但使用

dendextend

中的 cutree 有几个优点：

它还允许您使用特定的
```
k
```
（簇数），而不仅仅是
```
h
```
（特定高度）。
这与您在hclust上从cutree得到的结果一致（
```
classify
```
不会）。
通常会更快。

以下是使用代码的示例：

# Toy data:
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)

# Get the package:
install.packages("dendextend")
library(dendextend)

# Get the package:
cutree(dend1,h=70) # it now works on a dendrogram
# It is like using:
dendextend:::cutree.dendrogram(dend1,h=70)

顺便说一句，在这个功能的基础上，dendextend允许用户做更多很酷的事情，比如基于树状图切割的颜色分支/标签：

dend1 <- color_branches(dend1, k = 4)
dend1 <- color_labels(dend1, k = 5)
plot(dend1)

enter image description here

最后，这里还有一些代码用于演示我的其他观点：

# This would also work with k:
cutree(dend1,k=4)

# and would give identical result as cutree on hclust:
identical(cutree(hc,h=70)  , cutree(dend1,h=70)  )
   # TRUE

# But this is not the case for classify:
identical(classify(dend1,70)   , cutree(dend1,h=70)  )
   # FALSE


install.packages("microbenchmark")
require(microbenchmark)
microbenchmark(classify = classify(dend1,70),
               cutree = cutree(dend1,h=70)  )
#    Unit: milliseconds
#        expr      min       lq   median       uq       max neval
#    classify  9.70135  9.94604 10.25400 10.87552  80.82032   100
#      cutree 37.24264 37.97642 39.23095 43.21233 141.13880   100
# 4 times faster for this tree (it will be more for larger trees)

# Although (if to be exact about it) if I force cutree.dendrogram to not go through hclust (which can happen for "weird" trees), the speed will remain similar:
microbenchmark(classify = classify(dend1,70),
               cutree = cutree(dend1,h=70, try_cutree_hclust = FALSE)  )
# Unit: milliseconds
#        expr       min        lq    median       uq      max neval
#    classify  9.683433  9.819776  9.972077 10.48497 29.73285   100
#      cutree 10.275839 10.419181 10.540126 10.66863 16.54034   100

如果您正在考虑如何改进此功能，请通过此处进行修补：

https://github.com/talgalili/dendextend/blob/master/R/cutre.dendrogram.R

我希望您或其他人会发现这个答案有帮助。

Answer 2

我最终使用

dendrapply

创建了一个函数来完成此操作。虽然不优雅，但很有效

classify <- function(dendrogram,height){

#mini-function to use with dendrapply to return tip labels
 members <- function(n) {
    labels<-c()
    if (is.leaf(n)) {
        a <- attributes(n)
        labels<-c(labels,a$label)
    }
    labels
 }

 dend2 <- cut(dendrogram,height) #the cut dendrogram object
 branchesvector<-c()
 membersvector<-c()

 for(i in 1:length(dend2$lower)){                             #for each lower tree resulting from the cut
  memlist <- unlist(dendrapply(dend2$lower[[i]],members))     #get the tip lables
  branchesvector <- c(branchesvector,rep(i,length(memlist)))  #add the lower tree identifier to a vector
  membersvector <- c(membersvector,memlist)                   #add the tip labels to a vector
 }
out<-as.integer(branchesvector)                               #make the output a list of named integers, to match cut() output
names(out)<-membersvector
out
}

使用该函数可以清楚地看出问题是 cut 按字母顺序分配类别名称，而 cutree 从左到右分配分支名称。

hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)

classify(dend1,70) #Florida 1, North Carolina 1, etc.
cutree(hc,h=70)    #Alabama 1, Arizona 1, Arkansas 1, etc.

Answer 3

制作树形图后，使用 cutree 方法，然后将其转换为数据框。以下代码使用库 dendextend 制作了一个漂亮的树状图：

library(dendextend)

# set the number of clusters
clust_k <- 8

# make the hierarchical clustering
par(mar = c(2.5, 0.5, 1.0, 7))
d <- dist(mat, method = "euclidean")
hc <- hclust(d)
dend <- d %>% hclust %>% as.dendrogram
labels_cex(dend) <- .65
dend %>% 
  color_branches(k=clust_k) %>%
  color_labels() %>%
  highlight_branches_lwd(3) %>% 
  plot(horiz=TRUE, main = "Branch (Distribution) Clusters by Heloc Attributes", axes = T)

根据着色方案，看起来簇是在阈值 4 周围形成的。因此，为了将分配放入数据帧中，我们需要获取簇，然后

unlist()

它们。

首先您需要获取簇本身，但是，它只是数字的单个向量，行名称是实际的标签。

# creates a single item vector of the clusters    
myclusters <- cutree(dend, k=clust_k, h=4)

# make the dataframe of two columns cluster number and label
clusterDF <-  data.frame(Cluster = as.numeric(unlist(myclusters)),
                         Branch = names(myclusters))

# sort by cluster ascending
clusterDFSort <- clusterDF %>% arrange(Cluster)

Answer 4

使用

dendextend

包对树形图对象的基本

cutree

函数的扩展的公认答案是一种快速而简单的方法，可以按照 OP 的要求“让分类相互映射”。

按照OP的建议，如何“或者从树状图对象中提取较低的分支成员资格”也可以解决他们的问题。我需要让这种方法发挥作用，因为我在分析中使用了

dend2$lower

子树状图，所以我需要我的成员资格密钥来专门匹配

cut()

索引。

# Use the dendextend package for necessity and the data.table package for convenience
library(data.table)
library(dendextend)

# start with OP's dend2 object created using cut(dend1)

# pull out the "labels" under each cluster using get_nodes_attr from the dendextend package
clust.key <- lapply(X = dend2$lower, FUN = get_nodes_attr, attribute = "label", include_branches = F)

# reformat into a table of cluster membership using rbindlist from the data.table package
clust.key <- rbindlist(lapply(X = 1:length(clust.key), FUN = function(e){data.table(cluster = e, label = clust.key[[e]])}))

# remove NA's that correspond to internal node labels instead of leaf labels (this is data.table syntax)
clust.key <- clust.key[!is.na(label)]

> head(clust.key)
   cluster          label
     <int>         <char>
1:       1        Florida
2:       1 North Carolina
3:       2     California
4:       2       Maryland
5:       2        Arizona
6:       2     New Mexico

我要指出的是，如果您采用这种方法，您仍然可以通过

branches_attr_by_labels()

函数使用 dendextend 的精美绘图来为集群着色，这也允许更复杂的操作，例如仅对集群的子集进行着色，例如：

branches_attr_by_labels(dend = dend1, labels = clust.key[cluster == 2, label], TF_values = "blue", attr = "col") %>%
  branches_attr_by_labels(labels = clust.key[cluster == 4, label], TF_values = "orange", attr = "col") %>%
  plot()

从 R 中的切割树状图中提取标签成员资格/分类（即：树状图的 cutree 函数）

问题描述投票：0回答：4

4个回答

最新问题

从 R 中的切割树状图中提取标签成员资格/分类（即：树状图的 cutree 函数）

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4