我正在尝试检索数据框中存在的特定列中最重复的值。这是我的示例数据和代码如下。
data("Forbes2000", package = "HSAUR")
head(Forbes2000)
rank name country category sales profits assets marketvalue
1 1 Citigroup United States Banking 94.71 17.85 1264.03 255.30
2 2 General Electric United States Conglomerates 134.19 15.59 626.93 328.54
3 3 American Intl Group United States Insurance 76.66 6.46 647.66 194.87
4 4 ExxonMobil United States Oil & gas operations 222.88 20.96 166.99 277.02
5 5 BP United Kingdom Oil & gas operations 232.57 10.27 177.57 173.54
6 6 Bank of America United States Banking 49.01 10.81 736.45 117.55
根据我的样本数据,我需要返回最重复的类别,即保险。
subset(subset(Forbes2000,country=="Bermuda")
tail(names(sort(table(Forbes2000$category))), 1)
如果两个或更多类别可能最常用,请使用以下内容:
x <- c("Insurance", "Insurance", "Capital Goods", "Food markets", "Food markets")
tt <- table(x)
names(tt[tt==max(tt)])
[1] "Food markets" "Insurance"
data.table包的另一种方法,对于大型数据集更快:
set.seed(1)
x=sample(seq(1,100), 5000000, replace = TRUE)
方法1(上面提出的解决方案)
start.time <- Sys.time()
tt <- table(x)
names(tt[tt==max(tt)])
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
时差为4.883488秒
方法2(数据表)
start.time <- Sys.time()
ds <- data.table( x )
setkey(ds, x)
sorted <- ds[,.N,by=list(x)]
most_repeated_value <- sorted[order(-N)]$x[1]
most_repeated_value
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
时差0.328033秒
你可以使用table(Forbes2000$CategoryName, useNA="ifany")
。这将为您提供所选类别中所有可能值的列表以及该特定数据框中每个值的使用次数。
我知道我的答案有点晚了,但是我构建了以下函数,在不到一秒的时间内为包含超过50,000行的数据帧完成工作:
print_count_of_unique_values <- function(df, column_name, remove_items_with_freq_equal_or_lower_than = 0, return_df = F,
sort_desc = T, return_most_frequent_value = F)
{
temp <- df[column_name]
output <- as.data.frame(table(temp))
names(output) <- c("Item","Frequency")
output_df <- output[ output[[2]] > remove_items_with_freq_equal_or_lower_than, ]
if (sort_desc){
output_df <- output_df[order(output_df[[2]], decreasing = T), ]
}
cat("\nThis is the (head) count of the unique values in dataframe column '", column_name,"':\n")
print(head(output_df))
if (return_df){
return(output_df)
}
if (return_most_frequent_value){
output_df$Item <- as.character(output_df$Item)
output_df$Frequency <- as.numeric(output_df$Frequency)
most_freq_item <- output_df[1, "Item"]
cat("\nReturning most frequent item: ", most_freq_item)
return(most_freq_item)
}
}
因此,如果您有一个名为“df”的数据框和一个名为“name”的列,并且您想知道“name”列中的最多注释值,则可以运行:
most_common_name <- print_count_of_unique_values(df=df, column_name = "name", return_most_frequent_value = T)