data("pbc2.id", package = "JM") # Mayo Clinic Primary Biliary Cirrhosis Data
df <- pbc2.id
vars_num1 <- c("years", "age", "serBilir", "serChol", "albumin",
"alkaline", "SGOT", "platelets", "prothrombin", "histologic",
"status2")
cor(df[vars_num1], use = "complete.obs", method="pearson") # years vs age: -0.17719866
cor(df$years, df$age, use = "complete.obs", method="pearson") # -0.1631033
其他列确实给出了一致的结果,例如
serBilir
vsserChol
(0.39675890)。我自己还对其进行了编码以测试它:
v <- function(x,y=x) mean(x*y) - mean(x)*mean(y)
my_corr <- function(x,y) v(x,y) / sqrt(v(x) * v(y))
my_corr(df$years, df$age) # -0.1631033
为什么
cor(df[vars_num1], use = "complete.obs", method="pearson")
为什么给出不同的结果?
我认为问题来自您的NA值。在第二种情况下,COR函数比第一种情况下保持更多的行。使用
na.omit
,您会发现自己发现了同一件事。
data("pbc2.id", package = "JM") # Mayo Clinic Primary Biliary Cirrhosis Data
df <- pbc2.id
vars_num1 <- c("years", "age", "serBilir", "serChol", "albumin",
"alkaline", "SGOT", "platelets", "prothrombin", "histologic",
"status2")
df = na.omit(df)
cor(df[vars_num1], use = "complete.obs", method="pearson") # years vs age: -0.17719866
cor(df$years, df$age, use = "complete.obs", method="pearson") # -0.17719866
df[vars_num1]