假设我有两个数据框想要相互连接。然而,他们最终会形成多对多的关系,这是我不希望的。如果 y 到 x 有多个匹配项,我想考虑其他列(在本例中也是国家/地区)。一个可以扩展到更多“条件”列的解决方案会很好。 然而,在我看来,我确实认为 left_join 在“by”参数中传递了几列,但第一个 by 参数作为“必要条件”,另一个 by 参数作为对正确列进行子集化的附加证据。
如果只有一个可能匹配,则不应检查其他条件。 如果进一步的列(在本例中为国家/地区)具有 NA,请忽略此条件。
# Create the Questions DataFrame
questions_df <- data.frame(
question_id = c(1, 2, 3, 4, 5),
title = c("How to use Python?", "What is SQL?", "Django tutorial", "Data science with R", "Stackoverflow"),
tag = c("python", "sql", "django", "r", "python"),
country = c("NM", "TSE", "FR", "Z", "ZAF")
)
# Create the Tags DataFrame
tags_df <- data.frame(
tag = c("python", "python", "sql", "django", "django", "r", "r"),
expert = c("Expert A", "Expert B", "Expert C", "Expert D", "Expert E", "Expert F", "Expert G"),
country = c("TGV", "NM", "TSE", "FR", "Z", "ZAF", NA)
)
# Perform a left join to illustrate the many-to-many relationship
result <- left_join(questions_df, tags_df, by = "tag")
# Perform the join based on the primary `tag` condition
result <- left_join(questions_df, tags_df, by = "tag") %>%
group_by(question_id) %>%
# If there are multiple matches, filter by `country`
filter(n() == 1 | (country.x == country.y | is.na(country.y))) %>%
ungroup() %>%
distinct()
result