left_join 中的多对多

问题描述 投票:0回答:1

假设我有两个数据框想要相互连接。然而,他们最终会形成多对多的关系,这是我不希望的。如果 y 到 x 有多个匹配项,我想考虑其他列(在本例中也是国家/地区)。一个可以扩展到更多“条件”列的解决方案会很好。 然而,在我看来,我确实认为 left_join 在“by”参数中传递了几列,但第一个 by 参数作为“必要条件”,另一个 by 参数作为对正确列进行子集化的附加证据。

如果只有一个可能匹配,则不应检查其他条件。 如果进一步的列(在本例中为国家/地区)具有 NA,请忽略此条件。

# Create the Questions DataFrame
questions_df <- data.frame(
  question_id = c(1, 2, 3, 4, 5),
  title = c("How to use Python?", "What is SQL?", "Django tutorial", "Data science with R", "Stackoverflow"),
  tag = c("python", "sql", "django", "r", "python"),
  country = c("NM", "TSE", "FR", "Z", "ZAF")
)

# Create the Tags DataFrame
tags_df <- data.frame(
  tag = c("python", "python", "sql", "django", "django", "r", "r"),
  expert = c("Expert A", "Expert B", "Expert C", "Expert D", "Expert E", "Expert F", "Expert G"),
  country = c("TGV", "NM", "TSE", "FR", "Z", "ZAF", NA)
)

# Perform a left join to illustrate the many-to-many relationship
result <- left_join(questions_df, tags_df, by = "tag")
r left-join
1个回答
0
投票
# Perform the join based on the primary `tag` condition
result <- left_join(questions_df, tags_df, by = "tag") %>%
  group_by(question_id) %>%
  # If there are multiple matches, filter by `country`
  filter(n() == 1 | (country.x == country.y | is.na(country.y))) %>%
  ungroup() %>%
  distinct()

result
© www.soinside.com 2019 - 2024. All rights reserved.