我有一个列列表,需要在 SQL 数据库中扫描,以查看每列是否包含存储在名为
first_three_chars
的列表变量中的 ICD-10 代码。
#List of columns to scan
list <- c('D1','D2','D3','D4','D5','D6','D7','D8','D9','D10','D11','D12','D13')
我已经使用以下代码成功手动完成了此操作。注意:claims_tbl 是一个 tibble(大型数据库表的临时数据集)。
claim_dplyr_tbl <- claims_tbl %>%
filter(
((substr(D1, 1, 3) %in% first_three_chars) |
(substr(D2, 1, 3) %in% first_three_chars) |
(substr(D3, 1, 3) %in% first_three_chars) |
(substr(D4, 1, 3) %in% first_three_chars) |
(substr(D5, 1, 3) %in% first_three_chars) |
(substr(D6, 1, 3) %in% first_three_chars) |
(substr(D7, 1, 3) %in% first_three_chars) |
(substr(D8, 1, 3) %in% first_three_chars) |
(substr(D9, 1, 3) %in% first_three_chars) |
(substr(D10, 1, 3) %in% first_three_chars) |
(substr(D11, 1, 3) %in% first_three_chars) |
(substr(D12, 1, 3) %in% first_three_chars) |
(substr(D13, 1, 3) %in% first_three_chars))) %>%
collect()
但是,我需要使其动态化,因为我有多个数据集,并且上面的列表将更改列数和列名称。因此,我试图动态构建过滤器语句,但这样做遇到了困难。
我尝试了以下方法,尝试创建一个字符串,然后使用该字符串作为表达式。但是,尽管字符串与上面相同,但每当我运行它时,我都会在first_third_chars 附近收到语法错误的错误。有谁知道如何将此字符串转换为我可以在过滤器语句中使用的形式?
# Construct the filter command
filter_conditions <- sapply(list, function(col) {
paste0("(substr(", col, ", 1, 3) %in% first_three_chars)")
})
# Join the conditions with `|`
final_command <- paste(filter_conditions, collapse = " | ")
# Ensure the final command is properly formatted
print(final_command)
# Construct the filter expression
filter_expr <- str2lang(final_command)
claim_dplyr_tbl2 <- claims_tbl %>%
(filter_expr)) %>%
collect()
经过多次尝试,我找到了有效的解决方案。
claim_dplyr_tbl2 <- claims_tbl %>% filter_at( vars(all_of(list)), any_vars(substr(., 1, 3) %in% first_three_chars) ) %>% collect()
此解决方案独立于列名称工作,并返回包含以 first_third_chars 向量中的三字符字符串开头的值的所有列。
它的工作原理是将first_third_chars转换为正则表达式,例如
c("ABC", "DEF")
变为 ^(ABC|DEF)
,其中“^”表示从字符串开头开始搜索。
我改变了
filter()
和select()
的顺序。如果 select()
位于第一个,则包含日期范围之外的匹配字符串的列将被保留。
library(dplyr)
library(stringr)
# Create date filter strings
start_date <- "2024-08-24"
end_date <- "2024-08-30"
# Create vector of three letter strings to search for
first_three_chars <- c("ABC", "DEF")
# Create example df
claim_dplyr_tbl <-
data.frame(
date1 = as.character(seq(as.Date("2024-08-22"), by = "day", length.out = 10)),
A1 = c("ABC1", rep("", 9)),
B1 = c(rep("", 4), "ABC1", rep("", 5)),
C1 = c(c(rep("", 9), "DEF1")),
D1 = c(rep("", 6), "DEF2", rep("", 3)),
E1 = c(rep("", 2), "ABC2", rep("", 7)),
F1 = c(rep("", 3), "GHI3", rep("", 6)))
# Convet date1 to as.Date(), filter df by dates, select columns containing strings
# that start with first_three_chars values
result <- claim_dplyr_tbl |>
mutate(start_claim = as.Date(date1)) |>
filter((start_claim >= as.Date(start_date)) &
(start_claim <= as.Date(end_date))) |>
select(start_claim,
where(~ any(str_detect(., paste0("^(", paste(first_three_chars, collapse = "|"), ")")))))
result
# start_claim B1 D1 E1
# 1 2024-08-24 ABC2
# 2 2024-08-25
# 3 2024-08-26 ABC1
# 4 2024-08-27
# 5 2024-08-28 DEF2
# 6 2024-08-29
# 7 2024-08-30