编辑已解决:如何将字符串转换为表达式或可以在 R SQL Pull 中使用的不同动态过滤表达式? [已关闭]

问题描述 投票:0回答:1

我有一个列列表,需要在 SQL 数据库中扫描,以查看每列是否包含存储在名为

first_three_chars
的列表变量中的 ICD-10 代码。

 #List of columns to scan
 list <- c('D1','D2','D3','D4','D5','D6','D7','D8','D9','D10','D11','D12','D13')

我已经使用以下代码成功手动完成了此操作。注意:claims_tbl 是一个 tibble(大型数据库表的临时数据集)。

  claim_dplyr_tbl <- claims_tbl %>%
    filter(
          ((substr(D1, 1, 3) %in% first_three_chars) | 
           (substr(D2, 1, 3) %in% first_three_chars) | 
           (substr(D3, 1, 3) %in% first_three_chars) |
           (substr(D4, 1, 3) %in% first_three_chars) | 
           (substr(D5, 1, 3) %in% first_three_chars) |
           (substr(D6, 1, 3) %in% first_three_chars) | 
           (substr(D7, 1, 3) %in% first_three_chars) | 
           (substr(D8, 1, 3) %in% first_three_chars) | 
           (substr(D9, 1, 3) %in% first_three_chars) |
           (substr(D10, 1, 3) %in% first_three_chars) | 
           (substr(D11, 1, 3) %in% first_three_chars) |   
           (substr(D12, 1, 3) %in% first_three_chars) | 
           (substr(D13, 1, 3) %in% first_three_chars))) %>%
    collect()

但是,我需要使其动态化,因为我有多个数据集,并且上面的列表将更改列数和列名称。因此,我试图动态构建过滤器语句,但这样做遇到了困难。

我尝试了以下方法,尝试创建一个字符串,然后使用该字符串作为表达式。但是,尽管字符串与上面相同,但每当我运行它时,我都会在first_third_chars 附近收到语法错误的错误。有谁知道如何将此字符串转换为我可以在过滤器语句中使用的形式?

  # Construct the filter command
  filter_conditions <- sapply(list, function(col) {
    paste0("(substr(", col, ", 1, 3) %in% first_three_chars)")
  })
  # Join the conditions with `|`
  final_command <- paste(filter_conditions, collapse = " | ")
  # Ensure the final command is properly formatted
  print(final_command)
  # Construct the filter expression
  filter_expr <- str2lang(final_command)

  claim_dplyr_tbl2 <- claims_tbl %>%
             (filter_expr)) %>%
    collect()

经过多次尝试,我找到了有效的解决方案。

    claim_dplyr_tbl2 <- claims_tbl %>%
      filter_at(
        vars(all_of(list)), 
        any_vars(substr(., 1, 3) %in% first_three_chars)
      ) %>%
      collect()
sql r string dplyr expression
1个回答
0
投票

此解决方案独立于列名称工作,并返回包含以 first_third_chars 向量中的三字符字符串开头的值的所有列。

它的工作原理是将first_third_chars转换为正则表达式,例如

c("ABC", "DEF")
变为
^(ABC|DEF)
,其中“^”表示从字符串开头开始搜索。

我改变了

filter()
select()
的顺序。如果
select()
位于第一个,则包含日期范围之外的匹配字符串的列将被保留。

library(dplyr)
library(stringr)

# Create date filter strings
start_date <- "2024-08-24"
end_date <- "2024-08-30"

# Create vector of three letter strings to search for
first_three_chars <- c("ABC", "DEF")

# Create example df
claim_dplyr_tbl <- 
  data.frame(
    date1 = as.character(seq(as.Date("2024-08-22"), by = "day", length.out = 10)),
    A1 = c("ABC1", rep("", 9)),
    B1 = c(rep("", 4), "ABC1", rep("", 5)),
    C1 = c(c(rep("", 9), "DEF1")),
    D1 = c(rep("", 6), "DEF2", rep("", 3)),
    E1 = c(rep("", 2), "ABC2", rep("", 7)),
    F1 = c(rep("", 3), "GHI3", rep("", 6)))

# Convet date1 to as.Date(), filter df by dates, select columns containing strings
# that start with first_three_chars values
result <- claim_dplyr_tbl |>
  mutate(start_claim = as.Date(date1)) |>
  filter((start_claim >= as.Date(start_date)) & 
           (start_claim <= as.Date(end_date))) |>
  select(start_claim,
         where(~ any(str_detect(., paste0("^(", paste(first_three_chars, collapse = "|"), ")")))))

result
#   start_claim   B1   D1   E1
# 1  2024-08-24           ABC2
# 2  2024-08-25               
# 3  2024-08-26 ABC1          
# 4  2024-08-27               
# 5  2024-08-28      DEF2     
# 6  2024-08-29               
# 7  2024-08-30                     
© www.soinside.com 2019 - 2024. All rights reserved.