如何通过字符串检测来识别多列中是否存在多个字符串?

问题描述 投票:0回答:1

我正在搜索多个诊断以识别某些代码(S00-S99、T07-T34、T4112、W00-W99),并希望标记至少存在一个代码的病例。当我运行以下代码时:

library(tidyverse)
library(stringr)

raw_df <- tibble::tribble(
     ~dx1, ~dx2, ~ecm1, ~ecm2,
     "S045",   "T401",    "X64", "V99",
     "R901",   "T5621A",  "Y141", "U033",
       "J76",   "I51",    "K44", "G304"
)

dat <- raw_df %>%
mutate(diagn = case_when(
           if_any(c(dx1:odx2, ecm1:ecm2),
                  ~str_detect(., regex("[S]+[00-99]|[T]+[07-34]|T4112|[W]+[00-99]")))   ~ 1,
           TRUE                                                                         ~ 0))

我收到错误:

Error in `mutate()`:
ℹ In argument: `diagn = case_when(...)`.
Caused by error in `case_when()`:
! Failed to evaluate the left-hand side of formula 1.
Caused by error in `if_any()`:
! Can't compute column `dx1`.
Caused by error in `stri_detect_regex()`:
! In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE, context=`[S]+[00-99]|[T]+[07-34]|T4112|[W]+[00-99]`)
r
1个回答
0
投票

一种方法是在向量中定义代码,创建辅助函数来确定输入代码是否在范围内,然后创建满足该条件的参考矩阵。这也允许您识别匹配的列和行,或者仅识别行,无论您想要什么:

您的示例数据有一行和一列满足您的条件(第一行和第一列)。为了更好地测试,我在示例数据中添加了一行,其中有第四个观察结果满足每列中的不同代码:

raw_df <- tibble::tribble(
  ~dx1, ~dx2, ~ecm1, ~ecm2,
  "S045",   "T401",    "X64", "V99",
  "R901",   "T5621A",  "Y141", "U033",
  "J76",   "I51",    "K44", "G304",
  "S33", "T14", "T4112", "W01"
)

首先定义向量和辅助函数

codes <- c("S00-S99", "T07-T34", "T4112", "W00-W99")

range_fun <- function(code, range_str) {
  if(grepl("-", range_str)){
    range_parts <- stringr::str_split(range_str, "-", simplify = TRUE)
    dplyr::between(code, range_parts[1], range_parts[2])
  } else { 
    code == range_str
  }}

然后使用

*apply
函数来运行代码:

ref_matrix <- sapply(codes, \(x)
                     apply(raw_df, 1, \(y) any(range_fun(y, x))))

#      S00-S99 T07-T34 T4112 W00-W99
# [1,]    TRUE   FALSE FALSE   FALSE
# [2,]   FALSE   FALSE FALSE   FALSE
# [3,]   FALSE   FALSE FALSE   FALSE
# [4,]    TRUE    TRUE  TRUE    TRUE

如果您只想识别行,您可以使用

apply
:

进行索引
raw_df[apply(ref_matrix, 1, any),]

#    dx1  dx2  ecm1 ecm2
# 1 S045 T401   X64  V99
# 4  S33  T14 T4112  W01

如果你想识别行和列,你可以使用

which
:

which(ref_matrix, arr.ind = TRUE)

#      row col
# [1,]   1   1
# [2,]   4   1
# [3,]   4   2
# [4,]   4   3
# [5,]   4   4
© www.soinside.com 2019 - 2024. All rights reserved.