我有一个包含大约 2000 个条目的表,其中包含
names
、positions
、field of expertise
和 addresses of professors
。该表格非常混乱,我正在努力寻找一种程序化方式将其转换并转换为整洁格式。
我的目标是创建一个整洁的表格,至少包含以下列:
names
、positions
、field of expertise
和email address
。
这是数据示例:
数据 | 场 |
---|---|
第 1 人,医学博士 | A |
A副教授 | A |
项目助理,A | A |
FGP 运营高级医疗总监 | A |
UMMG 门诊手术 | A |
驻留计划 | A |
核心教育领导 | A |
[电子邮件受保护] | A |
不适用 | A |
2 号人,医学博士 | B |
2 人 | B |
B 临床助理教授 | B |
医学生实习和核心教育领导 | B |
[电子邮件受保护] | B |
不适用 | B |
实验服 | B |
第 3 人,医学博士 | B |
B 临床助理教授 | B |
[电子邮件受保护] | B |
不适用 | B |
实验服 | B |
4 号人,医学博士 | B |
B教授 | B |
B教授 | B |
[电子邮件受保护] | B |
不适用 | B |
第 5 人,医学博士 | C |
5 人 | C |
C教授 | C |
C教授 | C |
副主席,质量 | C |
泌尿外科及服务主任 | C |
[电子邮件受保护] | C |
132-547-1321 | C |
不适用 | C |
这是
tibble
代码(重现):
tibble::tribble(
~data, ~field,
"Person 1, MD", "A",
"Associate Professor of A", "A",
"Program Associate, A", "A",
"Senior Medical Director, FGP Operations", "A",
"UMMG Ambulatory Surgery", "A",
"Residency Program", "A",
"Core Educational Lead", "A",
"[email protected]", "A",
NA, "A",
"Person 2, MD", "B",
"Person 2", "B",
"Clinical Assistant Professor of B", "B",
"Medical Student Clerkship and Core Educational Lead", "B",
"[email protected]", "B",
NA, "B",
"labcoat", "B",
"Person 3, MD", "B",
"Clinical Assistant Professor of B", "B",
"[email protected]", "B",
NA, "B",
"labcoat", "B",
"Person 4, MD", "B",
"Professor of B", "B",
"Professor of B", "B",
"[email protected]", "B",
NA, "B",
"Person 5, MD", "C",
"Person 5", "C",
"Professor of C", "C",
"Professor of C", "C",
"Associate Chair, Quality", "C",
"Department of Urology and Service Chief", "C",
"[email protected]", "C",
"132-547-1321", "C",
NA, "C"
)
逐步的方法:
library(dplyr)
library(tidyr)
library(stringr)
df_wide <- df |>
mutate(names = case_when(str_detect(data, ", MD$") ~ "name",
str_detect(data, "@") ~ "email",
.default = NA)) |>
mutate(data = str_remove(data, ", MD$")) |>
mutate(tmp_start = cumsum(!is.na(names) & names == "name"),
tmp_end = lag(cumsum(!is.na(names) & names == "email"), default = 0),
id = if_else(tmp_start == tmp_end, NA, tmp_start)) |>
group_by(id) |>
fill(id, .direction = "downup") |>
select(-starts_with("tmp")) |>
filter(!is.na(id)) |>
filter(!duplicated(data, fromFirst = TRUE)) |>
mutate(names = if_else(is.na(names), paste0("position", 1:n() - 1), names)) |>
pivot_wider(names_from = names,
values_from = data) |>
ungroup() |>
select(-id)
data.frame(df_wide)
# field name position1 position2 position3 position4 position5 position6 email
# A Person 1 Associate Professor of A Program Associate, A Senior Medical Director, FGP Operations UMMG Ambulatory Surgery Residency Program Core Educational Lead [email protected]
# B Person 2 Clinical Assistant Professor of B Medical Student Clerkship and Core Educational Lead <NA> <NA> <NA> <NA> [email protected]
# B Person 3 Clinical Assistant Professor of B <NA> <NA> <NA> <NA> <NA> [email protected]
# B Person 4 Professor of B <NA> <NA> <NA> <NA> <NA> [email protected]
# C Person 5 Professor of C Associate Chair, Quality Department of Urology and Service Chief <NA> <NA> <NA> [email protected]