我在 R 中有两个文件(f1,f2),其中包含国家/地区名称(每个国家/地区仅在每个文件中出现一次)。我想使用内部联接对两个文件执行联接,然后对于不匹配的行,我将使用模糊联接来近似匹配它们。
这是我的方法:首先,我标准化所有名称(例如,将所有字母设为大写,删除空格、逗号、连字符、撇号等) - 然后我进行内部连接以获得精确匹配。然后,对于不匹配的名称,我使用模糊匹配度量,并在阈值小于某个常数时加入:
library(dplyr)
library(fuzzyjoin)
library(stringdist)
library(stringr)
set.seed(123)
f1 <- data.frame(
country = c("United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Japan", "Australia", "Brazil"),
value1 = runif(10, 1, 100), value2 = runif(10, 1, 100), value3 = runif(10, 1, 100)
)
f2 <- data.frame(
country = c("United States of America", "United Kingdom", "French Republic", "Germany", "Italian Republic", "Kingdom of Spain", "Canada", "Japan", "Commonwealth of Australia", "Federative Republic of Brazil"),
value2 = runif(10, 1, 100)
)
standardize_name <- function(name) {
name %>%
str_to_upper() %>%
str_replace_all("[[:punct:]]", "") %>%
str_replace_all("\\s+", "")
}
f1 <- f1 %>%
mutate(country_std = standardize_name(country))
f2 <- f2 %>%
mutate(country_std = standardize_name(country))
inner_joined <- inner_join(f1, f2, by = "country")
unmatched_f1 <- anti_join(f1, f2, by = "country")
unmatched_f2 <- anti_join(f2, f1, by = "country")
fuzzy_joined <- stringdist_join(unmatched_f1, unmatched_f2,
by = "country",
mode = "left",
method = "lv",
max_dist = 5,
distance_col = "distance")
final_result <- bind_rows(inner_joined, fuzzy_joined)
然而,实际上,f2 不是一个数据框,而是一个 shapefile(https://www.naturalearthdata.com/downloads/110m-culture-vectors/:Admin 0 – 国家)。我不确定当一个文件是形状文件而另一个文件是数据框时是否可以使用模糊匹配。
有人知道这是否可能吗?最终结果应该是一个用于绘图的 sf 对象,具有 f1 中的所有列,并且应具有 f2 中的所有行。
干杯
我不确定这就是您要找的,因为问题可能需要进一步澄清。
我将使用
R
包将自然地球数据直接加载到 rnatrualearth
环境中。我还将利用 countrycode
,它通常在匹配国家/地区名称时效果很好。我们可以用它来标准化 f1
和 f2
中的国家名称,尽管我怀疑您不需要它来 f2
,因为它们应该已经被 Natural Earth 标准化了。
library(dplyr)
library(countrycode)
library(rnaturalearth)
f1 <- data.frame(
country = c("United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Japan", "Australia", "Brazil"),
value1 = runif(10, 1, 100),
value2 = runif(10, 1, 100),
value3 = runif(10, 1, 100)
) %>%
mutate(country = countrycode(country, 'country.name', 'country.name')) # here you are creating a new column called country, where you'll use the field name_en and recode it based on country name. This function is useful, because, even if it is misspelled it will recognize and correct the name to a standard format. I never had a situation where it is so badly written that it doesn't recognize, but I suppose it could happen
f2 <- ne_countries(scale = 110, type = 'countries') %>%
mutate(country = countrycode(name_en, 'country.name', 'country.name')) %>% # here you are creating a new column called country, where you'll use the field name_en and recode it based on country name, just like on f1
select(country, geometry) # It's unclear whether you are interested in all the fields in the natural earth data frame f2. I assumed you only wanted to turn f1 into a polygon shapefile. If you want to keep all the fields from natural earth data frame you can delete this line
class(f2) # note that, because we used rnaturalearth, f2 is already an sf data frame object
# In theory the country names should match now
final_result <- f1 %>%
left_join(f2, by = 'country')
print(final_result)
请告诉我这是否有帮助。