在R中交换错位的单元格?

问题描述 投票:0回答:3

我有一个庞大的数据库(超过65M的行),我注意到有些单元格放错了位置。举个例子,假设我有这个:

library("tidyverse")

DATA <- tribble(
  ~SURNAME,~NAME,~STATE,~COUNTRY,
  'Smith','Emma','California','USA',
  'Johnson','Oliia','Texas','USA',
  'Williams','James','USA','California',
  'Jones','Noah','Pennsylvania','USA',
  'Williams','Liam','Illinois','USA',
  'Brown','Sophia','USA','Louisiana',
  'Daves','Evelyn','USA','Oregon',
  'Miller','Jacob','New Mexico','USA',
  'Williams','Lucas','Connecticut','USA',
  'Daves','John','California','USA',
  'Jones','Carl','USA','Illinois'
)

=====

> DATA
# A tibble: 11 x 4
   SURNAME  NAME   STATE        COUNTRY   
   <chr>    <chr>  <chr>        <chr>     
 1 Smith    Emma   California   USA       
 2 Johnson  Oliia  Texas        USA       
 3 Williams James  USA          California
 4 Jones    Noah   Pennsylvania USA       
 5 Williams Liam   Illinois     USA       
 6 Brown    Sophia USA          Louisiana 
 7 Daves    Evelyn USA          Oregon    
 8 Miller   Jacob  New Mexico   USA       
 9 Williams Lucas  Connecticut  USA       
10 Daves    John   California   USA       
11 Jones    Carl   USA          Illinois 

正如您所看到的,Country和State在某些行中放错了地方,我怎样才能有效地交换那些?

亲切的问候,路易斯。

r
3个回答
2
投票

使用data.table和内置的state.name矢量:

setDT(DATA)
DATA[COUNTRY %in% state.name, `:=`(COUNTRY = STATE, STATE = COUNTRY)]

DATA
#      SURNAME   NAME        STATE COUNTRY
#  1:    Smith   Emma   California     USA
#  2:  Johnson  Oliia        Texas     USA
#  3: Williams  James   California     USA
#  4:    Jones   Noah Pennsylvania     USA
#  5: Williams   Liam     Illinois     USA
#  6:    Brown Sophia    Louisiana     USA
#  7:    Daves Evelyn       Oregon     USA
#  8:   Miller  Jacob   New Mexico     USA
#  9: Williams  Lucas  Connecticut     USA
# 10:    Daves   John   California     USA
# 11:    Jones   Carl     Illinois     USA

1
投票

检查此解决方案(它假设COUNTRY列是ISO3格式,例如MEX,CAN):

DATA %>%
  mutate(
    COUNTRY_TMP = if_else(str_detect(COUNTRY, '[A-Z]{3}'), COUNTRY, STATE),
    STATE = if_else(str_detect(COUNTRY, '[A-Z]{3}'), STATE, COUNTRY),
    COUNTRY = COUNTRY_TMP
  ) %>%
  select(-COUNTRY_TMP)

0
投票

假设所有国家/地区名称都遵循ISO3格式,我们可以先安装countrycode软件包。在这个包中,有一个名为codelist的数据框,其中包含带有ISO3国家名称的列iso3c。我们可以使用以下方式交换国家/地区名称。

library(tidyverse)
library(countrycode)

DATA2 <- DATA %>%
  mutate(STATE2 = ifelse(STATE %in% codelist$iso3c & 
                           !COUNTRY %in% codelist$iso3c, COUNTRY, STATE),
         COUNTRY2 = ifelse(!STATE %in% codelist$iso3c & 
                             COUNTRY %in% codelist$iso3c, COUNTRY, STATE)) %>%
  select(-STATE, -COUNTRY) %>%
  rename(STATE = STATE2, COUNTRY = COUNTRY2)

DATA2
# # A tibble: 11 x 4
#    SURNAME  NAME   STATE        COUNTRY
#    <chr>    <chr>  <chr>        <chr>  
#  1 Smith    Emma   California   USA    
#  2 Johnson  Oliia  Texas        USA    
#  3 Williams James  California   USA    
#  4 Jones    Noah   Pennsylvania USA    
#  5 Williams Liam   Illinois     USA    
#  6 Brown    Sophia Louisiana    USA    
#  7 Daves    Evelyn Oregon       USA    
#  8 Miller   Jacob  New Mexico   USA    
#  9 Williams Lucas  Connecticut  USA    
# 10 Daves    John   California   USA    
# 11 Jones    Carl   Illinois     USA   
© www.soinside.com 2019 - 2024. All rights reserved.