我正在尝试从本网站的数据集中自动获取重要数据:https://ww2.nycourts.gov/oca-stat-act-31371。当我导入这些数据时,它会很好地导入到 R 的数据框中,但当我尝试可视化它时,视觉效果完全混乱了。同样,当我尝试在 R Shiny 应用程序中运行它时,它给出错误输入字符串 1 是无效的 UTF-8。我该怎么办?
下面是导入和处理数据的代码:
NYSdata <- read.csv("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")
new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")
names(NYSdata) <- new
NYSdata <- select(NYSdata, -c("row_num"))
这是可视化数据的代码:
NYSdata %>%
ggplot(aes(x = race)) +
geom_bar() +
xlab("Court") +
ylab("Number of People") +
abs(title = "Racial Breakdown of New York State Courts") +
theme_economist() +
theme(plot.title = element_text(hjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust = -.3)
这是我运行上述代码时得到的结果:
csv 文件相对较大(~50Mb),并且网站的下载速度相对较慢,因此您可能会遇到“超时”限制。尝试更改超时并查看是否最终得到未损坏的数据文件,例如
library(tidyverse)
options(timeout = 1200)
NYSdata <- read.csv("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")
new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")
names(NYSdata) <- new
NYSdata <- select(NYSdata, -c("row_num"))
NYSdata %>%
ggplot(aes(x = race)) +
geom_bar() +
xlab("Court") +
ylab("Number of People") +
abs(title = "Racial Breakdown of New York State Courts") +
theme_economist() +
theme(plot.title = element_text(hjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust = -.3)
更好的选择是使用 vroom 包,例如
library(tidyverse)
library(vroom)
library(ggthemes)
options(timeout = 2400)
NYSdata <- vroom("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")
new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")
names(NYSdata) <- new
NYSdata <- select(NYSdata, -c("row_num"))
NYSdata %>%
filter(grepl("[[:alpha:]]+", x = race)) %>%
ggplot(aes(x = race)) +
geom_bar() +
xlab("Court") +
ylab("Number of People") +
labs(title = "Racial Breakdown of New York State Courts") +
theme_economist() +
theme(plot.title = element_text(hjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust = -.3)
(此外,我使用
filter(grepl("[[:alpha:]]+", x = race))
过滤掉了少数具有种族数字代码而不是单词的患者,但根据您的用例,您可能不想这样做)
这绝对是编码问题。我使用“CSV UTF-8”在我的 macbook 上重新保存了 .csv,并且在加载和绘图时工作正常。我尝试在 R 中进行编码,但仍然无法正常工作:
NYSdata <- read.csv("OCA-STAT-Act.csv") %>%
dplyr::mutate_if(is.character, utf8_encode)
# Also, trying encoding when reading in the csv.
NYSdata <- read.csv("OCA-STAT-Act.csv", encoding = "UTF-8")
这段代码让你更接近你想要的,但仍然输出一个奇怪的数字(条形图底部有很多文本)。
使用
read_csv
似乎可以解决此问题,而无需指定。
library(readr)
NYSdata <- read_csv("OCA-STAT-Act.csv")
Stack Overflow 上有很多关于 UTF-8 编码的条目。以下是使用 Shiny 应用程序处理它的方法:输入字符串 1 无效 UTF-8 Shiny 应用程序