将纽约州法院数据中的数据转换为要可视化的R数据框（导入但未成为正确的数据框）

Question

我正在尝试从本网站的数据集中自动获取重要数据：https://ww2.nycourts.gov/oca-stat-act-31371。当我导入这些数据时，它会很好地导入到 R 的数据框中，但当我尝试可视化它时，视觉效果完全混乱了。同样，当我尝试在 R Shiny 应用程序中运行它时，它给出错误输入字符串 1 是无效的 UTF-8。我该怎么办？

下面是导入和处理数据的代码：

NYSdata <- read.csv("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")

new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")

names(NYSdata) <- new

NYSdata <- select(NYSdata, -c("row_num"))

这是可视化数据的代码：

NYSdata %>% 


ggplot(aes(x = race)) + 
geom_bar() + 
xlab("Court") + 
ylab("Number of People") + 
abs(title = "Racial Breakdown of New York State Courts") + 
theme_economist() + 
theme(plot.title = element_text(hjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust = -.3)

这是我运行上述代码时得到的结果：

Answer 1

csv 文件相对较大（~50Mb），并且网站的下载速度相对较慢，因此您可能会遇到“超时”限制。尝试更改超时并查看是否最终得到未损坏的数据文件，例如

library(tidyverse)
options(timeout = 1200)
NYSdata <- read.csv("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")
new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")
names(NYSdata) <- new
NYSdata <- select(NYSdata, -c("row_num"))
NYSdata %>%
ggplot(aes(x = race)) + 
  geom_bar() + 
  xlab("Court") + 
  ylab("Number of People") + 
  abs(title = "Racial Breakdown of New York State Courts") + 
  theme_economist() + 
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust = -.3)

编辑

更好的选择是使用 vroom 包，例如

library(tidyverse)
library(vroom)
library(ggthemes)
options(timeout = 2400)
NYSdata <- vroom("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")
new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")
names(NYSdata) <- new
NYSdata <- select(NYSdata, -c("row_num"))
NYSdata %>%
  filter(grepl("[[:alpha:]]+", x = race)) %>%
  ggplot(aes(x = race)) +
  geom_bar() +
  xlab("Court") + 
  ylab("Number of People") + 
  labs(title = "Racial Breakdown of New York State Courts") + 
  theme_economist() + 
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust = -.3)

（此外，我使用

filter(grepl("[[:alpha:]]+", x = race))

过滤掉了少数具有种族数字代码而不是单词的患者，但根据您的用例，您可能不想这样做）

Answer 2

这绝对是编码问题。我使用“CSV UTF-8”在我的 macbook 上重新保存了 .csv，并且在加载和绘图时工作正常。我尝试在 R 中进行编码，但仍然无法正常工作：

NYSdata <- read.csv("OCA-STAT-Act.csv") %>% 
           dplyr::mutate_if(is.character, utf8_encode) 
    
# Also, trying encoding when reading in the csv.
NYSdata <- read.csv("OCA-STAT-Act.csv", encoding = "UTF-8")

这段代码让你更接近你想要的，但仍然输出一个奇怪的数字（条形图底部有很多文本）。

使用

read_csv

似乎可以解决此问题，而无需指定。

library(readr)

NYSdata <- read_csv("OCA-STAT-Act.csv")

Stack Overflow 上有很多关于 UTF-8 编码的条目。以下是使用 Shiny 应用程序处理它的方法：输入字符串 1 无效 UTF-8 Shiny 应用程序

将纽约州法院数据中的数据转换为要可视化的R数据框（导入但未成为正确的数据框）

问题描述投票：0回答：2

2个回答

编辑

最新问题

将纽约州法院数据中的数据转换为要可视化的R数据框（导入但未成为正确的数据框）

问题描述 投票：0回答：2

2个回答

编辑

最新问题

问题描述投票：0回答：2