将纽约州法院数据中的数据转换为要可视化的R数据框(导入但未成为正确的数据框)

问题描述 投票:0回答:2

我正在尝试从本网站的数据集中自动获取重要数据:https://ww2.nycourts.gov/oca-stat-act-31371。当我导入这些数据时,它会很好地导入到 R 的数据框中,但当我尝试可视化它时,视觉效果完全混乱了。同样,当我尝试在 R Shiny 应用程序中运行它时,它给出错误输入字符串 1 是无效的 UTF-8。我该怎么办?

下面是导入和处理数据的代码:

NYSdata <- read.csv("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")

new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")

names(NYSdata) <- new

NYSdata <- select(NYSdata, -c("row_num"))

这是可视化数据的代码:

NYSdata %>% 


ggplot(aes(x = race)) + 
geom_bar() + 
xlab("Court") + 
ylab("Number of People") + 
abs(title = "Racial Breakdown of New York State Courts") + 
theme_economist() + 
theme(plot.title = element_text(hjust = 0.5))+
geom_text(stat='count', aes(label=..count..), vjust = -.3)

这是我运行上述代码时得到的结果:

enter image description here

r shiny
2个回答
2
投票

csv 文件相对较大(~50Mb),并且网站的下载速度相对较慢,因此您可能会遇到“超时”限制。尝试更改超时并查看是否最终得到未损坏的数据文件,例如

library(tidyverse)
options(timeout = 1200)
NYSdata <- read.csv("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")
new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")
names(NYSdata) <- new
NYSdata <- select(NYSdata, -c("row_num"))
NYSdata %>%
ggplot(aes(x = race)) + 
  geom_bar() + 
  xlab("Court") + 
  ylab("Number of People") + 
  abs(title = "Racial Breakdown of New York State Courts") + 
  theme_economist() + 
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust = -.3)

编辑

更好的选择是使用 vroom 包,例如

library(tidyverse)
library(vroom)
library(ggthemes)
options(timeout = 2400)
NYSdata <- vroom("https://www.nycourts.gov/LegacyPDFS/court-research/OCA-STAT-Act.csv")
new <- c("row_num", "court_type", "region", "district", "county", "court", "arresting_agency", "arrest_type", "arraign_year", "arraign_month", "top_charge_at_arraignment", "severity", "weight", "law", "article_section", "attempt_flag", "gender", "ethnicity", "race", "arrest_age", "docket_status", "disposition_type", "disposition_detail", "dismissal_reason", "most_severe_sentence", "fines_imposed", "fees_imposed", "surcharges_imposed")
names(NYSdata) <- new
NYSdata <- select(NYSdata, -c("row_num"))
NYSdata %>%
  filter(grepl("[[:alpha:]]+", x = race)) %>%
  ggplot(aes(x = race)) +
  geom_bar() +
  xlab("Court") + 
  ylab("Number of People") + 
  labs(title = "Racial Breakdown of New York State Courts") + 
  theme_economist() + 
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust = -.3)

example_1.png

(此外,我使用

filter(grepl("[[:alpha:]]+", x = race))
过滤掉了少数具有种族数字代码而不是单词的患者,但根据您的用例,您可能不想这样做)


0
投票

这绝对是编码问题。我使用“CSV UTF-8”在我的 macbook 上重新保存了 .csv,并且在加载和绘图时工作正常。我尝试在 R 中进行编码,但仍然无法正常工作:

NYSdata <- read.csv("OCA-STAT-Act.csv") %>% 
           dplyr::mutate_if(is.character, utf8_encode) 
    
# Also, trying encoding when reading in the csv.
NYSdata <- read.csv("OCA-STAT-Act.csv", encoding = "UTF-8")

这段代码让你更接近你想要的,但仍然输出一个奇怪的数字(条形图底部有很多文本)。

Plot after assigning UTF-8 encoding in R

使用

read_csv
似乎可以解决此问题,而无需指定。

library(readr)

NYSdata <- read_csv("OCA-STAT-Act.csv")

Plot after using read_csv

Stack Overflow 上有很多关于 UTF-8 编码的条目。以下是使用 Shiny 应用程序处理它的方法:输入字符串 1 无效 UTF-8 Shiny 应用程序

© www.soinside.com 2019 - 2024. All rights reserved.