我在这个网址中有这个数字谁在美国获得庇护?
我的目标是从该图中提取数据(作为数据框)。
我不知道该怎么做。这是我的尝试
library(rvest)
html_url<-read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")
html_url %>% html_elements(xpath = "//*[contains(@class, 'image')]")
但我一事无成。
这对于您的示例来说非常具体。从图像中提取一些文本是“简单”的部分。正如@Wimpel 已经说过的,从图像或其中的文本中提取可靠的数据非常困难。您如何知道该图代表哪种图表?当然,有一些用于散点图或基于点的图表的数字化工具,例如
digitize
。但一般来说,直接挖掘底层数据会更好。
尽管如此,我还是为您的具体示例构建了这段代码。
library(tesseract)
library(rvest)
library(dplyr)
library(tidyr)
library(tidyverse)
library(magick)
library(data.table)
# Read the webpage
html_url <- read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")
image_url <- html_url %>% html_elements("img") %>% html_attr("src")
graphics <- image_url[grepl("Infographic", image_url)]
# Download the image
download.file("https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg", destfile = "chart_image.png", mode = "wb")
# Load and preprocess image
img <- image_read("chart_image.png") %>%
image_resize("800x800") %>%
image_convert(colorspace = "gray")
# Save processed image and apply OCR
image_write(img, "processed_image.png")
text <- tesseract::ocr("processed_image.png")
text_to_asylum_df <- function(text) {
# Split text into lines
lines <- strsplit(text, "\n")[[1]]
# Filter out empty lines and header/footer
data_lines <- lines[grepl("[0-9]", lines)]
# Extract country and number using regex
asylum_data <- lapply(data_lines, function(line) {
# Extract country (word characters at start of line)
country <- gsub("^([A-Za-z ]+).*$", "\\1", line)
country <- trimws(country)
# Extract number (digits, possibly with comma or period)
number <- gsub("[^0-9,.]", "", line)
number <- gsub(",", "", number)
number <- gsub("\\.", "", number)
number <- as.numeric(number)
return(c(country = country, granted = number))
})
# Convert to dataframe
df <- as.data.frame(do.call(rbind, asylum_data))
# Convert granted column to numeric
df$granted <- as.numeric(as.character(df$granted))
# Add year as attribute
attr(df, "year") <- 2022
return(df)
}
# Create the dataframe
asylum_df <- text_to_asylum_df(text)
# View the result
print(asylum_df)
如你所见,中国和委内瑞拉连
tesseract
都不承认。
输出:
> print(asylum_df)
country granted
1 asylum in the U 2022
2 El Salvador S TS 2639
3 Guatemala 2329
4 india 22203
5 Honduras 1829
6 Afghanistan 1493
7 turkey 1228