从网络图表中提取数据

问题描述 投票:0回答:1

我在这个网址中有这个数字谁在美国获得庇护?

我的目标是从该图中提取数据(作为数据框)。

我不知道该怎么做。这是我的尝试

  library(rvest)
 html_url<-read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")
 html_url %>%  html_elements(xpath = "//*[contains(@class, 'image')]")

但我一事无成。

r url
1个回答
0
投票

这对于您的示例来说非常具体。从图像中提取一些文本是“简单”的部分。正如@Wimpel 已经说过的,从图像或其中的文本中提取可靠的数据非常困难。您如何知道该图代表哪种图表?当然,有一些用于散点图或基于点的图表的数字化工具,例如

digitize
。但一般来说,直接挖掘底层数据会更好。 尽管如此,我还是为您的具体示例构建了这段代码。

library(tesseract)
library(rvest)
library(dplyr)
library(tidyr)
library(tidyverse)
library(magick)
library(data.table)
# Read the webpage
html_url <- read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")

image_url <- html_url %>% html_elements("img") %>% html_attr("src")

graphics <- image_url[grepl("Infographic", image_url)]
# Download the image
download.file("https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg", destfile = "chart_image.png", mode = "wb")

# Load and preprocess image
img <- image_read("chart_image.png") %>%
  image_resize("800x800") %>%
  image_convert(colorspace = "gray")

# Save processed image and apply OCR
image_write(img, "processed_image.png")
text <- tesseract::ocr("processed_image.png")

text_to_asylum_df <- function(text) {
  # Split text into lines
  lines <- strsplit(text, "\n")[[1]]
  
  # Filter out empty lines and header/footer
  data_lines <- lines[grepl("[0-9]", lines)]
  
  # Extract country and number using regex
  asylum_data <- lapply(data_lines, function(line) {
    # Extract country (word characters at start of line)
    country <- gsub("^([A-Za-z ]+).*$", "\\1", line)
    country <- trimws(country)
    
    # Extract number (digits, possibly with comma or period)
    number <- gsub("[^0-9,.]", "", line)
    number <- gsub(",", "", number)
    number <- gsub("\\.", "", number)
    number <- as.numeric(number)
    
    return(c(country = country, granted = number))
  })
  
  # Convert to dataframe
  df <- as.data.frame(do.call(rbind, asylum_data))
  
  # Convert granted column to numeric
  df$granted <- as.numeric(as.character(df$granted))
  
  # Add year as attribute
  attr(df, "year") <- 2022
  
  return(df)
}

# Create the dataframe
asylum_df <- text_to_asylum_df(text)

# View the result
print(asylum_df)

如你所见,中国和委内瑞拉连

tesseract
都不承认。

输出:

> print(asylum_df)
           country granted
1  asylum in the U    2022
2 El Salvador S TS    2639
3        Guatemala    2329
4            india   22203
5         Honduras    1829
6      Afghanistan    1493
7           turkey    1228
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.