从网站转换文本文件到表格中

问题描述 投票:0回答:1

我正在尝试将包含文本表的 htm 文件转换为数据框。我浏览了之前的问题herehere,但没有解决我的问题。桌子乱了。

遵循可重现的示例。

# URL of the website 
  url <- paste0("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/st_co_93.htm") 
  
# Read the HTML code of the page regarding the interested table
  html_code <- paste(readLines(url))[44:99]
  
# Transform text into a table
  table_df <- read.table(text = html_code, skip = 3, fill = NA, 
                         col.names = c("CODE", "STATE/COUNTRY", "UTILITY", "DESIGN", "PLANT","REISSUE","TOTALS","SIRS"))
  
html r dataframe parsing text
1个回答
0
投票

恐怕你的行索引有点不对劲。它仍然是一个 html 文档,可见文本内容从第 12 行和第一个表开始,没有标题和空行,涵盖第 47 .. 101 行。因此,如果您同意硬编码索引,则应该这样做:

library(readr)
l <- read_lines("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/st_co_93.htm")

# check a section where first table starts
stringr::str_view(l)[40:50]
#> [40] │ 
#> [41] │ 
#> [42] │ STATE-COUNTRY COUNTS FROM CALENDAR YEAR 1993 PATENT FILE
#> [43] │ 
#> [44] │ MAIL
#> [45] │ CODE   STATE/COUNTRY    UTILITY   DESIGN   PLANT   REISSUE   SIRS    TOTALS
#> [46] │ 
#> [47] │ AL     ALABAMA          271       51       1       1         0       324     
#> [48] │ AK     ALASKA           50        14       0       0         0       64      
#> [49] │ AZ     ARIZONA          848       75       0       5         1       929     
#> [50] │ AR     ARKANSAS         114       36       4       0         0       154

# a fixed width file and readr::read_fwf() provides convenient column width guessing
l[47:101] |> 
  I() |> 
  read_fwf() |> 
  setNames(strsplit(paste(l[44:45], collapse = "_"), "\\s+")[[1]])
#> Rows: 55 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> 
#> chr (2): X1, X2
#> dbl (6): X3, X4, X5, X6, X7, X8
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

结果:

#> # A tibble: 55 × 8
#>    MAIL_CODE `STATE/COUNTRY` UTILITY DESIGN PLANT REISSUE  SIRS TOTALS
#>    <chr>     <chr>             <dbl>  <dbl> <dbl>   <dbl> <dbl>  <dbl>
#>  1 AL        ALABAMA             271     51     1       1     0    324
#>  2 AK        ALASKA               50     14     0       0     0     64
#>  3 AZ        ARIZONA             848     75     0       5     1    929
#>  4 AR        ARKANSAS            114     36     4       0     0    154
#>  5 CA        CALIFORNIA         8170   1211   159      28    11   9579
#>  6 CO        COLORADO            910    140     0       4     3   1057
#>  7 CT        CONNECTICUT        1544    194     0       8     3   1749
#>  8 DE        DELAWARE            507     13     1       0     4    525
#>  9 FL        FLORIDA            1777    316    12       9     3   2117
#> 10 GA        GEORGIA             705    154     0       2     1    862
#> # ℹ 45 more rows

创建于 2024 年 11 月 14 日,使用 reprex v2.1.1

© www.soinside.com 2019 - 2024. All rights reserved.