我正在尝试将包含文本表的 htm 文件转换为数据框。我浏览了之前的问题here和here,但没有解决我的问题。桌子乱了。
遵循可重现的示例。
# URL of the website
url <- paste0("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/st_co_93.htm")
# Read the HTML code of the page regarding the interested table
html_code <- paste(readLines(url))[44:99]
# Transform text into a table
table_df <- read.table(text = html_code, skip = 3, fill = NA,
col.names = c("CODE", "STATE/COUNTRY", "UTILITY", "DESIGN", "PLANT","REISSUE","TOTALS","SIRS"))
恐怕你的行索引有点不对劲。它仍然是一个 html 文档,可见文本内容从第 12 行和第一个表开始,没有标题和空行,涵盖第 47 .. 101 行。因此,如果您同意硬编码索引,则应该这样做:
library(readr)
l <- read_lines("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/st_co_93.htm")
# check a section where first table starts
stringr::str_view(l)[40:50]
#> [40] │
#> [41] │
#> [42] │ STATE-COUNTRY COUNTS FROM CALENDAR YEAR 1993 PATENT FILE
#> [43] │
#> [44] │ MAIL
#> [45] │ CODE STATE/COUNTRY UTILITY DESIGN PLANT REISSUE SIRS TOTALS
#> [46] │
#> [47] │ AL ALABAMA 271 51 1 1 0 324
#> [48] │ AK ALASKA 50 14 0 0 0 64
#> [49] │ AZ ARIZONA 848 75 0 5 1 929
#> [50] │ AR ARKANSAS 114 36 4 0 0 154
# a fixed width file and readr::read_fwf() provides convenient column width guessing
l[47:101] |>
I() |>
read_fwf() |>
setNames(strsplit(paste(l[44:45], collapse = "_"), "\\s+")[[1]])
#> Rows: 55 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#>
#> chr (2): X1, X2
#> dbl (6): X3, X4, X5, X6, X7, X8
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
结果:
#> # A tibble: 55 × 8
#> MAIL_CODE `STATE/COUNTRY` UTILITY DESIGN PLANT REISSUE SIRS TOTALS
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AL ALABAMA 271 51 1 1 0 324
#> 2 AK ALASKA 50 14 0 0 0 64
#> 3 AZ ARIZONA 848 75 0 5 1 929
#> 4 AR ARKANSAS 114 36 4 0 0 154
#> 5 CA CALIFORNIA 8170 1211 159 28 11 9579
#> 6 CO COLORADO 910 140 0 4 3 1057
#> 7 CT CONNECTICUT 1544 194 0 8 3 1749
#> 8 DE DELAWARE 507 13 1 0 4 525
#> 9 FL FLORIDA 1777 316 12 9 3 2117
#> 10 GA GEORGIA 705 154 0 2 1 862
#> # ℹ 45 more rows
创建于 2024 年 11 月 14 日,使用 reprex v2.1.1