提取表格之前的文本

Question

我想从 xml 文件中提取表格前一两行的副标题。例如，在此网页上：https://en.wikipedia.org/wiki/Cost_database

有几个表，我可以使用https://rud.is/b/2015/08/23/using-r-to-get-提供的库（xml）和R代码提取它们的标题数据-out-of-word-docs/

现在，我只想索引表格上方的一行并获取相应的文本。有没有好的办法呢？

Answer 1

您可以使用 rvest 包来获取网页中的第一段：

selectorgadget可以帮助识别html页面的正确元素名称。

library(rvest)

read_html("https://en.wikipedia.org/wiki/Cost_database") |> 
  html_element("p:nth-child(1)") |> 
  html_text2()

#> [1] "A cost database is a computerized database of cost estimating information, which is normally used with construction estimating software to support the formation of cost estimates. A cost database may also simply be an electronic reference of cost data."

^{由 reprex 包于 2022 年 6 月 18 日创建（v2.0.1）}

获取Word文档的第一段：

library(tidyverse)
library(officer)

# Create table
df <- tribble(
  ~col1, ~col2,
  "a", 1,
  "b", 2
)

# Create Word doc with para and table
example_doc <- read_docx() |> 
  body_add_par("Some text.") |> 
  body_add_table(df, style = "table_template")

# Save Word doc
print(example_doc, target = "example.docx")

# Read the doc
content <- read_docx("example.docx") |> 
  docx_summary()

# Get the text before the table
content |> 
  filter(doc_index == 1) |> 
  select(text)
#>         text
#> 1 Some text.

^{由 reprex 包于 2022 年 6 月 18 日创建（v2.0.1）}

提取表格之前的文本

问题描述投票：0回答：1

1个回答

最新问题

提取表格之前的文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1