在R中读取多个excel工作表时跳过行

问题描述 投票:3回答:2

我使用readxl库在同一个excel工作簿(称为data.xlsx)中读取许多excel工作表,格式如下:

数据从第3行开始。

  row1
  row2
 companyName   1980    1981    1982 ... 2016
 company1       5       6       7        8
 company2       10      20      30       40
 company3       20      40      60       80
 ....

每行和每列的数据范围长度不同。但是,他们将companyName作为公共密钥。年份范围从1980年或1990年到2016年不等。工作表名称是数据名称。

我想创建一个excel,其中所有数据都从所有工作表中提取。

 companyName   Year   dataname     values
 company1      1980   sheetname1     5
 company1      1981   sheetname1     6
 company1      1982   sheetname1     7
 company1      ...    sheetname1     ...
 company1      2016   sheetname1     8
 company2      1980   sheetname1     10
 company2      1981   sheetname1     20
 company2      1982   sheetname1     30
 company2      ...    sheetname1     ...
 company2      2016   sheetname1     40
 ....          ....     ...           ...
 company1      2000    sheetname2     xxx
 company1      2001    sheetname2     yyy
  etc
  etc
  etc

这也是我设法得到的程度:

  library(tidyverse)
  library(readxl)
  library(data.table)

   #read excel file (from [here][1])
   file.list<-"data.xlsx"

     **#read all sheets (and **skip** first two rows)**

   df.list <- lapply(file.list,function(x) {
     sheets <- excel_sheets(x)
     dfs <- lapply(sheets, function(y) {
       read_excel(x, sheet = y,skip=2)
       })
     names(dfs) <- sheets
     dfs
   })

我有以下问题:

  • 前两行没有被跳过
  • 如何创建一个仅包含选择工作表的数据框(即工作表5,工作表10和工作表15)。

谢谢您的帮助。

资料来源:R: reading multiple excel files, extract first sheet names, and create new column

r
2个回答
3
投票

我刚刚从df.list中删除了一层嵌套。

df.list <- lapply(file.list,function(x) {
    sheets <- excel_sheets(x)
    dfs <- lapply(sheets, function(y) {
    read_excel(x, sheet = y,skip=2)
  })
  names(dfs) <- sheets
  dfs 
})[[1]]

这适合我。我不能用跳过来复制你的问题。此外,如果行只是空行,read_excel()应默认使用trim_ws = TRUE跳过它们。

我使用以下列表来演示导入后要执行的操作。

df.list <- structure(list(sheetname1 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(5, 10, 40), `1981` = c(6, 
20, 50), `1982` = c(7, 30, 60)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname2 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(6, 11, 42), `1981` = c(7, 
21, 52), `1982` = c(8, 31, 62)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname3 = structure(list(companyName = c("company1", 
"company2", "company3"), `1990` = c(8, 12, 43), `1991` = c(9, 
22, 53), `1992` = c(10, 32, 63)), .Names = c("companyName", "1990", 
"1991", "1992"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))), .Names = c("sheetname1", "sheetname2", 
"sheetname3"))

即使这些年份始于1980年或1990年,以下工作仍然有效。

dat <- lapply(df.list, function(x){
  nrows = nrow(x)
  years = names(x[,2:nrows])
  x %>% gather(year, values, -companyName)
}) %>% enframe() %>% unnest()

dat

# # A tibble: 27 x 4
#    name       companyName year  values
#    <chr>      <chr>       <chr>  <dbl>
#  1 sheetname1 company1    1980      5.
#  2 sheetname1 company2    1980     10.
#  3 sheetname1 company3    1980     40.
#  4 sheetname1 company1    1981      6.
#  5 sheetname1 company2    1981     20.
#  6 sheetname1 company3    1981     50.
#  7 sheetname1 company1    1982      7.
#  8 sheetname1 company2    1982     30.
#  9 sheetname1 company3    1982     60.
# 10 sheetname2 company1    1980      6.
# # ... with 17 more rows

您现在可以使用sheetname来使用特定的dplyr::filter()

例如:

dat %>% filter(name == "sheetname1")

#   name       companyName year  values
#   <chr>      <chr>       <chr>  <dbl>
# 1 sheetname1 company1    1980      5.
# 2 sheetname1 company2    1980     10.
# 3 sheetname1 company3    1980     40.
# 4 sheetname1 company1    1981      6.
# 5 sheetname1 company2    1981     20.
# 6 sheetname1 company3    1981     50.
# 7 sheetname1 company1    1982      7.
# 8 sheetname1 company2    1982     30.
# 9 sheetname1 company3    1982     60.

2
投票

我建议使用openxlsx软件包,它允许你从包startRow中指定meltreshape2,它可以轻松地将数据帧更改为长格式。

library(openxlsx)
library(reshape2)

first.Row <- 6 # supposing the data starts at row 6
sheets.2.read <- loadWorkbook(file.list)$sheet_names # retrieving the sheet names
df <- data.frame()
for(tmp.sheet in sheets.2.read){
  tmp.dat <- read.xlsx(file.list, sheet = tmp.sheet, startRow = first.Row, colNames = TRUE)
  tmp.dat <- cbind(melt(tmp.dat, id.vars = "companyName"), tmp.sheet)
  df <- rbind(df, tmp.dat)
}

这是我的输出与一些虚拟数据(只打印10行):

> df[c(1:3, 50:53, 300:302),]
    company.name variable     value tmp.sheet
1          comp7     1968 0.3359298    Sheet1
2          comp8     1968 0.3359298    Sheet1
3          comp9     1968 0.3359298    Sheet1
50        comp16     1966 0.3359298    Sheet2
51        comp17     1966 0.3359298    Sheet2
52        comp18     1966 0.3359298    Sheet2
53        comp19     1966 0.3359298    Sheet2
300       comp16     2000 0.3359298    Sheet3
301       comp17     2000 0.3359298    Sheet3
302       comp18     2000 0.3359298    Sheet3
© www.soinside.com 2019 - 2024. All rights reserved.