我使用readxl库在同一个excel工作簿(称为data.xlsx)中读取许多excel工作表,格式如下:
数据从第3行开始。
row1
row2
companyName 1980 1981 1982 ... 2016
company1 5 6 7 8
company2 10 20 30 40
company3 20 40 60 80
....
每行和每列的数据范围长度不同。但是,他们将companyName作为公共密钥。年份范围从1980年或1990年到2016年不等。工作表名称是数据名称。
我想创建一个excel,其中所有数据都从所有工作表中提取。
companyName Year dataname values
company1 1980 sheetname1 5
company1 1981 sheetname1 6
company1 1982 sheetname1 7
company1 ... sheetname1 ...
company1 2016 sheetname1 8
company2 1980 sheetname1 10
company2 1981 sheetname1 20
company2 1982 sheetname1 30
company2 ... sheetname1 ...
company2 2016 sheetname1 40
.... .... ... ...
company1 2000 sheetname2 xxx
company1 2001 sheetname2 yyy
etc
etc
etc
这也是我设法得到的程度:
library(tidyverse)
library(readxl)
library(data.table)
#read excel file (from [here][1])
file.list<-"data.xlsx"
**#read all sheets (and **skip** first two rows)**
df.list <- lapply(file.list,function(x) {
sheets <- excel_sheets(x)
dfs <- lapply(sheets, function(y) {
read_excel(x, sheet = y,skip=2)
})
names(dfs) <- sheets
dfs
})
我有以下问题:
谢谢您的帮助。
资料来源:R: reading multiple excel files, extract first sheet names, and create new column
我刚刚从df.list
中删除了一层嵌套。
df.list <- lapply(file.list,function(x) {
sheets <- excel_sheets(x)
dfs <- lapply(sheets, function(y) {
read_excel(x, sheet = y,skip=2)
})
names(dfs) <- sheets
dfs
})[[1]]
这适合我。我不能用跳过来复制你的问题。此外,如果行只是空行,read_excel()
应默认使用trim_ws = TRUE
跳过它们。
我使用以下列表来演示导入后要执行的操作。
df.list <- structure(list(sheetname1 = structure(list(companyName = c("company1",
"company2", "company3"), `1980` = c(5, 10, 40), `1981` = c(6,
20, 50), `1982` = c(7, 30, 60)), .Names = c("companyName", "1980",
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), sheetname2 = structure(list(companyName = c("company1",
"company2", "company3"), `1980` = c(6, 11, 42), `1981` = c(7,
21, 52), `1982` = c(8, 31, 62)), .Names = c("companyName", "1980",
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), sheetname3 = structure(list(companyName = c("company1",
"company2", "company3"), `1990` = c(8, 12, 43), `1991` = c(9,
22, 53), `1992` = c(10, 32, 63)), .Names = c("companyName", "1990",
"1991", "1992"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))), .Names = c("sheetname1", "sheetname2",
"sheetname3"))
即使这些年份始于1980年或1990年,以下工作仍然有效。
dat <- lapply(df.list, function(x){
nrows = nrow(x)
years = names(x[,2:nrows])
x %>% gather(year, values, -companyName)
}) %>% enframe() %>% unnest()
dat
# # A tibble: 27 x 4
# name companyName year values
# <chr> <chr> <chr> <dbl>
# 1 sheetname1 company1 1980 5.
# 2 sheetname1 company2 1980 10.
# 3 sheetname1 company3 1980 40.
# 4 sheetname1 company1 1981 6.
# 5 sheetname1 company2 1981 20.
# 6 sheetname1 company3 1981 50.
# 7 sheetname1 company1 1982 7.
# 8 sheetname1 company2 1982 30.
# 9 sheetname1 company3 1982 60.
# 10 sheetname2 company1 1980 6.
# # ... with 17 more rows
您现在可以使用sheetname
来使用特定的dplyr::filter()
。
例如:
dat %>% filter(name == "sheetname1")
# name companyName year values
# <chr> <chr> <chr> <dbl>
# 1 sheetname1 company1 1980 5.
# 2 sheetname1 company2 1980 10.
# 3 sheetname1 company3 1980 40.
# 4 sheetname1 company1 1981 6.
# 5 sheetname1 company2 1981 20.
# 6 sheetname1 company3 1981 50.
# 7 sheetname1 company1 1982 7.
# 8 sheetname1 company2 1982 30.
# 9 sheetname1 company3 1982 60.
我建议使用openxlsx
软件包,它允许你从包startRow
中指定melt
和reshape2
,它可以轻松地将数据帧更改为长格式。
library(openxlsx)
library(reshape2)
first.Row <- 6 # supposing the data starts at row 6
sheets.2.read <- loadWorkbook(file.list)$sheet_names # retrieving the sheet names
df <- data.frame()
for(tmp.sheet in sheets.2.read){
tmp.dat <- read.xlsx(file.list, sheet = tmp.sheet, startRow = first.Row, colNames = TRUE)
tmp.dat <- cbind(melt(tmp.dat, id.vars = "companyName"), tmp.sheet)
df <- rbind(df, tmp.dat)
}
这是我的输出与一些虚拟数据(只打印10行):
> df[c(1:3, 50:53, 300:302),]
company.name variable value tmp.sheet
1 comp7 1968 0.3359298 Sheet1
2 comp8 1968 0.3359298 Sheet1
3 comp9 1968 0.3359298 Sheet1
50 comp16 1966 0.3359298 Sheet2
51 comp17 1966 0.3359298 Sheet2
52 comp18 1966 0.3359298 Sheet2
53 comp19 1966 0.3359298 Sheet2
300 comp16 2000 0.3359298 Sheet3
301 comp17 2000 0.3359298 Sheet3
302 comp18 2000 0.3359298 Sheet3