在读取一个目录中的所有文件时如何识别数据源?

问题描述 投票:0回答:2

在一个目录中,有一些文件包:

cpu_server01.csv
cpu_server02.csv
cpu_server03.csv

等等

我可以读取文件的内容并将其附加到dflist,如下所示。但我需要在dflist中创建另一列并将文件名放在那里?

path("C:/Server/web/")
#cpu

filenames <- list.files(path, pattern="cpu_*", full.names=TRUE)

dflist <- lapply(filenames, function(i) {
  read.csv(i, header=T)

})

我如何将文件的名称分别添加到每个文件中?

Date Cpu filename
r dataframe
2个回答
2
投票

这应该工作:

for(i in 1:length(dflist))
  dflist[[i]]$file_name = filenames[i]

例:

filenames=c("a","b")
dflist = list(head(mtcars,3),head(mtcars,3))

for(i in 1:length(dflist))
   dflist[[i]]$file_name = filenames[i]

输出:

[[1]]
               mpg cyl disp  hp drat    wt  qsec vs am gear carb file_name
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         a
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         a
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         a

[[2]]
               mpg cyl disp  hp drat    wt  qsec vs am gear carb file_name
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         b
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         b
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         b

0
投票

除了Florian's answer,还有两种处理这种常见情况的替代方法。

Name the list elements

将文件名复制为单个data.frames的列仅使IMHO感觉如果您计划将它们rbind()成为一个大型数据对象(请参阅下面的示例)

如果要在列表中单独保留每个data.frame,可以适当地命名列表元素,例如,

path <- "."
# get vector of filenames, note that pattern includes the cvs extension
filenames <- list.files(path, pattern = "cpu_.*csv$", full.names = TRUE)
# read files as a list of data.frames
dflist <- lapply(filenames, read.csv, header = TRUE)
# rename list element using file names without path
names(dflist) <- basename(filenames)

请注意,在调用lapply()时没有必要定义匿名函数,因为lapply()将无法识别的参数传递给被调用函数。所以,我们可以简明扼要地写

lapply(filenames, read.csv, header = TRUE)

代替

lapply(filenames, function(i) read.csv(i, header = TRUE)) 

现在,dflist已被恰当地命名

$cpu_server01.csv
  V1   V2
1  A 1001
2  B 1002
3  C 1003

$cpu_server02.csv
  V1   V2
1  A 2001
2  B 2002
3  C 2003

$cpu_server03.csv
  V1   V2
1  A 3001
2  B 3002
3  C 3003

Identify source file in a combined data object

如果目标是将所有数据块组合在一个大型数据对象中,则需要识别每行的原始源文件。

这可以通过Florian's approach和随后的rbinding来实现。或者,我们可以使用data.tablerbindlist()函数。

如果列表元素已经如上所述命名,我们可以简单地添加:

combi <- data.table::rbindlist(dflist, idcol = "file.name")
combi
          file.name V1   V2
1: cpu_server01.csv  A 1001
2: cpu_server01.csv  B 1002
3: cpu_server01.csv  C 1003
4: cpu_server02.csv  A 2001
5: cpu_server02.csv  B 2002
6: cpu_server02.csv  C 2003
7: cpu_server03.csv  A 3001
8: cpu_server03.csv  B 3002
9: cpu_server03.csv  C 3003

rbindlist()创建了id列“file.name”,并使用列表元素的名称填充它。


或者,我们可以先调用rbindlist()并添加文件名作为因子:

library(data.table)
path <- "."
# get vector of filenames, note that pattern includes the cvs extension
filenames <- list.files(path, pattern = "cpu_.*csv$", full.names = TRUE)
# read files as a list of data.frames and combine immediately
combi <- rbindlist(lapply(filenames, read.csv, header = TRUE), idcol = "file.name")
# change file number to appropriately labeled factor
combi[, file.name := factor(file.name, labels = basename(filenames))][]
          file.name V1   V2
1: cpu_server01.csv  A 1001
2: cpu_server01.csv  B 1002
3: cpu_server01.csv  C 1003
4: cpu_server02.csv  A 2001
5: cpu_server02.csv  B 2002
6: cpu_server02.csv  C 2003
7: cpu_server03.csv  A 3001
8: cpu_server03.csv  B 3002
9: cpu_server03.csv  C 3003

Data

为了重现性,虚拟文件​​由创建

idx_vec <- 1:3
invisible(sapply(1:3, function(i) {
  x <- data.frame(V1 = LETTERS[idx_vec], V2 = 1000L * i + idx_vec)
  write.csv(x, sprintf("cpu_server%02i.csv", i), row.names = FALSE)
}))
© www.soinside.com 2019 - 2024. All rights reserved.