在读取一个目录中的所有文件时如何识别数据源？

Question

在一个目录中，有一些文件包：

cpu_server01.csv
cpu_server02.csv
cpu_server03.csv

等等

我可以读取文件的内容并将其附加到dflist，如下所示。但我需要在dflist中创建另一列并将文件名放在那里？

path("C:/Server/web/")
#cpu

filenames <- list.files(path, pattern="cpu_*", full.names=TRUE)

dflist <- lapply(filenames, function(i) {
  read.csv(i, header=T)

})

我如何将文件的名称分别添加到每个文件中？

Date Cpu filename

Answer 1

这应该工作：

for(i in 1:length(dflist))
  dflist[[i]]$file_name = filenames[i]

例：

filenames=c("a","b")
dflist = list(head(mtcars,3),head(mtcars,3))

for(i in 1:length(dflist))
   dflist[[i]]$file_name = filenames[i]

输出：

[[1]]
               mpg cyl disp  hp drat    wt  qsec vs am gear carb file_name
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         a
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         a
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         a

[[2]]
               mpg cyl disp  hp drat    wt  qsec vs am gear carb file_name
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         b
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         b
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         b

Answer 2

除了Florian's answer，还有两种处理这种常见情况的替代方法。

Name the list elements

将文件名复制为单个data.frames的列仅使IMHO感觉如果您计划将它们rbind()成为一个大型数据对象（请参阅下面的示例）

如果要在列表中单独保留每个data.frame，可以适当地命名列表元素，例如，

path <- "."
# get vector of filenames, note that pattern includes the cvs extension
filenames <- list.files(path, pattern = "cpu_.*csv$", full.names = TRUE)
# read files as a list of data.frames
dflist <- lapply(filenames, read.csv, header = TRUE)
# rename list element using file names without path
names(dflist) <- basename(filenames)

请注意，在调用lapply()时没有必要定义匿名函数，因为lapply()将无法识别的参数传递给被调用函数。所以，我们可以简明扼要地写

lapply(filenames, read.csv, header = TRUE)

代替

lapply(filenames, function(i) read.csv(i, header = TRUE))

现在，dflist已被恰当地命名

$cpu_server01.csv
  V1   V2
1  A 1001
2  B 1002
3  C 1003

$cpu_server02.csv
  V1   V2
1  A 2001
2  B 2002
3  C 2003

$cpu_server03.csv
  V1   V2
1  A 3001
2  B 3002
3  C 3003

Identify source file in a combined data object

如果目标是将所有数据块组合在一个大型数据对象中，则需要识别每行的原始源文件。

这可以通过Florian's approach和随后的rbinding来实现。或者，我们可以使用data.table的rbindlist()函数。

如果列表元素已经如上所述命名，我们可以简单地添加：

combi <- data.table::rbindlist(dflist, idcol = "file.name")
combi

          file.name V1   V2
1: cpu_server01.csv  A 1001
2: cpu_server01.csv  B 1002
3: cpu_server01.csv  C 1003
4: cpu_server02.csv  A 2001
5: cpu_server02.csv  B 2002
6: cpu_server02.csv  C 2003
7: cpu_server03.csv  A 3001
8: cpu_server03.csv  B 3002
9: cpu_server03.csv  C 3003

rbindlist()创建了id列“file.name”，并使用列表元素的名称填充它。

或者，我们可以先调用rbindlist()并添加文件名作为因子：

library(data.table)
path <- "."
# get vector of filenames, note that pattern includes the cvs extension
filenames <- list.files(path, pattern = "cpu_.*csv$", full.names = TRUE)
# read files as a list of data.frames and combine immediately
combi <- rbindlist(lapply(filenames, read.csv, header = TRUE), idcol = "file.name")
# change file number to appropriately labeled factor
combi[, file.name := factor(file.name, labels = basename(filenames))][]

          file.name V1   V2
1: cpu_server01.csv  A 1001
2: cpu_server01.csv  B 1002
3: cpu_server01.csv  C 1003
4: cpu_server02.csv  A 2001
5: cpu_server02.csv  B 2002
6: cpu_server02.csv  C 2003
7: cpu_server03.csv  A 3001
8: cpu_server03.csv  B 3002
9: cpu_server03.csv  C 3003

Data

为了重现性，虚拟文件由创建

idx_vec <- 1:3
invisible(sapply(1:3, function(i) {
  x <- data.frame(V1 = LETTERS[idx_vec], V2 = 1000L * i + idx_vec)
  write.csv(x, sprintf("cpu_server%02i.csv", i), row.names = FALSE)
}))

在读取一个目录中的所有文件时如何识别数据源？

问题描述投票：0回答：2

2个回答

Name the list elements

Identify source file in a combined data object

Data

最新问题

在读取一个目录中的所有文件时如何识别数据源？

问题描述 投票：0回答：2

2个回答

Name the list elements

Identify source file in a combined data object

Data

最新问题

问题描述投票：0回答：2