我正在 GitHub Actions(Linux 环境)中执行代码并尝试从 CSV 文件检索数据。但是,CSV 文件的第一个元素显示为“???”'。有趣的是,处理 BOM(字节顺序标记)后,同一个 CSV 文件在 Windows 上工作正常。这是我用来检索的代码片段来自 csv 文件的数据
public void retrieveCSVFile(String directoryPath, String filename, List<String> csvContent, List<ExtentTest> extentLog) throws IOException {
File directory = new File(directoryPath);
Collection<File> files = FileUtils.listFiles(directory, new String[]{"csv"}, false);
Optional<File> fileOptional = files.stream().filter(file -> file.getName().contains(filename)).findFirst(); //retrieve the latest file
if (fileOptional.isPresent()) {
File file = fileOptional.get();
validateCsvHeaders(file, csvContent, extentLog);
} else {
extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, filename + " is not present in downloads folder"));
}
}
private static void validateCsvHeaders(File file, List<String> csvContent, List<ExtentTest> extentLog) {
try (CSVReader csvReader = new CSVReader(new FileReader(file))) {
List<String> header = Arrays.asList(csvReader.readNext());
for (int i = 0; i < header.size(); i++) {
int finalI = i;
if (removeBOMFromCSV(header.get(i)).contains(csvContent.get(i))) {
extentLog.stream().forEach(Exlog -> Exlog.log(Status.PASS, header.get(finalI) + " is present in CSV file: " + csvContent.get(finalI)));
} else {
extentLog.stream().forEach(Exlog -> Exlog.log(Status.INFO, "Expected header is " + header.get(finalI) + " and actual header is" + csvContent.get(finalI)));
extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, header.get(finalI) + " is not present in CSV file" + csvContent.get(finalI)));
}
}
} catch (IOException | CsvValidationException e) {
extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, "Error reading CSV file " + file.getName() + ": " + e.getMessage()));
}
}
public static String removeBOMFromCSV(String s) {
String[] BOMs = {"\uFEFF", // UTF-8 BOM
"\uFFFE", // UTF-16 BOM (BE)
"\uFEFF", // UTF-16 BOM (LE)
"\uFFFE", // UTF-32 BOM (BE)
"\uFEFF"}; // UTF-32 BOM (LE)
for (String bom : BOMs) {
if (s.startsWith(bom)) {
s = s.substring(bom.length());
s=s.trim();
s = s.replaceAll("[^\\p{Print}]", "");
if(s.startsWith("???\"")){
s=s.substring(4);
}
}
}
return s;
}
CSV 文件如下所示:
|Age|name|location
|:--|:--:|------:|
|30 |John|India
|25 |Jane|USA
在 GitHub Actions(env = Linux) 中执行代码后,输出为:
???“CSV 文件中不存在年龄
名称存在于 CSV 文件中
位置存在于 CSV 文件中
预期产量
年龄存在于 CSV 文件中
名称存在于 CSV 文件中
位置存在于 CSV 文件中
文件中字节顺序标记的表示将取决于用于编码文件的字符集。 由于
new FileReader(file)
已经采用了字符集,因此在创建 Reader 后检查 BOM 已经太晚了。
在创建 Reader 之前,您必须检查 BOM。 您必须检查字节,而不是字符。 unicode.org 有关于如何检查文件初始字节的方便参考。 代码可能如下所示:
private static Reader openWithoutBOM(File file)
throws IOException {
InputStream in = new BufferedInputStream(new FileInputStream(file));
in.mark(4);
ByteBuffer initialBytes = ByteBuffer.wrap(in.readNBytes(4));
String charset;
if (initialBytes.getInt(0) == 0x0000feff) {
charset = "UTF-32BE";
} else if (initialBytes.getInt(0) == 0xfffe0000) {
charset = "UTF-32LE";
} else if (initialBytes.getShort(0) == (short) 0xfeff) {
charset = "UTF-16BE";
in.reset();
in.skipNBytes(2);
} else if (initialBytes.getShort(0) == (short) 0xfffe) {
charset = "UTF-16LE";
in.reset();
in.skipNBytes(2);
} else if (initialBytes.get(0) == (byte) 0xef &&
initialBytes.get(1) == (byte) 0xbb &&
initialBytes.get(2) == (byte) 0xbf) {
charset = "UTF-8";
in.reset();
in.skipNBytes(3);
} else {
// You may want to make a different assumption here.
// For example, files created with Windows Notepad are
// UTF-16LE by default.
charset = "UTF-8";
in.reset();
}
return new InputStreamReader(in, charset);
}