“Linux 中的 CSV 数据显示 '???'在 First Element 中，但在 Windows 中工作正常 - 如何解决？”

Question

我正在 GitHub Actions（Linux 环境）中执行代码并尝试从 CSV 文件检索数据。但是，CSV 文件的第一个元素显示为“???”'。有趣的是，处理 BOM（字节顺序标记）后，同一个 CSV 文件在 Windows 上工作正常。这是我用来检索的代码片段来自 csv 文件的数据

public void retrieveCSVFile(String directoryPath, String filename, List<String> csvContent, List<ExtentTest> extentLog) throws IOException {
    File directory = new File(directoryPath);
    Collection<File> files = FileUtils.listFiles(directory, new String[]{"csv"}, false);
    Optional<File> fileOptional = files.stream().filter(file -> file.getName().contains(filename)).findFirst(); //retrieve the latest file
    if (fileOptional.isPresent()) {
        File file = fileOptional.get();
        validateCsvHeaders(file, csvContent, extentLog);
    } else {
        extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, filename + " is not present in downloads folder"));
    }
}

private static void validateCsvHeaders(File file, List<String> csvContent, List<ExtentTest> extentLog) {
    try (CSVReader csvReader = new CSVReader(new FileReader(file))) {
        List<String> header = Arrays.asList(csvReader.readNext());
        for (int i = 0; i < header.size(); i++) {
            int finalI = i;
            if (removeBOMFromCSV(header.get(i)).contains(csvContent.get(i))) {
                extentLog.stream().forEach(Exlog -> Exlog.log(Status.PASS, header.get(finalI) + " is present in CSV file: " + csvContent.get(finalI)));
            } else {
                extentLog.stream().forEach(Exlog -> Exlog.log(Status.INFO, "Expected header is " + header.get(finalI) + " and actual header is" + csvContent.get(finalI)));
                extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, header.get(finalI) + " is not present in CSV file" + csvContent.get(finalI)));
            }
        }

    } catch (IOException | CsvValidationException e) {
        extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, "Error reading CSV file " + file.getName() + ": " + e.getMessage()));
    }
}

public static String removeBOMFromCSV(String s) {
    String[] BOMs = {"\uFEFF",      // UTF-8 BOM
            "\uFFFE",      // UTF-16 BOM (BE)
            "\uFEFF",      // UTF-16 BOM (LE)
            "\uFFFE",      // UTF-32 BOM (BE)
            "\uFEFF"};     // UTF-32 BOM (LE)
    for (String bom : BOMs) {
        if (s.startsWith(bom)) {
            s = s.substring(bom.length());
            s=s.trim();
            s = s.replaceAll("[^\\p{Print}]", "");
            if(s.startsWith("???\"")){
                s=s.substring(4);
            }
        }
    }
    return s;
}

CSV 文件如下所示：

|Age|name|location
|:--|:--:|------:|
|30 |John|India
|25 |Jane|USA

在 GitHub Actions(env = Linux) 中执行代码后，输出为：

???“CSV 文件中不存在年龄

名称存在于 CSV 文件中

位置存在于 CSV 文件中

预期产量

年龄存在于 CSV 文件中

名称存在于 CSV 文件中

位置存在于 CSV 文件中

Answer 1

文件中字节顺序标记的表示将取决于用于编码文件的字符集。由于

new FileReader(file)

已经采用了字符集，因此在创建 Reader 后检查 BOM 已经太晚了。

在创建 Reader 之前，您必须检查 BOM。您必须检查字节，而不是字符。 unicode.org 有关于如何检查文件初始字节的

方便参考。代码可能如下所示：

private static Reader openWithoutBOM(File file)
throws IOException {
    InputStream in = new BufferedInputStream(new FileInputStream(file));

    in.mark(4);
    ByteBuffer initialBytes = ByteBuffer.wrap(in.readNBytes(4));

    String charset;
    if (initialBytes.getInt(0) == 0x0000feff) {
        charset = "UTF-32BE";
    } else if (initialBytes.getInt(0) == 0xfffe0000) {
        charset = "UTF-32LE";
    } else if (initialBytes.getShort(0) == (short) 0xfeff) {
        charset = "UTF-16BE";

        in.reset();
        in.skipNBytes(2);
    } else if (initialBytes.getShort(0) == (short) 0xfffe) {
        charset = "UTF-16LE";

        in.reset();
        in.skipNBytes(2);
    } else if (initialBytes.get(0) == (byte) 0xef &&
               initialBytes.get(1) == (byte) 0xbb &&
               initialBytes.get(2) == (byte) 0xbf) {

        charset = "UTF-8";

        in.reset();
        in.skipNBytes(3);
    } else {
        // You may want to make a different assumption here.
        // For example, files created with Windows Notepad are
        // UTF-16LE by default.

        charset = "UTF-8";
        in.reset();
    }

    return new InputStreamReader(in, charset);
}

“Linux 中的 CSV 数据显示 '???'在 First Element 中，但在 Windows 中工作正常 - 如何解决？”

问题描述投票：0回答：1

1个回答

最新问题

“Linux 中的 CSV 数据显示 '???'在 First Element 中，但在 Windows 中工作正常 - 如何解决？”

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1