使用 CSharpCompilation 时，诊断消息将文件中的不间断空格报告为错误

Question

我的源文件带有不间断空格。有时它只是

0xA0

，但有时它是

0xC2 0xA0

（作为一对）。当我解析这些文件并提供给

CSharpCompilation

时，它会返回如下包诊断消息：

c:\xyz\SomeFile.cs(8,1): error CS1056: Unexpected character 'ï¿½'

这是我编译代码的方法：

private static readonly List<KeyValuePair<string, ReportDiagnostic>> s_specificDiagnosticOptions = new[]
{
    // Assembly 'AssemblyName1' uses 'TypeName' which has a higher version than referenced assembly 'AssemblyName2'
    // https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS1705
    "CS1705",
    // Assuming assembly reference "Assembly Name #1" matches "Assembly Name #2", you may need to supply runtime policy
    // https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS1701
    "CS1701",
    // Assuming assembly reference "Assembly Name #1" used by "Type Name #1" matches identity "Assembly Name #2" of "Type Name #2", you may need to supply runtime policy
    "CS1702",
    // 'member1' hides inherited member 'member2'. Use the new keyword if hiding was intended
    // https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/cs0108
    "CS0108",
    // The member 'member' does not hide an inherited member. The new keyword is not required
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0109
    "CS0109",
    // 'function1' hides inherited member 'function2'. To make the current method override that implementation, add the override keyword. Otherwise add the new keyword.
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0114
    "CS0114",
    // The result of the expression is always 'value1' since a value of type 'value2' is never equal to 'null' of type 'value3'
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0472
    "CS0472",
    // 'class' overrides Object.Equals(object o) but does not override Object.GetHashCode()
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0659
    "CS0659",
    // Unreachable code detected
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0162
    "CS0162",
    // Invalid name for a preprocessing symbol; '' is not a valid identifier
    "CS8301",
    // The annotation for nullable reference types should only be used in code within a '#nullable' annotations context
    "CS8632",
    // The using directive for 'XYZ' appeared previously as global using
    "CS8933",
    // Unnecessary using directive
    "CS8019",
    // 'member' is obsolete
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0612
    "CS0612",
    // 'member' is obsolete: 'text'
    // https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS0618
    "CS0618"
}.Select(id => new KeyValuePair<string, ReportDiagnostic>(id, ReportDiagnostic.Suppress)).ToList();
...
public static SyntaxTree ParseBytes(this byte[] bytes, CSharpParseOptions options, string filePath) => CSharpSyntaxTree.ParseText(SourceText.From(bytes, bytes.Length), options, filePath);
...
var compilation = CSharpCompilation.Create(asmProps.AssemblyName,
    csFiles.Select(o => o.Bytes.ParseBytes(parseOptions, o.FilePath)),
    references,
    new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary,
        assemblyIdentityComparer: DesktopAssemblyIdentityComparer.Default,
        generalDiagnosticOption: ReportDiagnostic.Error,
        specificDiagnosticOptions: s_specificDiagnosticOptions));

源文件由 C# 编译器在命令行上编译得很好，所以我知道使用不间断空格应该不是问题。

CSharpParseOptions

对象仅包含定义常量并指定该语言的最新版本。

我如何指示

CSharpCompiler

在看到不间断空格时不要惊慌失措？我对压制

CS1056

持谨慎态度，这似乎不对。

编辑1

我仔细检查了情况。首先我想清理所有包含 0xA0 或 0xC2A0 组合的源文件。但这成本太高（数千个文件）并且冗余 - 99.9% 的这些文件不会编译失败。不知道为什么。但有一个文件确实失败了。

它没有有 BOM。它确实有 0xA0 字符（不是 2 字节序列 0xC2A0）：

该文件在 Notepad++（表示 ASCII 编码）中清晰显示。命令行构建也可以正常工作。

但是UTF8编码确实无法正确表示：

坦率地说，这个文件对我来说看起来很糟糕，但我有一个问题 - 它在命令行上通过了编译！否则，由于我们的 PR 构建政策，它不会被推送到 master。

我知道我可以清理它，但我怎么知道一般要清理哪些文件？通常我不会使用诊断来运行编译，因为它会显着减慢速度，而且我知道 master 中的所有文件都应该通过编译，因此我可以跳过诊断。

非常欢迎关于如何以普遍有效的方式解决我的问题的想法。

Answer 1

深入挖掘一下，我认为该文件的编码是 win-1252。有些没有任何 BOM 的文件有 0xC2A0 对，这意味着它们是 UTF8。

受到函数

StreamReader.DetectEncoding

的启发，我想出了以下变体：

private static readonly Encoding s_unicodeBigEndianWithBOM = new UnicodeEncoding(bigEndian: true, byteOrderMark: true);
private static readonly Encoding s_unicodeLittleEndianWithBOM = new UnicodeEncoding(bigEndian: false, byteOrderMark: true);
private static readonly Encoding s_utf32BigEndianWithBOM = new UTF32Encoding(bigEndian: true, byteOrderMark: true);
private static readonly Encoding s_utf32LittleEndianWithBOM = new UTF32Encoding(bigEndian: false, byteOrderMark: true);
private static readonly Encoding s_win1252 = Encoding.GetEncoding(1252);
...
private Encoding DetectEncoding(byte[] bytes)
{
    const byte NBSP_PREFIX = 0xC2;
    const byte NBSP = 0xA0;

    if (bytes.Length < 2)
    {
        return Encoding.UTF8;
    }
    if (bytes[0] == 254 && bytes[1] == byte.MaxValue)
    {
        return s_unicodeBigEndianWithBOM;
    }
    if (bytes[0] == byte.MaxValue && bytes[1] == 254)
    {
        if (bytes.Length < 4 || bytes[2] != 0 || bytes[3] != 0)
        {
            return s_unicodeLittleEndianWithBOM;
        }
        return s_utf32LittleEndianWithBOM;
    }
    if (bytes.Length >= 3 && bytes[0] == 239 && bytes[1] == 187 && bytes[2] == 191)
    {
        return Encoding.UTF8;
    }
    if (bytes.Length >= 4 && bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 254 && bytes[3] == byte.MaxValue)
    {
        return s_utf32BigEndianWithBOM;
    }
    int pos = bytes.AsSpan().IndexOf(NBSP);
    if (pos >= 0 && (pos == 0 || bytes[pos - 1] != NBSP_PREFIX))
    {
        return s_win1252;
    }
    return Encoding.UTF8;
}

在解析文件时使用此函数返回的编码似乎已经解决了我的问题。

使用 CSharpCompilation 时，诊断消息将文件中的不间断空格报告为错误

问题描述投票：0回答：1

1个回答

最新问题

使用 CSharpCompilation 时，诊断消息将文件中的不间断空格报告为错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1