我的源文件带有不间断空格。有时它只是
0xA0
,但有时它是 0xC2 0xA0
(作为一对)。
当我解析这些文件并提供给 CSharpCompilation
时,它会返回如下包诊断消息:
c:\xyz\SomeFile.cs(8,1): error CS1056: Unexpected character '�'
这是我编译代码的方法:
private static readonly List<KeyValuePair<string, ReportDiagnostic>> s_specificDiagnosticOptions = new[]
{
// Assembly 'AssemblyName1' uses 'TypeName' which has a higher version than referenced assembly 'AssemblyName2'
// https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS1705
"CS1705",
// Assuming assembly reference "Assembly Name #1" matches "Assembly Name #2", you may need to supply runtime policy
// https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS1701
"CS1701",
// Assuming assembly reference "Assembly Name #1" used by "Type Name #1" matches identity "Assembly Name #2" of "Type Name #2", you may need to supply runtime policy
"CS1702",
// 'member1' hides inherited member 'member2'. Use the new keyword if hiding was intended
// https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/cs0108
"CS0108",
// The member 'member' does not hide an inherited member. The new keyword is not required
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0109
"CS0109",
// 'function1' hides inherited member 'function2'. To make the current method override that implementation, add the override keyword. Otherwise add the new keyword.
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0114
"CS0114",
// The result of the expression is always 'value1' since a value of type 'value2' is never equal to 'null' of type 'value3'
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0472
"CS0472",
// 'class' overrides Object.Equals(object o) but does not override Object.GetHashCode()
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0659
"CS0659",
// Unreachable code detected
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0162
"CS0162",
// Invalid name for a preprocessing symbol; '' is not a valid identifier
"CS8301",
// The annotation for nullable reference types should only be used in code within a '#nullable' annotations context
"CS8632",
// The using directive for 'XYZ' appeared previously as global using
"CS8933",
// Unnecessary using directive
"CS8019",
// 'member' is obsolete
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0612
"CS0612",
// 'member' is obsolete: 'text'
// https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS0618
"CS0618"
}.Select(id => new KeyValuePair<string, ReportDiagnostic>(id, ReportDiagnostic.Suppress)).ToList();
...
public static SyntaxTree ParseBytes(this byte[] bytes, CSharpParseOptions options, string filePath) => CSharpSyntaxTree.ParseText(SourceText.From(bytes, bytes.Length), options, filePath);
...
var compilation = CSharpCompilation.Create(asmProps.AssemblyName,
csFiles.Select(o => o.Bytes.ParseBytes(parseOptions, o.FilePath)),
references,
new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary,
assemblyIdentityComparer: DesktopAssemblyIdentityComparer.Default,
generalDiagnosticOption: ReportDiagnostic.Error,
specificDiagnosticOptions: s_specificDiagnosticOptions));
源文件由 C# 编译器在命令行上编译得很好,所以我知道使用不间断空格应该不是问题。
CSharpParseOptions
对象仅包含定义常量并指定该语言的最新版本。
我如何指示
CSharpCompiler
在看到不间断空格时不要惊慌失措?我对压制CS1056
持谨慎态度,这似乎不对。
编辑1
我仔细检查了情况。首先我想清理所有包含 0xA0 或 0xC2A0 组合的源文件。但这成本太高(数千个文件)并且冗余 - 99.9% 的这些文件不会编译失败。不知道为什么。但有一个文件确实失败了。
它没有有 BOM。它确实有 0xA0 字符(不是 2 字节序列 0xC2A0):
该文件在 Notepad++(表示 ASCII 编码)中清晰显示。命令行构建也可以正常工作。
坦率地说,这个文件对我来说看起来很糟糕,但我有一个问题 - 它在命令行上通过了编译!否则,由于我们的 PR 构建政策,它不会被推送到 master。
我知道我可以清理它,但我怎么知道一般要清理哪些文件?通常我不会使用诊断来运行编译,因为它会显着减慢速度,而且我知道 master 中的所有文件都应该通过编译,因此我可以跳过诊断。
非常欢迎关于如何以普遍有效的方式解决我的问题的想法。
深入挖掘一下,我认为该文件的编码是 win-1252。有些没有任何 BOM 的文件有 0xC2A0 对,这意味着它们是 UTF8。
受到函数
StreamReader.DetectEncoding
的启发,我想出了以下变体:
private static readonly Encoding s_unicodeBigEndianWithBOM = new UnicodeEncoding(bigEndian: true, byteOrderMark: true);
private static readonly Encoding s_unicodeLittleEndianWithBOM = new UnicodeEncoding(bigEndian: false, byteOrderMark: true);
private static readonly Encoding s_utf32BigEndianWithBOM = new UTF32Encoding(bigEndian: true, byteOrderMark: true);
private static readonly Encoding s_utf32LittleEndianWithBOM = new UTF32Encoding(bigEndian: false, byteOrderMark: true);
private static readonly Encoding s_win1252 = Encoding.GetEncoding(1252);
...
private Encoding DetectEncoding(byte[] bytes)
{
const byte NBSP_PREFIX = 0xC2;
const byte NBSP = 0xA0;
if (bytes.Length < 2)
{
return Encoding.UTF8;
}
if (bytes[0] == 254 && bytes[1] == byte.MaxValue)
{
return s_unicodeBigEndianWithBOM;
}
if (bytes[0] == byte.MaxValue && bytes[1] == 254)
{
if (bytes.Length < 4 || bytes[2] != 0 || bytes[3] != 0)
{
return s_unicodeLittleEndianWithBOM;
}
return s_utf32LittleEndianWithBOM;
}
if (bytes.Length >= 3 && bytes[0] == 239 && bytes[1] == 187 && bytes[2] == 191)
{
return Encoding.UTF8;
}
if (bytes.Length >= 4 && bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 254 && bytes[3] == byte.MaxValue)
{
return s_utf32BigEndianWithBOM;
}
int pos = bytes.AsSpan().IndexOf(NBSP);
if (pos >= 0 && (pos == 0 || bytes[pos - 1] != NBSP_PREFIX))
{
return s_win1252;
}
return Encoding.UTF8;
}
在解析文件时使用此函数返回的编码似乎已经解决了我的问题。