区分区域性的 String.IndexOf 方法匹配的子字符串长度

Question

我尝试编写一种文化感知的字符串替换方法：

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + oldValue.Length)
        : text;
}

但是，它对 Unicode 组合字符感到窒息：

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. INCORRECT: do
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. INCORRECT: dóf

为了修复我的代码，我需要知道在第二个示例中，

String.IndexOf

仅匹配一个字符（

é

），即使它搜索了两个（

e\u0301

）。同样，我需要知道在第三个示例中，

String.IndexOf

匹配了两个字符 (

e\u0301

)，即使它只搜索了一个 (

é

)。

如何确定

String.IndexOf

匹配的子串的实际长度？

注意： 对

text

和

oldValue

执行 Unicode 规范化（按照 James Keesey 的建议）可以容纳组合字符，但连字仍然是一个问题：

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. INCORRECT: i
Console.WriteLine(Replace("oef", "œ", "i")); // 6. INCORRECT: ief

Answer 1

您需要自己直接调用 FindNLSString 或 FindNLSStringEx。

String.IndexOf

使用 FindNLSStringEx 但您需要的所有信息都可以在 FindNLSString 中找到。

这里是如何重写适用于您的测试用例的 Replace 方法的示例。请注意，我使用的是当前用户区域设置，如果您想使用系统区域设置或提供您自己的区域设置，请阅读 API 文档。我还为标志传递了 0，这意味着它将使用区域设置的默认字符串比较选项，文档再次可以帮助您提供不同的选项。

public const int LOCALE_USER_DEFAULT = 0x0400;

[DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
internal static extern int FindNLSString(int locale, uint flags, [MarshalAs(UnmanagedType.LPWStr)] string sourceString, int sourceCount, [MarshalAs(UnmanagedType.LPWStr)] string findString, int findCount, out int found);

public static string ReplaceWithCombiningCharSupport(string text, string oldValue, string newValue)
{
    int foundLength;
    int index = FindNLSString(LOCALE_USER_DEFAULT, 0, text, text.Length, oldValue, oldValue.Length, out foundLength);
    return index >= 0 ? text.Substring(0, index) + newValue + text.Substring(index + foundLength) : text;
}

Answer 2

我说得太早了（以前从未见过这种方法），但还有另一种选择。您可以使用 StringInfo.ParseCombiningCharacters() 方法来获取每个实际字符的开头，并使用它来确定要替换的字符串的长度。

在执行 Index 调用之前，您需要对两个字符串进行标准化。这将确保源字符串和目标字符串的长度相同。

请参阅描述此确切问题的 String.Normalize() 参考页。

Answer 3

使用以下方法适用于您的示例。它的工作原理是比较值，直到找到源字符串中需要多少个字符来等于

oldValue

，然后使用它而不是简单地

oldValue.Length

。

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    if (index >= 0)
        return text.Substring(0, index) + newValue +
                 text.Substring(index + LengthInString(text, oldValue, index));
    else
        return text;
}
static int LengthInString(string text, string oldValue, int index)
{
    for (int length = 1; length <= text.Length - index; length++)
        if (string.Equals(text.Substring(index, length), oldValue,
                                            StringComparison.CurrentCulture))
            return length;
    throw new Exception("Oops!");
}

Answer 4

从 .NET 5 开始，CompareInfo.IndexOf 方法有一个重载，它通过

out

参数返回匹配的字符数：

public int IndexOf(
    ReadOnlySpan<char> source, ReadOnlySpan<char> value, CompareOptions options,
    out int matchLength);

因此文化感知字符串替换方法可以这样重写：

public static string Replace(string text, string oldValue, string newValue)
{
    int index = CultureInfo.CurrentCulture.CompareInfo.IndexOf(text, oldValue, CompareOptions.IgnoreCase, out int matchLength);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + matchLength)
        : text;
}

结果如下：

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. CORRECT: dof
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. CORRECT: dof

此外，从 .NET 5 开始，由于从 NLS 切换到 ICU，连字“œ”不再与 en-US 文化中的“oe”匹配。恢复到 NLS 可以使示例 4-6 正常工作：

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. CORRECT: if
Console.WriteLine(Replace("oef", "œ", "i")); // 6. CORRECT: if

区分区域性的 String.IndexOf 方法匹配的子字符串长度

问题描述投票：0回答：4

4个回答

最新问题

区分区域性的 String.IndexOf 方法匹配的子字符串长度

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4