代码点索引的Java子串(将代理代码单元对作为单个代码点处理)

问题描述 投票:2回答:1

我有一个小的演示应用程序,当使用需要代理对的unicode代码点时(即不能用2个字节表示),显示Java的子串实现的问题。我想知道我的解决方案是否运作良好或者我是否遗漏了什么。我考虑过发布在codereview上,但这与Java的Strings实现有很大关系,而不是简单的代码本身。

public class SubstringTest {
    public static void main(String[] args) {

        String stringWithPlus2ByteCodePoints = "👦👩👪👫";

        String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
        String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
        String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);

        System.out.println(stringWithPlus2ByteCodePoints);
        System.out.println("invalid sub" + substring1);
        System.out.println("invalid sub" + substring2);
        System.out.println("invalid sub" + substring3);

        String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
        String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
        String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
        System.out.println("real sub:"  + realSub1);
        System.out.println("real sub:"  + realSub2);
        System.out.println("real sub:"  + realSub3);
    }

    private static String getRealSubstring(String string, int beginIndex, int endIndex) {
        if (string == null)
            throw new IllegalArgumentException("String should not be null");
        int length = string.length();
        if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || endIndex > length)
            throw new IllegalArgumentException("Invalid indices");
        int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
        int realEndIndex = string.offsetByCodePoints(0, endIndex);
        return string.substring(realBeginIndex, realEndIndex);
    }

}

输出:

👦👩👪👫
invalid sub: ?
invalid sub: 👦
invalid sub: ??
real sub: 👦
real sub: 👦👩
real sub: 👩👪

我是否可以依赖我的子字符串实现来始终提供所需的子字符串,以避免Java使用字符串为其子字符串方法的问题?

java string unicode character-encoding char
1个回答
2
投票

无需两次走到beginIndex

    public String codePointSubstring(String s, int start, int end) {
        int a = s.offsetByCodePoints(0, start);
        return s.substring(a, s.offsetByCodePoints(a, end - start));
    }

翻译自此Scala片段:

def codePointSubstring(s: String, begin: Int, end: Int): String = {
  val a = s.offsetByCodePoints(0, begin)
  s.substring(a, s.offsetByCodePoints(a, end - begin))
}

我省略了IllegalArgumentExceptions,因为它们似乎不包含任何比抛出的异常更多的信息。

© www.soinside.com 2019 - 2024. All rights reserved.