对代码点进行编码

问题描述 投票:0回答:2

我有一个 Unicode 代码点,它可能是任何东西:可能是 ASCII,可能是 BMP 中的某些东西,也可能是外来表情符号,例如 U+1F612。

我期望有一种简单的方法来获取代码点并将其编码为字节数组,但我找不到简单的方法。我可以将其转换为字符串,然后对其进行编码,但这是一种迂回的方式,首先将其编码为 UTF-16,然后将其重新编码为所需的编码。我想将其直接编码为字节。

public static byte[] encodeCodePoint(int codePoint, Charset charset) {
    // Surely there's got to be a better way than this:
    return new StringBuilder().appendCodePoint(codePoint).toString().getBytes(charset);
}
java unicode character-encoding
2个回答
1
投票

确实没有办法避免使用 UTF-16,因为 Java 使用 UTF-16 来处理文本数据,而这正是字符集转换器的设计目的。但是,这并不意味着您必须对 UTF-16 数据使用

String

public static byte[] encodeCodePoint(int codePoint, Charset charset) {
    char[] chars = Character.toChars(codePoint);
    CharBuffer cb = CharBuffer.wrap(chars);
    ByteBuffer buff = charset.encode(cb);
    byte[] bytes = new byte[buff.remaining()];
    buff.get(bytes);
    return bytes;
}

0
投票

如果你想将 emoji 编码为 UTF-16 中的四个字节

out.write(0xf0 | ((codePoint >> 18)));
out.write(0x80 | ((codePoint >> 12) & 0x3f));
out.write(0x80 | ((codePoint >>  6) & 0x3f));
out.write(0x80 | (codePoint & 0x3f));

将字符转换为字节的整个函数。我需要将字节写入流并编号(如果在开头)。您可以更改它以创建字节数组

void writeStringBytes(DataOutput out, char[] chars,final int off,final int strlen) throws IOException {
    
    int utflen = strlen; // optimized for ASCII

// counting bytes we are need 

    for (int i = 0; i < strlen; i++) {
        int c = chars[off+i];            
        if (c >= 0x80 || c == 0){
            if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE) ||
                    (c>=Character.MIN_LOW_SURROGATE && c<=Character.MAX_LOW_SURROGATE)){
                utflen += 1;
            } else {
                utflen += (c >= 0x800) ? 2 : 1;
            }
        }
    }
    
    
    out.writeInt(utflen); // i need number of bytes first. You can create array here. new byte[utflen]
    
    
    if(utflen==strlen){// only ascii chars
        for (int i = 0; i < strlen; i++) {
            out.write(chars[off+i]);
        }
        return;
    }
    
    for (int i=0; i < strlen; i++) {
        int c = chars[off+i];
        if (c < 0x80 && c != 0) {
            out.write(c);
        } else if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE)) {
            int uc = Character.codePointAt(chars,off+i);
            if (uc < 0) {// bad codePoint
                out.write('?');
                out.write('?');
            } else {
                out.write(0xf0 | ((uc >> 18)));
                out.write(0x80 | ((uc >> 12) & 0x3f));
                out.write(0x80 | ((uc >>  6) & 0x3f));
                out.write(0x80 | (uc & 0x3f));
                i++;
            }                                
        } else if (c >= 0x800) {
            out.write(0xE0 | ((c >> 12) & 0x0F));
            out.write(0x80 | ((c >>  6) & 0x3F));
            out.write(0x80 | ((c >>  0) & 0x3F));
        } else {
            out.write(0xC0 | ((c >>  6) & 0x1F));
            out.write(0x80 | ((c >>  0) & 0x3F));
        }
    }
}    
© www.soinside.com 2019 - 2024. All rights reserved.