对代码点进行编码

Question

我有一个 Unicode 代码点，它可能是任何东西：可能是 ASCII，可能是 BMP 中的某些东西，也可能是外来表情符号，例如 U+1F612。

我期望有一种简单的方法来获取代码点并将其编码为字节数组，但我找不到简单的方法。我可以将其转换为字符串，然后对其进行编码，但这是一种迂回的方式，首先将其编码为 UTF-16，然后将其重新编码为所需的编码。我想将其直接编码为字节。

public static byte[] encodeCodePoint(int codePoint, Charset charset) {
    // Surely there's got to be a better way than this:
    return new StringBuilder().appendCodePoint(codePoint).toString().getBytes(charset);
}

Answer 1

确实没有办法避免使用 UTF-16，因为 Java 使用 UTF-16 来处理文本数据，而这正是字符集转换器的设计目的。但是，这并不意味着您必须对 UTF-16 数据使用

String

：

public static byte[] encodeCodePoint(int codePoint, Charset charset) {
    char[] chars = Character.toChars(codePoint);
    CharBuffer cb = CharBuffer.wrap(chars);
    ByteBuffer buff = charset.encode(cb);
    byte[] bytes = new byte[buff.remaining()];
    buff.get(bytes);
    return bytes;
}

Answer 2

如果你想将 emoji 编码为 UTF-16 中的四个字节

out.write(0xf0 | ((codePoint >> 18)));
out.write(0x80 | ((codePoint >> 12) & 0x3f));
out.write(0x80 | ((codePoint >>  6) & 0x3f));
out.write(0x80 | (codePoint & 0x3f));

将字符转换为字节的整个函数。我需要将字节写入流并编号（如果在开头）。您可以更改它以创建字节数组

void writeStringBytes(DataOutput out, char[] chars,final int off,final int strlen) throws IOException {
    
    int utflen = strlen; // optimized for ASCII

// counting bytes we are need 

    for (int i = 0; i < strlen; i++) {
        int c = chars[off+i];            
        if (c >= 0x80 || c == 0){
            if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE) ||
                    (c>=Character.MIN_LOW_SURROGATE && c<=Character.MAX_LOW_SURROGATE)){
                utflen += 1;
            } else {
                utflen += (c >= 0x800) ? 2 : 1;
            }
        }
    }
    
    
    out.writeInt(utflen); // i need number of bytes first. You can create array here. new byte[utflen]
    
    
    if(utflen==strlen){// only ascii chars
        for (int i = 0; i < strlen; i++) {
            out.write(chars[off+i]);
        }
        return;
    }
    
    for (int i=0; i < strlen; i++) {
        int c = chars[off+i];
        if (c < 0x80 && c != 0) {
            out.write(c);
        } else if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE)) {
            int uc = Character.codePointAt(chars,off+i);
            if (uc < 0) {// bad codePoint
                out.write('?');
                out.write('?');
            } else {
                out.write(0xf0 | ((uc >> 18)));
                out.write(0x80 | ((uc >> 12) & 0x3f));
                out.write(0x80 | ((uc >>  6) & 0x3f));
                out.write(0x80 | (uc & 0x3f));
                i++;
            }                                
        } else if (c >= 0x800) {
            out.write(0xE0 | ((c >> 12) & 0x0F));
            out.write(0x80 | ((c >>  6) & 0x3F));
            out.write(0x80 | ((c >>  0) & 0x3F));
        } else {
            out.write(0xC0 | ((c >>  6) & 0x1F));
            out.write(0x80 | ((c >>  0) & 0x3F));
        }
    }
}

对代码点进行编码

问题描述投票：0回答：2

2个回答

最新问题

对代码点进行编码

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2