我有一个 Unicode 代码点,它可能是任何东西:可能是 ASCII,可能是 BMP 中的某些东西,也可能是外来表情符号,例如 U+1F612。
我期望有一种简单的方法来获取代码点并将其编码为字节数组,但我找不到简单的方法。我可以将其转换为字符串,然后对其进行编码,但这是一种迂回的方式,首先将其编码为 UTF-16,然后将其重新编码为所需的编码。我想将其直接编码为字节。
public static byte[] encodeCodePoint(int codePoint, Charset charset) {
// Surely there's got to be a better way than this:
return new StringBuilder().appendCodePoint(codePoint).toString().getBytes(charset);
}
确实没有办法避免使用 UTF-16,因为 Java 使用 UTF-16 来处理文本数据,而这正是字符集转换器的设计目的。但是,这并不意味着您必须对 UTF-16 数据使用
String
:
public static byte[] encodeCodePoint(int codePoint, Charset charset) {
char[] chars = Character.toChars(codePoint);
CharBuffer cb = CharBuffer.wrap(chars);
ByteBuffer buff = charset.encode(cb);
byte[] bytes = new byte[buff.remaining()];
buff.get(bytes);
return bytes;
}
如果你想将 emoji 编码为 UTF-16 中的四个字节
out.write(0xf0 | ((codePoint >> 18)));
out.write(0x80 | ((codePoint >> 12) & 0x3f));
out.write(0x80 | ((codePoint >> 6) & 0x3f));
out.write(0x80 | (codePoint & 0x3f));
将字符转换为字节的整个函数。我需要将字节写入流并编号(如果在开头)。您可以更改它以创建字节数组
void writeStringBytes(DataOutput out, char[] chars,final int off,final int strlen) throws IOException {
int utflen = strlen; // optimized for ASCII
// counting bytes we are need
for (int i = 0; i < strlen; i++) {
int c = chars[off+i];
if (c >= 0x80 || c == 0){
if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE) ||
(c>=Character.MIN_LOW_SURROGATE && c<=Character.MAX_LOW_SURROGATE)){
utflen += 1;
} else {
utflen += (c >= 0x800) ? 2 : 1;
}
}
}
out.writeInt(utflen); // i need number of bytes first. You can create array here. new byte[utflen]
if(utflen==strlen){// only ascii chars
for (int i = 0; i < strlen; i++) {
out.write(chars[off+i]);
}
return;
}
for (int i=0; i < strlen; i++) {
int c = chars[off+i];
if (c < 0x80 && c != 0) {
out.write(c);
} else if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE)) {
int uc = Character.codePointAt(chars,off+i);
if (uc < 0) {// bad codePoint
out.write('?');
out.write('?');
} else {
out.write(0xf0 | ((uc >> 18)));
out.write(0x80 | ((uc >> 12) & 0x3f));
out.write(0x80 | ((uc >> 6) & 0x3f));
out.write(0x80 | (uc & 0x3f));
i++;
}
} else if (c >= 0x800) {
out.write(0xE0 | ((c >> 12) & 0x0F));
out.write(0x80 | ((c >> 6) & 0x3F));
out.write(0x80 | ((c >> 0) & 0x3F));
} else {
out.write(0xC0 | ((c >> 6) & 0x1F));
out.write(0x80 | ((c >> 0) & 0x3F));
}
}
}