你能给我解释一下
mbstate_t
到底是什么吗?我已经阅读了cppreference描述,但我仍然不明白它的用途。我所理解的是 mbstate_t
是一些对一组有限的函数可见的静态结构,如 mbtowc()
、wctomb()
等,但我仍然对如何使用它感到困惑。我可以在 cppreference 示例中看到,在调用某些函数之前应该重置该结构。假设,我想计算多语言字符串中的字符数,如下所示:
std::string str = "Hello! Привет!";
显然,
str.size()
不能在这个例子中使用,因为它只是返回字符串中的字节数。但这样的事情就可以完成工作:
std::locale::global(std::locale("")); // Linux, UTF-8
std::string str = "Hello! Привет!";
std::string::size_type stringSize = str.size();
std::string::size_type nCharacters = 0;
std::string::size_type nextByte = 0;
std::string::size_type nBytesRead = 0;
std::mbtowc(nullptr, 0, 0); // What does it do, and why is it needed?
while (
(nBytesRead = std::mbtowc(nullptr, &str[nextByte], stringSize - nextByte))
!= 0)
{
++nCharacters;
nextByte += nBytesRead;
}
std::cout << nCharacters << '\n';
根据 cppreference 示例,在进入 while 循环之前,应通过调用
mbstate_t
来重置 mbtowc()
结构,并且所有参数均为零。这样做的目的是什么?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
int main(){
setlocale(LC_CTYPE,"");//utf8
// test string
const char * s1 = "привет";
const char * s1end = s1 + strlen(s1);
// letter positions
const char * p = s1;
mbtowc(NULL, 0, 0);
while(*p) {
printf("%zu ",p-s1);
p+= mbtowc(NULL,p,s1end-p);
}
printf("%zu\n",p-s1);
// split in the middle of the third letter
char s2[100];
char s3[100];
strncpy(s2,s1,5);
const char * s2end = s2+5;
strcpy(s3,s1+5);
const char * s3end = s3+strlen(s3);
// state, also check it size
mbstate_t state;
memset(&state,0,sizeof(state));
printf("state size = %zu\n",sizeof(state));
// print first part
wchar_t wc;
int rc;
p=s2;
while((rc = mbrtowc(&wc, p, s2end - p, &state)) > 0)
{
p += rc;
// unicode-char bytes-readed state(because it's size is 8, let's interpret it as pointer)
printf("%lc %d %p\n",wc,rc,*(void**)&state);
}
// state in the middle
printf("%p\n",*(void**)&state);
// print second part
p=s3;
while((rc = mbrtowc(&wc, p, s3end - p, &state)) > 0)
{
p += rc;
// unicode-char bytes-readed state
printf("%lc %d %p\n",wc,rc,*(void**)&state);
}
}
在Ubuntu下运行:
0 2 4 6 8 10 12
state size = 8
п 2 (nil)
р 2 (nil)
0x40000000201
и 1 0x40000000000
в 2 0x40000000000
е 2 0x40000000000
т 2 0x40000000000