如何将梵文双三和四联合辅音从字符串中拆分为一个整体？

Question

我是 Rust 新手，我试图将梵文（元音和）bi-tri 和 tetra 连接辅音作为一个整体，同时保留元音符号和 virama。然后将它们与其他印度文字映射。我首先尝试使用 Rust 的

chars()

，但没有成功。然后我遇到了字素簇。我一直在谷歌上搜索 Unicode 和 UTF-8、字素簇和复杂脚本。

我在当前的代码中使用了字素簇，但它没有给我所需的输出。我知道这种方法可能不适用于天城文或其他印度文等复杂脚本。

如何才能达到想要的输出？我有另一个代码，我尝试使用 Stack Overflow 的答案构建一个简单的集群，将其从 Python 转换为 Rust，但我还没有任何运气。已经两周了，我一直被这个问题困扰。

这是梵文脚本和连词维基：

天城文脚本：https://en.wikipedia.org/wiki/Devanagari
天城文连词：https://en.wikipedia.org/wiki/Devanagari_conjuncts

这是我写的拆分内容：

extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;


fn main() {
    
    let hs = "हिन्दी मुख्यमंत्री हिमंत";
    let hsi = hs.graphemes(true).collect::<Vec<&str>>();
    for i in hsi { 
        print!("{}  ", i); // double space eye comfort
    }
}

电流输出：

हि  न्  दी   मु  ख्  य  मं  त्  री    हि  मं  त

所需输出：

हि न्दी  मु ख्य मं त्री  हि मं त

我的另一次尝试：

我还尝试按照这个答案创建一个简单的字素集群https://stackoverflow.com/a/6806203/2724286

fn split_conjuncts(text: &str) -> Vec<String> {
    let mut result = vec![];
    let mut temp = String::new();

    for c in text.chars() {
        if (c as u32) >= 0x0300 && (c as u32) <= 0x036F {
            temp.push(c);
        } else {
            temp.push(c);
            if !temp.is_empty() {
                result.push(temp.clone());
                temp.clear();
            }
        }
    }
    if !temp.is_empty() {
        result.push(temp);
    }
    result
}

fn main() {
    let text = "संस्कृतम्";
    let split_tokens = split_conjuncts(text);
    println!("{:?}", split_tokens);

}

输出：

["स", "\u{902}", "स", "\u{94d}", "क", "\u{943}", "त", "म", "\u{94d}"]

那么，如何才能得到想要的输出呢？

所需输出：

हि न्दी  मु ख्य मं त्री  हि मं त

我还检查了处理 Unicode、grpahemes、UTF-8 问题的其他 SO 答案（下面的链接），但还没有运气。

组合变音符号未使用 unicodedata.normalize (PYTHON) 进行规范化

组合字符和字素扩展器之间的区别是什么

扩展字素簇停止组合

Answer 1

这是我在 python 中的做法

    import unicodedata
    
    def is_consonant(ch):
        return (unicodedata.category(ch) == 'Lo' and
                'DEVANAGARI LETTER' in unicodedata.name(ch))
    
    def is_combining_mark(ch):
        return unicodedata.category(ch) in ['Mn', 'Mc']
    
    def split_hindi_text(text):
        clusters = []
        cluster = ''
        i = 0
        n = len(text)
        while i < n:
            ch = text[i]
            cluster += ch
            if is_consonant(ch):
                # Handle Virama (Halant)
                while i + 1 < n and text[i + 1] == '\u094D':
                    i += 1
                    cluster += text[i]  # Add Virama
                    if i + 1 < n and is_consonant(text[i + 1]):
                        i += 1
                        cluster += text[i]  # Add next consonant
                    else:
                        break
                # Collect combining marks (vowel signs, nasalization, etc.)
                while i + 1 < n and is_combining_mark(text[i + 1]):
                    i += 1
                    cluster += text[i]
            else:
                # Collect combining marks for non-consonant characters
                while i + 1 < n and is_combining_mark(text[i + 1]):
                    i += 1
                    cluster += text[i]
            clusters.append(cluster)
            cluster = ''
            i += 1
        return ' '.join(clusters)
    
    # Example usage:
    text = 'हिन्दी मुख्यमंत्री हिमंत'
    
    # Split the text into words
    words = text.split()
    
    # Split each word into its characters and add spaces between them
    with open("test.txt", 'w') as f:
        for word in words:
            spaced_word = split_hindi_text(word)  # Apply the split_hindi_text function to each word
            f.write(spaced_word + '\n')  # Write each word on a new line

输出：

हि न्दी
मु ख्य मं त्री
हि मं त

如何将梵文双三和四联合辅音从字符串中拆分为一个整体？

问题描述投票：0回答：1

1个回答

最新问题

如何将梵文双三和四联合辅音从字符串中拆分为一个整体？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1