我是 Rust 新手,我试图将梵文(元音和)bi-tri 和 tetra 连接辅音作为一个整体,同时保留元音符号和 virama。然后将它们与其他印度文字映射。我首先尝试使用 Rust 的
chars()
,但没有成功。然后我遇到了字素簇。我一直在谷歌上搜索 Unicode 和 UTF-8、字素簇和复杂脚本。
我在当前的代码中使用了字素簇,但它没有给我所需的输出。我知道这种方法可能不适用于天城文或其他印度文等复杂脚本。
如何才能达到想要的输出?我有另一个代码,我尝试使用 Stack Overflow 的答案构建一个简单的集群,将其从 Python 转换为 Rust,但我还没有任何运气。已经两周了,我一直被这个问题困扰。
这是梵文脚本和连词维基:
天城文脚本:https://en.wikipedia.org/wiki/Devanagari
天城文连词:https://en.wikipedia.org/wiki/Devanagari_conjuncts
这是我写的拆分内容:
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let hs = "हिन्दी मुख्यमंत्री हिमंत";
let hsi = hs.graphemes(true).collect::<Vec<&str>>();
for i in hsi {
print!("{} ", i); // double space eye comfort
}
}
电流输出:
हि न् दी मु ख् य मं त् री हि मं त
所需输出:
हि न्दी मु ख्य मं त्री हि मं त
我的另一次尝试:
我还尝试按照这个答案创建一个简单的字素集群https://stackoverflow.com/a/6806203/2724286
fn split_conjuncts(text: &str) -> Vec<String> {
let mut result = vec![];
let mut temp = String::new();
for c in text.chars() {
if (c as u32) >= 0x0300 && (c as u32) <= 0x036F {
temp.push(c);
} else {
temp.push(c);
if !temp.is_empty() {
result.push(temp.clone());
temp.clear();
}
}
}
if !temp.is_empty() {
result.push(temp);
}
result
}
fn main() {
let text = "संस्कृतम्";
let split_tokens = split_conjuncts(text);
println!("{:?}", split_tokens);
}
输出:
["स", "\u{902}", "स", "\u{94d}", "क", "\u{943}", "त", "म", "\u{94d}"]
所需输出:
हि न्दी मु ख्य मं त्री हि मं त
我还检查了处理 Unicode、grpahemes、UTF-8 问题的其他 SO 答案(下面的链接),但还没有运气。
这是我在 python 中的做法
import unicodedata
def is_consonant(ch):
return (unicodedata.category(ch) == 'Lo' and
'DEVANAGARI LETTER' in unicodedata.name(ch))
def is_combining_mark(ch):
return unicodedata.category(ch) in ['Mn', 'Mc']
def split_hindi_text(text):
clusters = []
cluster = ''
i = 0
n = len(text)
while i < n:
ch = text[i]
cluster += ch
if is_consonant(ch):
# Handle Virama (Halant)
while i + 1 < n and text[i + 1] == '\u094D':
i += 1
cluster += text[i] # Add Virama
if i + 1 < n and is_consonant(text[i + 1]):
i += 1
cluster += text[i] # Add next consonant
else:
break
# Collect combining marks (vowel signs, nasalization, etc.)
while i + 1 < n and is_combining_mark(text[i + 1]):
i += 1
cluster += text[i]
else:
# Collect combining marks for non-consonant characters
while i + 1 < n and is_combining_mark(text[i + 1]):
i += 1
cluster += text[i]
clusters.append(cluster)
cluster = ''
i += 1
return ' '.join(clusters)
# Example usage:
text = 'हिन्दी मुख्यमंत्री हिमंत'
# Split the text into words
words = text.split()
# Split each word into its characters and add spaces between them
with open("test.txt", 'w') as f:
for word in words:
spaced_word = split_hindi_text(word) # Apply the split_hindi_text function to each word
f.write(spaced_word + '\n') # Write each word on a new line
输出:
हि न्दी
मु ख्य मं त्री
हि मं त