我正在构建一个基于浏览器的 TMX(翻译记忆库)编辑器。当源片段和/或目标片段包含标签时,我的内容提取脚本会中断。包含标签的源/目标字符串如下所示:
<tuv xml:lang="BG">
<seg><bpt x="1" i="1" type="italic"/>Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/></seg>
</tuv>
<tuv xml:lang="EN-GB">
<seg><bpt x="1" i="1" type="italic"/>The -lar/-lar- formant (from the Turkic -lar) is sparsely represented in the Bulgarian word-building system.<ept i="1"/></seg>
</tuv>
如果源/目标片段中没有标签,提取很容易:
$sourceText = $TU['tuv'][0]['seg'];
$targetText = $TU['tuv'][1]['seg'];
但是当标签存在(处于初始位置)时,我陷入了困境。特别是因为这些标签被视为数组。
我不知道如何继续:我可以检查源/目标字符串是否包含/是一个数组,但不确定下一步要做什么。最终,我需要将文本与标签一起打印,以便用户编辑文本并在必要时移动/删除标签。
这是我的测试代码:
$uploadedFile = "user_files/sample_tmx.tmx";
$xmlStr = file_get_contents($uploadedFile);
$xmlObj = simplexml_load_string($xmlStr);
$arrXml = $util->objectsIntoArray($xmlObj);
$TUs = $arrXml['body']['tu'];
$pattern = '~<.*?>~';
foreach($TUs as $TU) {
if(is_array($TU['tuv'][0]['seg'])) {
var_dump($TU['tuv'][0]['seg']);
$pattern = '~<.*?>~';
$uncleanSource = $TU['tuv'][0]['seg'];
$uncleanTarget = $TU['tuv'][1]['seg'];
foreach($uncleanSource as $unclean) {
//var_dump($unclean);
}
//$sourceText = preg_replace($pattern, "<TAG>", $uncleanSource);
//$targetText = preg_replace($pattern, "<TAG>", $uncleanTarget);
}
else {
$sourceText = $TU['tuv'][0]['seg'];
$targetText = $TU['tuv'][1]['seg'];
}
echo "<p>".$sourceText." = ".$targetText."</p>";
}
这是sample_tmx.tmx 文件内容:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd" >
<tmx version="1.4">
<header creationtool="TDC Analysis Package" creationtoolversion="org.gs4tr.tm3.tmx.Version" segtype="sentence" o-tmf="unknown" adminlang="EN-US" srclang="BG" datatype="unknown" creationdate="20221006T184234Z" >
</header>
<body>
<tu creationdate="20201101T133734Z" creationid="Cheeseus" changedate="20201103T151745Z" changeid="Cheeseus" usagecount="0">
<prop type="nextMd5Checksum">e82d0ed6d711aa59310d1e8f4478537e</prop>
<prop type="previousMd5Checksum">39b279e324e9f6cd27351287502eefcb</prop>
<tuv xml:lang="BG">
<seg><bpt x="1" i="1" type="italic"/>Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/></seg>
</tuv>
<tuv xml:lang="EN-GB">
<seg><bpt x="1" i="1" type="italic"/>The -lar/-lar- formant (from the Turkic -lar) is sparsely represented in the Bulgarian word-building system.<ept i="1"/></seg>
</tuv>
</tu>
<tu creationdate="20080812T111221Z" creationid="Cheeseus" changedate="20190825T065920Z" changeid="Cheeseus" usagecount="0">
<tuv xml:lang="BG">
<seg>ПАРТНЬОРИ</seg>
</tuv>
<tuv xml:lang="EN-GB">
<seg>PARTNERS</seg>
</tuv>
</tu>
</body>
</tmx>
问题似乎来自
objectsIntoArray()
,没有代码很难修复。
如果您删除该调用并使用 SimpleXML 元素,因为它们的预期用途,您可以使用以下方式调用它(删除其他代码以关注此问题)...
$xmlObj = simplexml_load_file($uploadedFile);
foreach($xmlObj->body->tu as $TU) {
if(isset($TU->tuv[0]->seg)) {
echo $TU->tuv[0]->seg->asXML();
}
}
使用
asXML
方法将按原样再现内容,将所有子元素扩展回值。这将给
<seg>
<bpt x="1" i="1" type="italic"/>
Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/>
</seg>
<seg>ПАРТНЬОРИ</seg>
您可能需要做更多的工作才能产生您想要的结果,但希望这能展示如何实现这一目标。