我有一个自然语言处理解析树为
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
并且我想将其存储在关联数组中,但是PHP中没有函数,因为NLP通常在python中完成。
因此,我应该解析左括号和右括号以构建树形关联数组。我可以想到两个选择
我认为第一种方法是非标准的,正则表达式模式在复杂情况下可能会中断。
您能建议一个可靠的方法吗?
关联数组可以具有任何形式,因为操作起来并不困难(我需要循环使用它,但是可以类似]
Array (
[0] = > word => ROOT, tag => S, children => Array (
[0] word => I, tag = > NP, children => Array()
[1] word => ROOT, tag => VP, children => Array (
[0] => word => ROOT, tag => VP, children => Array ( .... )
[1] => word => ROOT, tag => PP, children => Array ( .... )
)
)
)
或者可以是
Array (
[0] = > Array([0] => S, [1] => Array (
[0] Array([0] => NP, [1] => 'I') // child array is replaced by a string
[1] Array([0] => VP, [1] => Array (
[0] => Array([0] => VP, [1] => Array ( .... )
[1] => Array([0] => PP, [1] => Array ( .... )
)
)
[使用像bison或flex之类的词法分析器生成器,或只用手工编写自己的词法分析器,此answer具有您需要的一些有用信息。
这里是用PHP编写的快速而又肮脏的POC代码段,它将按预期输出一个关联数组。
$data =<<<EOL
(S
(NP I)
(VP
(VP (V shot) (NP (Det an) (N elephant)))
(PP (P in) (NP (Det my) (N pajamas)))))
EOL;
$lexer = new Lexer($data);
$array = buildTree($lexer, 0);
print_r($array);
function buildTree($lexer, $level)
{
$subtrees = [];
$markers = [];
while (($token = $lexer->nextToken()) !== false) {
if ($token == '(') {
$subtrees[] = buildTree($lexer, $level);
} elseif ($token == ')') {
return buildNode($markers, $subtrees);
} else {
$markers[] = $token;
}
}
return buildNode($markers, $subtrees);
}
function buildNode($markers, $subtrees)
{
if (count($markers) && count($subtrees)) {
return [$markers[0], $subtrees];
} elseif (count($subtrees)) {
return $subtrees;
} else {
return $markers;
}
}
class Lexer
{
private $data;
private $matches;
private $index = -1;
public function __construct($data)
{
$this->data = $data;
preg_match_all('/[\w]+|\(|\)/', $data, $matches);
$this->matches = $matches[0];
}
public function nextToken()
{
$index = ++$this->index;
if (isset($this->matches[$index]) === false) {
return false;
}
return $this->matches[$index];
}
}
输出
Array
(
[0] => Array
(
[0] => S
[1] => Array
(
[0] => Array
(
[0] => NP
[1] => I
)
[1] => Array
(
[0] => VP
[1] => Array
(
[0] => Array
(
[0] => VP
[1] => Array
(
[0] => Array
(
[0] => V
[1] => shot
)
[1] => Array
(
[0] => NP
[1] => Array
(
[0] => Array
(
[0] => Det
[1] => an
)
[1] => Array
(
[0] => N
[1] => elephant
)
)
)
)
)
[1] => Array
(
[0] => PP
[1] => Array
(
[0] => Array
(
[0] => P
[1] => in
)
[1] => Array
(
[0] => NP
[1] => Array
(
[0] => Array
(
[0] => Det
[1] => my
)
[1] => Array
(
[0] => N
[1] => pajamas
)
)
)
)
)
)
)
)
)
)