如何在PHP中将打开/关闭标签转换为关联数组?

问题描述 投票:-2回答:1

我有一个自然语言处理解析树为

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))

enter image description here

并且我想将其存储在关联数组中,但是PHP中没有函数,因为NLP通常在python中完成。

因此,我应该解析左括号和右括号以构建树形关联数组。我可以想到两个选择

  1. 用任意的XML或HTML标记替换括号,并将其解析为XML或HTML文档。
  2. 使用正则表达式。

我认为第一种方法是非标准的,正则表达式模式在复杂情况下可能会中断。

您能建议一个可靠的方法吗?

关联数组可以具有任何形式,因为操作起来并不困难(我需要循环使用它,但是可以类似]

Array (
[0] = > word => ROOT, tag => S, children => Array (
    [0] word => I, tag = > NP, children => Array()
    [1] word => ROOT, tag => VP, children => Array (
        [0] => word => ROOT, tag => VP, children => Array ( .... )
        [1] => word => ROOT, tag => PP, children => Array ( .... )
)
)
)

或者可以是

Array (
[0] = > Array([0] => S, [1] => Array (
    [0] Array([0] => NP, [1] => 'I') // child array is replaced by a string
    [1] Array([0] => VP, [1] => Array (
        [0] => Array([0] => VP, [1] => Array ( .... )
        [1] => Array([0] => PP, [1] => Array ( .... )
    )
)
php regex recursion xml-parsing preg-match
1个回答
3
投票

[使用像bisonflex之类的词法分析器生成器,或只用手工编写自己的词法分析器,此answer具有您需要的一些有用信息。

这里是用PHP编写的快速而又肮脏的POC代码段,它将按预期输出一个关联数组。

$data =<<<EOL
(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
EOL;

$lexer = new Lexer($data);
$array = buildTree($lexer, 0);
print_r($array);

function buildTree($lexer, $level)
{
    $subtrees = [];
    $markers = [];
    while (($token = $lexer->nextToken()) !== false) {
        if ($token == '(') {
            $subtrees[] = buildTree($lexer, $level);
        } elseif ($token == ')') {
            return buildNode($markers, $subtrees);
        } else {
            $markers[] = $token;
        }
    }

    return buildNode($markers, $subtrees);
}

function buildNode($markers, $subtrees)
{
    if (count($markers) && count($subtrees)) {
        return [$markers[0], $subtrees];
    } elseif (count($subtrees)) {
        return $subtrees;
    } else {
        return $markers;
    }
}

class Lexer
{
    private $data;

    private $matches;

    private $index = -1;

    public function __construct($data)
    {
        $this->data = $data;
        preg_match_all('/[\w]+|\(|\)/', $data, $matches);
        $this->matches = $matches[0];
    }

    public function nextToken()
    {
        $index = ++$this->index;
        if (isset($this->matches[$index]) === false) {
            return false;
        }
        return $this->matches[$index];
    }
}

输出

Array
(
    [0] => Array
        (
            [0] => S
            [1] => Array
                (
                    [0] => Array
                        (
                            [0] => NP
                            [1] => I
                        )

                    [1] => Array
                        (
                            [0] => VP
                            [1] => Array
                                (
                                    [0] => Array
                                        (
                                            [0] => VP
                                            [1] => Array
                                                (
                                                    [0] => Array
                                                        (
                                                            [0] => V
                                                            [1] => shot
                                                        )

                                                    [1] => Array
                                                        (
                                                            [0] => NP
                                                            [1] => Array
                                                                (
                                                                    [0] => Array
                                                                        (
                                                                            [0] => Det
                                                                            [1] => an
                                                                        )

                                                                    [1] => Array
                                                                        (
                                                                            [0] => N
                                                                            [1] => elephant
                                                                        )

                                                                )

                                                        )

                                                )

                                        )

                                    [1] => Array
                                        (
                                            [0] => PP
                                            [1] => Array
                                                (
                                                    [0] => Array
                                                        (
                                                            [0] => P
                                                            [1] => in
                                                        )

                                                    [1] => Array
                                                        (
                                                            [0] => NP
                                                            [1] => Array
                                                                (
                                                                    [0] => Array
                                                                        (
                                                                            [0] => Det
                                                                            [1] => my
                                                                        )

                                                                    [1] => Array
                                                                        (
                                                                            [0] => N
                                                                            [1] => pajamas
                                                                        )

                                                                )

                                                        )

                                                )

                                        )

                                )

                        )

                )

        )

)
© www.soinside.com 2019 - 2024. All rights reserved.