使用PHP查找文本文件中的关键词arsort()

问题描述 投票:0回答:2

我正在尝试使用脚本来搜索文本文件并返回满足特定条件的单词:

*该词仅列出一次 *它们不是忽略列表中的一个单词 *它们是最长单词中的前 10% *它们不是重复的字母 *最终名单将是符合上述标准的随机十名。 *如果以上任何一项为假,则报告的文字将为空。

我已经将以下内容放在一起,但脚本在 arsort() 处终止,表示它需要一个数组。任何人都可以建议进行更改以使 arsort 工作吗?或者建议使用替代(更简单)的脚本来查找元数据?**我意识到第二个问题可能更适合另一个 StackExchange。

<?php
  $fn="../story_link";
  $str=readfile($fn);
    function top_words($str, $limit=10, $ignore=""){
        if(!$ignore) $ignore = "the of to and a in for is The that on said with be was by"; 
        $ignore_arr = explode(" ", $ignore);
        $str = trim($str);
        $str = preg_replace("#[&].{2,7}[;]#sim", " ", $str);
        $str = preg_replace("#[()°^!\"§\$%&/{(\[)\]=}?´`,;.:\-_\#'~+*]#", " ", $str);
        $str = preg_replace("#\s+#sim", " ", $str);
        $arraw = explode(" ", $str);
        foreach($arraw as $v){
            $v = trim($v);
            if(strlen($v)<3 || in_array($v, $ignore_arr)) continue;
            $arr[$v]++;
        }
        arsort($arr);   
        return array_keys( array_slice($arr, 0, $limit) );
    }
    $meta_keywords = implode(", ", top_words( strip_tags( $html_content ) ) );
?>
php sorting search keyword
2个回答
2
投票

问题是当你的循环从不递增 $arr[$v] 时,这会导致 $arr 没有被定义。这就是错误的原因,因为此时 arsort() 被赋予 null 作为其参数 - 而不是数组。

解决方案是在循环之前将 $arr 定义为数组,其中 $arr[$v]++; 的实例没有被执行。

function top_words($str, $limit=10, $ignore=""){
    if(!$ignore) $ignore = "the of to and a in for is The that on said with be was by"; 
    $ignore_arr = explode(" ", $ignore);
    $str = trim($str);
    $str = preg_replace("#[&].{2,7}[;]#sim", " ", $str);
    $str = preg_replace("#[()°^!\"§\$%&/{(\[)\]=}?´`,;.:\-_\#'~+*]#", " ", $str);
    $str = preg_replace("#\s+#sim", " ", $str);
    $arraw = explode(" ", $str);
    $arr = array(); // Defined $arr here.
    foreach($arraw as $v){
        $v = trim($v);
        if(strlen($v)<3 || in_array($v, $ignore_arr)) continue;
        $arr[$v]++;
    }
    arsort($arr);   
    return array_keys( array_slice($arr, 0, $limit) );
}

0
投票

遇到了一个很好的代码,可以很好地表达这一点:

        <?php
    function extract_keywords($str, $minWordLen = 3, $minWordOccurrences = 2, $asArray = false, $maxWords = 5, $restrict = true)
    {
        $str = str_replace(array("?","!",";","(",")",":","[","]"), " ", $str);
        $str = str_replace(array("\n","\r","  "), " ", $str);
        strtolower($str);

        function keyword_count_sort($first, $sec)
        {
            return $sec[1] - $first[1];
        }
        $str = preg_replace('/[^\p{L}0-9 ]/', ' ', $str);
        $str = trim(preg_replace('/\s+/', ' ', $str));

        $words = explode(' ', $str);

        // If we don't restrict tag usage, we'll remove common words from array
        if ($restrict == false) {
        $commonWords = array('a','able','about','above', 'get a list here http://www.wordfrequency.info','you\'ve','z','zero');
        $words = array_udiff($words, $commonWords,'strcasecmp');
        }

        // Restrict Keywords based on values in the $allowedWords array
        // Use if you want to limit available tags
        if ($restrict == true) {
        $allowedWords =  array('engine','boeing','electrical','pneumatic','ice','pressurisation');
        $words = array_uintersect($words, $allowedWords,'strcasecmp');
        }

        $keywords = array();

        while(($c_word = array_shift($words)) !== null)
        {
            if(strlen($c_word) < $minWordLen) continue;

            $c_word = strtolower($c_word);
            if(array_key_exists($c_word, $keywords)) $keywords[$c_word][1]++;
            else $keywords[$c_word] = array($c_word, 1);
        }
        usort($keywords, 'keyword_count_sort');

        $final_keywords = array();
        foreach($keywords as $keyword_det)
        {
            if($keyword_det[1] < $minWordOccurrences) break;
            array_push($final_keywords, $keyword_det[0]);
        }
        $final_keywords = array_slice($final_keywords, 0, $maxWords);
        return $asArray ? $final_keywords : implode(', ', $final_keywords);
    }


    $text = "Many systems that traditionally had a reliance on the pneumatic system have been transitioned to the electrical architecture. They include engine start, API start, wing ice protection, hydraulic pumps and cabin pressurisation. The only remaining bleed system on the 787 is the anti-ice system for the engine inlets. In fact, Boeing claims that the move to electrical systems has reduced the load on engines (from pneumatic hungry systems) by up to 35 percent (not unlike today’s electrically power flight simulators that use 20% of the electricity consumed by the older hydraulically actuated flight sims).";

    echo extract_keywords($text);

    // Advanced Usage
    // $exampletext = "The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog.";
    // echo extract_keywords($exampletext, 3, 1, false, 5, false);
    ?>
© www.soinside.com 2019 - 2024. All rights reserved.