从文件中抓取文本以查看 - 使用非标准格式 - php

问题描述 投票:0回答:2

好的,我有一个会定期更改的文本文件,我需要将其抓取以显示在屏幕上,并可能插入到数据库中。文本格式如下:

"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products

我只需要歌名(引号之间的信息)、作者和演唱者。正如你所看到的,由行书写的内容可以多于一行。

我已经搜索了这些问题,这个问题是类似的抓取一个没有HTML的纯文本文件?并且我能够修改解决方案https://stackoverflow.com/a/8432563/827449如下它至少会找到引号之间的信息并将其放入数组中。但是,我无法弄清楚在哪里以及如何放置下一个 preg_match 语句来编写和执行,以便它将它添加到具有正确信息的数组中,当然假设我有正确的正则表达式。这是修改后的代码。

<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();

function dump_record($r) {
    print_r($r);
}
    $current = array();
    while ($line = fgets($fh)) {

        /* Skip empty lines (any number of whitespaces is 'empty' */
        if (preg_match('/^\s*$/', $line)) continue;

        /* Search for 'things between quotes' stanzas */
        if (preg_match('/(?<=\")(.*?)(?=\")/', $line, $start)) {
            /* If we already parsed a record, this is the time to dump it */
            if (!empty($current)) dump_record($current);

        /* Let's start the new record */
        $current = array( 'id' => $start[1] );
    }
    else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
        /* Otherwise parse a plain 'key: value' stanza */
        $current[ $keyval[1] ] = $keyval[2];
    }
    else {
        error_log("parsing error: '$line'");
    }
}
/* Don't forget to dump the last parsed record, situation
 * we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);

fclose($in);

任何帮助都会很棒,因为我现在的 PHP 和正则表达式知识有限,无法理解。

php regex web-scraping
2个回答
1
投票

怎么样:

$str =<<<EOD
"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products

EOD;

preg_match_all('/"([^"]+)".*?Written by (.*?)Performed by (.*?)Courtesy/s', $str, $m, PREG_SET_ORDER);
print_r($m);

输出:

Array
(
    [0] => Array
        (
            [0] => "Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy
            [1] => Stranglehold
            [2] => Ted Nugent

            [3] => Ted Nugent

        )

    [1] => Array
        (
            [0] => "Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy
            [1] => Chateau Lafltte '59 Boogie
            [2] => David Peverett
and Rod Price

            [3] => Foghat

        )

)

1
投票

这是该问题的正则表达式解决方案。请记住,这里实际上并不需要正则表达式。请参阅下面的第二个选项。

<?php

$string = '"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte \'59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products';

// Titles delimit a record
$title_pattern = '#"(?<title>[^\n]+)"\n(?<meta>.*?)(?=\n"|$)#s';
// From the meta section we need these tokens
$meta_keys = array(
    'Written by ' => 'written',
    'Performed by ' => 'performed',
    'Courtesy of ' => 'courtesy',
    "By Arrangement with\n" => 'arranged',
);
$meta_pattern = '#(?<key>' . join(array_keys($meta_keys), "|") . ')(?<value>[^\n$]+)(?:\n|$)#ims';


$songs = array();
if (preg_match_all($title_pattern, $string, $matches, PREG_SET_ORDER)) {
    foreach ($matches as $match) {
        $t = array(
            'title' => $match['title'],
        );

        if (preg_match_all($meta_pattern, $match['meta'], $_matches, PREG_SET_ORDER)) {
            foreach ($_matches as $_match) {
                $k = $meta_keys[$_match['key']];
                $t[$k] = $_match['value'];
            }
        }

        $songs[] = $t;
    }
}

将会导致

$songs = array (
  array (
    'title'     => 'Stranglehold',
    'written'   => 'Ted Nugent',
    'performed' => 'Ted Nugent',
    'courtesy'  => 'Epic Records',
    'arranged'  => 'Sony Music Licensing',
  ),
  array (
    'title'     => 'Chateau Lafltte \'59 Boogie',
    'written'   => 'David Peverett',
    'performed' => 'Foghat',
    'courtesy'  => 'Rhino Entertainment',
    'arranged'  => 'Warner Special Products',
  ),
);

没有正则表达式的解决方案也是可能的,尽管稍微更冗长:

<?php

$string = '"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte \'59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products';

$songs = array();
$current = array();
$lines = explode("\n", $string);
// can't use foreach if we want to extract "By Arrangement"
// cause it spans two lines
for ($i = 0, $_length = count($lines); $i < $_length; $i++) {
    $line = $lines[$i];
    $length = strlen($line); // might want to use mb_strlen()

    // if line is enclosed in " it's a title
    if ($line[0] == '"' && $line[$length - 1] == '"') {
        if ($current) {
            $songs[] = $current;
        }

        $current = array(
            'title' => substr($line, 1, $length - 2),
        );

        continue;
    }

    $meta_keys = array(
        'By Arrangement with' => 'arranged', 
    );

    foreach ($meta_keys as $key => $k) {
        if ($key == $line) {
            $i++;
            $current[$k] = $lines[$i];
            continue;
        }
    }

    $meta_keys = array(
        'Written by ' => 'written', 
        'Performed by ' => 'performed', 
        'Courtesy of ' => 'courtesy',
    );

    foreach ($meta_keys as $key => $k) {
        if (strpos($line, $key) === 0) {
            $current[$k] = substr($line, strlen($key));
            continue 2;
        }
    }    
}

if ($current) {
    $songs[] = $current;
}
© www.soinside.com 2019 - 2024. All rights reserved.