我有一个非常大的文本文件,其中缺少一些条目。逻辑是持久的,因为每个“部分”的第一行都有正确的条目,该初始行之后的每一行都缺少这些条目。我正在尝试使用初始行中的信息更新错过这些条目的每一行,直到找到新的“初始信息行”。之后我将继续使用这些新发现的数据。
我在 sed 的帮助下在 bash 中构建了一个解决方案,但是这个过程非常非常慢,需要几个小时才能完成。我猜延迟的原因是我正在逐行读取,在 bash 中处理这些并将它们写入一个新文件。我的猜测是,带有变量和文件本身(-f)的 sed 脚本可以显着加快该过程。我不是 sed 这些高级用法的专家。我也愿意接受其他建议或工具 - 只要它们可以从 bash 脚本调用,因为这是自动化的一部分。
示例输入文件:
{"Initial line with more information like headers, unimportant, really only one line"
"Alpha","OldTheme","Some more text"
"","","Another rest text"
"","","Yet another text"
"Yadda","NewTheme","Crazy Text"
"","","More crazy text"
预期结果:
"Alpha","OldTheme","Some more text"
"Alpha","OldTheme","Another rest text"
"Alpha","OldTheme","Yet another text"
"Yadda","NewTheme","Crazy Text"
"Yadda","NewTheme","More crazy text"
这是我的工作(但非常慢)bash 脚本:
#!/bin/bash
first=0
cat inputfile | \
while read line; do
if [ ${first} -eq 0 ]; then
first=1; continue
fi
partline=$(echo "${line}" | grep -o '","\(.*\)')
newinitial=$(echo "${line}" | sed 's/",".*//; s/^"//')
if [ ! -z "${newinitial}" ]; then
initial=${newinitial}
fi
newtheme=$(echo "${partline}" | sed 's/^","//; s/",".*//')
if [ ! -z "${newtheme}" ]; then
theme=${newtheme}
fi
restline=$(echo ${partline} | sed 's/^","//' | grep -o '","\(.*\)')
echo "\"${initial}\",\"${theme}${restline}"
done >outputfile
Text::CSV_XS
模块,可通过操作系统的包管理器(对于 OpenSUSE 和 RedHat 为 perl-Text-CSV_XS
,对于 Debian 系列为 libtext-csv-xs-perl
等)或您最喜欢的 CPAN 客户端进行安装。
% perl -MText::CSV_XS -e '
print scalar <>; # Print header line
my @saved;
my $csv = Text::CSV_XS->new({binary => 1, always_quote => 1, empty_is_undef => 1});
while (my $r = $csv->getline(STDIN)) {
for my $i (0 .. $#$r) {
if ($r->[$i]) {
$saved[$i] = $r->[$i]
} else {
$r->[$i] = $saved[$i]
}
}
$csv->say(STDOUT, $r)
}' < input.csv
"Initial line with more information like headers, unimportant, really only one line"
"Alpha","OldTheme","Some more text"
"Alpha","OldTheme","Another rest text"
"Alpha","OldTheme","Yet another text"
"Yadda","NewTheme","Crazy Text"
"Yadda","NewTheme","More crazy text"
通过将每个非空字段保存在数组中,并在查看空字段时使用该保存的值来工作。