Python子进程 - 在新文件中保存输出

Question

我使用以下命令重新格式化文件，并创建一个新文件：

sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' toto> toto.json

它在命令行上运行正常。

我尝试通过python脚本使用它，但它不会创建新文件。

我尝试：

subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1], " > ",sys.argv[2]])

问题是：它给了我stdout中的输出并引发错误：

sed: can't read >: No such file or directory
Traceback (most recent call last):
File "test.py", line 14, in <module>
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/", 
sys.argv[1], ">",sys.argv[2])
File "C:\Users\Anaconda3\lib\subprocess.py", line 291, in 
check_call raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sed', '-e', '1s/^/[/', '-e', 
's/$/,/', '-e', '$s/,$/]/', 'toto.txt, '>', 'toto.json']' returned non-zero 
exit status 2.

我阅读了子进程的其他问题，并尝试使用选项shell = True的其他命令，但它也没有用。我使用python 3.6

有关信息，该命令在第一行和最后一行添加括号，并在除最后一行之外的每行末尾添加逗号。所以，它确实：

from
a
b
c

至：

[a,
b,
c]

Answer 1

在Linux和其他Unix系统上，重定向字符不是命令的一部分，而是由shell解释，因此将它作为参数传递给子进程没有意义。

希望subprocess.call允许stdout参数成为文件对象。所以你应该这样做：

subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1]],
    stdout=open(sys.argv[2], "w"))

Answer 2

不要那样做。如果可以避免，请不要使用任何OS调用。

如果您使用的是Python，请使用pythonic Python脚本。

就像是：

input_filename = 'toto'
output_filename = 'toto.json'

with open(input_filename, 'r') as inputf:
    lines = ['{},\n'.format(line.rstrip()) for line in inputf]
    lines = ['['] + lines + [']']

    with open(output_filename, 'w') as outputf:
        outputf.writelines(lines)

它基本上与命令行相同。

相信这段代码有点脏，仅用于示例目的。我建议你做自己的事情，像我一样避免oneliners。

Answer 3

我有一种预感，Python可以比sed更快地做到这一点但是我没有时间检查到现在，所以...基于你对Arount的回答的评论：

我的真实文件实际上非常大，命令行比python脚本快

这不一定是真的，事实上，在你的情况下，我怀疑Python可以做多次，比sed快很多倍，因为使用Python你不仅限于通过行缓冲器迭代你的文件，也不需要一个完整的正则表达式引擎只是为了获得行分隔符。

我不确定你的文件有多大，但我生成了我的测试示例：

with open("example.txt", "w") as f:
    for i in range(10**8):  # I would consider 100M lines as "big" enough for testing
        print(i, file=f)

这基本上创建了一个100M行长（888.9MB）的文件，每行有不同的编号。

现在，单独计时你的sed命令，以最高优先级（chrt -f 99）运行会导致：

[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' example.txt > output.txt
    Command being timed: "sed -e 1s/^/[/ -e s/$/,/ -e $s/,$/]/ example.txt"
    User time (seconds): 56.89
    System time (seconds): 1.74
    Percent of CPU this job got: 98%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.28
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1044
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 313
    Voluntary context switches: 7
    Involuntary context switches: 29
    Swaps: 0
    File system inputs: 1140560
    File system outputs: 1931424
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

如果您实际上是从Python调用它，结果会更糟，因为它也会带有subprocess和STDOUT重定向开销。

但是，如果我们将它留给Python来完成所有工作而不是sed：

import sys

CHUNK_SIZE = 1024 * 64  # 64k, tune this to the FS block size / platform for best performance

with open(sys.argv[2], "w") as f_out:  # open the file from second argument for writing
    f_out.write("[")  # start the JSON array
    with open(sys.argv[1], "r") as f_in:  # open the file from the first argument for reading
        chunk = None
        last_chunk = ''  # keep a track of the last chunk so we can remove the trailing comma
        while True:
            chunk = f_in.read(CHUNK_SIZE)  # read the next chunk
            if chunk:
                f_out.write(last_chunk)  # write out the last chunk
                last_chunk = chunk.replace("\n", ",\n")  # process the new chunk
            else:  # EOF
                break
    last_chunk = last_chunk.rstrip()  # clear out the trailing whitespace
    if last_chunk[-1] == ",":  # clear out the trailing comma
        last_chunk = last_chunk[:-1]
    f_out.write(last_chunk)  # write the last chunk
    f_out.write("]")  # end the JSON array

没有触及shell会导致：

[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> python process_file.py example.txt output.txt
    Command being timed: "python process_file.py example.txt output.txt"
    User time (seconds): 1.75
    System time (seconds): 0.72
    Percent of CPU this job got: 93%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.65
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 4716
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 3
    Minor (reclaiming a frame) page faults: 14835
    Voluntary context switches: 16
    Involuntary context switches: 0
    Swaps: 0
    File system inputs: 3120
    File system outputs: 1931424
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

并且考虑到利用率，瓶颈实际上是I / O，留给自己的设备（或者从非常快速的存储而不是像我的测试平台上的虚拟化HDD那样工作）Python可以更快地完成它。

因此，sed花了32.5倍的时间来完成与Python相同的任务。即使您稍微优化了sed，Python仍然可以更快地运行，因为sed仅限于行缓冲区，因此输入I / O会浪费很多时间（比较上述基准测试中的数字）并且没有（简单）绕过那条路。

结论：对于这项特殊任务，Python比sed更快。

Python子进程 - 在新文件中保存输出

问题描述投票：0回答：3

3个回答

最新问题

Python子进程 - 在新文件中保存输出

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3