GNU Parallel 中的并发输出问题:尽管使用 --line-buffer,行仍被合并或截断

问题描述 投票:0回答:1

我有一个使用 awk 处理二级结构数据并将结果附加到最终输出文件的函数。即使我在 GNU Parallel 中使用 --line-buffer,我仍然偶尔会在输出文件中得到如下行:

4          GLN        A           447       C 1          GLN        A             1       T

或者有时:

4          GLN        A           447

多个进程似乎同时写入文件,导致行被合并或切断。这是我的代码的相关部分:

calculate_secondary_structure() {
    frame_counter=$1
    # process only chain A, as it is polyQ
    ${stride_path}/stride ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb -ca -fsecondary_structure${frame_counter}.txt
    rm ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb

    awk -v frame_counter="$frame_counter" '
    BEGIN { OFS="          " } # 10 spaces as a separator
    /^ASG/ {
        residue_name = substr($0, 6, 3)
        chain_name = substr($0, 10, 1)
        residue_number = substr($0, 17, 4)
        ss_code = substr($0, 25, 1)

        # Print the frame number followed by the extracted fields with 10 spaces between them
        printf "%-10s %-10s %-10s %-10s %-10s\n", frame_counter, residue_name, chain_name, residue_number, ss_code
    }' "secondary_structure${frame_counter}.txt" >> ${final_output_file} # Append directly to the final file

    rm secondary_structure${frame_counter}.txt
}

export -f calculate_secondary_structure
seq 1 ${number_of_frames} | parallel --bar --line-buffer --block 1k --round-robin -j192 calculate_secondary_structure {}

并行版本:

GNU parallel 20240822
gnu-parallel
1个回答
0
投票

问题是你不让 GNU Parallel 序列化输出,而是直接附加到 GNU Parallel 后面的

$final_output_file

所以你可能想要:

calculate_secondary_structure() {
    frame_counter=$1
    # process only chain A, as it is polyQ
    ${stride_path}/stride ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb -ca -fsecondary_structure${frame_counter}.txt
    rm ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb

    awk -v frame_counter="$frame_counter" '
    BEGIN { OFS="          " } # 10 spaces as a separator
    /^ASG/ {
        residue_name = substr($0, 6, 3)
        chain_name = substr($0, 10, 1)
        residue_number = substr($0, 17, 4)
        ss_code = substr($0, 25, 1)

        # Print the frame number followed by the extracted fields with 10 spaces between them
        printf "%-10s %-10s %-10s %-10s %-10s\n", frame_counter, residue_name, chain_name, residue_number, ss_code
    }' "secondary_structure${frame_counter}.txt"

    rm secondary_structure${frame_counter}.txt
}

export -f calculate_secondary_structure
seq 1 ${number_of_frames} |
  parallel --bar --line-buffer -j192 calculate_secondary_structure {} >> ${final_output_file}

--block 1k --round-robin
仅在您使用
--pipe
/
--pipepart
时才有意义。否则它们不会执行任何操作。如果您在 192 核心上运行,还可以考虑使用
-j100%
(或干脆将其省略 - 因为这是默认值)这样您在获得新的 512 核心服务器时就不需要更改 192)。

© www.soinside.com 2019 - 2024. All rights reserved.