我有一个使用 awk 处理二级结构数据并将结果附加到最终输出文件的函数。即使我在 GNU Parallel 中使用 --line-buffer,我仍然偶尔会在输出文件中得到如下行:
4 GLN A 447 C 1 GLN A 1 T
或者有时:
4 GLN A 447
多个进程似乎同时写入文件,导致行被合并或切断。这是我的代码的相关部分:
calculate_secondary_structure() {
frame_counter=$1
# process only chain A, as it is polyQ
${stride_path}/stride ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb -ca -fsecondary_structure${frame_counter}.txt
rm ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb
awk -v frame_counter="$frame_counter" '
BEGIN { OFS=" " } # 10 spaces as a separator
/^ASG/ {
residue_name = substr($0, 6, 3)
chain_name = substr($0, 10, 1)
residue_number = substr($0, 17, 4)
ss_code = substr($0, 25, 1)
# Print the frame number followed by the extracted fields with 10 spaces between them
printf "%-10s %-10s %-10s %-10s %-10s\n", frame_counter, residue_name, chain_name, residue_number, ss_code
}' "secondary_structure${frame_counter}.txt" >> ${final_output_file} # Append directly to the final file
rm secondary_structure${frame_counter}.txt
}
export -f calculate_secondary_structure
seq 1 ${number_of_frames} | parallel --bar --line-buffer --block 1k --round-robin -j192 calculate_secondary_structure {}
并行版本:
GNU parallel 20240822
问题是你不让 GNU Parallel 序列化输出,而是直接附加到 GNU Parallel 后面的
$final_output_file
。
所以你可能想要:
calculate_secondary_structure() {
frame_counter=$1
# process only chain A, as it is polyQ
${stride_path}/stride ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb -ca -fsecondary_structure${frame_counter}.txt
rm ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb
awk -v frame_counter="$frame_counter" '
BEGIN { OFS=" " } # 10 spaces as a separator
/^ASG/ {
residue_name = substr($0, 6, 3)
chain_name = substr($0, 10, 1)
residue_number = substr($0, 17, 4)
ss_code = substr($0, 25, 1)
# Print the frame number followed by the extracted fields with 10 spaces between them
printf "%-10s %-10s %-10s %-10s %-10s\n", frame_counter, residue_name, chain_name, residue_number, ss_code
}' "secondary_structure${frame_counter}.txt"
rm secondary_structure${frame_counter}.txt
}
export -f calculate_secondary_structure
seq 1 ${number_of_frames} |
parallel --bar --line-buffer -j192 calculate_secondary_structure {} >> ${final_output_file}
(
--block 1k --round-robin
仅在您使用--pipe
/--pipepart
时才有意义。否则它们不会执行任何操作。如果您在 192 核心上运行,还可以考虑使用 -j100%
(或干脆将其省略 - 因为这是默认值)这样您在获得新的 512 核心服务器时就不需要更改 192)。