Thanks for viewing this post. I will try to be clear and comprehensive in return!
Below the situation:
Hundreds of ~GB size .gz archives
List of wanted data that consists in identifiers. Each identifier is associated with the name of the unique archive in which to find the data.
Data structure of a .gz archive:
zcat archive.gz
...
identifier_nth
...
END_BLOCK
...
...
...
identifier_1
...
END_BLOCK
...
...
...
identifier_1
...
...
END_BLOCK
...
...
identifier_nth
...
END_BLOCK
...
...
...
identifier_1
...
END_BLOCK
...
identifier_nth
...
END_BLOCK
I currently do:
start=$(echo "$wanted_identifier_of_list") # I cat | while read through a list of thousands identifiers for the process (here $wanted_identifier_of_list = identifier_1)
end=$(echo "END_BLOCK")
zcat nth_archive.gz | sed -n "/${start}/,/${end}/p" > ${start}.dat
It works fine, but it is slow and there are too many blocks extracted for each identifier. I just need a fraction of them from first to Nth occurrence.
So I would like to:
1) limit the number of block I retrieve to an arbitrary number (here N = 2 for example)2) quit both zcat
任何帮助将是非常感激
非常感谢。
弗洛里安
像这样的东西应该与早期退出工作。 然而,未经测试。
$ zcat ... | awk -v start="identifier_1" -v end="END_BLOCK" -v n=2 '
!f && $0~start{f=n} f; f && $0~end{f--; if(!f) exit}'
下面的一些更多的输入:我使用 "#############名称。 ZINC000005215379 "作为开始,"#########名称:"作为当前停止。
...
########## Name: ZINC000005215379
...
@<TRIPOS>MOLECULE
ZINC000005215379 none
58 62 1 0 0
...
@<TRIPOS>ATOM
1 C1 -1.3168 -6.3293 -6.1200 C.3 1 LIG1 -0.1600
2 C2 -0.1404 -5.3624 -5.9715 C.3 1 LIG1 0.0700
...
@<TRIPOS>BOND
1 1 2 1
2 1 41 1
...
########## Name: ZINC000005215379
...
@<TRIPOS>MOLECULE
ZINC000005215379 none
58 62 1 0 0
...
@<TRIPOS>ATOM
1 C1 -1.3168 -6.3293 -6.1200 C.3 1 LIG1 -0.1600
2 C2 -0.1404 -5.3624 -5.9715 C.3 1 LIG1 0.0700
...
@<TRIPOS>BOND
1 1 2 1
2 1 41 1
...
########## Name: ZINC000004473749
...
@<TRIPOS>MOLECULE
ZINC000004473749 none
...
@<TRIPOS>ATOM
...
@<TRIPOS>BOND
1 1 2 1
2 1 41 1
...