sed, awk, perl)

Question

Thanks for viewing this post. I will try to be clear and comprehensive in return!

Below the situation:

Hundreds of ~GB size .gz archives
List of wanted data that consists in identifiers. Each identifier is associated with the name of the unique archive in which to find the data.

Data structure of a .gz archive:

zcat archive.gz

    ...
    identifier_nth
    ...
    END_BLOCK
    ...
    ...
    ...
    identifier_1
    ...
    END_BLOCK
    ...
    ...
    ...
    identifier_1
    ...
    ...
    END_BLOCK
    ...
    ...
    identifier_nth
    ...
    END_BLOCK
    ...
    ...
    ...
    identifier_1
    ...
    END_BLOCK
    ...
    identifier_nth
    ...
    END_BLOCK

I currently do:

start=$(echo "$wanted_identifier_of_list") # I cat | while read through a list of thousands identifiers for the process (here $wanted_identifier_of_list = identifier_1)
end=$(echo "END_BLOCK")

zcat nth_archive.gz | sed -n "/${start}/,/${end}/p" > ${start}.dat

It works fine, but it is slow and there are too many blocks extracted for each identifier. I just need a fraction of them from first to Nth occurrence.

So I would like to:

1) limit the number of block I retrieve to an arbitrary number (here N = 2 for example)2) quit both zcat

任何帮助将是非常感激

非常感谢。

弗洛里安

Answer 1

像这样的东西应该与早期退出工作。然而，未经测试。

$ zcat ... | awk -v start="identifier_1" -v end="END_BLOCK" -v n=2 '
                     !f && $0~start{f=n} f; f && $0~end{f--; if(!f) exit}'

Answer 2

下面的一些更多的输入：我使用 "#############名称。 ZINC000005215379 "作为开始，"#########名称："作为当前停止。

...
##########                 Name:     ZINC000005215379
...

@<TRIPOS>MOLECULE
 ZINC000005215379      none
   58    62     1     0     0
...
@<TRIPOS>ATOM
      1 C1         -1.3168    -6.3293    -6.1200 C.3        1  LIG1  -0.1600
      2 C2         -0.1404    -5.3624    -5.9715 C.3        1  LIG1   0.0700
...
@<TRIPOS>BOND
     1    1    2 1
     2    1   41 1
...
##########                 Name:     ZINC000005215379
...

@<TRIPOS>MOLECULE
 ZINC000005215379      none
   58    62     1     0     0
...
@<TRIPOS>ATOM
      1 C1         -1.3168    -6.3293    -6.1200 C.3        1  LIG1  -0.1600
      2 C2         -0.1404    -5.3624    -5.9715 C.3        1  LIG1   0.0700
...
@<TRIPOS>BOND
     1    1    2 1
     2    1   41 1
...
##########                 Name:     ZINC000004473749
...

@<TRIPOS>MOLECULE
 ZINC000004473749      none
...
@<TRIPOS>ATOM

...
@<TRIPOS>BOND
     1    1    2 1
     2    1   41 1
...

sed, awk, perl)

问题描述投票：0回答：1

1个回答

最新问题

sed, awk, perl)

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1