我见过很多版本的类似问题,但使用 bash 的不是一个好的版本(除非我错过了,请指导我找到一个)。
我需要通过将
file1.txt
中的键匹配到 keyvalue.txt
中找到的相同键来附加值。
文件格式如下:
file1.txt
其中包含第 10 列中的“taxid”KEY(在物种名称之前)。
1e470705074f483368a70ad18a7,gi|296972128|gb|HM278533.1|,249,1309,100,237,98.394,1.37e-118,438,0,Uncultured,bacterium clone ncd557a11c1 16S ribosomal RNA gene, partial sequence
29ec61e470705074f483368a70ad18a7,gi|2259146620|emb|OW925276.1|,249,363,100,237,98.394,1.37e-118,438,0,uncultured,Anaerolineae bacterium DNA containing 16S-23S intergenic spacer region, clone 43
29ec61e470705074f483368a70ad18a7,gi|1771871051|emb|LR646018.1|,249,1412,100,228,97.189,1.38e-113,422,253,uncultured,bacterium partial 16S rRNA gene
29ec61e470705074f483368a70ad18a7,gi|1271843868|gb|MG277305.1|,249,363,100,225,96.787,6.40e-112,416,253,Uncultured,bacterium clone OTU_3084 16S ribosomal RNA gene, partial sequence
29ec61e470705074f483368a70ad18a7,gi|1271845089|gb|MG278526.1|,249,363,98,192,93.004,1.42e-93,355,253,Uncultured,bacterium clone OTU_4378 16S ribosomal RNA gene, partial sequence
29ec61e470705074f483368a70ad18a7,gi|378812876|gb|JQ408060.1|,249,1299,100,192,92.369,1.42e-93,355,43,Uncultured,bacterium clone SEV1BE061 16S ribosomal RNA gene, partial sequence
29ec61e470705074f483368a70ad18a7,gi|151655702|gb|EU012237.1|,249,1030,100,189,91.968,6.59e-92,350,43,Uncultured,Chloroflexi bacterium clone 251 16S ribosomal RNA gene, partial sequence
29ec61e470705074f483368a70ad18a7,gi|440507165|gb|KC001159.1|,249,1308,100,189,92.000,6.59e-92,350,0,Unidentified,marine bacterioplankton clone P3-1B_21 16S ribosomal RNA gene, partial sequence
29ec61e470705074f483368a70ad18a7,gi|2259561892|emb|OW932746.1|,249,363,100,186,91.566,3.07e-90,344,391,uncultured,Anaerolineae bacterium DNA containing 16S-23S intergenic spacer region, clone 43
29ec61e470705074f483368a70ad18a7,gi|165972207|dbj|AB257651.1|,249,1468,100,186,91.700,3.07e-90,344,391,Uncultured,Chloroflexus sp. gene for 16S rRNA, partial sequence, clone: Dolo_29
keyvalue.txt
:
0,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
21,Bacteria,Pseudomonadota,Alphaproteobacteria,Caulobacterales,Caulobacteraceae,Phenylobacterium,Phenylobacterium immobile
33,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Myxococcus,Myxococcus fulvus
35,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Corallococcus,Corallococcus macrosporus
41,Bacteria,Myxococcota,Myxococcia,Myxococcales,Archangiaceae,Stigmatella,Stigmatella aurantiaca
43,Bacteria,Myxococcota,Myxococcia,Myxococcales,Archangiaceae,Cystobacter,Cystobacter fuscus
253,Bacteria,Bacteroidota,Flavobacteriia,Flavobacteriales,Weeksellaceae,Chryseobacterium,Chryseobacterium indologenes
254,Bacteria,Bacteroidota,Flavobacteriia,Flavobacteriales,Weeksellaceae,Chryseobacterium,Chryseobacterium indoltheticum
391,Bacteria,Pseudomonadota,Alphaproteobacteria,Hyphomicrobiales,Rhizobiaceae,Rhizobium,Rhizobium sp.
396,Bacteria,Pseudomonadota,Alphaproteobacteria,Hyphomicrobiales,Rhizobiaceae,Rhizobium,Rhizobium phaseoli
期望的结果只是将从
keyvalue.txt
开始的整行(键和值)附加到其中具有相同键 ID 的行的末尾(从第 10 列开始)。我一直在尝试从 file1.txt
获取密钥,但由于某种原因,我保存了额外的值,因此我无法正确地从 keyvalue.txt
中提取值。
这是我一直在尝试的方法,但效果不佳:
for line in $(cat file1.txt); do key=$( echo $line | cut -d"," -f10); \
if grep $key keyvalue.txt; then grep $key keyvalue.txt; fi; done
但是我得到了很多奇怪的输出,它在 grep 错误的键,并且基本上只是返回我的整个
keyvalue.txt
,其中只有很多奇怪的空格,这真是一团糟。我认为因为物种名称(第 10 列之后的所有内容)带有额外的逗号,因此导致了此类问题,即使只是呼应这些行也会导致一些奇怪的事情发生。
例如(只需在文件中的另一随机行上进行测试):
for line in $(cat file1.txt); do echo $line; done
输出:
29ec61e470705074f483368a70ad18a7,gi|165972207|dbj|AB257651.1|,249,1468,100,186,91.700,3.07e-90,344,391,Uncultured,Chloroflexus
sp.
gene
for
16S
rRNA,
partial
sequence,
clone:
Dolo_29
当它应该是:
29ec61e470705074f483368a70ad18a7,gi|165972207|dbj|AB257651.1|,249,1468,100,186,91.700,3.07e-90,344,391,Uncultured,Chloroflexus sp. gene for 16S rRNA, partial sequence, clone: Dolo_29
我的理想输出就是(可以是一个单独的文件,它甚至不需要附加到 file1.txt,我可以稍后自己单独附加它):
0,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli 0,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
253,Bacteria,Bacteroidota,Flavobacteriia,Flavobacteriales,Weeksellaceae,Chryseobacterium,Chryseobacterium indologenes
253,Bacteria,Bacteroidota,Flavobacteriia,Flavobacteriales,Weeksellaceae,Chryseobacterium,Chryseobacterium indologenes
253,Bacteria,Bacteroidota,Flavobacteriia,Flavobacteriales,Weeksellaceae,Chryseobacterium,Chryseobacterium indologenes
43,Bacteria,Myxococcota,Myxococcia,Myxococcales,Archangiaceae,Cystobacter,Cystobacter fuscus
43,Bacteria,Myxococcota,Myxococcia,Myxococcales,Archangiaceae,Cystobacter,Cystobacter fuscus
0,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
391,Bacteria,Pseudomonadota,Alphaproteobacteria,Hyphomicrobiales,Rhizobiaceae,Rhizobium,Rhizobium sp.
391,Bacteria,Pseudomonadota,Alphaproteobacteria,Hyphomicrobiales,Rhizobiaceae,Rhizobium,Rhizobium sp.
在@markp-fuso的帮助下,这是一个对我有用的解决方案:
while read -r line; do key=$( echo $line | cut -d"," -f10); if grep -wq "^${key}" keyvalue.txt; then grep "^${key}" keyvalue.txt; fi; done < file1.txt
Output:
0,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
0,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
253,Bacteria,Bacteroidota,Flavobacteriia,Flavobacteriales,Weeksellaceae,Chryseobacterium,Chryseobacterium indologenes
253,Bacteria,Bacteroidota,Flavobacteriia,Flavobacteriales,Weeksellaceae,Chryseobacterium,Chryseobacterium indologenes
253,Bacteria,Bacteroidota,Flavobacteriia,Flavobacteriales,Weeksellaceae,Chryseobacterium,Chryseobacterium indologenes
43,Bacteria,Myxococcota,Myxococcia,Myxococcales,Archangiaceae,Cystobacter,Cystobacter fuscus
43,Bacteria,Myxococcota,Myxococcia,Myxococcales,Archangiaceae,Cystobacter,Cystobacter fuscus
0,Bacteria,Pseudomonadota,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
391,Bacteria,Pseudomonadota,Alphaproteobacteria,Hyphomicrobiales,Rhizobiaceae,Rhizobium,Rhizobium sp.
391,Bacteria,Pseudomonadota,Alphaproteobacteria,Hyphomicrobiales,Rhizobiaceae,Rhizobium,Rhizobium sp.
希望对其他有类似问题的人有帮助。