正则表达式循环由于相似的单词而导致重复匹配。如何避免?

问题描述 投票:0回答:1

我有某种 Regex 问题,尽管我已经在 MATLAB 中编写了代码,但我想让它尽可能通用。

信息

LipidData
是一个 68x2 表,其中包含名称列和
Short
列,即
LPC
PC
AC4PIM2
SHexCer
SQDG
等字符串。这个
LipidData
矩阵不会改变,而
foundpattern
可能会根据它来自的真实输入数据而变化。

foundpattern
是一个 N×4 表,在我的示例中,N 是 7。这里唯一相关的列是第一个列,称为
ISDs
,其中包含要检查的字符串(为了重现性,您可以仅将列复制为元胞数组)。在这里您可以看到两个 MATLAB 表:

输入

>> LipidData

LipidData =

 68×2 table

                Lipid subclass name                       Short   
___________________________________________________    ___________

{'Diacylated phosphatidylinositol monomannoside'                  }    {'Ac2PIM1'    }
{'Diacylated phosphatidylinositol dimannoside'                    }    {'Ac2PIM2'    }
{'Triacylated phosphatidylinositol dinomannoside'                 }    {'Ac3PIM2'    }
{'Tetraaacylated phosphatidylinositol dimannoside'                }    {'AC4PIM2'    }
{'Anacardic Acid'                                                 }    {'ACar'       }
{'Acetylglucose andrographolide'                                  }    {'AcylGlcADG' }
{'Bis[monoacylglycero]phosphates'                                 }    {'BMP'        }
{'Cholesteryl esters'                                             }    {'CE'         }
{'Ceramide'                                                       }    {'Cer'        }
{'Ceramide alpha-hydroxy fatty acid-dihydrosphingosine'           }    {'CerADS'     }
{'Ceramide alpha-hydroxy fatty acid-phytospingosine'              }    {'CerAP'      }
{'Ceramide beta-hydroxy fatty acid-sphingosine'                   }    {'CerAS'      }
{'Ceramide beta-hydroxy fatty acid-dihydrosphingosine'            }    {'CerBDS'     }
{'Ceramide beta-hydroxy fatty acid-sphingosine'                   }    {'CerBS'      }
{'Ceramide Esterified omega-hydroxy fatty acid-dihydrosphingosine'}    {'CerEODS'    }
{'Ceramide Esterified omega-hydroxy fatty acid-sphingosine'       }    {'CerEOS'     }
{'Ceramide non-hydroxyfatty acid-dihydrosphingosine'              }    {'CerNDS'     }
{'Ceramide non-hydroxyfatty acid-phytospingosine'                 }    {'CerNP'      }
{'Ceramide non-hydroxyfatty acid-sphingosine'                     }    {'Cer_NS'     }
{'Ceramide phosphate'                                             }    {'CerP'       }
{'Cholesterol'                                                    }    {'Cholesterol'}
{'Cardiolipins'                                                   }    {'CL'         }
{'Diacyl/alkylglycerides'                                         }    {'DG'         }
{'Digalactosyldiacylglycerols'                                    }    {'DGDG'       }
{'1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine'         }    {'DGTS'       }
{'Ether Oxygenated Phosphatidylcholines'                          }    {'EtherOxPC'  }
{'Ether Oxygenated Phosphatidylethanolamines'                     }    {'EtherOxPE'  }
{'Ether-linked Phosphatidylcoline'                                }    {'EtherPC'    }
{'Ether-linked Phosphatidylethanolamine'                          }    {'EtherPE'    }
{'Fatty Acids'                                                    }    {'FA'         }
{'Fatty acid ester of hydroxyl fatty acid'                        }    {'FAHFA'      }
{'Glucuronosyldiacylglycerol'                                     }    {'GlcADG'     }
{'GM3 Ganglioside'                                                }    {'GM3'        }
{'Hidroxy Bis[monoacylglycero]phosphates'                         }    {'HBMP'       }
{'Hexosylceramide alpha-hydroxy fatty acid-phytospingosine'       }    {'HexCerAP'   }
{'Hexosylceramide non-hydroxyfatty acid-dihydrosphingosine'       }    {'HexCerNDS'  }
{'Hexosylceramide non-hydroxyfatty acid-sphingosine'              }    {'HexCer_NS'  }
{'Lyso 1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine'    }    {'DGTS'       }
{'Lyso Phosphatidic acids'                                        }    {'LPA'        }
{'Lyso Phosphatidylcholines'                                      }    {'LPC'        }
{'Lyso Phosphatidylethanolamines'                                 }    {'LPE'        }
{'Lyso Phosphatidylglycerols'                                     }    {'LPG'        }
{'Lyso Phosphatidylinositols'                                     }    {'LPI'        }
{'Lyso Phosphatidylserines'                                       }    {'LPS'        }
{'Monoacyl/alkylglycerides'                                       }    {'MG'         }
{'Monogalactosyldiacylglycerols'                                  }    {'MGDG'       }
{'Oxygenated Cardiolipins'                                        }    {'OxCL'       }
{'Oxygenated Fatty Acids'                                         }    {'OxFA'       }
{'Oxygenated Phosphatidic acids'                                  }    {'OxPA'       }
{'Oxygenated Phosphatidylcholines'                                }    {'OxPC'       }
{'Oxygenated Phosphatidylethanolamines'                           }    {'OxPE'       }
{'Oxygenated Phosphatidylglycerols'                               }    {'OxPG'       }
{'Oxygenated Phosphatidylinositols'                               }    {'OxPI'       }
{'Oxygenated Phosphatidylserines'                                 }    {'OxPS'       }
{'Oxygenated Triacyl/alkylglycerides'                             }    {'OxTG'       }
{'Phosphatidic acids'                                             }    {'PA'         }
{'Phosphatidylbutyl alcohol'                                      }    {'PBtOH'      }
{'Phosphatidylcholines'                                           }    {'PC'         }
{'Phosphatidylethanolamines'                                      }    {'PE'         }
{'Phosphatidyletanol'                                             }    {'PEtOH'      }
{'Phosphatidylglycerols'                                          }    {'PG'         }
{'Phosphatidylinositols'                                          }    {'PI'         }
{'Phosphatidylmethanol'                                           }    {'PMeOH'      }
{'Phosphatidylserines'                                            }    {'PS'         }
{'Sulfatides hexosyl ceramide'                                    }    {'SHexCer'    }
{'Sphingomyelines'                                                }    {'SM'         }
{'Sulfoquinovosyl diacylglycerols'                                }    {'SQDG'       }
{'Triacyl/alkylglycerides'                                        }    {'TG'         }


>> foundpattern

foundpattern =

7×4 table

           ISDs                 tR      Standard desv      RSD  
__________________________    ______    _____________    _______

{'18:1 (d7) MG'          }      1.34       0.020418       1.5238
{'18:1(d7) LPC'          }    1.5868      0.0056024      0.35305
{'18:1 (d9) SM'          }    6.8999        0.08336       1.2081
{'15:0-18:1(d7) PC'      }     7.989       0.072533      0.90791
{'15:0-18:1(d7) DG'      }    12.085       0.097445      0.80631
{'15:0-18:1 (d7)-15:0 TG'}    17.487       0.029701      0.16984
{'Cholesterol (d7)'      }    18.247       0.032275      0.17687

问题在于将 LipidData

PC
的正则表达式与
{'18:1(d7) LPC'}
的 findpattern 值进行比较时,这会产生“匹配”,但我不知道如何避免它。我只需要在 Short 中找到
精确
相同的
foundpattern.ISDs
值。如果在找到的模式中存在
Cer_NS
,则假设会出现同一问题的另一个示例,它不仅与其 LipidData 值
Cer_NS
匹配,还与
Cer
匹配。

我相信将值作为一个组(使用带括号的正则表达式),就像您在代码中看到的那样是一个解决方案,但当然,这些组会被“稍微修改”,从而导致重复。我知道我错过了一些东西,但我不知道是什么。

无论如何都要避免重复比赛吗?正如您在 OUTPUT 中看到的那样,Codes 元胞数组应该只有 7 个条目,而不是 8 个。

代码

Codes={}
for j=1:size(ID,1)
  expression=strcat("(",char(LipidData{j,2}),")");
  for i=1:size(foundpattern,1)
    if regexp(char(foundpattern{i,1}),expression) ~= 0
      disp(foundpattern{i,1})
      disp(LipidData{j,2})
      Codes{end+1}=LipidData{j,2};
    end
  end
end

输出

>> Codes

Codes =

1×8 cell array

Columns 1 through 6

{1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}

Columns 7 through 8

{1×1 cell}    {1×1 cell}

>> for i=1:size(Codes,2)
Codes{i}
end

ans =

  1×1 cell array

  {'Cholesterol'}


ans =

  1×1 cell array

  {'DG'}


ans =

  1×1 cell array

  {'LPC'}


ans =

  1×1 cell array

  {'MG'}


ans =

  1×1 cell array

  {'PC'}


ans =

  1×1 cell array

  {'PC'}


ans =

  1×1 cell array

  {'SM'}


ans =

  1×1 cell array

  {'TG'}

>> 
regex matlab regex-lookarounds regex-group
1个回答
0
投票

你需要

expression=strcat('(?:[_\W]|^)(', regexptranslate('escape', char(LipidData{j,2})),
,')(?:[_\W]|$)')

(?:[_\W]|^)
部分匹配非单词字符或
_
字符。
regexptranslate('escape', char(LipidData{j,2}))
现在转义正则表达式模式中字面使用的文本中的特殊正则表达式元字符。 并且
(?:[_\W]|$)
匹配
_
或非单词字符或字符串结尾。

© www.soinside.com 2019 - 2024. All rights reserved.