正则表达式不区分大小写的搜索，但输出与正则表达式中的大小写匹配

Question

最近我重写了我的程序来查找由化学符号缩写组成的单词，例如“HERSHeY”。我想出了这个动态正则表达式：

grep -Pi "^($(paste -s -d'|' element_symbols.txt))+$" /usr/share/dict/words

粘贴中的正则表达式扩展为类似

(H|He|Li| ... )

的内容。 element_symbols.txt 是一个类似

的文件

H
He
Li

-i

使搜索不区分大小写，因此输出中包含“Hershey”，但是有没有办法保留正则表达式中字母的大写，因此输出类似于“HErSHeY”？这需要替换words文件中的字母，所以也许可以用sed代替。

Answer 1

如果您想要保留大写，您将需要比简单的 grep 更复杂的方法。这里的想法是将单词的每个部分与相应的化学符号相匹配并替换它，同时根据需要保留或调整大小写。

由于单独使用 sed 和标准 Unix 工具可能不够灵活，无法进行涉及条件逻辑的复杂模式替换（例如保留特定的大写字母），因此您可以考虑使用 awk 或 Python 或 Perl 等脚本语言，它们可以更稳健地处理这些转换。

以下是如何在 Python 中实现此功能：

import re

# Load chemical symbols
with open('element_symbols.txt', 'r') as file:
    symbols = [line.strip() for line in file]

# Sort symbols by length in descending order to prioritize longer matches (e.g., 'He' over 'H')
symbols.sort(key=len, reverse=True)

# Create a regex pattern that matches any of the symbols
pattern = re.compile(r'\b(' + '|'.join(map(re.escape, symbols)) + r')\b', re.IGNORECASE)

# Function to replace matches with their correctly capitalized forms
def match_replacer(match):
    symbol = match.group(0)
    # Find the correct symbol (case-sensitive match)
    for s in symbols:
        if s.lower() == symbol.lower():
            return s
    return symbol  # fallback, shouldn't happen

# Process each word in the dictionary
with open('/usr/share/dict/words', 'r') as words:
    for word in words:
        word = word.strip()
        # Replace and print the word if it changes
        new_word = pattern.sub(match_replacer, word)
        if new_word != word:
            print(new_word)

正则表达式不区分大小写的搜索，但输出与正则表达式中的大小写匹配

问题描述投票：0回答：1

1个回答

最新问题

正则表达式不区分大小写的搜索，但输出与正则表达式中的大小写匹配

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1