如何在Python中查找字符串中所有出现的子字符串，同时忽略某些字符？

Question

我想找到所有出现的子字符串，同时忽略某些字符。我怎样才能用Python做到这一点？

示例：

long_string = 'this is a t`es"t. Does the test work?'
small_string = "test"
chars_to_ignore = ['"', '`']
print(find_occurrences(long_string, small_string))

应该返回 [(10, 16), (27, 31)] 因为我们想忽略字符 ` 和 " 的存在。

```
(10, 16)
```
是
```
long_string
```
中t`es"t的开始和结束索引，
```
(27, 31)
```
是
```
test
```
中
```
long_string
```
的开始和结束索引。

使用

re.finditer

不起作用，因为它不会忽略字符 ` 和 ":

的存在

import re
long_string = 'this is a t`es"t. Does the test work?'
small_string = "test"
matches = []
[matches.append((m.start(), m.end())) for m in re.finditer(small_string, long_string)]
print(matches)

输出

[(27, 31)]

，错过了

(10, 16)

。

Answer 1

一种解决方案：

import re

def substring_search(text, value, chars_to_ignore=None):
    if not text or not value:
        return ""

    # No ignored chars
    if not chars_to_ignore:
        if value not in text:
            return ""
        substring_index = text.index(value)
        return text[substring_index:substring_index + len(value)]
    else:
        # Use regex when there are ignored chars
        regex_for_small_string = build_substring_search_regex(value, chars_to_ignore)
        matches = []
        [matches.append((m.start(), m.end(), long_string[m.start():m.end()])) for m in re.finditer(regex_for_small_string, long_string)]
        return matches

def build_substring_search_regex(value, chars_to_ignore):
    ignore_pattern = "[{0}]*?"
    ignore_string = ignore_pattern.format("".join(map(sanitize_char_for_regex, chars_to_ignore)))

    regex_string = []
    for character in value:
        regex_string.append(sanitize_char_for_regex(character) + ignore_string)
    print(f'regex_string: {regex_string}')

    return re.compile("".join(regex_string), re.IGNORECASE)

def sanitize_char_for_regex(character):
    # Escape the dash
    if character == '-':
        return r"\-"
    return re.escape(character)

long_string = 'this is a t`es"t. Does the test work?'
small_string = "test"
chars_to_ignore = ['"', '`']

print(substring_search(long_string, small_string, chars_to_ignore=chars_to_ignore))

输出：

[(10, 16, 't`es"t'), (27, 31, 'test')]

工作原理：

基于子字符串构建正则表达式（我采用了 Fredy Treboux 的 code 并将其从 C# 转换为 Python）。
在该正则表达式上使用
```
re.finditer
```
。

在 Windows 10 上使用 Python 3.11 进行测试。

如何在Python中查找字符串中所有出现的子字符串，同时忽略某些字符？

问题描述投票：0回答：1

1个回答

最新问题

如何在Python中查找字符串中所有出现的子字符串，同时忽略某些字符？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1