删除带有空格的电子邮件地址

Question

我正在处理呼叫中心记录。在理想情况下，语音转文本软件将按如下方式转录电子邮件：[电子邮件受保护]。情况并非总是如此。因此，我正在寻找一种正则表达式（RegEx）解决方案，该解决方案可以容纳电子邮件地址中的空格，例如Maya [电子邮件受保护] 或 maya.lucco @proton.me 或 maya-lucco@pro ton.me

我尝试使用 regex101 扩展此解决方案，但没有成功。按照此解决方案中的建议编译 re 对象（模式）对于该任务来说似乎过于复杂。我查看了有关验证电子邮件地址的帖子，但它们描述了不同的问题。到目前为止我的代码如下：

import re 

#creating some data 
test = ['some random text maya @ proton.me with some more text [email protected]',
        '[email protected] with another address [email protected]',
        'some text maya.lucco @proton.me with some more bla [email protected]',
        '[email protected] more text maya@ proton.me '
        ]
        
test = pd.DataFrame(test, columns = ['words'])

#creating a function because I like to add some other data cleaning to it later on
def anonymiseEmail(text):
    
    text = str(text) #make text as string variable
    text = text.strip() #remove any leading, and trailing whitespaces
    text = re.sub(r'\S*@\S*\s?', '{e-mail}', text) #remove e-mail address
    
    return text

# applying the function
test['noEmail'] = test.words.apply(anonymiseEmail)

#checking the results
print(test.noEmail[0])

Output: some random text maya {e-mail}proton.me with some more text {e-mail}

第一个电子邮件地址未完全删除。玛雅的名字仍然存在。这是项目的一个问题。

如何扩展代码，以便整个电子邮件地址（无论有多少空格）都被替换为占位符或删除？

更新以下评论：

我已经研究了正则表达式lookahead和lookbehind，即

(?=@)

和

(?<=@)

，但似乎无法使其与@符号之前或之后的单词匹配。我正在查看 Wiktor Stribiżew 在另一个场合提供的代码片段

\b(?:Dear|H(?:ello|i))(?:[^\S\r\n]+[A-Z]\w*(?:[’'-]\w+)*\.?)+''', '', text

，并认为我可以将其更新为

(?i)\b(?<=@)(?:[^\S\r\n]+[A-Z]\w*(?:[’'-]\w+)*\.?)+

但根据 regex101，它与任何电子邮件地址都不匹配。也许可以修改代码片段

(?i)\b(?<=@)

（或任何其他正则表达式）以匹配前面和/或后面的单词？

我想到的另一个可能的解决方案是选择@符号前后的5个单词，将它们放入单独的变量中，检查@符号前后是否有4个字母/字符的空格。如果是，则将它们放入队列中以进行手动检查。我对这个解决方案感兴趣的是 a) 计算能力，b) 技术实现和 c) 一般可行性。但我想我应该本着试图找到解决方案的精神来分享它。

Answer 1

这是一个经过调整的正则表达式，用于识别电子邮件地址，例如

maya @ proton.me

。

import pandas as pd

def anonymiseEmail(text):
    email_regex = r"\b[a-zA-Z0-9._%+-]+\s*@\s*[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"
    return re.sub(email_regex, "{e-mail}", str(text).strip())


lines = [
    "some random text maya @ proton.me with some more text [email protected]",
    "[email protected] with another address [email protected]",
    "some text maya.lucco @proton.me with some more bla [email protected]",
    "[email protected] more text maya@ proton.me "
    ]

sample = pd.DataFrame(columns=["Lines"], data=lines)
sample["NoEmail"] = sample.Lines.apply(anonymiseEmail)

print(sample.NoEmail)

输出：

0    some random text {e-mail} with some more text ...
1               {e-mail} with another address {e-mail}
2       some text {e-mail} with some more bla {e-mail}
3                          {e-mail} more text {e-mail}
Name: NoEmail, dtype: object

删除带有空格的电子邮件地址

问题描述投票：0回答：1

1个回答

最新问题

删除带有空格的电子邮件地址

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1