如何优化在 python 中提取消息的正则表达式模式的性能？

Question

我需要一个正则表达式模式来创建一个包含三个独立列的 Pandas DataFrame：日期、用户和消息。我相信正则表达式是这种情况下的最佳选择，但如果有其他方法可以更快地达到相同的结果，我想知道。

图案必须：

捕获整个消息，包括多行消息。
检测没有用户时，将其分类为用户栏中的群组通知。

我已经创建了一个，但我觉得它并没有真正优化，并且在大型数据集上运行需要几秒钟。这是我当前拥有的模式，以及我需要提取的数据和格式的示例。

data = """
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User3: Hi StackOverflow
"""


pattern = r'\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'

#pass the pattern and data to split it to get the list of messages with users
messages = re.split(pattern, data)[1:]

#extract all dates
dates = re.findall(pattern, data)

#create dataframe
df = pd.DataFrame({'user_message': messages, 'date': dates})

#separate Users and Message
users = []
messages = []
for message in df['user_message']:
    entry = re.split(r'([\w\W]+?):\s', message)
    if entry[1:]:  # user name
        users.append(entry[1])
        messages.append(" ".join(entry[2:]))
    else:
        users.append('group_notification')
        messages.append(entry[0])

df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace=True)

Answer 1

假设用户不能乱搞并构造一个模仿新条目的多行按摩，我可能会手动解析它，然后如果需要的话将数据加载到 pandas 中。

这里的技巧是维护“先前”记录的句柄，以便在我们有连续消息的情况下，我们可以扩展先前的消息。

这是一个你可以尝试的概念：

import io
import datetime
import pandas

data = """
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User3: Hi StackOverflow
""".strip()

results = []
prior = {}
with io.StringIO(data) as file_in:
    for row in file_in:
        parts = row.split(" - ")

        ## ------------------------
        ## If this row does not start with a date
        ## assume it is part of the prior message
        ## ------------------------
        try:
            current = {
                "date": datetime.datetime.strptime(parts[0], "%m/%d/%y, %H:%M"),
                "user": "group_notification",
                "message": ""
            }
        except ValueError:
            prior["message"] += row
            continue
        ## ------------------------

        ## ------------------------
        ## The prior record is complete
        ## ------------------------
        if prior:
            results.append(prior)
        ## ------------------------

        ## ------------------------
        ## Break the remaining text into user and message
        ## ------------------------
        sub_parts = parts[1].split(" ")
        if sub_parts[0].endswith(":"):
            current["user"] = sub_parts[0][:-1]
            current["message"] = " ".join(sub_parts[1:])
        else:
            current["message"] = parts[1]
        ## ------------------------

        prior = current

    ## ------------------------
    ## The prior record is complete
    ## ------------------------
    results.append(prior)
    ## ------------------------

print(pandas.DataFrame(results))

这应该给你：

                 date                user                                           message
0 2022-09-06 11:28:00  group_notification           Group creator created group "Example"\n
1 2022-09-06 11:39:00               User1                               This is some text\n
2 2022-09-06 11:58:00  group_notification  Group creator changed group name to "Example2"\n
3 2022-09-06 12:13:00               User2         This is\nsome text\nwith multiple lines\n
4 2022-09-06 13:13:00  group_notification               Admin changed group profile photo\n
5 2022-09-06 14:45:00               User3                                  Hi StackOverflow

如何优化在 python 中提取消息的正则表达式模式的性能？

问题描述投票：0回答：1

1个回答

最新问题

如何优化在 python 中提取消息的正则表达式模式的性能？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1