我需要一个正则表达式模式来创建一个包含三个独立列的 Pandas DataFrame:日期、用户和消息。我相信正则表达式是这种情况下的最佳选择,但如果有其他方法可以更快地达到相同的结果,我想知道。
图案必须:
捕获整个消息,包括多行消息。
检测没有用户时,将其分类为用户栏中的群组通知。
我已经创建了一个,但我觉得它并没有真正优化,并且在大型数据集上运行需要几秒钟。这是我当前拥有的模式,以及我需要提取的数据和格式的示例。
data = """
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User3: Hi StackOverflow
"""
pattern = r'\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'
#pass the pattern and data to split it to get the list of messages with users
messages = re.split(pattern, data)[1:]
#extract all dates
dates = re.findall(pattern, data)
#create dataframe
df = pd.DataFrame({'user_message': messages, 'date': dates})
#separate Users and Message
users = []
messages = []
for message in df['user_message']:
entry = re.split(r'([\w\W]+?):\s', message)
if entry[1:]: # user name
users.append(entry[1])
messages.append(" ".join(entry[2:]))
else:
users.append('group_notification')
messages.append(entry[0])
df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace=True)
假设用户不能乱搞并构造一个模仿新条目的多行按摩,我可能会手动解析它,然后如果需要的话将数据加载到 pandas 中。
这里的技巧是维护“先前”记录的句柄,以便在我们有连续消息的情况下,我们可以扩展先前的消息。
这是一个你可以尝试的概念:
import io
import datetime
import pandas
data = """
9/6/22, 11:28 - Group creator created group "Example"
9/6/22, 11:39 - User1: This is some text
9/6/22, 11:58 - Group creator changed group name to "Example2"
9/6/22, 12:13 - User2: This is
some text
with multiple lines
9/6/22, 13:13 - Admin changed group profile photo
9/6/22, 14:45 - User3: Hi StackOverflow
""".strip()
results = []
prior = {}
with io.StringIO(data) as file_in:
for row in file_in:
parts = row.split(" - ")
## ------------------------
## If this row does not start with a date
## assume it is part of the prior message
## ------------------------
try:
current = {
"date": datetime.datetime.strptime(parts[0], "%m/%d/%y, %H:%M"),
"user": "group_notification",
"message": ""
}
except ValueError:
prior["message"] += row
continue
## ------------------------
## ------------------------
## The prior record is complete
## ------------------------
if prior:
results.append(prior)
## ------------------------
## ------------------------
## Break the remaining text into user and message
## ------------------------
sub_parts = parts[1].split(" ")
if sub_parts[0].endswith(":"):
current["user"] = sub_parts[0][:-1]
current["message"] = " ".join(sub_parts[1:])
else:
current["message"] = parts[1]
## ------------------------
prior = current
## ------------------------
## The prior record is complete
## ------------------------
results.append(prior)
## ------------------------
print(pandas.DataFrame(results))
这应该给你:
date user message
0 2022-09-06 11:28:00 group_notification Group creator created group "Example"\n
1 2022-09-06 11:39:00 User1 This is some text\n
2 2022-09-06 11:58:00 group_notification Group creator changed group name to "Example2"\n
3 2022-09-06 12:13:00 User2 This is\nsome text\nwith multiple lines\n
4 2022-09-06 13:13:00 group_notification Admin changed group profile photo\n
5 2022-09-06 14:45:00 User3 Hi StackOverflow