如果Pandas系列中的字符串包含单词中的单词，则最快的方法

Question

我有一个大型数据集all_transcripts，有近300万行。其中一列msgText包含书面消息。

>>> all_transcripts['msgText']

['this is my first message']
['second message is here']
['this is my third message']

此外，我有一个200多个单词的列表，称为gemeentes。

>>> gemeentes
['first','second','third' ... ]

如果此列表中的单词包含在msgText中，我想用另一个单词替换它。为此，我创建了这个函数：

def replaceCity(text):
    newText = text.replace(plaatsnaam, 'woonplaats')
    return str(newText)

所以，我想要的输出看起来像：

['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']

目前，我循环遍历列表，并列出我列表中的每个项目，应用replaceCityfunction。

for plaatsnaam in gemeentes:
    global(plaatsnaam)
    all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)

但是，这需要很长时间，因此似乎效率不高。有没有更快的方法来执行此任务？

这篇文章（Algorithm to find multiple string matches）是类似的，但我的问题是不同的，因为：

这里只有一小段文字，而我有一个包含许多不同行的数据集
我想替换单词，而不是仅仅找到单词。

Answer 1

假设all_transcripts是大熊猫DataFrame：

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

例：

all_transcripts = pd.DataFrame([['this is my first message'],
                                ['second message is here'],
                                ['this is my third message']],
                               columns=['msgText'])
gemeentes = ['first','second','third']

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

输出

0    this is my woonplaats message
1       woonplaats message is here
2    this is my woonplaats message

如果Pandas系列中的字符串包含单词中的单词，则最快的方法

问题描述投票：2回答：1

1个回答

最新问题

如果Pandas系列中的字符串包含单词中的单词，则最快的方法

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1