我有一个标记化文本列表(list_of_words),看起来像这样:
list_of_words =
['08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complet',
...]
我试图从这个列表中删除所有日期和时间的实例。我尝试过使用.remove()函数无济于事。我已经尝试将通配符(例如'.. / .. / ....)传递给我正在排序的停用词列表,但这不起作用。我终于尝试编写以下代码:
for line in list_of_words:
if re.search('[0-9]{2}/[09]{2}/[0-9]{4}',line):
list_of_words.remove(line)
但这也不起作用。如何从列表中删除格式化为日期或时间的所有内容?
^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$
这个正则表达式将执行以下操作:
12/23/2016
和时间12:34:56
的字符串am
或pm
的字符串,它们可能是源列表中前一次的一部分现场演示
样品清单
08/20/2014
10:04:27
pm
complete
vendor
per
mfg/recommend
08/20/2014
10:04:27
pm
complete
处理后列表
complete
vendor
per
mfg/recommend
complete
示例Python脚本
import re
SourceList = ['08/20/2014',
'10:04:27',
'pm',
'complete',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complete']
OutputList = filter(
lambda ThisWord: not re.match('^(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4}|am|pm)$', ThisWord),
SourceList)
for ThisValue in OutputList:
print ThisValue
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture (2 times):
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
[:\/,] any character of: ':', '\/', ','
----------------------------------------------------------------------
){2} end of grouping
----------------------------------------------------------------------
[0-9]{2,4} any character of: '0' to '9' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
am 'am'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
pm 'pm'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
如果你想要数学列表中的时间和日期字符串,也许你可以尝试下面的正则表达式:
[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}
添加python代码:
import re
list_of_words = [
'08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm',
'complet'
]
new_list = [item for item in list_of_words if not re.search(r'[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', item)]
试试这个:
import re
list_of_words = ['08/20/2014',
'10:04:27',
'pm',
'complet',
'vendor',
'per',
'mfg/recommend',
'08/20/2014',
'10:04:27',
'pm', 'complet']
list_of_words = filter(
lambda x: not re.match('[0-9]{2}[\/,:][0-9]{2}[\/,:][0-9]{2,4}', x),
list_of_words)