我有一个预定义的字符 - >类型字典。例如,'a' - 是一个lower_case字母,1是一个数字,')'是一个标点符号等。使用以下脚本,我标记给定字符串中的所有字符:
labels=''
for ch in list(example):
try:
l = character_type_dict[ch]
print(l)
labels = labels+l
except KeyError:
labels = labels+'o'
print('o')
labels
例如,给定"1,234.45kg (in metric system)"
作为输入,代码生成dpdddpddwllwpllwllllllwllllllp
作为输出。
现在,我想根据组拆分字符串。输出应该看起来像这样:
['1',',','234','.','45','kg',' ','(','in',' ','metric',' ','system',')']
也就是说,它应该根据字符类型边框进行拆分。有任何想法如何有效地完成这项工作?
您可以更简洁地计算标签(并且可能更快):
labels = ''.join(character_type_dict.get(ch, 'o') for ch in example)
或者,使用辅助函数:
character_type = lambda ch: character_type_dict.get(ch, 'o')
labels = ''.join(map(character_type, example))
但是你不需要标签来分割字符串;在itertools.groupby的帮助下,您可以直接拆分:
splits = list(''.join(g)
for _, g in itertools.groupby(example, key=character_type)
一个可能更有趣的结果是类型和相关分组的元组向量:
>>> list((''.join(g), code)
... for code, g in itertools.groupby(example, key=character_type))
[('1', 'd'), (',', 'p'), ('234', 'd'), ('.', 'p'), ('45', 'd'), ('kg', 'l'),
(' ', 'w'), ('(', 'p'), ('in', 'l'), (' ', 'w'), ('metric', 'l'), (' ', 'w'),
('system', 'l'), (')', 'p')]
我按如下方式计算了character_type_dict
:
character_type_dict = {}
for code, chars in (('w', string.whitespace),
('l', string.ascii_letters),
('d', string.digits),
('p', string.punctuation)):
for char in chars: character_type_dict[char] = code
但我也可以做到这一点(我后来想到):
from collections import ChainMap
character_type_dict = dict(ChainMap(*({c:t for c in string.__getattribute__(n)}
for t,n in (('w', 'whitespace')
,('d', 'digits')
,('l', 'ascii_letters')
,('p', 'punctuation')))))
labels
是错的(在你的例子中是'dpdddpddwllwpllwllllllwllllllp'
,但我相信它应该是'dpdddpddllwpllwllllllwllllllp'
)
无论如何,你可以
使用
虐待itertools.groupby
:
from itertools import groupby
example = "1,234.45kg (in metric system)"
labels = 'dpdddpddllwpllwllllllwllllllp'
output = [''.join(group)
for _, group in groupby(example, key=lambda ch: labels[example.index(ch)])]
print(output)
# ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']
记住最后一个类的类:
import string
character_type = {c: "l" for c in string.ascii_letters}
character_type.update({c: "d" for c in string.digits})
character_type.update({c: "p" for c in string.punctuation})
character_type.update({c: "w" for c in string.whitespace})
example = "1,234.45kg (in metric system)"
x = []
prev = None
for ch in example:
try:
l = character_type[ch]
if l == prev:
x[-1].append(ch)
else:
x.append([ch])
except KeyError:
print(ch)
else:
prev = l
x = map(''.join, x)
print(list(x))
# ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']
另一种算法方法。而不是try: except:
使用dictionaryget(value, default_value)
方法更好。
import string
character_type_dict = {}
for ch in string.ascii_lowercase:
character_type_dict[ch] = 'l'
for ch in string.digits:
character_type_dict[ch] = 'd'
for ch in string.punctuation:
character_type_dict[ch] = 'p'
for ch in string.whitespace:
character_type_dict[ch] = 'w'
example = "1,234.45kg (in metric system)"
split_list = []
split_start = 0
for i in range(len(example) - 1):
if character_type_dict.get(example[i], 'o') != character_type_dict.get(example[i + 1], 'o'):
split_list.append(example[split_start: i + 1])
split_start = i + 1
split_list.append(example[split_start:])
print(split_list)
将此作为算法难题:
# dummy mapping
character_type_dict = dict({c: "l" for c in string.ascii_letters}.items() \
+ {c: "d" for c in string.digits}.items() \
+ {c: "p" for c in string.punctuation}.items() \
+ {c: "w" for c in string.whitespace}.items())
example = "1,234.45kg (in metric system)"
last = example[0]
temp = last
res = []
for ch in example[1:]:
try:
cur = character_type_dict[ch]
if cur != last:
res.append(temp)
temp = ''
temp += ch
last = cur
except KeyError:
last = 'o'
res.append(temp)
结果是:
['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']