假设我有像这样的元组数组:
a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]
b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]
我正在尝试将这些数组转换为数值向量,每个维度代表一个特征。
所以预期的输出结果如下:
amod = [1, 0, 1] # or [1, 1, 1]
bmod = [1, 1, 2] # or [1, 2, 2]
因此,要创建的向量取决于它之前所看到的(即矩形仍被编码为1
,但是新值'large'被编码为下一步的2
)。
我想我可以使用yield
和备忘录功能的某种组合来帮助我。这是我到目前为止尝试过的:
def memoize(f):
memo = {}
def helper(x):
if x not in memo:
memo[x] = f(x)
return memo[x]
return helper
@memoize
def verbal_to_value(tup):
u = 1
if tup[0] == 'shape':
yield u
u += 1
if tup[0] == 'fill':
yield u
u += 1
if tup[0] == 'size':
yield u
u += 1
但是我仍然收到此错误:
TypeError: 'NoneType' object is not callable
有没有一种方法可以创建此函数,该函数可以存储已看到的内容?如果它可以动态添加键,则可以加分,因此我不必对“形状”或“填充”之类的东西进行硬编码。
首先:这是我首选的备忘录实现装饰器,主要是因为速度...
def memoize(f):
class memodict(dict):
__slots__ = ()
def __missing__(self, key):
self[key] = ret = f(key)
return ret
return memodict().__getitem__
[除了一些边缘情况,与您的效果相同:
def memoize(f):
memo = {}
def helper(x):
if x not in memo:
memo[x] = f(x)
#else:
# pass
return memo[x]
return helper
但速度更快,因为if x not in memo:
发生在本机代码,而不是python中的代码。要了解它,您只需要知道在正常情况下:解释adict[item]
python调用adict.__getitem__(key)
,如果adict不包含键,__getitem__()
调用adict.__missing__(key)
,因此我们可以利用python魔术方法协议可为我们带来好处...
#This the first idea I had how I would implement your
#verbal_to_value() using memoization:
from collections import defaultdict
work=defaultdict(set)
@memoize
def verbal_to_value(kv):
k, v = kv
aset = work[k] #work creates a new set, if not already created.
aset.add(v) #add value if not already added
return len(aset)
包括备忘录装饰器,这是15行代码...
#test suite:
def vectorize(alist):
return [verbal_to_value(kv) for kv in alist]
a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]
b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]
print (vectorize(a)) #shows [1,1,1]
print (vectorize(b)) #shows [1,2,2]
defaultdict是一个功能强大的对象,具有几乎相同的逻辑作为备忘:各种方式的标准字典,除了查找失败,它将运行回调函数以创建丢失的内容值。在我们的情况下set()
不幸的是,此问题需要访问被用作键或字典状态本身。随着结果,我们不能只为.default_factory
写一个简单的函数但是我们可以根据memoize / defaultdict模式编写一个新对象:
#This how I would implement your verbal_to_value without
#memoization, though the worker class is so similar to @memoize,
#that it's easy to see why memoize is a good pattern to work from:
class sloter(dict):
__slots__ = ()
def __missing__(self,key):
self[key] = ret = len(self) + 1
#this + 1 bothers me, why can't these vectors be 0 based? ;)
return ret
from collections import defaultdict
work2 = defaultdict(sloter)
def verbal_to_value2(kv):
k, v = kv
return work2[k][v]
#~10 lines of code?
#test suite2:
def vectorize2(alist):
return [verbal_to_value2(kv) for kv in alist]
print (vectorize2(a)) #shows [1,1,1]
print (vectorize2(b)) #shows [1,2,2]
您之前可能已经看过sloter
之类的东西,因为它是有时恰好用于这种情况。转换会员名称到数字再返回。因此,我们具有以下优势能够扭转这样的事情:
def unvectorize2(a_vector, pattern=('shape','fill','size')):
reverser = [{v:k2 for k2,v in work2[k].items()} for k in pattern]
for index, vect in enumerate(a_vector):
yield pattern[index], reverser[index][vect]
print (list(unvectorize2(vectorize2(a))))
print (list(unvectorize2(vectorize2(b))))
但是我在您的原始帖子中看到了这些收益,并且它们吸引了我思考...如果有一个对象的备忘录/ defaultdict怎么办可能需要一个生成器而不是一个函数,并且知道提前生成器而不是调用它。然后我意识到...是的,生成器带有一个称为__next__()
的可调用对象意味着我们不需要新的defaultdict实现,只需仔细提取正确的成员函数...
def count(start=0): #same as: from itertools import count
while True:
yield start
start += 1
#so we could get the exact same behavior as above, (except faster)
#by saying:
sloter3=lambda :defaultdict(count(1).__next__)
#and then
work3 = defaultdict(sloter3)
#or just:
work3 = defaultdict(lambda :defaultdict(count(1).__next__))
#which yes, is a bit of a mindwarp if you've never needed to do that
#before.
#the outer defaultdict interprets the first item. Every time a new
#first item is received, the lambda is called, which creates a new
#count() generator (starting from 1), and passes it's .__next__ method
#to a new inner defaultdict.
def verbal_to_value3(kv):
k, v = kv
return work3[k][v]
#you *could* call that 8 lines of code, but we managed to use
#defaultdict twice, and didn't need to define it, so I wouldn't call
#it 'less complex' or anything.
#test suite3:
def vectorize3(alist):
return [verbal_to_value3(kv) for kv in alist]
print (vectorize3(a)) #shows [1,1,1]
print (vectorize3(b)) #shows [1,2,2]
#so yes, that can also work.
#and since the internal state in `work3` is stored in the exact same
#format, it be accessed the same way as `work2` to reconstruct input
#from output.
def unvectorize3(a_vector, pattern=('shape','fill','size')):
reverser = [{v:k2 for k2,v in work3[k].items()} for k in pattern]
for index, vect in enumerate(a_vector):
yield pattern[index], reverser[index][vect]
print (list(unvectorize3(vectorize3(a))))
print (list(unvectorize3(vectorize3(b))))
最终评论:
这些实现中的每一个都会在全局状态下存储状态变量。我发现它是抗美学的,但要看你是什么打算以后再使用该向量,这可能是一个功能。当我演示。
编辑:冥想这一天,以及我可能需要的各种情况,我认为我应该像这样封装此功能:
from collections import defaultdict
from itertools import count
class slotter4:
def __init__(self):
#keep track what order we expect to see keys
self.pattern = defaultdict(count(1).__next__)
#keep track of what values we've seen and what number we've assigned to mean them.
self.work = defaultdict(lambda :defaultdict(count(1).__next__))
def slot(self, kv, i=False):
"""used to be named verbal_to_value"""
k, v = kv
if i and i != self.pattern[k]:# keep track of order we saw initial keys
raise ValueError("Input fields out of order")
#in theory we could ignore this error, and just know
#that we're going to default to the field order we saw
#first. Or we could just not keep track, which might be
#required, if our code runs to slow, but then we cannot
#make pattern optional in .unvectorize()
return self.work[k][v]
def vectorize(self, alist):
return [self.slot(kv, i) for i, kv in enumerate(alist,1)]
#if we're not keeping track of field pattern, we could do this instead
#return [self.work[k][v] for k, v in alist]
def unvectorize(self, a_vector, pattern=None):
if pattern is None:
pattern = [k for k,v in sorted(self.pattern.items(), key=lambda a:a[1])]
reverser = [{v:k2 for k2,v in work3[k].items()} for k in pattern]
return [(pattern[index], reverser[index][vect])
for index, vect in enumerate(a_vector)]
#test suite4:
s = slotter4()
if __name__=='__main__':
Av = s.vectorize(a)
Bv = s.vectorize(b)
print (Av) #shows [1,1,1]
print (Bv) #shows [1,2,2]
print (s.unvectorize(Av))#shows a
print (s.unvectorize(Bv))#shows b
else:
#run the test silently, and only complain if something has broken
assert s.unvectorize(s.vectorize(a))==a
assert s.unvectorize(s.vectorize(b))==b
祝你好运!
不是最佳方法,但可以帮助您找到更好的解决方案
class Shape:
counter = {}
def to_tuple(self, tuples):
self.tuples = tuples
self._add()
l = []
for i,v in self.tuples:
l.append(self.counter[i][v])
return l
def _add(self):
for i,v in self.tuples:
if i in self.counter.keys():
if v not in self.counter[i]:
self.counter[i][v] = max(self.counter[i].values()) +1
else:
self.counter[i] = {v: 0}
a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]
b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]
s = Shape()
s.to_tuple(a)
s.to_tuple(b)