我有一只熊猫
DataFrame
,像这样:
d={'gen':['A','A','A','A','B','B','B','B','C','D','D','D','D','D','D','D','D','D','D'], 'diff':pd.Series([1,1,1,1,2,1,1,1,1,1,1,1,1,2,2,1,1,1], index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17])}
wk = pd.DataFrame(data=d, index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18])
我的目标是根据一些标准计算
gen
出现了多少次:
计算diff
索引
diff
gen
等于索引 i
处的 gen
,并且 如果有连续的 1,则计数如下: if (连续 1 的个数) %2 == 0: count = 连续的个数/2,如果没有: count = (连续的个数 - 1) /2 i+1
字典
k=0
j=0
z={}
for i in range(wk.shape[0]):
if wk['diff'][i] == 1:
if wk['gen'][i] == wk['gen'][i+1]:
if j == 0:
j+=2
if j%2==0:
k+=1
if j>=2:
j+=1
z[wk['gen'][i]] = k
if wk['gen'][i] != wk['gen'][i+1]:
j=0
k=0
的结果是:
z
但是当我使用更大的数据(超过 410,000 条记录)时,当索引 {'A': 2, 'B': 1, 'D': 4}
处的
gen
不等于索引 i
处的 gen
时,计数器并不总是从 0 开始。我的代码有什么问题吗?计算每组连续1秒,执行2个
groupby.count
(相当于你的
floordiv
),并在转换x/2 if x%2==0 else (x-1)/2
之前再次用
groupby.sum
聚合:
to_dict
输出:
group = wk['diff'].ne(wk.groupby('gen')['diff'].shift()).cumsum()
m = wk['diff'].eq(1)
out = (wk[m].groupby(['gen', group]) # keep only 1s and group
['diff'].count().floordiv(2) # count and floor division
.groupby(level='gen').sum() # sum per "gen" group
.loc[lambda x: x>0].to_dict() # only counts > 0 and convert to dict
)
中间体
{'A': 2, 'B': 1, 'D': 3}
和
group
:m