我有这个数据框。
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Client':np.random.choice(['Customer_A', 'Customer_B'], 1000),
'Product':np.random.choice( ['Guns', 'Ammo', 'Armour'], 1000),
'Value':(np.random.randn(1000))
})
Categoricals = ['Client', 'Product']
df[Categoricals] = df[Categoricals].astype('category')
df = df.drop_duplicates()
df
我想要这个结果
# Non-anonymous function for Anomaly limit
def Anomaly (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
return (Q3 + (IQR * 2.0))
# Non-anonymous function for CriticalAnomaly limit
def CriticalAnomaly (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
return (Q3 + (IQR * 3.0))
# Define metrics
Metrics = {'Value':['count', Anomaly, CriticalAnomaly]}
# Groupby has more than 1 grouping column, so agg can only accept non-anonymous functions
Limits = df.groupby(['Client', 'Product']).agg(Metrics)
Limits
但在大数据集上速度很慢 因为 "Anomaly "和 "CriticalAnomaly "这两个函数 要重新计算Q1 Q3和IQR两次,而不是一次。将这两个函数结合在一起就会快很多。但是结果会被输出到1列而不是2列。
# Combined anomaly functions
def CombinedAnom (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
Anomaly = (Q3 + (IQR * 2.0))
CriticalAnomaly = (Q3 + (IQR * 3.0))
return (Anomaly, CriticalAnomaly)
# Define metrics
Metrics = {'Value':['count', CombinedAnom]}
# Groupby has more than 1 grouping column, so agg can only accept non-anonymous functions
Limits = df.groupby(['Client', 'Product']).agg(Metrics)
Limits
我怎样才能使组合函数的结果进入2列?
如果你使用 apply
而不是 agg
,你可以返回一个 Series
,被解压成列。
def f(g):
return pd.Series({
'c1': np.sum(g.b),
'c2': np.prod(g.b)
})
df = pd.DataFrame({'a': list('aabbcc'), 'b': [1,2,3,4,5,6]})
df.groupby('a').apply(f)
这是从:
a b
0 a 1
1 a 2
2 b 3
3 b 4
4 c 5
5 c 6
到
c1 c2
a
a 3 2
b 7 12
c 11 30