内置的 Pandas 单元格级函数具有索引/列感知功能?

问题描述 投票:0回答:1

我正在清理历史数据以进行指数平滑预测。我有美国县级(即二级行政区划)的数据,但有很多零值(由于数量较少),导致预测模型出现问题。

由于数据季节性很强,我所做的是检查每年每个县的数据。如果特定年份的数据包含零,我将用调整后的数据集替换该年的县数据,该数据集将州级季节性应用于县级年量。

经过多次尝试和错误,以避免迭代,导致我沿着带有重置索引的嵌套

apply
函数的路径(例如,
df.apply(lambda x: x.reset_index().apply(lambda y: [calculation]))
),我最终使用计算季节性的迭代编写了数据清理,然后将季节性数据乘以将年度数量存储在每月列中的数据框:

# Initialize empty seasonality df with the index and column values from the source data
cty_season = pd.DataFrame(index=cty_data.index, columns=cty_data.columns)

# Iterate through the index and columns to populate each value
for idx in cty_season.index:
   for col in cty_season.columns:
      cty_season.loc[idx,col] = [calculation referring to helper dfs with identical indices and columns]

# Combine seasonality data with sales totals to get revised dataset
cty_adj = cty_season * cty_annual

有没有一种方法可以更有效地做到这一点,或者以一种更“pandic”的方式(或者 Pandas 相当于 pythonic 的方式)?唯一想到的是将列分开,以便每年都是一个单独的行,这可能允许更简单的

apply
语句,因为替换是逐年完成的。


编辑:这是数据清理过程的示例。正如我上面所建议的,对于这个特定的用例,答案可能是将每年分成一个单独的行。但是,我在其他可能没有相同解决方案的用例中遇到了这种情况。这段代码的一个区别是,我通常会旋转销售记录来创建数据帧,因此 NaN 值已经在 df 中,而不必在此示例中将 0 替换为 NaN。

import pandas as pd
import numpy as np

data = [[73,  0,  0, 22,  0, 34,  5, 46],
        [51, 12, 77,  0, 19,  3,  0, 34],
        [73, 44,  1, 72,  0, 56, 21,  3],
        [ 3, 74,  2, 24,  4, 60,  8, 39],
        [70,  0, 36, 50,  3,  1, 59,  1],
        [14, 37, 26, 27, 87, 58, 95,  2],
        [ 4,  1, 17, 34, 25,  1,  1,  2],
        [ 0,  0,  0,  4, 18,  1,  8,  0],
        [42, 27, 41, 15, 67,  2, 25,  6]]

df = pd.DataFrame(data,
                  index=pd.MultiIndex.from_product([['County 1','County 2','County 3'],['A','B','C']],names=['County','Product']),
                  columns=pd.Series(['Y1Q1','Y1Q2','Y1Q3','Y1Q4','Y2Q1','Y2Q2','Y2Q3','Y2Q4'],name='Quarter'))

# Roll up totals by product
tot_df = df.groupby('Product').sum()

# Find out how many non-zero data points there should be per year
# (this is done to allow for YTD analysis instead of assuming each year should have 4 quarterly points or 12 monthly points)
# There is an assumption that the data doesn't have any zeroes at the total level
tot_values = tot_df.apply(lambda x: x.groupby(x.index.str[:2]).count(),axis=1)

# Calculate seasonality/share of year for each product, each year
tot_season = tot_df.apply(lambda x: x.reset_index().apply(lambda y: y[x.name]/x[x.index.str[:2]==y.Quarter[:2]].sum(),axis=1),axis=1)
tot_season.columns = tot_df.columns

# Look for zeroes to determine if the data for a particular product and year can be used
cty_valid = df.replace({0:np.nan}).apply(lambda x: x.groupby(x.index.str[:2]).count().eq(tot_values.loc[x.name[-1]]),axis=1)

# Total up annual numbers by county/product.
# These numbers are repeated at the quarterly level so that the annual data
# can be directly multiplied with the county seasonality to be generated
cty_annual = df.apply(lambda x: x.groupby(x.index.str[:2]).sum(),axis=1)
cty_annual.columns = [x + 'Q1' for x in cty_annual.columns]
cty_annual = cty_annual.reindex(columns=df.columns).ffill(axis=1)

# Create a dataframe with the needed index and columns
cty_season = pd.DataFrame(index=df.index,columns=df.columns)

# Iterate through each county/product and period combination to populate the dataframe
for idx in cty_season.index:
    for col in cty_season.columns:
        # Use the actual seasonality (period sales / annual sales) if the year has non-zero values for that product/county.
        # If not, use the seasonality calculated at the total level for that product
        cty_season.loc[idx,col] = df.loc[idx,col]/cty_annual.loc[idx,col] if cty_valid.loc[idx,col[:2]] else tot_season.loc[idx[-1],col]

# Multiply the seasonality df with the annual sales df to get an adjusted sales history.
cty_adj = cty_season * cty_annual
python pandas dataframe data-cleaning forecasting
1个回答
0
投票

如果我们首先重新排列数据使其变长而不是变宽,那么计算会更容易一些。接下来,我们为每个新元素创建一个新列,而不是每个步骤创建一个单独的 df。

# wide to long
df2 = df.stack().rename('values').reset_index()
# recreate tot_df in a new column
df2['prod_quart_tot'] = df2.groupby(['Product', 'Quarter'])['values'].transform('sum')
# create year column
df2['year'] = df2['Quarter'].str[:2]
# recreate tot_values
df2['tot_values'] = df2.groupby(['year', 'Product', 'County'])['prod_quart_tot'].transform(lambda x: x.gt(0).count())
# in between step needed for tot_seas
df2['prod_year_tot'] = df2.groupby(['Product', 'year'])['values'].transform('sum')
# recreate tot_seas
df2['tot_seas'] = df2['prod_quart_tot']/df2['prod_year_tot']
# recreate cty_valid
df2['cty_valid'] = df2.groupby(['County', 'Product', 'year'])['values'].transform(lambda x: 0 not in x.values)
# recreate cty_annual
df2['cty_annual'] = df2.groupby(['County', 'Product', 'year'])['values'].transform('sum')
# recreate cty_season
df2['cty_season'] = np.where(df2['cty_valid'], df2['values'].div(df2['cty_annual']), df2['tot_seas'])
# recreate cty_adj
df2['cty_adj'] = df2['cty_season'].mul(df2['cty_annual'])

# final values in original format
df_out = df2.set_index(['County', 'Product', 'Quarter'])['cty_adj'].unstack()

#check if df_out matches cty_adj
print(df_out.eq(cty_adj).all().all())

True
© www.soinside.com 2019 - 2024. All rights reserved.