Pandas apply 正在将矩阵变成 nan/None

问题描述 投票:0回答:1

我正在数据集上运行以下代码,试图统计一个数据集中与一组不同条件匹配的行。我使用 apply 函数将此计数存储在矩阵内,我在其中调用辅助函数,该函数仅用于更改矩阵的单个值。由于对我来说一点也不明显的原因,这个矩阵列有时会变成带有 nan 的 float 类型,或者在代码的其他迭代中变成 NoneType 类型,导致错误“type ___ not indexable”(在这种情况下,前 91 个运行)此循环成功,第 92 次运行导致错误。)

我正在复制整个相关代码部分,因为我无法弄清楚这个错误是从哪里产生的。

导致错误的函数:

#######################################
# takes in year and month ex: ("12", "01")
# loads creates time series data with matrix of shared rides
#######################################
def processMonth(year, month):
    print("Processing " + year + "-" + month)
    
    #read in correct month data
    currentMonth = pd.read_parquet("blocked/20" + year + "/data_wBlocks_20" + year + "-" + month + ".parquet")

    #Cleaning Data:
    #remove whitespace from columns
    currentMonth.columns = currentMonth.columns.str.replace(' ', '')
    #drop rows with no match in either side
    currentMonth.dropna(inplace=True)
    #convert datetime to date
    currentMonth["pickup_datetime"] = currentMonth["pickup_datetime"].dt.date
    
    #create new df as time series on business days for daily ride vols
    MonthTS = pd.DataFrame({'date' : pd.Series(pd.date_range(datetime.date(2000 + int(year), int(month), 1), end=datetime.date(2000 + int(year), int(month), 1) + pd.DateOffset(months=1) - pd.DateOffset(days=1), freq='D'))})
    #for each day of the year, create empty matrix to initialize with ride volumes
    MonthTS["matrix"] = [np.zeros((len(firms.index), len(brokers.index))) for x in range(len(MonthTS))]
    MonthTS['date'] = MonthTS['date'].dt.date
    MonthTS = MonthTS.set_index(['date'])

    #iterate over combonations of firm and brokerage, and append to matrix
    for idx1, firm in firms.iterrows():
        #MonthTS["matrix"] = MonthTS["matrix"].apply(lambda x : x.append([]))
        for idx2, broker in brokers.iterrows():
            #tallying taxi rides on the given day in either direction between firms and brokers
            tmp = currentMonth[(currentMonth["pu_block"] == str(firm.bctcb2010)) & (currentMonth["do_block"] == (broker.bctcb2010)) | ((currentMonth["do_block"] == str(firm.bctcb2010)) & (currentMonth["pu_block"] == str(broker.bctcb2010)))]
            tmp = tmp.groupby("pickup_datetime").count()
            MonthTS = pd.concat([MonthTS, tmp], axis = 1)
            #print(MonthTS.head(5))
            MonthTS["matrix"] = MonthTS.apply(lambda x: replaceVal(x, idx1, idx2), axis=1)
            MonthTS = MonthTS["matrix"]

    #at this point, for the index on a given day, matrix[i][j] represents the ride between firm i and brokerage j on the given day
    return MonthTS

次要函数调用:

def replaceVal(x, idx1, idx2):
    if (x.pu_block == x.pu_block):
        (x.matrix)[idx1][idx2] = x.pu_block
    return x.matrix

尝试通过使用其他两个集合中的行的单一组合来运行来重现此错误,但无法做到。当我在故障点打印 x.matrix 时,它会打印 nan。

总结一下,df 在 apply 之前看起来像这样:

            matrix
date
2012-02-01 [[0, 1, 2, ...
2012-02-02 [[0, 1, 2, ...
2012-02-03 [[0, 1, 2, ...
...
2012-02-27 [[0, 1, 2, ...
2012-02-28 [[0, 1, 2, ...
2012-02-29 [[0, 1, 2, ...

然后像这样:

            matrix
date
2012-02-01 [[0, 1, 2, ...
2012-02-02 [[0, 1, 2, ...
2012-02-03 nan
...
2012-02-27 nan
2012-02-28 nan
2012-02-29 nan
python pandas numpy apply nan
1个回答
0
投票

(本来会要求发表评论,但还没有这个特权)

出于好奇,您的辅助函数中是否有理由询问 x.pu_block 是否等于其自身?

© www.soinside.com 2019 - 2024. All rights reserved.