我有一个如下所示的代码,它使用 pyspark。
test_truth_value = RDD。
test_predictor_rdd = RDD。
valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[1])))
metrics = RegressionMetrics(valuesAndPred)
当我运行代码时,出现以下错误
pyspark.errors.exceptions.base.PySparkTypeError: [CANNOT_ACCEPT_OBJECT_IN_TYPE] `DoubleType()` can not accept object `-44604.288415296396` in type `float64`.
下面的部分会发生这种情况。
metrics = RegressionMetrics(valuesAndPred)
一般来说,我会按照下面的链接答案来修复 RDD 的类型。 Pyspark 从字符串 RDD 映射到双精度列表 RDD
但是...我现在有三个问题。
double_cast_list = ['price','bed','bath','acre_lot','house_size']
for cast_item in double_cast_list:
top_zip_df = top_zip_df.withColumn(cast_item, col(cast_item).cast(DoubleType()))
lasso_df = top_zip_df.select('price','bed','bath','acre_lot','house_size')
train_df, test_df = lasso_df.randomSplit(weights = [0.7,0.3], seed = 100)
def scaled_rdd_generation(df):
rdd = df.rdd.map(lambda row: LabeledPoint(row[0], row[1::]))
# separate the features and the lables from rdd - only need to standardize the features.
features_rdd = rdd.map(lambda row: row.features) # this is possible, because the LabeledPoint class has label and feature columns already built in
scaler = StandardScaler(withMean = True, withStd = True)
# for the standard scaler, you need to fit the scaler and then transforme the df.
# scaler.fit(rdd) -> computes the mean and variance and stores as a model to be used later
scaler_model = scaler.fit(features_rdd)
scaled_feature_rdd = scaler_model.transform(features_rdd)
# rdd zip method: zips RDD with another one. returns key-value pair.
scaled_rdd = rdd.zip(scaled_feature_rdd).map(lambda x: LabeledPoint(x[0].label, x[1]))
return scaled_rdd
model_save_path = r'C:\Users\ra064640\OneDrive - Honda\Desktop\Spark\Real Estate Linear Regression'
train_scaled_rdd = scaled_rdd_generation(train_df)
test_scaled_rdd = scaled_rdd_generation(test_df)
test_predictor_rdd = test_scaled_rdd.map(lambda x: x.features)
test_truth_value = test_scaled_rdd.map(lambda x: x.label)
在哪里将 double 转换为 float64?
首先,如Spark 文档中所述 - 这是 float 类型和 double 类型之间的区别:
- FloatType:表示4字节单精度浮点数。
- DoubleType:表示8字节双精度浮点数。
其次,正如您提到的,错误出现在这里:
valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[1])))
metrics = RegressionMetrics(valuesAndPred)
更具体地说,问题可能是由于这部分引起的:
lasso_modle.predict(test_predictor_rdd)
。
最后,要解决此问题,您可以尝试投射预测以及
lasso_model.predict(test_predictor_rdd).map(float)
。
修改后的代码:
valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd).map(float)).map(lambda x: ((x[0]), (x[1])))
metrics = RegressionMetrics(valuesAndPred)