将 RDD 转换为不同类型（从 float64 到 double）

Question

我有一个如下所示的代码，它使用 pyspark。

test_truth_value = RDD。

test_predictor_rdd = RDD。

valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[1])))
metrics = RegressionMetrics(valuesAndPred)

当我运行代码时，出现以下错误

pyspark.errors.exceptions.base.PySparkTypeError: [CANNOT_ACCEPT_OBJECT_IN_TYPE] `DoubleType()` can not accept object `-44604.288415296396` in type `float64`.

下面的部分会发生这种情况。

metrics = RegressionMetrics(valuesAndPred)

一般来说，我会按照下面的链接答案来修复 RDD 的类型。 Pyspark 从字符串 RDD 映射到双精度列表 RDD

但是...我现在有三个问题。

float64 和 double 有什么区别？ Double 和 Float64 之间的 Swift 区别从这个链接看来，pyspark 正在区分 float64 和 double？
当我创建之前的 RDD 时，我已经将它们转换为 double ，如下所示。

double_cast_list = ['price','bed','bath','acre_lot','house_size']
for cast_item in double_cast_list:
    top_zip_df = top_zip_df.withColumn(cast_item, col(cast_item).cast(DoubleType()))

lasso_df = top_zip_df.select('price','bed','bath','acre_lot','house_size')
train_df, test_df = lasso_df.randomSplit(weights = [0.7,0.3], seed = 100)

def scaled_rdd_generation(df):
    rdd = df.rdd.map(lambda row: LabeledPoint(row[0], row[1::]))
    
    # separate the features and the lables from rdd - only need to standardize the features. 
    features_rdd = rdd.map(lambda row: row.features) # this is possible, because the LabeledPoint class has label and feature columns already built in
    scaler = StandardScaler(withMean = True, withStd = True)
    # for the standard scaler, you need to fit the scaler and then transforme the df. 
    # scaler.fit(rdd) -> computes the mean and variance and stores as a model to be used later
    scaler_model =  scaler.fit(features_rdd)
    scaled_feature_rdd = scaler_model.transform(features_rdd)
    # rdd zip method: zips RDD with another one. returns key-value pair. 
    scaled_rdd = rdd.zip(scaled_feature_rdd).map(lambda x: LabeledPoint(x[0].label, x[1]))
    return scaled_rdd


model_save_path = r'C:\Users\ra064640\OneDrive - Honda\Desktop\Spark\Real Estate Linear Regression'
train_scaled_rdd = scaled_rdd_generation(train_df)
test_scaled_rdd = scaled_rdd_generation(test_df)
test_predictor_rdd = test_scaled_rdd.map(lambda x: x.features)
test_truth_value = test_scaled_rdd.map(lambda x: x.label)

在哪里将 double 转换为 float64？

我应该如何解决这个问题？我没有看到类似于上一个链接中 float(x[0]) 建议的 double(x[0]) 的函数。谢谢！

Answer 1

首先，如Spark 文档中所述 - 这是 float 类型和 double 类型之间的区别：

FloatType：表示4字节单精度浮点数。

DoubleType：表示8字节双精度浮点数。

其次，正如您提到的，错误出现在这里：

valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[1])))
metrics = RegressionMetrics(valuesAndPred)

更具体地说，问题可能是由于这部分引起的：

lasso_modle.predict(test_predictor_rdd)

。

最后，要解决此问题，您可以尝试投射预测以及

lasso_model.predict(test_predictor_rdd).map(float)

。

修改后的代码：

valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd).map(float)).map(lambda x: ((x[0]), (x[1])))
metrics = RegressionMetrics(valuesAndPred)

将 RDD 转换为不同类型（从 float64 到 double）

问题描述投票：0回答：1

1个回答

最新问题

将 RDD 转换为不同类型（从 float64 到 double）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1