pyspark 1.6.3线性回归错误float()参数必须是字符串或数字

问题描述 投票:0回答:1

我正在使用pyspark的线性回归这是我的代码:

from pyspark.ml.regression import LabeledPoint,LinearRegressionWithSGD
from pyspark import SparkContext, SparkConf 
from pyspark.sql import SQLContext
from pyspark.ml.evaluation import RegressionEvaluator
import time
import csv

start_time = time.time()

conf = SparkConf().setAppName("project_spark").setMaster("local")
sc = SparkContext(conf=conf)
sqlc = SQLContext(sc)

X_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Train_int_1k.csv')
X_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Test_int_1k.csv')
y_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Train_Tags81_1k.csv')
y_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Test_Tags81_1k.csv')

X_train = X_train.map(lambda line: line.split(","))
X_test = X_test.map(lambda line: line.split(","))
y_train = y_train.map(lambda line: line.split(","))
y_test = y_test.map(lambda line: line.split(","))

training = LabeledPoint(y_train, X_train)
testing = LabeledPoint(y_test, X_test)

model = LinearRegressionWithSGD.train(training)
valuesAndPreds = (testing.map(lambda p: (p.label, model.predict(p.features))))

evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(valuesAndPreds)

print("Root Mean Squared Error = " + str(RMSE))
Time = time.time() - start_time
print("--- %s seconds ---" % Time)
spark.stop()

但是这段代码有错误,float()参数必须是一个字符串或数字

training = LabeledPoint(y_train, X_train)

那么,我该怎么办呢

python pyspark linear-regression
1个回答
0
投票

没有全局,我的猜测是你给LabeledPoint错误的论点类型。更具体地说,您的y_trainy_test从以下值获得值:

...
y_train.map(lambda line: line.split(","))
y_test.map(lambda line: line.split(","))

每个返回一个list,这与LabeledPoint label论证不兼容。

所以:training = LabeledPoint(y_train, X_train) - > training = LabeledPoint([some, values], [some, other, values])

但是,从docs/sourceLabeledPoint预计第一个论点,标签,是可以铸造到float的东西。

class LabeledPoint(object):

    """
    Class that represents the features and labels of a data point.

    :param label:
      Label for this data point.
    :param features:
      Vector of features for this point (NumPy array, list,
      pyspark.mllib.linalg.SparseVector, or scipy.sparse column matrix).

    .. note:: 'label' and 'features' are accessible as class attributes.

    .. versionadded:: 1.0.0
    """

    def __init__(self, label, features):
        self.label = float(label)
        self.features = _convert_to_vector(features)

因此,根据您的行的样子,可能会将您的代码更改为以下内容:

...
y_train.map(lambda line: line.split(",")[0])
...
y_test.map(lambda line: line.split(",")[0])

希望有所帮助,祝你好运!

© www.soinside.com 2019 - 2024. All rights reserved.