pyspark 1.6.3线性回归错误float（）参数必须是字符串或数字

Question

我正在使用pyspark的线性回归这是我的代码：

from pyspark.ml.regression import LabeledPoint,LinearRegressionWithSGD
from pyspark import SparkContext, SparkConf 
from pyspark.sql import SQLContext
from pyspark.ml.evaluation import RegressionEvaluator
import time
import csv

start_time = time.time()

conf = SparkConf().setAppName("project_spark").setMaster("local")
sc = SparkContext(conf=conf)
sqlc = SQLContext(sc)

X_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Train_int_1k.csv')
X_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Test_int_1k.csv')
y_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Train_Tags81_1k.csv')
y_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Test_Tags81_1k.csv')

X_train = X_train.map(lambda line: line.split(","))
X_test = X_test.map(lambda line: line.split(","))
y_train = y_train.map(lambda line: line.split(","))
y_test = y_test.map(lambda line: line.split(","))

training = LabeledPoint(y_train, X_train)
testing = LabeledPoint(y_test, X_test)

model = LinearRegressionWithSGD.train(training)
valuesAndPreds = (testing.map(lambda p: (p.label, model.predict(p.features))))

evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(valuesAndPreds)

print("Root Mean Squared Error = " + str(RMSE))
Time = time.time() - start_time
print("--- %s seconds ---" % Time)
spark.stop()

但是这段代码有错误，float（）参数必须是一个字符串或数字

training = LabeledPoint(y_train, X_train)

那么，我该怎么办呢

Answer 1

没有全局，我的猜测是你给LabeledPoint错误的论点类型。更具体地说，您的y_train和y_test从以下值获得值：

...
y_train.map(lambda line: line.split(","))
y_test.map(lambda line: line.split(","))

每个返回一个list，这与LabeledPoint label论证不兼容。

所以：training = LabeledPoint(y_train, X_train) - > training = LabeledPoint([some, values], [some, other, values])

但是，从docs/source，LabeledPoint预计第一个论点，标签，是可以铸造到float的东西。

class LabeledPoint(object):

    """
    Class that represents the features and labels of a data point.

    :param label:
      Label for this data point.
    :param features:
      Vector of features for this point (NumPy array, list,
      pyspark.mllib.linalg.SparseVector, or scipy.sparse column matrix).

    .. note:: 'label' and 'features' are accessible as class attributes.

    .. versionadded:: 1.0.0
    """

    def __init__(self, label, features):
        self.label = float(label)
        self.features = _convert_to_vector(features)

因此，根据您的行的样子，可能会将您的代码更改为以下内容：

...
y_train.map(lambda line: line.split(",")[0])
...
y_test.map(lambda line: line.split(",")[0])

希望有所帮助，祝你好运！

pyspark 1.6.3线性回归错误float（）参数必须是字符串或数字

问题描述投票：0回答：1

1个回答

最新问题

pyspark 1.6.3线性回归错误float（）参数必须是字符串或数字

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1