Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.5.0
-
spark 1.4 on mapr hadoop, running on centos 7.0
Description
pyspark interpreter seems to have issue accessing python RDD
import numpy as np from sklearn.cross_validation import train_test_split from pyspark.mllib.classification import NaiveBayes from pyspark.mllib.linalg import Vectors from pyspark.mllib.regression import LabeledPoint X = np.random.rand(100,3) y = np.random.randint(5,size=100) trainX,testX,trainy,testy = train_test_split(X,y,test_size=0.2) training = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for (xrow,ylabel) in zip(trainX,trainy)]) testing = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for (xrow,ylabel) in zip(testX,testy)]) model = NaiveBayes.train(training, 0.1)
above code errors out at last line
Error:
(<type 'exceptions.AttributeError'>, AttributeError("'list' object has no attribute '_get_object_id'",), <traceback object at 0x392b638>)
above code runs fine from pyspark shell. Also tested other features like data frames from zepellin pyspark interpreter and they seem to work fine as well.