[ZEPPELIN-97] pyspark issue with mllib api - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.5.0
Fix Version/s: 0.5.0
Component/s: Interpreters
Labels:
- interpreter
- pyspark
Environment:

spark 1.4 on mapr hadoop, running on centos 7.0

Description

pyspark interpreter seems to have issue accessing python RDD

import numpy as np
from sklearn.cross_validation import train_test_split
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint 

X = np.random.rand(100,3)
y = np.random.randint(5,size=100)

trainX,testX,trainy,testy = train_test_split(X,y,test_size=0.2)

training = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for (xrow,ylabel) in zip(trainX,trainy)])
testing = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for (xrow,ylabel) in zip(testX,testy)])

model = NaiveBayes.train(training, 0.1)

above code errors out at last line

Error:

(<type 'exceptions.AttributeError'>, AttributeError("'list' object has no attribute '_get_object_id'",), <traceback object at 0x392b638>)

above code runs fine from pyspark shell. Also tested other features like data frames from zepellin pyspark interpreter and they seem to work fine as well.

Attachments

Activity

People

Assignee:: Lee Moon Soo

Reporter:: Bobby Chowdary

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Jun/15 22:58

Updated:: 30/Jun/15 17:41

Resolved:: 30/Jun/15 17:41