Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-97

pyspark issue with mllib api

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.5.0
    • Fix Version/s: 0.5.0
    • Component/s: Interpreters
    • Environment:

      spark 1.4 on mapr hadoop, running on centos 7.0

      Description

      pyspark interpreter seems to have issue accessing python RDD

      import numpy as np
      from sklearn.cross_validation import train_test_split
      from pyspark.mllib.classification import NaiveBayes
      from pyspark.mllib.linalg import Vectors
      from pyspark.mllib.regression import LabeledPoint 
      
      X = np.random.rand(100,3)
      y = np.random.randint(5,size=100)
      
      trainX,testX,trainy,testy = train_test_split(X,y,test_size=0.2)
      
      training = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for (xrow,ylabel) in zip(trainX,trainy)])
      testing = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for (xrow,ylabel) in zip(testX,testy)])
      
      model = NaiveBayes.train(training, 0.1)
      

      above code errors out at last line

      Error:

      (<type 'exceptions.AttributeError'>, AttributeError("'list' object has no attribute '_get_object_id'",), <traceback object at 0x392b638>)
      

      above code runs fine from pyspark shell. Also tested other features like data frames from zepellin pyspark interpreter and they seem to work fine as well.

        Attachments

          Activity

            People

            • Assignee:
              moon Lee Moon Soo
              Reporter:
              bobbych03 Bobby Chowdary
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: