Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12006

GaussianMixture.train crashes if an initial model is not None

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0, 1.5.0, 1.6.0
    • 1.4.2, 1.5.3, 1.6.1, 2.0.0
    • MLlib, PySpark
    • None

    Description

      Steps to reproduce :

      from pyspark.mllib.clustering import GaussianMixture
      from numpy import array
      
      data = sc.textFile("data/mllib/gmm_data.txt")
      parsedData = data.map(lambda line: array([float(x) for x in line.strip().split(' ')]))
      
      gmm = GaussianMixture.train(parsedData, 2)
      GaussianMixture.train(parsedData, 2, initialModel=gmm)
      

      It looks like the source of the problem is initialModelWeights NumPy array. In 1.5 / 1.6 it leads to net.razorvine.pickle.PickleException, in 1.4 we get Method trainGaussianMixture([..., class org.apache.spark.mllib.linalg.DenseVector, class java.util.ArrayList, class java.util.ArrayList]) does not exist

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zero323 Maciej Szymkiewicz
            zero323 Maciej Szymkiewicz
            Joseph K. Bradley Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment