Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12006

GaussianMixture.train crashes if an initial model is not None

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0, 1.5.0, 1.6.0
    • Fix Version/s: 1.4.2, 1.5.3, 1.6.1, 2.0.0
    • Component/s: MLlib, PySpark
    • Labels:
      None

      Description

      Steps to reproduce :

      from pyspark.mllib.clustering import GaussianMixture
      from numpy import array
      
      data = sc.textFile("data/mllib/gmm_data.txt")
      parsedData = data.map(lambda line: array([float(x) for x in line.strip().split(' ')]))
      
      gmm = GaussianMixture.train(parsedData, 2)
      GaussianMixture.train(parsedData, 2, initialModel=gmm)
      

      It looks like the source of the problem is initialModelWeights NumPy array. In 1.5 / 1.6 it leads to net.razorvine.pickle.PickleException, in 1.4 we get Method trainGaussianMixture([..., class org.apache.spark.mllib.linalg.DenseVector, class java.util.ArrayList, class java.util.ArrayList]) does not exist

        Attachments

          Activity

            People

            • Assignee:
              zero323 Maciej Szymkiewicz
              Reporter:
              zero323 Maciej Szymkiewicz
              Shepherd:
              Joseph K. Bradley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: