Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5089

Vector conversion broken for non-float64 arrays

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2.0
    • 1.2.1, 1.3.0
    • MLlib, PySpark
    • None

    Description

      Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to DenseVectors. If the data are numpy arrays with dtype float64 this works. If data are numpy arrays with lower precision (e.g. float16 or float32), they should be upcast to float64, but due to a small bug in this line this currently doesn't happen (casting is not inplace).

      if ar.dtype != np.float64:
          ar.astype(np.float64)
      

      Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results:

      from numpy import random
      from pyspark.mllib.clustering import KMeans
      data = sc.parallelize(random.randn(100,10).astype('float32'))
      model = KMeans.train(data, k=3)
      len(model.centers[0])
      >> 5 # should be 10!
      

      But this works fine:

      data = sc.parallelize(random.randn(100,10).astype('float64'))
      model = KMeans.train(data, k=3)
      len(model.centers[0])
      >> 10 # this is correct
      

      The fix is trivial, I'll submit a PR shortly.

      Attachments

        Activity

          People

            freeman-lab Jeremy Freeman
            freeman-lab Jeremy Freeman
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: