Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5089

Vector conversion broken for non-float64 arrays

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.1, 1.3.0
    • Component/s: MLlib, PySpark
    • Labels:
      None

      Description

      Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to DenseVectors. If the data are numpy arrays with dtype float64 this works. If data are numpy arrays with lower precision (e.g. float16 or float32), they should be upcast to float64, but due to a small bug in this line this currently doesn't happen (casting is not inplace).

      if ar.dtype != np.float64:
          ar.astype(np.float64)
      

      Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results:

      from numpy import random
      from pyspark.mllib.clustering import KMeans
      data = sc.parallelize(random.randn(100,10).astype('float32'))
      model = KMeans.train(data, k=3)
      len(model.centers[0])
      >> 5 # should be 10!
      

      But this works fine:

      data = sc.parallelize(random.randn(100,10).astype('float64'))
      model = KMeans.train(data, k=3)
      len(model.centers[0])
      >> 10 # this is correct
      

      The fix is trivial, I'll submit a PR shortly.

        Attachments

          Activity

            People

            • Assignee:
              freeman-lab Jeremy Freeman
              Reporter:
              freeman-lab Jeremy Freeman
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: