Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to DenseVectors. If the data are numpy arrays with dtype float64 this works. If data are numpy arrays with lower precision (e.g. float16 or float32), they should be upcast to float64, but due to a small bug in this line this currently doesn't happen (casting is not inplace).
Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results:
But this works fine:
The fix is trivial, I'll submit a PR shortly.