Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30154

PySpark UDF to convert MLlib vectors to dense arrays

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: ML, MLlib, PySpark
    • Labels:
      None
    • Target Version/s:

      Description

      If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project.

      What we can do is to predefine those converters in Scala and expose them in PySpark, e.g.:

      from pyspark.ml.functions import vector_to_dense_array
      
      df.select(vector_to_dense_array(col("features"))
      

      cc: Weichen Xu

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                weichenxu123 Weichen Xu
                Reporter:
                mengxr Xiangrui Meng
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: