Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17161

Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.2.0
    • 2.2.0
    • ML, PySpark
    • None

    Description

      Often in Spark ML, there are classes that use a Scala Array in a constructor. In order to add the same API to Python, a Java-friendly alternate constructor needs to exist to be compatible with py4j when converting from a list. This is because the current conversion in PySpark _py2java creates a java.util.ArrayList, as shown in this error msg

      Py4JError: An error occurred while calling None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
      py4j.Py4JException: Constructor org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) does not exist
      	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
      	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
      	at py4j.Gateway.invoke(Gateway.java:235)
      

      Creating an alternate constructor can be avoided by creating a py4j JavaArray using new_array. This type is compatible with the Scala Array currently used in classes like CountVectorizerModel and StringIndexerModel.

      Most of the boiler-plate Python code to do this can be put in a convenience function inside of ml.JavaWrapper to give a clean way of constructing ML objects without adding special constructors.

      Attachments

        Issue Links

          Activity

            People

              bryanc Bryan Cutler
              bryanc Bryan Cutler
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: