Description
Often in Spark ML, there are classes that use a Scala Array in a constructor. In order to add the same API to Python, a Java-friendly alternate constructor needs to exist to be compatible with py4j when converting from a list. This is because the current conversion in PySpark _py2java creates a java.util.ArrayList, as shown in this error msg
Py4JError: An error occurred while calling None.org.apache.spark.ml.feature.CountVectorizerModel. Trace: py4j.Py4JException: Constructor org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) at py4j.Gateway.invoke(Gateway.java:235)
Creating an alternate constructor can be avoided by creating a py4j JavaArray using new_array. This type is compatible with the Scala Array currently used in classes like CountVectorizerModel and StringIndexerModel.
Most of the boiler-plate Python code to do this can be put in a convenience function inside of ml.JavaWrapper to give a clean way of constructing ML objects without adding special constructors.
Attachments
Issue Links
- is related to
-
SPARK-15009 PySpark CountVectorizerModel should be able to construct from vocabulary list
- Resolved
- links to