[SPARK-17161] Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.2.0
Component/s: ML, PySpark
Labels:
None

Description

Often in Spark ML, there are classes that use a Scala Array in a constructor. In order to add the same API to Python, a Java-friendly alternate constructor needs to exist to be compatible with py4j when converting from a list. This is because the current conversion in PySpark _py2java creates a java.util.ArrayList, as shown in this error msg

Py4JError: An error occurred while calling None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
py4j.Py4JException: Constructor org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
	at py4j.Gateway.invoke(Gateway.java:235)

Creating an alternate constructor can be avoided by creating a py4j JavaArray using new_array. This type is compatible with the Scala Array currently used in classes like CountVectorizerModel and StringIndexerModel.

Most of the boiler-plate Python code to do this can be put in a convenience function inside of ml.JavaWrapper to give a clean way of constructing ML objects without adding special constructors.

Attachments

Issue Links

is related to

SPARK-15009 PySpark CountVectorizerModel should be able to construct from vocabulary list

Resolved

links to

[Github] Pull Request #14725 (BryanCutler)

Activity

People

Assignee:: Bryan Cutler

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Aug/16 23:10

Updated:: 03/Feb/17 21:41

Resolved:: 03/Feb/17 13:22