[SPARK-24632] Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: ML, PySpark
Labels:
None

Description

This is a follow-up for ~~SPARK-17025~~, which allowed users to implement Python PipelineStages in 3rd-party libraries, include them in Pipelines, and use Pipeline persistence. This task is to make it easier for 3rd-party libraries to have PipelineStages written in Java and then to use pyspark.ml abstractions to create wrappers around those Java classes. This is currently possible, except that users hit bugs around persistence.

I spent a bit thinking about this and wrote up thoughts and a proposal in the doc linked below. Summary of proposal:

Require that 3rd-party libraries with Java classes with Python wrappers implement a trait which provides the corresponding Python classpath in some field:

trait PythonWrappable {
  def pythonClassPath: String = …
}
MyJavaType extends PythonWrappable

This will not be required for MLlib wrappers, which we can handle specially.

One issue for this task will be that we may have trouble writing unit tests. They would ideally test a Java class + Python wrapper class pair sitting outside of pyspark.

Attachments

Issue Links

relates to

SPARK-17025 Cannot persist PySpark ML Pipeline model that includes custom Transformer

Resolved

links to

Design sketch

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 22/Jun/18 18:18

Updated:: 28/Dec/20 20:12