[SPARK-20099] Add transformSchema to pyspark.ml - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: ML, PySpark
Labels:
- bulk-closed

Description

Python's ML API currently lacks the PipelineStage abstraction. This abstraction's main purpose is to provide transformSchema() for checking for early failures in a Pipeline.

As mentioned in https://github.com/apache/spark/pull/17218 it would also be useful in Python for checking Params in Python wrapper for Scala implementations; in these, transformSchema would involve passing Params in Python to Scala, which would then be able to validate the Param values. This could prevent late failures from bad Param settings in Pipeline execution, while still allowing us to check Param values on only the Scala side.

This issue is for adding transformSchema() to pyspark.ml. If it's reasonable, we could create a PipelineStage abstraction. But it'd probably be fine to add transformSchema() directly to Transformer and Estimator, rather than creating PipelineStage.

Attachments

Issue Links

is related to

SPARK-15574 Python meta-algorithms in Scala

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Mar/17 00:22

Updated:: 21/May/19 04:13

Resolved:: 21/May/19 04:13