Details
-
Umbrella
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.2.0
-
None
Description
We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes.
Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata.
Possible fixes:
More details here SPARK-22346.
2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.
Attachments
Issue Links
- contains
-
SPARK-22888 OneVsRestModel does not work with Structured Streaming
- Resolved
-
SPARK-24465 LSHModel should support Structured Streaming for transform
- Resolved
-
SPARK-22644 Make ML testsuite support StructuredStreaming test
- Resolved
- is related to
-
SPARK-23037 RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline
- Resolved
-
SPARK-22346 Update VectorAssembler to work with Structured Streaming
- Resolved
-
SPARK-21748 Migrate the implementation of HashingTF from MLlib to ML
- Resolved
- relates to
-
SPARK-19141 VectorAssembler metadata causing memory issues
- Resolved
-
SPARK-13030 Change OneHotEncoder to Estimator
- Resolved
-
SPARK-22735 Add VectorSizeHint to ML features documentation
- Resolved
-
SPARK-23048 Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator
- Resolved