[SPARK-22346] Update VectorAssembler to work with Structured Streaming - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: ML, Structured Streaming
Labels:
None

Target Version/s:

2.3.0

Description

The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, SPARK-13030. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction.

Pros:

Possibly simplest of the potential fixes

Cons:

We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks.

Pros:

Potentially, easy short term fix for VectorAssembler
(drop metadata for vector columns in streaming).
Current Attributes implementation is also causing other issues, eg SPARK-19141.

Cons:

To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings.
A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size.

Pros:

We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume.
This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known.

Cons:

This would require breaking changes.

Attachments

Issue Links

is related to

SPARK-22644 Make ML testsuite support StructuredStreaming test

Resolved

is required by

SPARK-22734 VectorSizeHint Python API

Resolved

SPARK-22735 Add VectorSizeHint to ML features documentation

Resolved

relates to

SPARK-19141 VectorAssembler metadata causing memory issues

Resolved

SPARK-24467 VectorAssemblerEstimator

In Progress

SPARK-8515 Improve ML attribute API

Resolved

SPARK-21926 Compatibility between ML Transformers and Structured Streaming

Resolved

links to

[Github] Pull Request #19746 (MrBago)

(2 relates to, 1 links to)

Activity

People

Assignee:: Bago Amirbekian

Reporter:: Bago Amirbekian

Shepherd:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 24/Oct/17 23:23

Updated:: 05/Jun/18 06:04

Resolved:: 22/Dec/17 22:09