Update VectorAssembler to work with Structured Streaming



      The issue
      In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines.

      I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach.

      Potential fixes
      1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, SPARK-13030. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction.


      • Possibly simplest of the potential fixes


      • We'll need to deprecate current VectorAssembler

      2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks.


      • Potentially, easy short term fix for VectorAssembler
        (drop metadata for vector columns in streaming).
      • Current Attributes implementation is also causing other issues, eg SPARK-19141.


      • To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings.
      • A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions.

      3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size.


      • We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume.
      • This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known.


      • This would require breaking changes.


