Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24467

VectorAssemblerEstimator

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • ML
    • None

    Description

      In SPARK-22346, I believe I made a wrong API decision: I recommended added `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since I thought the latter option would break most workflows. However, I should have proposed:

      • Add a Param to VectorAssembler for specifying the sizes of Vectors in the inputCols. This Param can be optional. If not given, then VectorAssembler will behave as it does now. If given, then VectorAssembler can use that info instead of figuring out the Vector sizes via metadata or examining Rows in the data (though it could do consistency checks).
      • Add a VectorAssemblerEstimator which gets the Vector lengths from data and produces a VectorAssembler with the vector lengths Param specified.

      This will not break existing workflows. Migrating to VectorAssemblerEstimator will be easier than adding VectorSizeHint since it will not require users to manually input Vector lengths.

      Note: Even with this Estimator, VectorSizeHint might prove useful for other things in the future which require vector length metadata, so we could consider keeping it rather than deprecating it.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: