Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30202

impl QuantileTransform

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: ML, PySpark
    • Labels:
      None

      Description

      Recently, I encountered some practice senarinos to map the data to another distribution.

      Then I found that QuantileTransformer in sklearn is what I needed, I locally fitted a model on sampled dataset and broadcast it to transform the whole dataset in pyspark.

      After that I impled QuantileTransform as a new Estimator atop Spark, the impl followed scikit-learn' s impl, however there still are sereral differences:

      1, use QuantileSummaries for approximation, no matter the size of dataset;

      2, use linear interpolate, the logic is similar to existing IsotonicRegression, while scikit-learn use a bi-directional interpolate;

      3, when skipZero=true, treat sparse vectors just like dense ones, while scikit-learn have two different logics for sparse and dense datasets.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                podongfeng zhengruifeng
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: