Recently, I encountered some practice senarinos to map the data to another distribution.
Then I found that QuantileTransformer in sklearn is what I needed, I locally fitted a model on sampled dataset and broadcast it to transform the whole dataset in pyspark.
After that I impled QuantileTransform as a new Estimator atop Spark, the impl followed scikit-learn' s impl, however there still are sereral differences:
1, use QuantileSummaries for approximation, no matter the size of dataset;
2, use linear interpolate, the logic is similar to existing IsotonicRegression, while scikit-learn use a bi-directional interpolate;
3, when skipZero=true, treat sparse vectors just like dense ones, while scikit-learn have two different logics for sparse and dense datasets.