[SPARK-30202] impl QuantileTransform - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: ML, PySpark
Labels:
None

Description

Recently, I encountered some practice senarinos to map the data to another distribution.

Then I found that QuantileTransformer in sklearn is what I needed, I locally fitted a model on sampled dataset and broadcast it to transform the whole dataset in pyspark.

After that I impled QuantileTransform as a new Estimator atop Spark, the impl followed scikit-learn' s impl, however there still are sereral differences:

1, use QuantileSummaries for approximation, no matter the size of dataset;

2, use linear interpolate, the logic is similar to existing IsotonicRegression, while scikit-learn use a bi-directional interpolate;

3, when skipZero=true, treat sparse vectors just like dense ones, while scikit-learn have two different logics for sparse and dense datasets.

Attachments

Issue Links

is related to

SPARK-31180 Implement PowerTransform

In Progress

links to

GitHub Pull Request #26832

Activity

People

Assignee:: Unassigned

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Dec/19 11:18

Updated:: 20/Apr/20 05:29

Resolved:: 20/Apr/20 05:29