[SPARK-25782] Add PCA Aggregator to support grouping - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: ML, MLlib
Labels:
- bulk-closed

Description

I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset).

So I built a little Aggregator that can do that, here's an example of how it's called:

val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn

// For each grouping, compute a PCA matrix/vector
val pcaModels = inputData
  .groupBy(keys:_*)
  .agg(pcaAggregation.as(pcaOutput))

I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

I saw this as a missing feature in the existing PCA implementation in MLlib. I suspect the use case is a common one: you have data from different entities (could be different users, different locations, or different products, for example) and you need to model them separately since they behave differently--perhaps their features run in different ranges, or perhaps they have completely different features.

For example if you were modeling the weather in different parts of the world for a given time period, and the features were things like temperature, humidity, wind speed, pressure, etc. With the current PCA/RowMatrix options, you can only calculate PCA on the entire dataset, when you really want to model the weather in New York separately from the weather in Buenos Aires. Today your options are to collect the data from each city and calculate PCA using some other library like Breeze, or use the PCA implementation from MLlib but only on one key at a time.

I hope this will make the PCA offering in MLlib useful to more people. As it stands today, I wasn't able to use it for much and I suspect others had the same experience, for example:
https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Matt Saunders

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Oct/18 19:26

Updated:: 25/May/21 01:55

Resolved:: 25/May/21 01:43