Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25782

Add PCA Aggregator to support grouping

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.2
    • None
    • ML, MLlib

    Description

      I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). 

      So I built a little Aggregator that can do that, here's an example of how it's called:

      val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
      
      // For each grouping, compute a PCA matrix/vector
      val pcaModels = inputData
        .groupBy(keys:_*)
        .agg(pcaAggregation.as(pcaOutput))

      I used the same algorithms under the hood as RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works directly on Datasets without converting to RDD first.

      I saw this as a missing feature in the existing PCA implementation in MLlib. I suspect the use case is a common one: you have data from different entities (could be different users, different locations, or different products, for example) and you need to model them separately since they behave differently--perhaps their features run in different ranges, or perhaps they have completely different features.
       
      For example if you were modeling the weather in different parts of the world for a given time period, and the features were things like temperature, humidity, wind speed, pressure, etc. With the current PCA/RowMatrix options, you can only calculate PCA on the entire dataset, when you really want to model the weather in New York separately from the weather in Buenos Aires. Today your options are to collect the data from each city and calculate PCA using some other library like Breeze, or use the PCA implementation from MLlib but only on one key at a time.
       
      I hope this will make the PCA offering in MLlib useful to more people. As it stands today, I wasn't able to use it for much and I suspect others had the same experience, for example:
      https://stackoverflow.com/questions/45240556/perform-pca-on-each-group-of-a-groupby-in-pyspark

      Attachments

        Activity

          People

            Unassigned Unassigned
            mttsndrs Matt Saunders
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: