Details

Type: New Feature

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: 0.6

Fix Version/s: 0.7

Component/s: None

Labels:None
Description
It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a prerequisite step and also avoiding densifying the big input.
Several approaches were suggested:
1) subtract mean off B
2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
3) ?
It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

 SSVDCLI.pdf
 406 kB
 Dmitriy Lyubimov

 SSVDPCA options.pdf
 369 kB
 Dmitriy Lyubimov

 MAHOUT817RC1.patch
 140 kB
 Dmitriy Lyubimov

 MAHOUT817.patch
 120 kB
 Dmitriy Lyubimov

 MAHOUT817.patch
 120 kB
 Dmitriy Lyubimov

 MAHOUT817.patch
 120 kB
 Dmitriy Lyubimov

 ssvd.R
 2 kB
 Dmitriy Lyubimov

 ssvdtests.R
 0.9 kB
 Dmitriy Lyubimov

 ssvd.m
 2 kB
 Raphael Cendrillon
Activity
Integrated in MahoutQuality #1361 (See https://builds.apache.org/job/MahoutQuality/1361/)
MAHOUT817 PCA options for SSVD (RC1) (Revision 1292532)
Result = SUCCESS
dlyubimov : http://svn.apache.org/viewcvs.cgi/?root=ApacheSVN&view=rev&rev=1292532
Files :
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java
 /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java
 /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java
refreshing the attached patch (called RC1) to correspond to what was posted on review board.

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/3863/

(Updated 20120217 20:50:01.339012)
Review request for mahout.
Changes

commit 95d5934405d1ca51e13439a43e0fc793418e5d37
Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
Date: Fri Feb 17 12:48:37 2012 0800
Fixing option recovery based on new api changes
Summary

2d542fd4dfcc6e01577bddc28600632a88e358ee Merge remotetracking branch 'apache/trunk' into MAHOUT817
1f245bb5cc1354e7495ec62fbc5f41ed6d590210 Merge branch 'trunk' into MAHOUT817
458d8112de180c93d5194d67ccfc00442ed1d460 Merge remotetracking branch 'apache/trunk' into MAHOUT817
3fea9bd981043e268dd003d4c6c3943bb570c0f7 added test, bug fixes
2725c1061c167126238d288039f0f68baafa7dc8 adding pca and pcaOffset options, minor fixes
48c7b425241afff42ce52d3bb005a87aeb68386d fixing front end to factor in the median data.
4e072615ac2b8a256d037aaf00db21820abb91e2 tweaking B' job to produce necessary correctors s_q and s_b
b10fefd8d4aa5a0ed2f60902904d551afbbdf57e cosmetic fixes
849171d3af75117a2ee1115e6d5fc8e4a1fff5ce comment
6c196ea9606b3ca05d401fa1474ee9262a6c0303 retrofitting V job to do pca correction
e6fbe7cdb606698f180127302c33d30fffc6c4d7 adding pca options to Q,ABt jobs. still need to work on B'job, Vjob and frontend pca corrections.
ecf5dd21c5d5805d70715a78abd07246d171536c Computing s_b0
b9b33cf72af85ade16fcfbf4e13a036877489afb comments
9bb6e971c68e0674b087b8c5d64f4967878f1834 More cleanup in favor of standard functions, unit tests pass but need to verify the 2G benchmark.
39faa70158b52e50d31aca2abc4006874a9ea8fd cleanup I
780b291eb902e0e832d41748d45bf6d2163f9537 cosmetic changes, adding api with out redundant parameters
02daf0024489305032320c578ac546c16bda31c1 current MAHOUT923 patch from Raphael
This addresses bug MAHOUT817.
https://issues.apache.org/jira/browse/MAHOUT817
Diffs (updated)
core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 3e0dd5e
core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRECREATION
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java c52fe2a
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java 0c3a996
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java 0fa8707
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java 59bdedb
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java 703c420
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java d314186
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java PRECREATION
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java 98c8c59
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java b1a8b56
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java 53f26f4
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java d58789e
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java bd8c6b1
core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 0ef8622
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java PRECREATION
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java 59f79c5
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java beb0102
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java PRECREATION
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java 503433f
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java 32342c1
Diff: https://reviews.apache.org/r/3863/diff
Testing

Additional unit tests for PCA
Thanks,
Dmitriy

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/3863/

(Updated 20120217 20:43:22.593328)
Review request for mahout.
Changes

commit 996464eb600400745baf25498606aca115cb7e96
Merge: cd48627 aa7e1d8
Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
Date: Fri Feb 17 12:40:26 2012 0800
Merge remotetracking branch 'apache/trunk' into MAHOUT817
Conflicts:
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java
Summary

2d542fd4dfcc6e01577bddc28600632a88e358ee Merge remotetracking branch 'apache/trunk' into MAHOUT817
1f245bb5cc1354e7495ec62fbc5f41ed6d590210 Merge branch 'trunk' into MAHOUT817
458d8112de180c93d5194d67ccfc00442ed1d460 Merge remotetracking branch 'apache/trunk' into MAHOUT817
3fea9bd981043e268dd003d4c6c3943bb570c0f7 added test, bug fixes
2725c1061c167126238d288039f0f68baafa7dc8 adding pca and pcaOffset options, minor fixes
48c7b425241afff42ce52d3bb005a87aeb68386d fixing front end to factor in the median data.
4e072615ac2b8a256d037aaf00db21820abb91e2 tweaking B' job to produce necessary correctors s_q and s_b
b10fefd8d4aa5a0ed2f60902904d551afbbdf57e cosmetic fixes
849171d3af75117a2ee1115e6d5fc8e4a1fff5ce comment
6c196ea9606b3ca05d401fa1474ee9262a6c0303 retrofitting V job to do pca correction
e6fbe7cdb606698f180127302c33d30fffc6c4d7 adding pca options to Q,ABt jobs. still need to work on B'job, Vjob and frontend pca corrections.
ecf5dd21c5d5805d70715a78abd07246d171536c Computing s_b0
b9b33cf72af85ade16fcfbf4e13a036877489afb comments
9bb6e971c68e0674b087b8c5d64f4967878f1834 More cleanup in favor of standard functions, unit tests pass but need to verify the 2G benchmark.
39faa70158b52e50d31aca2abc4006874a9ea8fd cleanup I
780b291eb902e0e832d41748d45bf6d2163f9537 cosmetic changes, adding api with out redundant parameters
02daf0024489305032320c578ac546c16bda31c1 current MAHOUT923 patch from Raphael
This addresses bug MAHOUT817.
https://issues.apache.org/jira/browse/MAHOUT817
Diffs (updated)
core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 3e0dd5e
core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRECREATION
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java c52fe2a
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java 0c3a996
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java 0fa8707
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java 59bdedb
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java 703c420
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java d314186
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java PRECREATION
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java 98c8c59
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java b1a8b56
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java 53f26f4
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java d58789e
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java bd8c6b1
core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 0ef8622
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java PRECREATION
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java 59f79c5
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java beb0102
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java PRECREATION
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java 503433f
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java 32342c1
Diff: https://reviews.apache.org/r/3863/diff
Testing

Additional unit tests for PCA
Thanks,
Dmitriy

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/3863/

(Updated 20120217 20:38:49.925577)
Review request for mahout.
Changes

commit cd4862738fb74f01114e0e4c2fee8a737a009c13
Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
Date: Fri Feb 17 12:35:47 2012 0800
Getting rid of prototype code; styling round
:100644 100644 d61210f... ebf087d... M core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java
:100644 100644 254887a... d9c03cb... M core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java
:100644 100644 959d491... 8be8df1... M core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java
:100644 000000 59bdedb... 0000000... D core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java
:100644 100644 d247af4... 59f64ba... M core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java
:100644 100644 96fe5e1... 1127f6a... M core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java
:100644 000000 09f05d1... 0000000... D core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java
:100644 100644 915fce5... 4168e98... M core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java
:100644 100644 885f5fa... 1346d71... M core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.j
:100644 100644 760c715... 280e10a... M core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTes
:100644 100644 7015283... 0e34568... M core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSe
:000000 100644 0000000... 5bb5706... A core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java
:100644 000000 503433f... 0000000... D core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java
:100644 100644 32342c1... d6605c1... M core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java
Summary

2d542fd4dfcc6e01577bddc28600632a88e358ee Merge remotetracking branch 'apache/trunk' into MAHOUT817
1f245bb5cc1354e7495ec62fbc5f41ed6d590210 Merge branch 'trunk' into MAHOUT817
458d8112de180c93d5194d67ccfc00442ed1d460 Merge remotetracking branch 'apache/trunk' into MAHOUT817
3fea9bd981043e268dd003d4c6c3943bb570c0f7 added test, bug fixes
2725c1061c167126238d288039f0f68baafa7dc8 adding pca and pcaOffset options, minor fixes
48c7b425241afff42ce52d3bb005a87aeb68386d fixing front end to factor in the median data.
4e072615ac2b8a256d037aaf00db21820abb91e2 tweaking B' job to produce necessary correctors s_q and s_b
b10fefd8d4aa5a0ed2f60902904d551afbbdf57e cosmetic fixes
849171d3af75117a2ee1115e6d5fc8e4a1fff5ce comment
6c196ea9606b3ca05d401fa1474ee9262a6c0303 retrofitting V job to do pca correction
e6fbe7cdb606698f180127302c33d30fffc6c4d7 adding pca options to Q,ABt jobs. still need to work on B'job, Vjob and frontend pca corrections.
ecf5dd21c5d5805d70715a78abd07246d171536c Computing s_b0
b9b33cf72af85ade16fcfbf4e13a036877489afb comments
9bb6e971c68e0674b087b8c5d64f4967878f1834 More cleanup in favor of standard functions, unit tests pass but need to verify the 2G benchmark.
39faa70158b52e50d31aca2abc4006874a9ea8fd cleanup I
780b291eb902e0e832d41748d45bf6d2163f9537 cosmetic changes, adding api with out redundant parameters
02daf0024489305032320c578ac546c16bda31c1 current MAHOUT923 patch from Raphael
This addresses bug MAHOUT817.
https://issues.apache.org/jira/browse/MAHOUT817
Diffs (updated)
core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/DatasetSplitter.java c9003ad
core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/FactorizationEvaluator.java 0c6e3f7
core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJob.java 7dc3b79
core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java 9ca0b16
core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java 1feaa03
core/src/main/java/org/apache/mahout/cf/taste/hadoop/preparation/PreparePreferenceMatrixJob.java fbe8914
core/src/main/java/org/apache/mahout/cf/taste/hadoop/pseudo/RecommenderJob.java 02d1ba6
core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.java 951c860
core/src/main/java/org/apache/mahout/cf/taste/hadoop/slopeone/SlopeOneAverageDiffsJob.java 57fa036
core/src/main/java/org/apache/mahout/cf/taste/impl/model/PlusAnonymousConcurrentUserDataModel.java 11eb295
core/src/main/java/org/apache/mahout/cf/taste/impl/model/PlusAnonymousUserDataModel.java 7f9cfd4
core/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java 15da502
core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java 4da6426
core/src/main/java/org/apache/mahout/clustering/AbstractCluster.java 2ceb01b
core/src/main/java/org/apache/mahout/clustering/CIMapper.java 5f25f4f
core/src/main/java/org/apache/mahout/clustering/CIReducer.java 726363e
core/src/main/java/org/apache/mahout/clustering/Cluster.java 2f8d4dd
core/src/main/java/org/apache/mahout/clustering/ClusterIterator.java e39c71e
core/src/main/java/org/apache/mahout/clustering/ClusterWritable.java dba8c37
core/src/main/java/org/apache/mahout/clustering/ClusteringPolicy.java b07b649
core/src/main/java/org/apache/mahout/clustering/ClusteringPolicyWritable.java 8c148a8
core/src/main/java/org/apache/mahout/clustering/DirichletClusteringPolicy.java 116973f
core/src/main/java/org/apache/mahout/clustering/FuzzyKMeansClusteringPolicy.java 6c39d94
core/src/main/java/org/apache/mahout/clustering/KMeansClusteringPolicy.java 7b0d874
core/src/main/java/org/apache/mahout/clustering/Model.java 79dab30
core/src/main/java/org/apache/mahout/clustering/WeightedPropertyVectorWritable.java 92373eb
core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java 7147015
core/src/main/java/org/apache/mahout/clustering/canopy/CanopyMapper.java 52fe865
core/src/main/java/org/apache/mahout/clustering/canopy/CanopyReducer.java ca814f9
core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java 366ec3c
core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java 49a9cfc
core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java 09be170
core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletCluster.java 7293479
core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletClusterer.java 3cf25bc
core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletState.java d19842f
core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansClusterer.java 2d882b0
core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansDriver.java aa7389f
core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansUtil.java 5f6cb47
core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java 52fd764
core/src/main/java/org/apache/mahout/clustering/kmeans/Cluster.java PRECREATION
core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterMapper.java 3cf41ec
core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterer.java 9471e74
core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansCombiner.java eb086d8
core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansDriver.java 1099206
core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansMapper.java 0945dcb
core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansReducer.java bb777a4
core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansUtil.java 1c84f87
core/src/main/java/org/apache/mahout/clustering/kmeans/Kluster.java 8b22709
core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java 4a725e7
core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java 28fc43b
core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java a33f1ca
core/src/main/java/org/apache/mahout/clustering/spectral/eigencuts/EigencutsDriver.java 06e0549
core/src/main/java/org/apache/mahout/clustering/spectral/kmeans/SpectralKMeansDriver.java 82daa5b
core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReader.java 11c4d88
core/src/main/java/org/apache/mahout/common/AbstractJob.java 55040f6
core/src/main/java/org/apache/mahout/common/commandline/DefaultOptionCreator.java 868d82f
core/src/main/java/org/apache/mahout/common/iterator/sequencefile/PathFilters.java 19f78b5
core/src/main/java/org/apache/mahout/graph/AdjacencyMatrixJob.java ae419f6
core/src/main/java/org/apache/mahout/graph/linkanalysis/RandomWalk.java 5727a77
core/src/main/java/org/apache/mahout/graph/linkanalysis/RandomWalkWithRestartJob.java fcf4549
core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 3e0dd5e
core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRECREATION
core/src/main/java/org/apache/mahout/math/hadoop/MatrixMultiplicationJob.java e907a6d
core/src/main/java/org/apache/mahout/math/hadoop/TransposeJob.java a046b41
core/src/main/java/org/apache/mahout/math/hadoop/decomposer/DistributedLanczosSolver.java c81ef71
core/src/main/java/org/apache/mahout/math/hadoop/decomposer/EigenVerificationJob.java 2e152c4
core/src/main/java/org/apache/mahout/math/hadoop/similarity/SeedVectorUtil.java 4d63f46
core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/RowSimilarityJob.java ff517dc
core/src/main/java/org/apache/mahout/math/hadoop/solver/DistributedConjugateGradientSolver.java eba6d2a
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java c52fe2a
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java 0c3a996
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java 0fa8707
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java 59bdedb
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java 703c420
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java d314186
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java PRECREATION
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java 98c8c59
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java b1a8b56
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java 53f26f4
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java d58789e
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java bd8c6b1
core/src/main/java/org/apache/mahout/math/stats/entropy/Entropy.java 4a8078e
core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.java 7a0c639
core/src/test/java/org/apache/mahout/cf/taste/impl/model/PlusAnonymousConcurrentUserDataModelTest.java 984ef6c
core/src/test/java/org/apache/mahout/clustering/TestClusterClassifier.java 391bdf6
core/src/test/java/org/apache/mahout/clustering/TestClusterInterface.java d9f06ec
core/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java 0b70339
core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java 8a5e1ea
core/src/test/java/org/apache/mahout/clustering/dirichlet/TestDirichletClustering.java d87c3e3
core/src/test/java/org/apache/mahout/clustering/dirichlet/TestMapReduce.java c996d97
core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java aa32112
core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java 8dd9d41
core/src/test/java/org/apache/mahout/common/AbstractJobTest.java 4feae91
core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 0ef8622
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java PRECREATION
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java 59f79c5
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java beb0102
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java PRECREATION
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java 503433f
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java 32342c1
examples/src/main/java/org/apache/mahout/cf/taste/example/email/MailToPrefsDriver.java 1781481
examples/src/main/java/org/apache/mahout/classifier/email/PrepEmailVectorsDriver.java 4d4836f
examples/src/main/java/org/apache/mahout/clustering/display/DisplayClustering.java 7faf92e
examples/src/main/java/org/apache/mahout/clustering/display/DisplayDirichlet.java 2edadf1
examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java a5ef4d0
examples/src/main/java/org/apache/mahout/clustering/display/DisplayKMeans.java bc5c2ea
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java 3833932
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java 32b9efe
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java 3ac3cca
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java d63ac9e
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java ef69827
integration/pom.xml b751b98
integration/src/main/java/org/apache/mahout/classifier/ConfusionMatrixDumper.java 5958ce8
integration/src/main/java/org/apache/mahout/utils/MatrixDumper.java b71cb95
integration/src/main/java/org/apache/mahout/utils/SequenceFileDumper.java e108aa4
integration/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java 3bc72ab
integration/src/main/java/org/apache/mahout/utils/vectors/RowIdJob.java 11769b1
integration/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java 5a9d0f2
integration/src/main/java/org/apache/mahout/utils/vectors/VectorHelper.java 716aaf9
integration/src/test/java/org/apache/mahout/clustering/dirichlet/TestL1ModelClustering.java eef9551
pom.xml 7485994
Diff: https://reviews.apache.org/r/3863/diff
Testing

Additional unit tests for PCA
Thanks,
Dmitriy

This is an automatically generated email. To reply, visit:
https://reviews.apache.org/r/3863/

(Updated 20120211 03:15:25.803911)
Review request for mahout.
Summary

2d542fd4dfcc6e01577bddc28600632a88e358ee Merge remotetracking branch 'apache/trunk' into MAHOUT817
1f245bb5cc1354e7495ec62fbc5f41ed6d590210 Merge branch 'trunk' into MAHOUT817
458d8112de180c93d5194d67ccfc00442ed1d460 Merge remotetracking branch 'apache/trunk' into MAHOUT817
3fea9bd981043e268dd003d4c6c3943bb570c0f7 added test, bug fixes
2725c1061c167126238d288039f0f68baafa7dc8 adding pca and pcaOffset options, minor fixes
48c7b425241afff42ce52d3bb005a87aeb68386d fixing front end to factor in the median data.
4e072615ac2b8a256d037aaf00db21820abb91e2 tweaking B' job to produce necessary correctors s_q and s_b
b10fefd8d4aa5a0ed2f60902904d551afbbdf57e cosmetic fixes
849171d3af75117a2ee1115e6d5fc8e4a1fff5ce comment
6c196ea9606b3ca05d401fa1474ee9262a6c0303 retrofitting V job to do pca correction
e6fbe7cdb606698f180127302c33d30fffc6c4d7 adding pca options to Q,ABt jobs. still need to work on B'job, Vjob and frontend pca corrections.
ecf5dd21c5d5805d70715a78abd07246d171536c Computing s_b0
b9b33cf72af85ade16fcfbf4e13a036877489afb comments
9bb6e971c68e0674b087b8c5d64f4967878f1834 More cleanup in favor of standard functions, unit tests pass but need to verify the 2G benchmark.
39faa70158b52e50d31aca2abc4006874a9ea8fd cleanup I
780b291eb902e0e832d41748d45bf6d2163f9537 cosmetic changes, adding api with out redundant parameters
02daf0024489305032320c578ac546c16bda31c1 current MAHOUT923 patch from Raphael
This addresses bug MAHOUT817.
https://issues.apache.org/jira/browse/MAHOUT817
Diffs
core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 3e0dd5e
core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRECREATION
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java c52fe2a
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java 0c3a996
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java 0fa8707
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java 703c420
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java 0d81ccd
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java PRECREATION
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java 98c8c59
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java b1a8b56
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java 53f26f4
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java d58789e
core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java bd8c6b1
core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 0ef8622
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java PRECREATION
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java 59f79c5
core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java beb0102
Diff: https://reviews.apache.org/r/3863/diff
Testing

Additional unit tests for PCA
Thanks,
Dmitriy
brought patch in sync with current postrelease trunk.
rebasing on current trunk
btw this patch doesn't address use cases of "folding in" and "folding out" which are basically special cases of SVD foldin adjusted to rowwise input and PCA offset.
Do we want to leave it out of scope? Generally it usually doesn't make sense to do this stuff in a batch, but rather in real time which requires some indexing mechanism for V (and U). Other than that, it is a simple multiplication operation, perhaps we could just engineer a foldin using regular distributed matrix operations? I never investigated an issue of a batch fold in with Mahout.
Thanks for merging Dmitriy. Is there anything you need from me at this point?
I would always appreciate if you could poke CLI version and verify it independently via matlab test for precision of computed singular values and V output on a larger input.
(I am still working on reading Mahout files into R and merging with RHadoop, when it's done i will be able to verify larger tests with R.)
d
First round. unit test seems to pass, although it is debatable how offcentered the data is in it. Also put in CLI options for pca (pca=true, pcaoffset= location to override default computation of row means).
Thanks for merging Dmitriy. Is there anything you need from me at this point?
I merged with MAHOUT923 and started some initial cleanup and work in MAHOUT817 branch in my github on this.
Mostly the cleanup so far, removing old kludgy code and replacing stuff with standard vector framework functions.
Updated R code to match working notes more closely.
udpated math document
Yeah. It looks like this will indeed be necessary.
By the way, could you take a look through the columnwise mean job in MAHOUT880?
Ok found a case what affects the Y fix. As soon as I take random gen off the 0 mean for the simulated orthonormal matrices for the test input, the difference between version with Y fix and without it appears in the output.
The first printout is for PCA routine with Y fix, the second is for PCA routine without Y fix, and the third one is SSVD over Amean matrix.
reattached the newest R files.
> ## PCActest
> # compute median xi
>
> xfixed=matrix(nrow=m,ncol=n)
> for ( i in 1:m) xfixed[i,]=x[i,]xi
>
>
> respca=ssvd.cpca(x,k,qiter=qi)
fixing Y...
Warning message:
In sqrt(e$values) : NaNs produced
> # compare also with results when Y fix is ignored
> respca1=ssvd.cpca(x,k,qiter=qi,fixY=F)
Warning message:
In sqrt(e$values) : NaNs produced
>
> ressvd=ssvd.svd(xfixed,k,qiter=qi)
>
> # compare 3 sets of singular values
> respca$svalues
[1] 9.0584987 8.0500343 7.0271257 6.0267613 5.0266239 4.0221945 3.0428140
[8] 2.0328541 1.1788628 0.8524032
> respca1$svalues
[1] 9.0504971 8.0487910 7.0238114 6.0246926 5.0250013 4.0221219 3.0371404
[8] 2.0306501 1.0668975 0.3805301
> ressvd$svalues
[1] 9.0584987 8.0500343 7.0271257 6.0267613 5.0266239 4.0221945 3.0428140
[8] 2.0328541 1.1788628 0.8524032
>
> #compare first rows of singular vectors
> respca$v[1,]
[1] 0.010705297 0.002515335 0.015630454 0.023178851 0.022406230
[6] 0.023602299 0.016234821 0.045020972 0.084333758 0.053624133
> respca1$v[1,]
[1] 0.010691547 0.002485415 0.015705498 0.023117058 0.022482137
[6] 0.023557896 0.015686873 0.046335615 0.061378867 0.226028214
> ressvd$v[1,]
[1] 0.010705297 0.002515335 0.015630454 0.023178851 0.022406230
[6] 0.023602299 0.016234821 0.045020972 0.084333758 0.053624133
>
and i also don't see any difference for small 100x200 inputs between pci and svd on a fixed(mean subtracted) input even if bypass Y correction for mean for Ys in both B_0 and power iterations!..
perhaps it has to do with the way i generate the input. that also may not necessarily be the case for extreme sparse cases.
But i think first patch could bypass the Y fix.
respci$svalues [1] 9.9013440 8.9980801 7.9936265 6.9882617 5.9982148 4.9935232 3.9848657 [8] 2.9811621 1.9891654 0.9977757 > ressvd$svalues [1] 9.9013440 8.9980801 7.9936265 6.9882617 5.9982148 4.9935232 3.9848657 [8] 2.9811621 1.9891654 0.9977757 >
So i did an R simulation of columnwise mean and it seems to work , so i think this verifies the math.
I still need to finish the doc (it also has a little typo in it), i will be finishing it from home as i don't seem to have the doc source on me here.
I guess it clears the implementation on existing ssvd solver.
test results comparing "brute forced" svd with "median propagated" version:
> respci$svalues [1] 9.9995227 8.9992220 7.9907894 6.9860235 5.9786348 4.9866553 3.9853651 [8] 2.9735904 1.9999941 0.9971344 > ressvd$svalues [1] 9.9995227 8.9992220 7.9907894 6.9860235 5.9786348 4.9866553 3.9853651 [8] 2.9735904 1.9999941 0.9971344 >
fixed
rolling back solution for now. There are errors.
Actually, propagating median thru power iterations is not yet quite finished. I will finish it a tad later.
It seems to be OK in the examples I've looked at. This may be quite dependent on m, n,k, p etc. though.
ok. that's what i suspected. but i think the variance is going to depend a lot on variance in the input (between different rows). Can you try and test how it is going to be affected if you increase the variances of the input such that deviation >> mean?
Here's a little snipet of Matlab code which evaluates the performance of SSVD with and without meansubtraction on A.
At first glance it seems that Q is relatively insensitive to the mean of A, so that reasonable performance can be achieved even if A is not normalized.
I'm not sure if there are corner cases where this may not hold. It probably requires further study.
minor editions
Another problem i identified with the scheme is that Q is produced in blocks and formation of entire row sum vector is not available at the point of B' and BB' computation. There's one more step further in this.
Ok i think i see how to fix BB' computation as well as power iterations.
One issue still remains as far as estimate of m*Omega term is concerned. See attached.
I am posting a first stub at bringing all the ideas together, please review. It doesn't contain the detailed modification plan though, just the algebra.
BTW is there a formal name of a vector product of a and b in a form of a new vector (a_1 * b_1, a2 * b_2, ... a_n * b_n)?
Elementwise product.
BTW is there a formal name of a vector product of a and b in a form of a new vector (a_1 * b_1, a2 * b_2, ... a_n * b_n)?
Another problem i identified with the scheme is that Q is produced in blocks and formation of entire row sum vector is not available at the point of B' and BB' computation. There's one more step further in this.
Yes expectatiin is zero but variance is going to be big regardless of the input *size I think unfortunately. So m Omega term is still a problem. For my problems its brute force computation will actually take more than e.g. squaring my input. So it was first thought but I don't think it is valid enough. So I withdraw this for now.
But we may not have a choice for the big data though. And then again there's a connection with power iterations. The basis doesn't have to be perfect and in practice it never is, but power iterations improve it a lot. Power iterations flow is here: https://github.com/dlyubimov/mahoutcommits/blob/ssvddocs/Power%20Iterations.pdf?raw=true. Now question is if this assumption is going to render power iteration flow useless.
I noticed the same thing with some quick matlab tests. It seems that the orthogonal basis (Q) of Y does not change too much even if meansubtraction is not applied to A. This seems to be true even when the mean of A is not zero. I still need to think some more about this to understand if it is always the case or not.
Still need a bit of thought how it all works with power iterations, there need to be changes there as well
And it seems when mean of rows is used then indeed what Raphael is saying the output if Q has to produce sum of rows as single vector and with mean of columns output of Q will have to produce sum of columns as blocked vector. Then this vector must be incorporated to Bt job to produce offsets there. Got it.
OK so that's what I called brute force approach. Assuming we somehow know the median, just adjust the input as we go. For column wise median we will know the median right away. For row wise median, which I think the majority of use cases would want to do, we will have to precompute it with one more pass. Good thing about it is that at least it wiukd have a very little shuffle and sort pressure, so it would practically run almost as fast as a map only job.
I think this is a very easy change.
For the SSVD and PCA, what I had in mind was that forming an offset Y was easy if you have the row means because you can compute
Y = (A  m) \Omega = A \Omega  m \Omega
That is, each row of Y can be adjusted on the fly as it is computed. The computation of Q in the next step will be unchanged, but the definition of B must include the mean subtraction as well:
B = Q' (A  m) = Q' A  Q' m
Other than this, the actual decomposition should be nearly good to go.
situation gets even more hairy if you factor in power iterations and future option with Cholesky route, unless you assume already modified input. So i am dubious about everything except brute force from every angle of it so far.
The way i understood original idea from Ted, since we are performing projection into B, then the center of original data would also project onto center of projected data (in this case, data are column vectors).
if row vectors are implied as pca items that means subtraction of row mean but i am not 100% sure how this works, but it seems that this case can be solved by finding rowmean of Y and proceed with YM_y instead of Y. However, i am not sure at all how it plays out esp. with power iterations. It would seem to me that random projection of centered vs. noncentered data may not be the same in the context of this method. I don't immediately see this.
Even subtraction of median in B may affect the accuracy because random projection captured the action of the original data, but not necessarily the centered data. Once data is centered, the optimal subspace capturing variances might be quite different from original subspace produced in Q. That's why i say maybe brute force approach is the right one. At least i can easily convince myself it is what PCA defines.
In addition, the main difficulty is that to know mean of A, we need one separate pass over A (at least with a row mean), and the whole idea is that probably we can do it on the fly somewehre else with already projected data.
One question: is it necessary to do meansubtraction of A before computing the QR decomposition, or will the columns of Q still
form a good basis even without meansubtraction?
That's exactly my concern. i think this breaks the fundamental premise of the method (unless it somehow magically appears to be just as good, bit it would seem to me it is not, at least i can construct a visual counterexample in my head).
So assume we need to do subtraction before attempting to find a good basis for projection. Then for the case of columnwise mean it is easy, we can do it on the fly and we need just one pass over data while doing the Y and Q stuff. If we want a rowwise mean, the brute force requires one more pass to aquire the mean.
It seems there are two jobs that need to be modified: BBTjob and Vjob. Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean( i ).
BBt job is now obsolete. BBt is now produced in reducers of Bt job as a bonus and finalized in the front end.
Could you expand on this a little?
If I understand correctly we need to implicitly do meansubtraction of A whenever we work with B.
It seems this is equivalent to subtracting qs'*a_mean from B, where qs is the sum of the rows of Q
and a_mean is the mean of the rows of A. So if bi is the ith column of B then the column with
implicit meansubtraction of A is
bi  qs'*a_mean( i )
where a_mean( i ) is the ith element of a_mean.
It seems there are two jobs that need to be modified: BBTjob and Vjob. Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean( i ).
One question: is it necessary to do meansubtraction of A before computing the QR decomposition, or will the columns of Q still
form a good basis even without meansubtraction?
Could you explain what the 'column mean' is? I thought that each data point corresponds to a row in A, so that subtraction of row means
would be more appropriate?
For the column mean bruteforce approach is probably the simplest, we 'd have to decorate input of A with mean subtraction.
I don't think we want to have an explicit step to compile either Y or B means.
We can construct them and even output them in the fly albeit in a blocked form.
But we probably do need A means in the final output to enable back and forward fold ins of the new items, right?
Dmitriy, what the current state of this? I'll start looking into this if it suits
removed from 0.6 roadmap per conversation on the list.
why would we want to support both row and column mean subtraction? I need to reread the motivation of this.
I think a lot also resides on a question if we actually also want output the mean.
And the next question is whether we want to spend one additional pass just to find the mean. if yes, then the rest is easy. we just will be doing mean subtraction as part of Y computation . should be ok flopswise.
but if we think we shouldn't be waiting for mean computation as a separate pass, and we don't want to output it either, then that's where it becomes a little tricky.
1 & 2 sound comprehensive to me. Option 1 (subtracting the mean from B) seems like a great approach except that it seems to be focused on column or global subtraction of means. If you want to subtract row means then working on Y might be applicable. As you say, this requires a bit of thinking.
reorganized SSVDCLI manual.