Mahout
  1. Mahout
  2. MAHOUT-638

Stochastic svd's is not handling well all cases of sparse vectors

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.5
    • Component/s: Math
    • Labels:
      None

      Description

      The Mahout patch of the algorithm is not handling all types of sparse input efficiently. BtJob doesn't handle SequentialSparseVector in a way to pick only non-zero elements from initial input and QJob doesn't iterate over RandomAccessSparseVector correctly. With extremely sparse inputs (0.05% non-zero elements) that leads to a terrible inefficiency in the aforementioned jobs (QJob, BtJob).

      1. MAHOUT-638-2.patch
        30 kB
        Dmitriy Lyubimov
      2. MAHOUT-638-2.patch
        34 kB
        Dmitriy Lyubimov
      3. MAHOUT-638.patch
        3 kB
        Dmitriy Lyubimov
      4. MAHOUT-622-2.patch
        12 kB
        Dmitriy Lyubimov

        Activity

        Hide
        Dmitriy Lyubimov added a comment -

        is that the best way to do row-wise outer product computation (rows A and Q)?

              if ( (aRow instanceof SequentialAccessSparseVector) ||
                  (aRow instanceof RandomAccessSparseVector )) {
                for ( Vector.Element el:aRow ) { 
                  double mul=el.get();
                  for ( int j =0; j < kp; j++ ) 
                    btRow.setQuick(j, mul * qRow.getQuick(j));
                  btKey.set(el.index());
                  context.write(btKey, btValue);
                }
              } else { 
                int n = aRow.size();
                for (int i = 0; i < n; i++) {
                  double mul = aRow.getQuick(i);
                  for (int j = 0; j < kp; j++)
                    btRow.setQuick(j, mul * qRow.getQuick(j));
                  btKey.set(i);
                  context.write(btKey, btValue);
                }
              }
        
        
        Show
        Dmitriy Lyubimov added a comment - is that the best way to do row-wise outer product computation (rows A and Q)? if ( (aRow instanceof SequentialAccessSparseVector) || (aRow instanceof RandomAccessSparseVector )) { for ( Vector.Element el:aRow ) { double mul=el.get(); for ( int j =0; j < kp; j++ ) btRow.setQuick(j, mul * qRow.getQuick(j)); btKey.set(el.index()); context.write(btKey, btValue); } } else { int n = aRow.size(); for ( int i = 0; i < n; i++) { double mul = aRow.getQuick(i); for ( int j = 0; j < kp; j++) btRow.setQuick(j, mul * qRow.getQuick(j)); btKey.set(i); context.write(btKey, btValue); } }
        Hide
        Dmitriy Lyubimov added a comment -

        adding missing support for sparse vectors during projection and multiplication (partial outer product computation)

        Show
        Dmitriy Lyubimov added a comment - adding missing support for sparse vectors during projection and multiplication (partial outer product computation)
        Hide
        Sean Owen added a comment -

        Vector has properties isDense() and isSequentialAccess() which should prevent us from ever needing instanceof. The patch doesn't quite match code style, but, that's minor enough. It'll get cleaned up eventually by me, though might have a glance at the rest of the code for spacing / brace conventions and such.

        Otherwise I'm sure you're welcome to proceed.

        Show
        Sean Owen added a comment - Vector has properties isDense() and isSequentialAccess() which should prevent us from ever needing instanceof. The patch doesn't quite match code style, but, that's minor enough. It'll get cleaned up eventually by me, though might have a glance at the rest of the code for spacing / brace conventions and such. Otherwise I'm sure you're welcome to proceed.
        Hide
        Dmitriy Lyubimov added a comment -

        Yeah. I looked the instanceof use somewhere in DRM code, it would seem either vector contract deficient, or it was not used properly. I'll patch it.

        Strange. I committed that patch yesterday since a user was waiting to validate this code. Jenkins did not comment on it, although i am pretty sure it had the issue number in the commit comment.

        Show
        Dmitriy Lyubimov added a comment - Yeah. I looked the instanceof use somewhere in DRM code, it would seem either vector contract deficient, or it was not used properly. I'll patch it. Strange. I committed that patch yesterday since a user was waiting to validate this code. Jenkins did not comment on it, although i am pretty sure it had the issue number in the commit comment.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #708 (See https://hudson.apache.org/hudson/job/Mahout-Quality/708/)

        Show
        Hudson added a comment - Integrated in Mahout-Quality #708 (See https://hudson.apache.org/hudson/job/Mahout-Quality/708/ )
        Hide
        Dmitriy Lyubimov added a comment -

        Let me do one more round on this, please. Please don't close it yet. Thanks.

        Show
        Dmitriy Lyubimov added a comment - Let me do one more round on this, please. Please don't close it yet. Thanks.
        Hide
        Dmitriy Lyubimov added a comment -

        Second installment:

        • reworked to use Vector#isDense().
        • isolating code with valid warnings and marking them with @SuppressWarnings().
        • adding test for sparse matrix.
        Show
        Dmitriy Lyubimov added a comment - Second installment: reworked to use Vector#isDense(). isolating code with valid warnings and marking them with @SuppressWarnings(). adding test for sparse matrix.
        Hide
        Dmitriy Lyubimov added a comment -

        Fixing checkstyle complaints.

        I will let this patch sit for another day and commit it if no objections. This is a small change.

        Show
        Dmitriy Lyubimov added a comment - Fixing checkstyle complaints. I will let this patch sit for another day and commit it if no objections. This is a small change.
        Hide
        Dmitriy Lyubimov added a comment -

        Fixing checkstyle complaints.
        This is a small change. i will let it sit here for another day before committing. I will commit if no objections in a day.

        Show
        Dmitriy Lyubimov added a comment - Fixing checkstyle complaints. This is a small change. i will let it sit here for another day before committing. I will commit if no objections in a day.
        Hide
        Dmitriy Lyubimov added a comment -

        Ooops. please ignore -622 patch, it does not belong here.

        Second installment patch is MAHOUT-638-2.patch.

        Show
        Dmitriy Lyubimov added a comment - Ooops. please ignore -622 patch, it does not belong here. Second installment patch is MAHOUT-638 -2.patch.
        Hide
        Dmitriy Lyubimov added a comment -

        Github url for this issue: https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-638 (pull requests are welcome).

        Show
        Dmitriy Lyubimov added a comment - Github url for this issue: https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-638 (pull requests are welcome).
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #761 (See https://hudson.apache.org/hudson/job/Mahout-Quality/761/)
        MAHOUT-638: second installment of fixes, sequential sparse source test

        Show
        Hudson added a comment - Integrated in Mahout-Quality #761 (See https://hudson.apache.org/hudson/job/Mahout-Quality/761/ ) MAHOUT-638 : second installment of fixes, sequential sparse source test
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #762 (See https://hudson.apache.org/hudson/job/Mahout-Quality/762/)
        MAHOUT-638: fixing a row multiplication for sparse input

        Show
        Hudson added a comment - Integrated in Mahout-Quality #762 (See https://hudson.apache.org/hudson/job/Mahout-Quality/762/ ) MAHOUT-638 : fixing a row multiplication for sparse input
        Hide
        Dmitriy Lyubimov added a comment -

        This of course is still wrong. I looked at the implementAtion of the full iterator in sparse vectors and it does sequential iteration, whereas I assumed it is equivAlent to iteratenonzero. So this fix is not the right one and sparse inputs are still handled deficiently.

        I cannot reopen this one, so I will commit the quick fix for that in one of the other re lated issues.

        Show
        Dmitriy Lyubimov added a comment - This of course is still wrong. I looked at the implementAtion of the full iterator in sparse vectors and it does sequential iteration, whereas I assumed it is equivAlent to iteratenonzero. So this fix is not the right one and sparse inputs are still handled deficiently. I cannot reopen this one, so I will commit the quick fix for that in one of the other re lated issues.
        Hide
        Ted Dunning added a comment -

        Don't even assume that iteratorNonZero will never return zeros.

        Show
        Ted Dunning added a comment - Don't even assume that iteratorNonZero will never return zeros.
        Hide
        Dmitriy Lyubimov added a comment -

        but it doesn't show all zeros, right? otherwise, what's the best way to iterate thru non-zero elements of a sparse vector?

        i just want not to iterate thru millions of zeros, but i am ok to iterate thru few of them.

        Show
        Dmitriy Lyubimov added a comment - but it doesn't show all zeros, right? otherwise, what's the best way to iterate thru non-zero elements of a sparse vector? i just want not to iterate thru millions of zeros, but i am ok to iterate thru few of them.
        Hide
        Ted Dunning added a comment -

        Right. IterateNonZero is allowed to not returned zeros, but it is not guaranteed to never return a zero.

        If you want to go faster, then it is the thing to use. There is no contract, however, that says that it will be faster and for a dense matrix, it may well not be. It would be good to have the dense iterator avoid showing zeros if only to avoid constructing the little Element structure unnecessarily.

        Show
        Ted Dunning added a comment - Right. IterateNonZero is allowed to not returned zeros, but it is not guaranteed to never return a zero. If you want to go faster, then it is the thing to use. There is no contract, however, that says that it will be faster and for a dense matrix, it may well not be. It would be good to have the dense iterator avoid showing zeros if only to avoid constructing the little Element structure unnecessarily.
        Hide
        Dmitriy Lyubimov added a comment -

        Ok... my latest version to compute Y buffer (which also may be sparse, but i handled it separately looks like this :

        
          public void computeYRow(Vector aRow, double[] yRow) {
            // assert yRow.length == kp;
            Arrays.fill(yRow, 0.0);
            if (aRow.isDense()) {
              int n = aRow.size();
              for (int j = 0; j < n; j++) {
                accumDots(j, aRow.getQuick(j), yRow);
              }
            } else {
              for (Iterator<Element> iter = aRow.iterateNonZero(); iter.hasNext();) {
                Element el = iter.next();
                accumDots(el.index(), el.get(), yRow);
              }
            }
          }
        

        Does it look good to you? Do you think Y should be handled as RandomAcessSparseVector instead?

        Show
        Dmitriy Lyubimov added a comment - Ok... my latest version to compute Y buffer (which also may be sparse, but i handled it separately looks like this : public void computeYRow(Vector aRow, double [] yRow) { // assert yRow.length == kp; Arrays.fill(yRow, 0.0); if (aRow.isDense()) { int n = aRow.size(); for ( int j = 0; j < n; j++) { accumDots(j, aRow.getQuick(j), yRow); } } else { for (Iterator<Element> iter = aRow.iterateNonZero(); iter.hasNext();) { Element el = iter.next(); accumDots(el.index(), el.get(), yRow); } } } Does it look good to you? Do you think Y should be handled as RandomAcessSparseVector instead?
        Hide
        Ted Dunning added a comment -

        I don't know the context well enough here.

        Why is simple multiplication not good enough here?

        Show
        Ted Dunning added a comment - I don't know the context well enough here. Why is simple multiplication not good enough here?
        Hide
        Dmitriy Lyubimov added a comment - - edited

        it's random projection multiplication Y_

        {i*}=A_{i*}

        \Omega (part of the Omega class). Since that's the only operation we ever do with Omega, i did not feel like I needed to implement the whole Matrix api for Omega; and even if i did, it would've looked as something similar except here it is optimized to re-use the same holder object (yRow).

        Show
        Dmitriy Lyubimov added a comment - - edited it's random projection multiplication Y_ {i*}=A_{i*} \Omega (part of the Omega class). Since that's the only operation we ever do with Omega, i did not feel like I needed to implement the whole Matrix api for Omega; and even if i did, it would've looked as something similar except here it is optimized to re-use the same holder object (yRow).
        Hide
        Ted Dunning added a comment -

        Ahh...

        I went ahead and implemented a random matrix to make things like this easier.

        Show
        Ted Dunning added a comment - Ahh... I went ahead and implemented a random matrix to make things like this easier.

          People

          • Assignee:
            Dmitriy Lyubimov
            Reporter:
            Dmitriy Lyubimov
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development