Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-639

Need special case to handle creating a new SequentialAccessSparseVector from a large (> 1M dims) random/hashed vector

    Details

      Description

      When trying to transpose a matrix of tfidf vectors created from text documents (ASF mail archives in this case), there is a bottleneck in the TransposeJob's reducer when Mahout creates a new SequentialAccessSparseVector from a RandomAccessSparseVector after the while loop completes:

      SequentialAccessSparseVector outVector = new SequentialAccessSparseVector(tmp);

      For high-frequency terms (some of which occur over ~1M times in my data), the code to create a SequentialAccessSparseVector from a RandomAccessSparseVector bogs down completely ....

      From Jake Mannix:
      "Suspicion confirmed:

      public SequentialAccessSparseVector(Vector other) {
      this(other.size(), other.getNumNondefaultElements());
      Iterator<Element> it = other.iterateNonZero();
      Element e;
      while (it.hasNext() && (e = it.next()) != null)

      { set(e.index(), e.get()); }

      }

      we iterate over the other vector (which is in random/hashed order), adding it to the sequential access vector (which always tries to stay in sequential order). So actually, this may be worse than O(n^2), but I'd prefer to just not know how much worse, and instead we should fix it.

      Should be fairly straightforward: make an array of structs (essentially) with the index and the double, of size other.getNumNonDefaultElements() (what a horrible method name), fill it up on one iteration over the other vector, sort it in place, then make your new OrderedIntDoubleMapping out of the indexes and values (unless someone has a cleverer idea to sort a pair of two arrays at the same time, shuffling one based on the ordering criterion of the other)."

      1. MAHOUT-639.patch
        7 kB
        Timothy Potter

        Activity

        Hide
        jake.mannix Jake Mannix added a comment -

        Tim, did you ever check the correctness of the version you wrote for this?

        Show
        jake.mannix Jake Mannix added a comment - Tim, did you ever check the correctness of the version you wrote for this?
        Hide
        thelabdude Timothy Potter added a comment -

        Not the "optimal" way suggested by Jake, but definitely gets passed the problem by using a new internal class OrderedElement which sorts Elements by their index, thus allowing us to use Arrays.sort(). However, it does mean that we create many temporary OrderedElement objects during the copy process. Also, I'm not entirely sure my test case is not going to cause the community issues since it's based on a JUnit timeout.

        Lastly, I also changed a few things around to make the matrixmult job more efficient which also bogged down in multiplying these large vectors in the mappers after I got passed the transpose issue.

        Show
        thelabdude Timothy Potter added a comment - Not the "optimal" way suggested by Jake, but definitely gets passed the problem by using a new internal class OrderedElement which sorts Elements by their index, thus allowing us to use Arrays.sort(). However, it does mean that we create many temporary OrderedElement objects during the copy process. Also, I'm not entirely sure my test case is not going to cause the community issues since it's based on a JUnit timeout. Lastly, I also changed a few things around to make the matrixmult job more efficient which also bogged down in multiplying these large vectors in the mappers after I got passed the transpose issue.
        Hide
        thelabdude Timothy Potter added a comment -

        see previous comment

        Show
        thelabdude Timothy Potter added a comment - see previous comment
        Hide
        jake.mannix Jake Mannix added a comment -

        Whenever this comes up, it kills performance. This is the root cause of the issue raised today in the mailing list ("LanczosSolver Very Slow"). So I'm going to get this (or equivalent) committed before 0.6 goes out.

        Show
        jake.mannix Jake Mannix added a comment - Whenever this comes up, it kills performance. This is the root cause of the issue raised today in the mailing list ("LanczosSolver Very Slow"). So I'm going to get this (or equivalent) committed before 0.6 goes out.
        Hide
        jake.mannix Jake Mannix added a comment -

        Ok, Tim I think the idea of using auxiliary structs is fine. It's not perfectly optimal, but the time complexity is the same big-O as the optimal solution (different constant, probably, and requires extra short-lived objects, true). I was able to get it about 10% faster by not using the set() method on OrderedIntDoubleMapping, and instead making the constructor for that class package private and using it with the already-ordered int[] and double[] which can be created directly in copySortedRandomAccessSparseVector().

        I don't think we want a timeout based nondeterministic unit test for this, however. That's just asking for trouble. I think verifying all other tests work, and then actually checking this actually improves performance is enough.

        I'll clean this up and get it committed this weekend.

        Show
        jake.mannix Jake Mannix added a comment - Ok, Tim I think the idea of using auxiliary structs is fine. It's not perfectly optimal, but the time complexity is the same big-O as the optimal solution (different constant, probably, and requires extra short-lived objects, true). I was able to get it about 10% faster by not using the set() method on OrderedIntDoubleMapping, and instead making the constructor for that class package private and using it with the already-ordered int[] and double[] which can be created directly in copySortedRandomAccessSparseVector(). I don't think we want a timeout based nondeterministic unit test for this, however. That's just asking for trouble. I think verifying all other tests work, and then actually checking this actually improves performance is enough. I'll clean this up and get it committed this weekend.
        Hide
        hudson Hudson added a comment -

        Integrated in Mahout-Quality #786 (See https://builds.apache.org/hudson/job/Mahout-Quality/786/)
        Fixes MAHOUT-639

        Show
        hudson Hudson added a comment - Integrated in Mahout-Quality #786 (See https://builds.apache.org/hudson/job/Mahout-Quality/786/ ) Fixes MAHOUT-639
        Hide
        thelabdude Timothy Potter added a comment -

        Hi Jake,

        Agreed on the unit test timeout not being a good approach. Thanks for cleaning up the code I submitted.

        Show
        thelabdude Timothy Potter added a comment - Hi Jake, Agreed on the unit test timeout not being a good approach. Thanks for cleaning up the code I submitted.
        Hide
        jake.mannix Jake Mannix added a comment -

        Hi Tim,

        Thanks for having that timeout in there, actually - for developing / verification purposes it was a good idea. We just can't have it in there for our continuous integration etc. Good stuff, thanks for getting this in!

        Show
        jake.mannix Jake Mannix added a comment - Hi Tim, Thanks for having that timeout in there, actually - for developing / verification purposes it was a good idea. We just can't have it in there for our continuous integration etc. Good stuff, thanks for getting this in!

          People

          • Assignee:
            jake.mannix Jake Mannix
            Reporter:
            thelabdude Timothy Potter
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development