Mahout
  1. Mahout
  2. MAHOUT-881

Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Later
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      TopItems.getTop*() all do a fair number of excessive operations that can be replaced by switching to using Lucene's PriorityQueue implementation, which is more efficient and faster than Java's built in PQ implementation.

      1. Call_Tree_2.html
        8 kB
        Sean Owen
      2. Call_Tree.html
        7 kB
        Sean Owen
      3. MAHOUT-881.patch
        20 kB
        Grant Ingersoll
      4. MAHOUT-881.patch
        19 kB
        Grant Ingersoll
      5. MAHOUT-881.patch
        15 kB
        Grant Ingersoll

        Activity

        Hide
        Grant Ingersoll added a comment -

        This patch converts TopItems to use Lucene PQ and removes unnecessary sorts, etc. It fails on one test due to the fact that the old code does not provide a total ordering on the SimilarUser object (it does not honor the x.compare == 0 implies x.equals suggestion of Comparable) which means that in the GenericUserBasedRecommenderTest.testMostSimilar() asserts that the returned item is #4, but my version returns #3, which has the same score as #4. This is because I break ties using the id.

        Show
        Grant Ingersoll added a comment - This patch converts TopItems to use Lucene PQ and removes unnecessary sorts, etc. It fails on one test due to the fact that the old code does not provide a total ordering on the SimilarUser object (it does not honor the x.compare == 0 implies x.equals suggestion of Comparable) which means that in the GenericUserBasedRecommenderTest.testMostSimilar() asserts that the returned item is #4, but my version returns #3, which has the same score as #4. This is because I break ties using the id.
        Hide
        Sean Owen added a comment -

        -1 Grant I thought we discussed this on the mailing list? I don't see that this achieves anything. You are still doing an implicit n log n heap sort to attain an ordered result. You still allocate a container object for the result. This seems like swapping some code for more complex (but equally working) code, and adding a Lucene dependency.

        Do you have a load test that shows it would be notably faster? the patch does have such a thing.

        The comment about SimilarUser is right though. It would just affect tie breaking, which doesn't really matter, but should be fixed.

        Show
        Sean Owen added a comment - -1 Grant I thought we discussed this on the mailing list? I don't see that this achieves anything. You are still doing an implicit n log n heap sort to attain an ordered result. You still allocate a container object for the result. This seems like swapping some code for more complex (but equally working) code, and adding a Lucene dependency. Do you have a load test that shows it would be notably faster? the patch does have such a thing. The comment about SimilarUser is right though. It would just affect tie breaking, which doesn't really matter, but should be fixed.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1170 (See https://builds.apache.org/job/Mahout-Quality/1170/)
        MAHOUT-881: add a unit test for TopItems also fix minor code formatting issue

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201222
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/TopItems.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/TopItemsTest.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1170 (See https://builds.apache.org/job/Mahout-Quality/1170/ ) MAHOUT-881 : add a unit test for TopItems also fix minor code formatting issue gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201222 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/TopItems.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/TopItemsTest.java
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1171 (See https://builds.apache.org/job/Mahout-Quality/1171/)
        MAHOUT-882 Actually used rescored value in one TopItems method. Also, fix SimilarUser issue along the way from MAHOUT-881

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201248
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SimilarUser.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/TopItems.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/GenericUserBasedRecommenderTest.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1171 (See https://builds.apache.org/job/Mahout-Quality/1171/ ) MAHOUT-882 Actually used rescored value in one TopItems method. Also, fix SimilarUser issue along the way from MAHOUT-881 srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201248 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SimilarUser.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/TopItems.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/GenericUserBasedRecommenderTest.java
        Hide
        Yonik Seeley added a comment -

        Grant's patch seems to half the number of PQ operations while collecting the top N. I don't know enough about possible Mahout usage patterns, or the relative cost to this against everything else to know if it matters, but it does for Lucene. One lucene example is sorting by date descending when they have been added in roughly ascending order - the number of PQ operations is very high, so cutting it down by half is a nice win.

        As far as the sort, the patch essentially implements a heap sort, taking advantage of the fact that we have already done the first half of that (building the heap). Not sure how to compare that to other sort algorithms...

        Memory usage also seems improved. This existing code

            List<RecommendedItem> result = Lists.newArrayListWithCapacity(size);
            result.addAll(topItems);
            Collections.sort(result, ByValueRecommendedItemComparator.getInstance());
        
        • Object[size] allocated for the ArrayList
        • result.addAll calls toArray which creates another Object[size] and then copies it to the first
        • Collections.sort creates another Object[size] while doing the mergesort, then iterates the original collection, setting every item

        I think Grant's might be improved by using Arrays.asList which would avoid an extra Object[size] allocation + copy.

        So the patch also improves on the memory use - but again, I don't know enough about Mahout to know if it matters in this context.

        Show
        Yonik Seeley added a comment - Grant's patch seems to half the number of PQ operations while collecting the top N. I don't know enough about possible Mahout usage patterns, or the relative cost to this against everything else to know if it matters, but it does for Lucene. One lucene example is sorting by date descending when they have been added in roughly ascending order - the number of PQ operations is very high, so cutting it down by half is a nice win. As far as the sort, the patch essentially implements a heap sort, taking advantage of the fact that we have already done the first half of that (building the heap). Not sure how to compare that to other sort algorithms... Memory usage also seems improved. This existing code List<RecommendedItem> result = Lists.newArrayListWithCapacity(size); result.addAll(topItems); Collections.sort(result, ByValueRecommendedItemComparator.getInstance()); Object [size] allocated for the ArrayList result.addAll calls toArray which creates another Object [size] and then copies it to the first Collections.sort creates another Object [size] while doing the mergesort, then iterates the original collection, setting every item I think Grant's might be improved by using Arrays.asList which would avoid an extra Object [size] allocation + copy. So the patch also improves on the memory use - but again, I don't know enough about Mahout to know if it matters in this context.
        Hide
        Sean Owen added a comment -

        (See my comments on dev@ too)

        Why are there fewer operations? Both seem to do an insert if the new item is known to be among the new top-N, and not otherwise.

        Both versions allocate the ArrayList. I think you're quite right about the addAll() and sort() though. Even if that's an implementation detail, it's real. Those are a few extra arrays of size 10-ish on average, so I would imagine it's not significant. But I also don't like the idea.

        On the other hand, I have an unfounded suspicion that the implicit heap sort here is slower in practice here than a simple quicksort, since that is generally the case.

        I would be surprised if this was not all but identical in performance. We could also make some micro-wins here in other ways to avoid these couple extra allocations. or we can use the Lucene-based queue and avoid heap sort. I have a very slight aversion to changing to something equal but different, replacing very standard java.util.PriorityQueue with merely "standard" Lucene PriorityQueue.

        If I'm wrong that this isn't equal – I support it. If we all concludes it's equal, or barely slower, I'm -0 on the change and really not fussed about it.

        Show
        Sean Owen added a comment - (See my comments on dev@ too) Why are there fewer operations? Both seem to do an insert if the new item is known to be among the new top-N, and not otherwise. Both versions allocate the ArrayList. I think you're quite right about the addAll() and sort() though. Even if that's an implementation detail, it's real. Those are a few extra arrays of size 10-ish on average, so I would imagine it's not significant. But I also don't like the idea. On the other hand, I have an unfounded suspicion that the implicit heap sort here is slower in practice here than a simple quicksort, since that is generally the case. I would be surprised if this was not all but identical in performance. We could also make some micro-wins here in other ways to avoid these couple extra allocations. or we can use the Lucene-based queue and avoid heap sort. I have a very slight aversion to changing to something equal but different, replacing very standard java.util.PriorityQueue with merely "standard" Lucene PriorityQueue. If I'm wrong that this isn't equal – I support it. If we all concludes it's equal, or barely slower, I'm -0 on the change and really not fussed about it.
        Hide
        Yonik Seeley added a comment -

        Why are there fewer operations? Both seem to do an insert if the new item is known to be among the new top-N, and not otherwise.

        Assuming a full priority queue, you have an add() and a poll() per competitive entry (I don't think there's a way to get around that using Java's PQ). Lucene's PQ can rebalance the heap once per competitive entry.

        I have an unfounded suspicion that the implicit heap sort here is slower in practice here than a simple quicksort, since that is generally the case.

        Heapsort is generally slower than quicksort... but since we already have a built heap, we're only doing half-a-heapsort?

        Show
        Yonik Seeley added a comment - Why are there fewer operations? Both seem to do an insert if the new item is known to be among the new top-N, and not otherwise. Assuming a full priority queue, you have an add() and a poll() per competitive entry (I don't think there's a way to get around that using Java's PQ). Lucene's PQ can rebalance the heap once per competitive entry. I have an unfounded suspicion that the implicit heap sort here is slower in practice here than a simple quicksort, since that is generally the case. Heapsort is generally slower than quicksort... but since we already have a built heap, we're only doing half-a-heapsort?
        Hide
        Yonik Seeley added a comment -

        Grant's patch should probably also continue to track lowestTopValue to avoid creating unnecessary GenericRecommendedItem objects.

        Show
        Yonik Seeley added a comment - Grant's patch should probably also continue to track lowestTopValue to avoid creating unnecessary GenericRecommendedItem objects.
        Hide
        Sean Owen added a comment -

        Both good points, I get you now. For me, these last few are valid, and tally up to enough of a theoretical win that I support the patch. I'm guessing that it's still virtually identically fast in practice – for example the expected number of add/poll operations when finding the top 10 is just about 20 anyway. I know from profiling that virtually no time is spent here. Even if it adds some code, I think it's nice.

        Show
        Sean Owen added a comment - Both good points, I get you now. For me, these last few are valid, and tally up to enough of a theoretical win that I support the patch. I'm guessing that it's still virtually identically fast in practice – for example the expected number of add/poll operations when finding the top 10 is just about 20 anyway. I know from profiling that virtually no time is spent here. Even if it adds some code, I think it's nice.
        Hide
        Grant Ingersoll added a comment -

        Grant's patch should probably also continue to track lowestTopValue to avoid creating unnecessary GenericRecommendedItem objects

        I'll add that.

        Show
        Grant Ingersoll added a comment - Grant's patch should probably also continue to track lowestTopValue to avoid creating unnecessary GenericRecommendedItem objects I'll add that.
        Hide
        Grant Ingersoll added a comment -

        replacing very standard java.util.PriorityQueue with merely "standard" Lucene PriorityQueue.

        I think that you will find many of Java's implementations are subpar for high performance, specialized cases, which is what we have here. So, yeah, they have standard APIs that everyone knows, but the downside is they are usually slower.

        Show
        Grant Ingersoll added a comment - replacing very standard java.util.PriorityQueue with merely "standard" Lucene PriorityQueue. I think that you will find many of Java's implementations are subpar for high performance, specialized cases, which is what we have here. So, yeah, they have standard APIs that everyone knows, but the downside is they are usually slower.
        Hide
        Yonik Seeley added a comment -

        One last note: if the number of competitive entries is ever expected to be high (as is the case with the lucene example I gave above) then you can still cap the number of GenericRecommendedItem objects created by making them mutable. Then on a competitive entry and a full PQ, you modify the bottom entry and call adjustTop() instead of creating a new entry.

        Show
        Yonik Seeley added a comment - One last note: if the number of competitive entries is ever expected to be high (as is the case with the lucene example I gave above) then you can still cap the number of GenericRecommendedItem objects created by making them mutable. Then on a competitive entry and a full PQ, you modify the bottom entry and call adjustTop() instead of creating a new entry.
        Hide
        Grant Ingersoll added a comment -

        Adds back in the checks for fullness and lowest value.

        Still working on benchmarking. I think it would be nice if we could do both macro and micro benchmark. For macro, Lucene has a nice benchmark toolkit that is pretty flexible/pluggable that we maybe could leverage.

        Show
        Grant Ingersoll added a comment - Adds back in the checks for fullness and lowest value. Still working on benchmarking. I think it would be nice if we could do both macro and micro benchmark. For macro, Lucene has a nice benchmark toolkit that is pretty flexible/pluggable that we maybe could leverage.
        Hide
        Sebastian Schelter added a comment -

        You should also include org.apache.mahout.cf.taste.common.FixedSizePriorityQueue in your optimizations if possible, as its implementing the same functionality.

        Show
        Sebastian Schelter added a comment - You should also include org.apache.mahout.cf.taste.common.FixedSizePriorityQueue in your optimizations if possible, as its implementing the same functionality.
        Hide
        Grant Ingersoll added a comment -

        Here's an updated patch which has some basic benchmarking. Still need to validate on real machine and double check some more things.

        Show
        Grant Ingersoll added a comment - Here's an updated patch which has some basic benchmarking. Still need to validate on real machine and double check some more things.
        Hide
        Sean Owen added a comment -

        Since it's easy, I just used jprofiler to observe the exact difference. See attached excerpt from the call graph, before and after. Out of about 23 minutes of CPU time spent in the getTopItems() method, the cost of queue operations did in fact drop, from 90 microseconds to 47 microseconds. That's 0.0065% of runtime before, or about 1 part in 15,000, so I don't think you are observing any actual difference in runtime.

        I'm not against this; I suppose that if it's not adding what we thought and is introducing very slightly more complexity and change, I'd be very slightly predisposed to not make such a change. Or what about addressing some of those allocations directly that Yonik mentioned, if anything? Those are 1-liners.

        In any event, I would rather not also change AbstractAverageDifferenceEvaluator – was that just a temp change? Or I could work in reporting those figures differently for you. To answer your TODO: returning an array would change the API, and cause some other difficult breakage (IIRC). It's a List on purpose. I think you'll find however that Arrays.asList() is the better method call there, probably avoids overhead.

        Show
        Sean Owen added a comment - Since it's easy, I just used jprofiler to observe the exact difference. See attached excerpt from the call graph, before and after. Out of about 23 minutes of CPU time spent in the getTopItems() method, the cost of queue operations did in fact drop, from 90 microseconds to 47 microseconds. That's 0.0065% of runtime before, or about 1 part in 15,000, so I don't think you are observing any actual difference in runtime. I'm not against this; I suppose that if it's not adding what we thought and is introducing very slightly more complexity and change, I'd be very slightly predisposed to not make such a change. Or what about addressing some of those allocations directly that Yonik mentioned, if anything? Those are 1-liners. In any event, I would rather not also change AbstractAverageDifferenceEvaluator – was that just a temp change? Or I could work in reporting those figures differently for you. To answer your TODO: returning an array would change the API, and cause some other difficult breakage (IIRC). It's a List on purpose. I think you'll find however that Arrays.asList() is the better method call there, probably avoids overhead.
        Hide
        Grant Ingersoll added a comment -

        Yeah, I did some profiling too and came to the same conclusion. I'm going to shelve this. We can come back to it after improving some of the other things.

        Show
        Grant Ingersoll added a comment - Yeah, I did some profiling too and came to the same conclusion. I'm going to shelve this. We can come back to it after improving some of the other things.
        Hide
        Sean Owen added a comment -

        I think the tests should at least be committed. I'd also like to make some 1-liner improvements here that are obvious, if tiny, wins.

        Show
        Sean Owen added a comment - I think the tests should at least be committed. I'd also like to make some 1-liner improvements here that are obvious, if tiny, wins.
        Hide
        Grant Ingersoll added a comment -

        Yeah, tests are good and the evaluator.

        Show
        Grant Ingersoll added a comment - Yeah, tests are good and the evaluator.
        Hide
        Ted Dunning added a comment -

        from 90 microseconds to 47 microseconds

        This is consistent with my experience in other efforts. The priority queue is rarely the problem if you avoid inserting most elements and even if you do insert most elements due to pathological ordering of the original data, it isn't a big deal since the cost is n log k where n is the number of documents and k is the size of the queue.

        One big difference that we can probably make, however, is to multi-thread some of these sequential programs. This isn't very hard with the Executors in Java. This doesn't make things more efficient, but it does make them 10x faster on commonly available servers. That is an effort for a different JIRA in any case.

        Show
        Ted Dunning added a comment - from 90 microseconds to 47 microseconds This is consistent with my experience in other efforts. The priority queue is rarely the problem if you avoid inserting most elements and even if you do insert most elements due to pathological ordering of the original data, it isn't a big deal since the cost is n log k where n is the number of documents and k is the size of the queue. One big difference that we can probably make, however, is to multi-thread some of these sequential programs. This isn't very hard with the Executors in Java. This doesn't make things more efficient, but it does make them 10x faster on commonly available servers. That is an effort for a different JIRA in any case.
        Hide
        Grant Ingersoll added a comment -

        I see some other things we can do, too. For instance, I'm benchmarking whether we can avoid what allocating the arrays for storing the similarity results so often.

        Show
        Grant Ingersoll added a comment - I see some other things we can do, too. For instance, I'm benchmarking whether we can avoid what allocating the arrays for storing the similarity results so often.
        Hide
        Ted Dunning added a comment -

        avoid what allocating the arrays

        There is a somewhat pathological interest in the Lucene community about avoiding allocation. The standard approach to doing this is to re-use data structures.

        In fact, this often has a perverse effect on the GC that makes programs slower overall and definitely makes the code far more error prone.

        The problem with performance is rarely allocation and is far more commonly the cost of copying. If the idiom that you are using to avoid allocation still involves as much copying, then you are unlikely to save anything at all by avoiding an allocation and it may cost you quite a bit since you are making an array live longer than its natural life which can, in the worst situations, even trigger a full GC if the array survives too long.

        For most uses of arrays such as score accumulators, the copying is inherent in the algorithm being used and is not something to be avoided because having the array be collected as a short-lived object is usually the most efficient way to go.

        Mutation and re-use also introduces complexities of storage management that are roughly equivalent to the cognitive load of malloc/free which, particularly if not associated with any level of optimization should be avoided like the plague.

        One common idiom that used to cause performance issues had to do with gratuitous boxing and unboxing of data in order to package it for passing between different parts of code. This is much less of a problem than it used to be because lots of these uses are inlined and the structure creation is optimized away. You still have to watch for it with collections because of the memory pressure that it creates.

        Show
        Ted Dunning added a comment - avoid what allocating the arrays There is a somewhat pathological interest in the Lucene community about avoiding allocation. The standard approach to doing this is to re-use data structures. In fact, this often has a perverse effect on the GC that makes programs slower overall and definitely makes the code far more error prone. The problem with performance is rarely allocation and is far more commonly the cost of copying. If the idiom that you are using to avoid allocation still involves as much copying, then you are unlikely to save anything at all by avoiding an allocation and it may cost you quite a bit since you are making an array live longer than its natural life which can, in the worst situations, even trigger a full GC if the array survives too long. For most uses of arrays such as score accumulators, the copying is inherent in the algorithm being used and is not something to be avoided because having the array be collected as a short-lived object is usually the most efficient way to go. Mutation and re-use also introduces complexities of storage management that are roughly equivalent to the cognitive load of malloc/free which, particularly if not associated with any level of optimization should be avoided like the plague. One common idiom that used to cause performance issues had to do with gratuitous boxing and unboxing of data in order to package it for passing between different parts of code. This is much less of a problem than it used to be because lots of these uses are inlined and the structure creation is optimized away. You still have to watch for it with collections because of the memory pressure that it creates.
        Hide
        Grant Ingersoll added a comment -

        I think the tests should at least be committed. I'd also like to make some 1-liner improvements here that are obvious, if tiny, wins.

        Were you referring to the tests already committed? Also, I'd like to see something like the LoadEvaluationRunner in, but an earlier comment suggested not changing the AbstractDiffRecommenderEvaluator. Sean, do you have other suggestions for getting those stats out?

        Show
        Grant Ingersoll added a comment - I think the tests should at least be committed. I'd also like to make some 1-liner improvements here that are obvious, if tiny, wins. Were you referring to the tests already committed? Also, I'd like to see something like the LoadEvaluationRunner in, but an earlier comment suggested not changing the AbstractDiffRecommenderEvaluator. Sean, do you have other suggestions for getting those stats out?
        Hide
        Sean Owen added a comment -

        I'm referring to TopItemsTest and anything already committed. All of that is of course useful.
        Sketch out what kind of stats you'd like out of this (in another JIRA if you like) and I'll implement it. More data is OK, just thought I might be able to set up the API in a way that accommodates it all more easiliy.

        Show
        Sean Owen added a comment - I'm referring to TopItemsTest and anything already committed. All of that is of course useful. Sketch out what kind of stats you'd like out of this (in another JIRA if you like) and I'll implement it. More data is OK, just thought I might be able to set up the API in a way that accommodates it all more easiliy.
        Hide
        Grant Ingersoll added a comment -

        Once PQ operations become a bottleneck (b/c everything else is fast) we can revisit.

        Show
        Grant Ingersoll added a comment - Once PQ operations become a bottleneck (b/c everything else is fast) we can revisit.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development