Lucene - Core
  1. Lucene - Core
  2. LUCENE-3972

Improve AllGroupsCollector implementations

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/grouping
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think that the performance of TermAllGroupsCollectorm, DVAllGroupsCollector.BR and DVAllGroupsCollector.SortedBR can be improved by using BytesRefHash to store the groups instead of an ArrayList.

      1. LUCENE-3972.patch
        4 kB
        Martijn van Groningen
      2. LUCENE-3972.patch
        3 kB
        Martijn van Groningen

        Activity

        Hide
        Robert Muir added a comment -

        Yes you are correct. its currently pretty ugly. But, i don't think we should really cache this anywhere
        else other than a slow-wrapper (it would be wrong to do so)...

        Show
        Robert Muir added a comment - Yes you are correct. its currently pretty ugly. But, i don't think we should really cache this anywhere else other than a slow-wrapper (it would be wrong to do so)...
        Hide
        Martijn van Groningen added a comment -

        I think so as well. What would be the best way to use the OrdinalMap? Just create an OrdinalsMap from a top level reader via SlowCompositeReaderWrapper#getSortedSetDocValues()? This seems the only place where OrdinalsMaps are cached.

        Show
        Martijn van Groningen added a comment - I think so as well. What would be the best way to use the OrdinalMap? Just create an OrdinalsMap from a top level reader via SlowCompositeReaderWrapper#getSortedSetDocValues()? This seems the only place where OrdinalsMaps are cached.
        Hide
        Robert Muir added a comment -

        I think it would be better if we changed these to use OrdinalMap so that O(#terms) is only computed once per reopen instead of this rebasing which is O(#terms) per query.

        Show
        Robert Muir added a comment - I think it would be better if we changed these to use OrdinalMap so that O(#terms) is only computed once per reopen instead of this rebasing which is O(#terms) per query.
        Hide
        Martijn van Groningen added a comment -

        I noticed that the real cost was in the setNextReader method, in that method the collected group values are re-based into ordinals that are valid for the new upcoming segment (binary search for each collected term). But this is what makes the ordinals comparison in the collect method actually work.

        Reducing the number of segments, reduces the number of setNextReader invocations and makes this feature faster. If you only have one segment (optimized index), then the rebasing of the group values doesn't occur, and the AllGroupsCollector should be much faster.

        I think that having a hybrid solution that would change the impl when a predefined number of groups have been found would make the current approach better, but this would never be faster as just having one segment (or global/index level ordinals).

        Show
        Martijn van Groningen added a comment - I noticed that the real cost was in the setNextReader method, in that method the collected group values are re-based into ordinals that are valid for the new upcoming segment (binary search for each collected term). But this is what makes the ordinals comparison in the collect method actually work. Reducing the number of segments, reduces the number of setNextReader invocations and makes this feature faster. If you only have one segment (optimized index), then the rebasing of the group values doesn't occur, and the AllGroupsCollector should be much faster. I think that having a hybrid solution that would change the impl when a predefined number of groups have been found would make the current approach better, but this would never be faster as just having one segment (or global/index level ordinals).
        Hide
        Paul Masurel added a comment -

        (e-commerce Solr user here)

        We hit the very same performance hit with pathological queries with
        1M+ unique groups and need to solve this issue for our business.

        Would an hybrid approach switching implementation half-ways when
        the number of unique groups detected gets too high be welcomed?

        I also wonder whether the number of segments plays a great role in this.
        Did you observe that in your benchmarking?

        Show
        Paul Masurel added a comment - (e-commerce Solr user here) We hit the very same performance hit with pathological queries with 1M+ unique groups and need to solve this issue for our business. Would an hybrid approach switching implementation half-ways when the number of unique groups detected gets too high be welcomed? I also wonder whether the number of segments plays a great role in this. Did you observe that in your benchmarking?
        Hide
        Martijn van Groningen added a comment -

        If you have fewer unique groups (and as the number of docs collected goes up), I think the current impl should be faster...?

        This is true. I ran a few tests on an index containing 10M docs:

        Run Num unique groups Perf. collector in patch Perf. committed collector
        1 ~65000 892 ms 132 ms
        2 ~645000 1183 ms 878 ms
        3 ~953000 1291 ms 1517 ms
        4 ~1819000 1783 ms 3762 ms
        5 ~3332000 2703 ms 4882 ms
        6 ~6668000 4840 ms 18989 ms

        All the times are the average over 10 executions with a match all query.

        the time is likely dominated by re-ord'ing for each segment?

        During run 4 I noticed that 3470 ms of the total 3762 ms was spend on re-ord'ing groups for segments.

        It seems that the implementation in the patch is only usable if a search yields many unique groups as result.

        Show
        Martijn van Groningen added a comment - If you have fewer unique groups (and as the number of docs collected goes up), I think the current impl should be faster...? This is true. I ran a few tests on an index containing 10M docs: Run Num unique groups Perf. collector in patch Perf. committed collector 1 ~65000 892 ms 132 ms 2 ~645000 1183 ms 878 ms 3 ~953000 1291 ms 1517 ms 4 ~1819000 1783 ms 3762 ms 5 ~3332000 2703 ms 4882 ms 6 ~6668000 4840 ms 18989 ms All the times are the average over 10 executions with a match all query. the time is likely dominated by re-ord'ing for each segment? During run 4 I noticed that 3470 ms of the total 3762 ms was spend on re-ord'ing groups for segments. It seems that the implementation in the patch is only usable if a search yields many unique groups as result.
        Hide
        Michael McCandless added a comment -

        Actually, we are storing term ords here, not docIDs.

        I think the high number of unique groups explains why the new patch is
        faster: the time is likely dominated by re-ord'ing for each segment?

        If you have fewer unique groups (and as the number of docs collected goes up),
        I think the current impl should be faster...?

        Show
        Michael McCandless added a comment - Actually, we are storing term ords here, not docIDs. I think the high number of unique groups explains why the new patch is faster: the time is likely dominated by re-ord'ing for each segment? If you have fewer unique groups (and as the number of docs collected goes up), I think the current impl should be faster...?
        Hide
        Dawid Weiss added a comment -

        Hmmm... it's not collisions then, it was worth a try. I still find the difference puzzling – I can't justify your version being 3x faster. Curious what it might be.

        But we know a lot about docids, and extra hashing should just lead to an average-case slowdown.

        Ok.

        Show
        Dawid Weiss added a comment - Hmmm... it's not collisions then, it was worth a try. I still find the difference puzzling – I can't justify your version being 3x faster. Curious what it might be. But we know a lot about docids, and extra hashing should just lead to an average-case slowdown. Ok.
        Hide
        Yonik Seeley added a comment -

        I would also patch it for the future in either case.

        Hmmm, if you don't know anything about the integer keys, then a little extra hashing by default is a good idea.
        But we know a lot about docids, and extra hashing should just lead to an average-case slowdown.

        Show
        Yonik Seeley added a comment - I would also patch it for the future in either case. Hmmm, if you don't know anything about the integer keys, then a little extra hashing by default is a good idea. But we know a lot about docids, and extra hashing should just lead to an average-case slowdown.
        Hide
        Martijn van Groningen added a comment -

        I ran a test with the changes to SentinelIntSet and one without. Performance difference between the two tests are minimal. The test with the changes is somewhat slower (15.7 s) than the one without the changes (15.3 s).

        Show
        Martijn van Groningen added a comment - I ran a test with the changes to SentinelIntSet and one without. Performance difference between the two tests are minimal. The test with the changes is somewhat slower (15.7 s) than the one without the changes (15.3 s).
        Hide
        Dawid Weiss added a comment -

        Yes, sorry – hash of course. The hash method that should redistribute keys space into buckets (but currently doesn't).

        As for BytesRefHash vs. BytesRef instances – maybe it's the source of the speedup, who knows. I would try the hash method though, if nothing else just for curiosity. I would also patch it for the future in either case. Not rehashing input keys is a flaw in my opinion (again – backed by real life experience from HPPC).

        Show
        Dawid Weiss added a comment - Yes, sorry – hash of course. The hash method that should redistribute keys space into buckets (but currently doesn't). As for BytesRefHash vs. BytesRef instances – maybe it's the source of the speedup, who knows. I would try the hash method though, if nothing else just for curiosity. I would also patch it for the future in either case. Not rehashing input keys is a flaw in my opinion (again – backed by real life experience from HPPC).
        Hide
        Martijn van Groningen added a comment -

        The major difference between the committed TermAllGroupsCollector and one in the patch is that committed TermAllGroupsCollector creates a BytesRef instance for each unique group and puts it in a ArrayList (5M in my case). The version in the patch reuses a single BytesRef instance. The BytesRef bytes are copied into the BytesRefHash big bytes array via the BytesRefHash#add(...) method. I think the many BytesRef instances makes the committed TermAllGroupsCollector slow.

        Show
        Martijn van Groningen added a comment - The major difference between the committed TermAllGroupsCollector and one in the patch is that committed TermAllGroupsCollector creates a BytesRef instance for each unique group and puts it in a ArrayList (5M in my case). The version in the patch reuses a single BytesRef instance. The BytesRef bytes are copied into the BytesRefHash big bytes array via the BytesRefHash#add(...) method. I think the many BytesRef instances makes the committed TermAllGroupsCollector slow.
        Hide
        Martijn van Groningen added a comment -

        Dawid - Do you mean the hash method instead of rehash method in SentinelIntSet? The rehash methods doesn't have any parameters.

        Show
        Martijn van Groningen added a comment - Dawid - Do you mean the hash method instead of rehash method in SentinelIntSet? The rehash methods doesn't have any parameters.
        Hide
        Dawid Weiss added a comment - - edited

        This is curious indeed. One thing to check would be this: SentinelIntSet uses no key rehashing (rehash simply returns the key). This resulted in very poor performance for certain regular integer sets (my experience from implementing HPPC). So while rehashing may seem like an additional overhead, it actually boosts performance.

        Martijn – could you patch the trunk's SentinelIntSet#rehash with, for example, this (murmur hash3 tail):

            public static int rehash(int k)
            {
                k ^= k >>> 16;
                k *= 0x85ebca6b;
                k ^= k >>> 13;
                k *= 0xc2b2ae35;
                k ^= k >>> 16;
                return k;
            }
        

        and retry your test? Btw. I'm not saying it'll be faster

        Show
        Dawid Weiss added a comment - - edited This is curious indeed. One thing to check would be this: SentinelIntSet uses no key rehashing (rehash simply returns the key). This resulted in very poor performance for certain regular integer sets (my experience from implementing HPPC). So while rehashing may seem like an additional overhead, it actually boosts performance. Martijn – could you patch the trunk's SentinelIntSet#rehash with, for example, this (murmur hash3 tail): public static int rehash(int k) { k ^= k >>> 16; k *= 0x85ebca6b; k ^= k >>> 13; k *= 0xc2b2ae35; k ^= k >>> 16; return k; } and retry your test? Btw. I'm not saying it'll be faster
        Hide
        Martijn van Groningen added a comment -

        Tested the current patch on a index containing 10M documents that has ~5M unique groups. The already existing implementation needs ~15.3 seconds to get the total group count and the new impl in the patch needs ~4.4 seconds to get the total group count.

        Show
        Martijn van Groningen added a comment - Tested the current patch on a index containing 10M documents that has ~5M unique groups. The already existing implementation needs ~15.3 seconds to get the total group count and the new impl in the patch needs ~4.4 seconds to get the total group count.
        Hide
        Michael McCandless added a comment -

        Curious that it's so much faster ... BytesRefHash operates on the byte[] term while the current approach operates on int ord.

        How large was the index? If it was smallish, maybe the time was dominated by re-ord'ing after each reader...?

        Show
        Michael McCandless added a comment - Curious that it's so much faster ... BytesRefHash operates on the byte[] term while the current approach operates on int ord. How large was the index? If it was smallish, maybe the time was dominated by re-ord'ing after each reader...?
        Hide
        Martijn van Groningen added a comment -

        Attached initial patch. The TermAllGroupsCollector variant attached in this patch seems to be 2 times faster then the one that is already in subversion.

        Show
        Martijn van Groningen added a comment - Attached initial patch. The TermAllGroupsCollector variant attached in this patch seems to be 2 times faster then the one that is already in subversion.

          People

          • Assignee:
            Unassigned
            Reporter:
            Martijn van Groningen
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development