Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: None
    • Labels:
      None

      Description

      A DocSet type that can skip to support SOLR-1165

      1. SOLR-1169.patch
        21 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          We need skipping in our DocSets in order to be able to pass them as filters and improve search performance.
          One could Sort a HashDocSet every time it's used.... but that's not desirable.

          The way Lucene now uses filters, random access performance is no longer important, but being able to efficiently skip is.
          Intersection performance is very important to Solr of course, so if we can get SortedIntDocSet performance fast enough then it would make sense for it to replace HashDocSet.

          I've already started working on this, and the results look promising. I'll post a draft soon.

          Show
          Yonik Seeley added a comment - We need skipping in our DocSets in order to be able to pass them as filters and improve search performance. One could Sort a HashDocSet every time it's used.... but that's not desirable. The way Lucene now uses filters, random access performance is no longer important, but being able to efficiently skip is. Intersection performance is very important to Solr of course, so if we can get SortedIntDocSet performance fast enough then it would make sense for it to replace HashDocSet. I've already started working on this, and the results look promising. I'll post a draft soon.
          Hide
          Mike Klaas added a comment -

          sweet. intersecting sorted int dicts should be faster in the general case. HashSet will of course win when one set is very small, but I expect this to still be pretty fast anyway.

          Show
          Mike Klaas added a comment - sweet. intersecting sorted int dicts should be faster in the general case. HashSet will of course win when one set is very small, but I expect this to still be pretty fast anyway.
          Hide
          Yonik Seeley added a comment -

          Draft patch attached. It's not "hooked in" to Solr yet... just the performance tests, which I'm doing now.

          Show
          Yonik Seeley added a comment - Draft patch attached. It's not "hooked in" to Solr yet... just the performance tests, which I'm doing now.
          Hide
          Yonik Seeley added a comment -

          The first results are in, and it's looking good.

          Algorithm:
          For intersectionSize (and intersection), if the sets are near in size, we do a linear scan. A little micro-optimization there got us an extra 13% speedup... just for rearranging the order of comparisons base on the fact that one was more likely to be true than the other (based on the set sizes). When the set sizes differ by a factor of 8 or more, we use a modified binary search. The first modification keeps track of the lower bound rather than resetting to 0... a big win (an obvious optimization... I didn't measure the benefit). Then second modification probes closer to the lower bound (rather than using the midpoint) based on the relative size difference of the sets, and leads to a 40% performance increase.

          Performance results of intersectionSize:

          DocSet sizes Advantage Percent comments
          1-200 x 1-200 Int 53% random sized DocSets from 1-200 in size intersected with other DocSets of size 1-200
          1-3000 x 1-3000 Int 33% 3000 is the current default upper bound of HashDocSet
          1-3000 x 1-3000 Int 130% -client instead of -server
          2000-3000 x 2000-3000 Int 60% only big sets
          20000-30000 x 20000-30000 Int 54% only bigger sets
          100-200 x 1000-2000 Hash 87% small sets with big sets
          1-10000 x 20000-30000 Int 74% smaller sets intersected with BitDocSets
          1-30000 x 1-30000 Int 80% docsets over maxDoc/64 are BitDocSets (maxDoc=1M)

          So to sum up, only small sets intersected with big sets are slower.... but given that big sets intersected with big sets take a majority of the time, we get a nice net win. It gets more dramatic when intersecting a small set with a BitDocSet... these affects are probably due to nicer treatment of the memory subsystem when accessing memory in order. I think these intersections tend to be bound by memory bandwidth.

          The improvements will also allow us to bump up the max size of the "small set" implementation. From a memory consumption point of view, the break-even point is maxDoc/32. When I tested using SortedIntDocSets with maxDoc/64, there was always a net speedup over maxDoc/32 and maxDoc/100, so this seems to be a good balance between performance and saving memory.

          Memory savings: SortedIntDocSet is more efficient than HashDocSet at storing the same amount of data, and it can be used at larger sizes (relative to maxDoc) before performance decreases (another memory win).

          Other savings: Faster set creation - Lucene currently delivers docs in order, hence so sorting step is needed after collection.

          Other notes: I tried a cool partitioning algorithm that I thought would be superior - take the middle element of the big set and use it to partition the small set. Say you have set sizes of 100 and 10000... you do a single binary search on the small set, and now for all 100 elements you have half the big set size to search. Recurse on the corresponding lower and upper partitions until they are small enough to use a different method such as a modified binary search. This approach worked, but it wasn't able to beat the modified binary search alone once I put in all the optimizations... shrugs

          Show
          Yonik Seeley added a comment - The first results are in, and it's looking good. Algorithm: For intersectionSize (and intersection), if the sets are near in size, we do a linear scan. A little micro-optimization there got us an extra 13% speedup... just for rearranging the order of comparisons base on the fact that one was more likely to be true than the other (based on the set sizes). When the set sizes differ by a factor of 8 or more, we use a modified binary search. The first modification keeps track of the lower bound rather than resetting to 0... a big win (an obvious optimization... I didn't measure the benefit). Then second modification probes closer to the lower bound (rather than using the midpoint) based on the relative size difference of the sets, and leads to a 40% performance increase. Performance results of intersectionSize: DocSet sizes Advantage Percent comments 1-200 x 1-200 Int 53% random sized DocSets from 1-200 in size intersected with other DocSets of size 1-200 1-3000 x 1-3000 Int 33% 3000 is the current default upper bound of HashDocSet 1-3000 x 1-3000 Int 130% -client instead of -server 2000-3000 x 2000-3000 Int 60% only big sets 20000-30000 x 20000-30000 Int 54% only bigger sets 100-200 x 1000-2000 Hash 87% small sets with big sets 1-10000 x 20000-30000 Int 74% smaller sets intersected with BitDocSets 1-30000 x 1-30000 Int 80% docsets over maxDoc/64 are BitDocSets (maxDoc=1M) So to sum up, only small sets intersected with big sets are slower.... but given that big sets intersected with big sets take a majority of the time, we get a nice net win. It gets more dramatic when intersecting a small set with a BitDocSet... these affects are probably due to nicer treatment of the memory subsystem when accessing memory in order. I think these intersections tend to be bound by memory bandwidth. The improvements will also allow us to bump up the max size of the "small set" implementation. From a memory consumption point of view, the break-even point is maxDoc/32. When I tested using SortedIntDocSets with maxDoc/64, there was always a net speedup over maxDoc/32 and maxDoc/100, so this seems to be a good balance between performance and saving memory. Memory savings: SortedIntDocSet is more efficient than HashDocSet at storing the same amount of data, and it can be used at larger sizes (relative to maxDoc) before performance decreases (another memory win). Other savings: Faster set creation - Lucene currently delivers docs in order, hence so sorting step is needed after collection. Other notes: I tried a cool partitioning algorithm that I thought would be superior - take the middle element of the big set and use it to partition the small set. Say you have set sizes of 100 and 10000... you do a single binary search on the small set, and now for all 100 elements you have half the big set size to search. Recurse on the corresponding lower and upper partitions until they are small enough to use a different method such as a modified binary search. This approach worked, but it wasn't able to beat the modified binary search alone once I put in all the optimizations... shrugs
          Hide
          Yonik Seeley added a comment -

          committed.

          Show
          Yonik Seeley added a comment - committed.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Yonik Seeley
            • Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development