Solr
  1. Solr
  2. SOLR-6581

Efficient DocValues support and numeric collapse field implementations for Collapse and Expand

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      The 4x implementation of the CollapsingQParserPlugin and the ExpandComponent are optimized to work with a top level FieldCache. Top level FieldCaches have a very fast docID to top-level ordinal lookup. Fast access to the top-level ordinals allows for very high performance field collapsing on high cardinality fields.

      LUCENE-5666 unified the DocValues and FieldCache api's so that the top level FieldCache is no longer in regular use. Instead all top level caches are accessed through MultiDocValues.

      This ticket does the following:

      1) Optimizes Collapse and Expand to use MultiDocValues and makes this the default approach when collapsing on String fields

      2) Provides an option to use a top level FieldCache if the performance of MultiDocValues is a blocker. The mechanism for switching to the FieldCache is a new "hint" parameter. If the hint parameter is set to "top_fc" then the top-level FieldCache would be used for both Collapse and Expand.

      Example syntax:

      fq={!collapse field=x hint=TOP_FC}
      

      3) Adds numeric collapse field implementations.

      4) Resolves issue SOLR-6066

      1. renames.diff
        17 kB
        David Smiley
      2. SOLR-6581.patch
        125 kB
        Joel Bernstein
      3. SOLR-6581.patch
        124 kB
        Joel Bernstein
      4. SOLR-6581.patch
        122 kB
        Joel Bernstein
      5. SOLR-6581.patch
        122 kB
        Joel Bernstein
      6. SOLR-6581.patch
        122 kB
        Joel Bernstein
      7. SOLR-6581.patch
        118 kB
        Joel Bernstein
      8. SOLR-6581.patch
        117 kB
        Joel Bernstein
      9. SOLR-6581.patch
        102 kB
        Joel Bernstein
      10. SOLR-6581.patch
        92 kB
        Joel Bernstein
      11. SOLR-6581.patch
        87 kB
        Joel Bernstein
      12. SOLR-6581.patch
        84 kB
        Joel Bernstein
      13. SOLR-6581.patch
        59 kB
        Joel Bernstein
      14. SOLR-6581.patch
        54 kB
        Joel Bernstein
      15. SOLR-6581.patch
        9 kB
        Joel Bernstein
      16. SOLR-6581.patch
        4 kB
        Joel Bernstein

        Issue Links

          Activity

          Hide
          Joel Bernstein added a comment - - edited

          The latest patch includes changes for both the CollapsingQParserPlugin and the ExpandComponent. These changes are designed to provide the 5.0 version with the same performance characteristics as the 4.0 versions.

          This is somewhat of an interesting ticket as it involved exploring the tradeoffs of using the new Lucene DocValues class versus the UninvertedReader class which can be used to hook into the FieldCache. I'll update the description of this ticket with the details.

          Show
          Joel Bernstein added a comment - - edited The latest patch includes changes for both the CollapsingQParserPlugin and the ExpandComponent. These changes are designed to provide the 5.0 version with the same performance characteristics as the 4.0 versions. This is somewhat of an interesting ticket as it involved exploring the tradeoffs of using the new Lucene DocValues class versus the UninvertedReader class which can be used to hook into the FieldCache. I'll update the description of this ticket with the details.
          Hide
          Joel Bernstein added a comment -

          A patch with the current state of the work being done. Patch includes implementations for top level String FieldCache and String MultiDocValues. The patch also includes initial implementations for collapsing on an integer field.

          Show
          Joel Bernstein added a comment - A patch with the current state of the work being done. Patch includes implementations for top level String FieldCache and String MultiDocValues. The patch also includes initial implementations for collapsing on an integer field.
          Hide
          David Smiley added a comment -

          While you are improving CollapsingQParserPlugin, I suggest doing some variable renames. In my work I needed something like this code so I forked it to be modified, but found my first task was to do a bunch of renames so that it was clear what variable was for what. The attached patch is a redacted version from my code and includes a tad bit of other stuff to be ignored, but see the change in variable names, and a getter rename or two. As a random example, "docId" is unclear; is this a global doc ID or is it segment local? Likewise for ordinals. Arguably most of Lucene/Solr is guilty of this but this one source file I found hard to penetrate until I did the renames to decipher what's going on.

          Show
          David Smiley added a comment - While you are improving CollapsingQParserPlugin, I suggest doing some variable renames. In my work I needed something like this code so I forked it to be modified, but found my first task was to do a bunch of renames so that it was clear what variable was for what. The attached patch is a redacted version from my code and includes a tad bit of other stuff to be ignored, but see the change in variable names, and a getter rename or two. As a random example, "docId" is unclear; is this a global doc ID or is it segment local? Likewise for ordinals. Arguably most of Lucene/Solr is guilty of this but this one source file I found hard to penetrate until I did the renames to decipher what's going on.
          Hide
          Joel Bernstein added a comment -

          Yeah, I see what you mean. I'm knee deep into a large refactoring of this code and I'll work on clarifying variables and naming conventions.

          I'm also adding specific implementations for collapsing on numeric fields which are a little slower at query time, but very real-time indexing friendly. Hope to wrap all this up this week.

          Show
          Joel Bernstein added a comment - Yeah, I see what you mean. I'm knee deep into a large refactoring of this code and I'll work on clarifying variables and naming conventions. I'm also adding specific implementations for collapsing on numeric fields which are a little slower at query time, but very real-time indexing friendly. Hope to wrap all this up this week.
          Hide
          Joel Bernstein added a comment -

          Latest work including test cases for collapsing on numeric field. Not all tests are passing yet.

          Show
          Joel Bernstein added a comment - Latest work including test cases for collapsing on numeric field. Not all tests are passing yet.
          Hide
          Joel Bernstein added a comment -

          Getting much closer. The numeric collapse field tests are now passing and variables have been renamed for clarity.

          Show
          Joel Bernstein added a comment - Getting much closer. The numeric collapse field tests are now passing and variables have been renamed for clarity.
          Hide
          Joel Bernstein added a comment -

          Added implementation to the ExpandComponent to handle expanding on numeric fields. Test cases have not yet been updated to include numeric field expansion.

          Show
          Joel Bernstein added a comment - Added implementation to the ExpandComponent to handle expanding on numeric fields. Test cases have not yet been updated to include numeric field expansion.
          Hide
          Joel Bernstein added a comment -

          New patch with working tests for the new ExpandComponent implementations. Test for expanding on Numeric fields is included.

          Show
          Joel Bernstein added a comment - New patch with working tests for the new ExpandComponent implementations. Test for expanding on Numeric fields is included.
          Hide
          Joel Bernstein added a comment -

          New patch adding testing for FAST_QUERY "hint" to the ExpandComponent tests.

          Show
          Joel Bernstein added a comment - New patch adding testing for FAST_QUERY "hint" to the ExpandComponent tests.
          Hide
          Joel Bernstein added a comment -

          Patch with performance improvements for the ExpandComponent. Tests still need to be re-worked to support the performance enhancements.

          Show
          Joel Bernstein added a comment - Patch with performance improvements for the ExpandComponent. Tests still need to be re-worked to support the performance enhancements.
          Hide
          Joel Bernstein added a comment -

          Patch with all updated trunk code and all tests passing.

          Show
          Joel Bernstein added a comment - Patch with all updated trunk code and all tests passing.
          Hide
          Joel Bernstein added a comment -

          Added more error handling and removed all debugging/timing code.

          Show
          Joel Bernstein added a comment - Added more error handling and removed all debugging/timing code.
          Hide
          Joel Bernstein added a comment -

          Very close to committing this now. I'll do some more manual testing and if all looks good I plan to commit in the next day or two.

          Show
          Joel Bernstein added a comment - Very close to committing this now. I'll do some more manual testing and if all looks good I plan to commit in the next day or two.
          Hide
          Joel Bernstein added a comment -

          Patch with bugfix when collapsing on numeric fields that was turned up during manual testing. Testcases have be updated to catch this bug as well.

          Show
          Joel Bernstein added a comment - Patch with bugfix when collapsing on numeric fields that was turned up during manual testing. Testcases have be updated to catch this bug as well.
          Hide
          Joel Bernstein added a comment -

          unit test are passing, manual testing looks good, pre-commit passes.

          Show
          Joel Bernstein added a comment - unit test are passing, manual testing looks good, pre-commit passes.
          Hide
          Yonik Seeley added a comment -

          FAST_QUERY doesn't give much of an idea of what's going on under the covers, and a more descriptive name would probably be better if more methods/optimizations may be added in the future. Maybe something like "top_fc"? Probably best to stick to lower case too...

          Show
          Yonik Seeley added a comment - FAST_QUERY doesn't give much of an idea of what's going on under the covers, and a more descriptive name would probably be better if more methods/optimizations may be added in the future. Maybe something like "top_fc"? Probably best to stick to lower case too...
          Hide
          Joel Bernstein added a comment -

          "top_fc" sounds good to me. I'll make the change.

          Show
          Joel Bernstein added a comment - "top_fc" sounds good to me. I'll make the change.
          Hide
          ASF subversion and git services added a comment -

          Commit 1651087 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1651087 ]

          SOLR-6581: Efficient DocValues support and numeric collapse field implementations for Collapse and Expand

          Show
          ASF subversion and git services added a comment - Commit 1651087 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1651087 ] SOLR-6581 : Efficient DocValues support and numeric collapse field implementations for Collapse and Expand
          Hide
          ASF subversion and git services added a comment -

          Commit 1651109 from Joel Bernstein in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1651109 ]

          SOLR-6581: Efficient DocValues support and numeric collapse field implementations for Collapse and Expand

          Show
          ASF subversion and git services added a comment - Commit 1651109 from Joel Bernstein in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651109 ] SOLR-6581 : Efficient DocValues support and numeric collapse field implementations for Collapse and Expand
          Hide
          ASF subversion and git services added a comment -

          Commit 1651685 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1651685 ]

          SOLR-6581: Additional test for Collapse and fixed problem with Expand float tests

          Show
          ASF subversion and git services added a comment - Commit 1651685 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1651685 ] SOLR-6581 : Additional test for Collapse and fixed problem with Expand float tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1651693 from Joel Bernstein in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1651693 ]

          SOLR-6581: Additional test for Collapse and fixed problem with Expand float tests

          Show
          ASF subversion and git services added a comment - Commit 1651693 from Joel Bernstein in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651693 ] SOLR-6581 : Additional test for Collapse and fixed problem with Expand float tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1651736 from Joel Bernstein in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1651736 ]

          SOLR-6581: Additional test for Collapse and fixed problem with Expand float tests

          Show
          ASF subversion and git services added a comment - Commit 1651736 from Joel Bernstein in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1651736 ] SOLR-6581 : Additional test for Collapse and fixed problem with Expand float tests
          Hide
          Joel Bernstein added a comment -

          The hint in the code is still upper case TOP_FC. This was meant to be lower case. I'll open another issue for this and have it accept both cases. 5.0 will go out with the upper case syntax though so I'll update the documentation.

          Show
          Joel Bernstein added a comment - The hint in the code is still upper case TOP_FC. This was meant to be lower case. I'll open another issue for this and have it accept both cases. 5.0 will go out with the upper case syntax though so I'll update the documentation.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          Dallan Quass added a comment -

          Any ideas how much slower numeric collapse/expand implementation is than string collapse/expand with the top_fc hint? I'm trying to decide if I should re-index my int collapse field as a string. (I don't care about real-time performance.)

          Show
          Dallan Quass added a comment - Any ideas how much slower numeric collapse/expand implementation is than string collapse/expand with the top_fc hint? I'm trying to decide if I should re-index my int collapse field as a string. (I don't care about real-time performance.)
          Hide
          David Smiley added a comment -

          Dallan:

          Any ideas how much slower numeric collapse/expand implementation is than string collapse/expand with the top_fc hint?

          I would guess a numeric impl would beat a string impl every time.

          Hey FYI everyone... this is weird but while I just finished doing a major Solr 4.10 -> Solr 6.1.0 upgrade, I found that the top_fc for the collapse had quite the opposite effect in a test environment with no indexing/commits. top_fc took twice as long as without. Shrug; no clue. So folks don't go setting this blindly without actually measuirng before & after in your own environment.

          Show
          David Smiley added a comment - Dallan: Any ideas how much slower numeric collapse/expand implementation is than string collapse/expand with the top_fc hint? I would guess a numeric impl would beat a string impl every time. Hey FYI everyone... this is weird but while I just finished doing a major Solr 4.10 -> Solr 6.1.0 upgrade, I found that the top_fc for the collapse had quite the opposite effect in a test environment with no indexing/commits . top_fc took twice as long as without. Shrug; no clue. So folks don't go setting this blindly without actually measuirng before & after in your own environment.
          Hide
          Joel Bernstein added a comment -

          Were you using the sort param or max/min param? All the top_fc performance testing was done with max/min param. The sort param came later and I'm not sure how it performs with top_fc.

          The top_fc param should always be the fastest approach when using min/max. If it's not then something odd is going on.

          Show
          Joel Bernstein added a comment - Were you using the sort param or max/min param? All the top_fc performance testing was done with max/min param. The sort param came later and I'm not sure how it performs with top_fc. The top_fc param should always be the fastest approach when using min/max. If it's not then something odd is going on.
          Hide
          Joel Bernstein added a comment -

          David Smiley, can you post the query you were using. And also some specs on the data set and result set. Things like how many unique values in the collapse field, how documents in the result set pre-collapse. Thanks!

          Show
          Joel Bernstein added a comment - David Smiley , can you post the query you were using. And also some specs on the data set and result set. Things like how many unique values in the collapse field, how documents in the result set pre-collapse. Thanks!
          Hide
          Joel Bernstein added a comment -

          I just ran a test with collapse on trunk and validated David Smiley findings. Collapse at this point is so slow both with and without top_fc that it's really not fit for purpose.

          I'll dig a little to see if I can see what's happening.

          Show
          Joel Bernstein added a comment - I just ran a test with collapse on trunk and validated David Smiley findings. Collapse at this point is so slow both with and without top_fc that it's really not fit for purpose. I'll dig a little to see if I can see what's happening.
          Hide
          David Smiley added a comment -

          Joel Bernstein It's a single index (shard) of 20.8M docs spanning 180 segments (kinda a lot I know) containing a string parent-document-id field with docValues. The field is always populated and has almost as many distinct values as there are documents – 18.9M. The collapse usage has max=(aBooleanField) and no sort. I do see a field cache (UninvertedReader) entry on this field via looking at the admin screen (says it takes up ~403MB).

          Show
          David Smiley added a comment - Joel Bernstein It's a single index (shard) of 20.8M docs spanning 180 segments (kinda a lot I know) containing a string parent-document-id field with docValues. The field is always populated and has almost as many distinct values as there are documents – 18.9M. The collapse usage has max=(aBooleanField) and no sort. I do see a field cache (UninvertedReader) entry on this field via looking at the admin screen (says it takes up ~403MB).
          Hide
          Joel Bernstein added a comment -

          Ok, false alarm. my initial tests were faulty. I thought a had loaded 5,000,000 docs and actually a had set the job to load 50,000,000 docs. So the test was running with the indexing running.

          After running a proper test I found that things are as expected. I'm seeing to_fc queries running almost 3 times faster with the top_fc hint.

          I was running with these simple queries:

          {!collapse field=test_s hint=top_fc}
          
          and
          
          {!collapse field=test_s}
          

          I had an index of 5 million docs and the test_s field had 1.8 million unique values.

          With the top_fc hint the query was taking around 160 millis.
          Without the top_fc hint the query was taking around 440 millis.

          Show
          Joel Bernstein added a comment - Ok, false alarm. my initial tests were faulty. I thought a had loaded 5,000,000 docs and actually a had set the job to load 50,000,000 docs. So the test was running with the indexing running. After running a proper test I found that things are as expected. I'm seeing to_fc queries running almost 3 times faster with the top_fc hint. I was running with these simple queries: {!collapse field=test_s hint=top_fc} and {!collapse field=test_s} I had an index of 5 million docs and the test_s field had 1.8 million unique values. With the top_fc hint the query was taking around 160 millis. Without the top_fc hint the query was taking around 440 millis.
          Hide
          Joel Bernstein added a comment -

          Wow that's a hard case.

          The string collapse is done into a fixed array the size of the unique values in the field. Similar to faceting on a string field. So we're talking about a huge amount of memory for a single query. Still not sure why MultiDocValues would outperform the top level field cache in this scenario.

          In this scenario sharding would be very useful, but you would have to shard on the collapse field.

          Show
          Joel Bernstein added a comment - Wow that's a hard case. The string collapse is done into a fixed array the size of the unique values in the field. Similar to faceting on a string field. So we're talking about a huge amount of memory for a single query. Still not sure why MultiDocValues would outperform the top level field cache in this scenario. In this scenario sharding would be very useful, but you would have to shard on the collapse field.
          Hide
          David Smiley added a comment -

          This is just one shard; the actual total collection is sharded beyond the levels most people shard.

          In any case, maybe some time I'll peek into profiling what's going on to see if I can find any insights. But that's low priority as our Solr 6 upgrade without the hint=top_fc is already a huge improvement overall.

          Show
          David Smiley added a comment - This is just one shard; the actual total collection is sharded beyond the levels most people shard. In any case, maybe some time I'll peek into profiling what's going on to see if I can find any insights. But that's low priority as our Solr 6 upgrade without the hint=top_fc is already a huge improvement overall.
          Hide
          Jeff Wartes added a comment -

          For what it's worth, I recall having a bad experience with that hint in a Solr 5.4 cluster late last year. I never did dig into why though.
          I had a similar case where I was collapsing on a highly distinct field, and as Joel indicates, the memory allocation rate was bad enough I had to give up on the whole thing. Joel and I discussed this a little in SOLR-9125 if you're curious.

          Show
          Jeff Wartes added a comment - For what it's worth, I recall having a bad experience with that hint in a Solr 5.4 cluster late last year. I never did dig into why though. I had a similar case where I was collapsing on a highly distinct field, and as Joel indicates, the memory allocation rate was bad enough I had to give up on the whole thing. Joel and I discussed this a little in SOLR-9125 if you're curious.

            People

            • Assignee:
              Joel Bernstein
              Reporter:
              Joel Bernstein
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development