Lucene - Core
  1. Lucene - Core
  2. LUCENE-4819

move Sorted[Set]DocValuesTermsEnum to codec

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 4.2.1, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently a user can instantiate a SortedDocValuesTermsEnum(SortedDocValues).

      This is a generic termsenum, implementing all operations by lookupOrd().

      I think instead this should be the default implementation, and we should have e.g. SortedDocValues.termsEnum() that returns it (codec can implement something fancier).

      For example the default codec implements lookupOrd as an FST binary search, which means next() on this termsenum is much slower than it needs to be for the places where this enum is actually used (segment merging, OrdinalMap used for faceting in SOLR-4490 and LUCENE-4795)

      So instead, it can override this method and use an FSTEnum, and these operations are significantly faster (3x faster for me with a simple benchmark with 10M terms).

      1. LUCENE-4819.patch
        24 kB
        Robert Muir
      2. LUCENE-4819.patch
        16 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        patch. there are some things I don't like about it though.

        Show
        Robert Muir added a comment - patch. there are some things I don't like about it though.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        David Smiley added a comment -

        +1

        Show
        David Smiley added a comment - +1
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1454968

        LUCENE-4819: move Sorted(set)DocValuesTermsEnum to codec

        Show
        Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1454968 LUCENE-4819 : move Sorted(set)DocValuesTermsEnum to codec
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1454969

        LUCENE-4819: move Sorted(set)DocValuesTermsEnum to codec

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1454969 LUCENE-4819 : move Sorted(set)DocValuesTermsEnum to codec
        Hide
        Robert Muir added a comment -

        This caused solr test failures.

        I've reverted for now. I'll figure out whats wrong (and why lucene tests didnt catch it). But for now I'm trying to release.

        Show
        Robert Muir added a comment - This caused solr test failures. I've reverted for now. I'll figure out whats wrong (and why lucene tests didnt catch it). But for now I'm trying to release.
        Hide
        Commit Tag Bot added a comment -
        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1454995 LUCENE-4819 : revert
        Hide
        Commit Tag Bot added a comment -
        Show
        Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1454996 LUCENE-4819 : revert
        Hide
        Robert Muir added a comment -

        I see my bug:

            @Override
            public SeekStatus seekCeil(BytesRef text, boolean useCache) throws IOException {
              if (in.seekCeil(text) == null) {
                return SeekStatus.NOT_FOUND; /* <-- bug: This should be SeekStatus.END */
              } else if (term().equals(text)) {
                // TODO: add SeekStatus to FSTEnum like in https://issues.apache.org/jira/browse/LUCENE-3729
                // to remove this comparision?
                return SeekStatus.FOUND;
              } else {
                return SeekStatus.NOT_FOUND;
              }
            }
        
        Show
        Robert Muir added a comment - I see my bug: @Override public SeekStatus seekCeil(BytesRef text, boolean useCache) throws IOException { if (in.seekCeil(text) == null ) { return SeekStatus.NOT_FOUND; /* <-- bug: This should be SeekStatus.END */ } else if (term().equals(text)) { // TODO: add SeekStatus to FSTEnum like in https://issues.apache.org/jira/browse/LUCENE-3729 // to remove this comparision? return SeekStatus.FOUND; } else { return SeekStatus.NOT_FOUND; } }
        Hide
        Robert Muir added a comment -

        Updated patch with the bugfix. Additionally i added some simple tests so hopefully no codec makes this mistake (debugging the random test would not be fun).

        This found another bug, with the existing default implementation's seekExact(BytesRef). Fortunately nothing is yet using this method.

        I'll beast for a while and then commit.

        Show
        Robert Muir added a comment - Updated patch with the bugfix. Additionally i added some simple tests so hopefully no codec makes this mistake (debugging the random test would not be fun). This found another bug, with the existing default implementation's seekExact(BytesRef). Fortunately nothing is yet using this method. I'll beast for a while and then commit.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1455394

        LUCENE-4819: move Sorted(set)DocValuesTermsEnum to codec

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1455394 LUCENE-4819 : move Sorted(set)DocValuesTermsEnum to codec
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1455391

        LUCENE-4819: move Sorted(set)DocValuesTermsEnum to codec

        Show
        Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1455391 LUCENE-4819 : move Sorted(set)DocValuesTermsEnum to codec
        Hide
        Robert Muir added a comment -

        I can backport this to 4.2.1 if we want, or we can just wait for 4.3

        The thing i hate about it: the stupid binary-search next() in the default codec
        makes for long reopen times (if you use e.g. DV faceting in solr) and slower merging.

        But things are working correctly without it, so I can go either way.

        Show
        Robert Muir added a comment - I can backport this to 4.2.1 if we want, or we can just wait for 4.3 The thing i hate about it: the stupid binary-search next() in the default codec makes for long reopen times (if you use e.g. DV faceting in solr) and slower merging. But things are working correctly without it, so I can go either way.
        Hide
        David Smiley added a comment -

        I vote for inclusion Rob. I lamented it's absence in 4.2.

        Show
        David Smiley added a comment - I vote for inclusion Rob. I lamented it's absence in 4.2.
        Hide
        Mark Miller added a comment -

        +1 for 4.2.1

        Show
        Mark Miller added a comment - +1 for 4.2.1
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development