Lucene - Core
  1. Lucene - Core
  2. LUCENE-3473

CheckIndex should verify numUniqueTerms == recomputedNumUniqueTerms

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.4, 4.0-ALPHA
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Just glancing at the code it seems to sorta do this check, but only in the hasOrd==true case maybe (which seems to be testing something else)?

      It would be nice to verify this also for terms dicts that dont support ord.

      we should add explicit checks per-field in 4.x, and for-all-fields in 3.x and preflex

      1. LUCENE-3473.patch
        17 kB
        Robert Muir
      2. LUCENE-3473.patch
        17 kB
        Robert Muir
      3. LUCENE-3473.patch
        13 kB
        Robert Muir
      4. LUCENE-3473.patch
        12 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5
          Hide
          Robert Muir added a comment -

          updated patch, now that LUCENE-3526 is fixed, all tests passed.

          • removed the useless TestBackwardsCompatibility test (i was just debugging)
          • fixed TestRollingUpdates to not combine PreFlexCodec and MemoryCodec in PerFieldCodecWrapper (this is stupid, and causes my assert to trip)
          Show
          Robert Muir added a comment - updated patch, now that LUCENE-3526 is fixed, all tests passed. removed the useless TestBackwardsCompatibility test (i was just debugging) fixed TestRollingUpdates to not combine PreFlexCodec and MemoryCodec in PerFieldCodecWrapper (this is stupid, and causes my assert to trip)
          Hide
          Robert Muir added a comment -

          Uwe yes: i was actually adding this test only for debugging... I'll remove it (it does not give us any additional test coverage)

          Show
          Robert Muir added a comment - Uwe yes: i was actually adding this test only for debugging... I'll remove it (it does not give us any additional test coverage)
          Hide
          Uwe Schindler added a comment -

          Robert: In your patch is an additional test for CheckIndex on the old indexes. This is implicitely already done by: testSearchOldIndex, which calls Testutil's checkindex as first step. So this test is duplicate and slows down, right?

          Show
          Uwe Schindler added a comment - Robert: In your patch is an additional test for CheckIndex on the old indexes. This is implicitely already done by: testSearchOldIndex, which calls Testutil's checkindex as first step. So this test is duplicate and slows down, right?
          Hide
          Robert Muir added a comment -

          I committed the trivial patch to enable this check on 3.x branch... so something else is up with trunk... ill investigate.

          Show
          Robert Muir added a comment - I committed the trivial patch to enable this check on 3.x branch... so something else is up with trunk... ill investigate.
          Hide
          Robert Muir added a comment -

          Hmm, I noticed i left a s.o.p in the previous patch for preflex, but it wasn't being called from CheckIndex.

          This is because we always wrap PreFlex inside PerFieldCodecWrapper... even if its a 3.x index! This is a problem as it still perpetuates the loss of IR.numUniqueTerms.

          So in this patch we no longer do this, which means I'm able to remove the assume as well.

          But, now that preflex is being tested I think I've found an off-by-one with this statistic when the field name is the empty string.

          I'm gonna see if i can make a testcase/issue against 3.x separately for this... because this patch is already too big.

          NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testEmptyFieldName -Dtests.seed=57fd2807ecfb5a2b:5556d32d3a1f68b7:469f7ed779c63825 -Dtests.codec=PreFlex

          Show
          Robert Muir added a comment - Hmm, I noticed i left a s.o.p in the previous patch for preflex, but it wasn't being called from CheckIndex. This is because we always wrap PreFlex inside PerFieldCodecWrapper... even if its a 3.x index! This is a problem as it still perpetuates the loss of IR.numUniqueTerms. So in this patch we no longer do this, which means I'm able to remove the assume as well. But, now that preflex is being tested I think I've found an off-by-one with this statistic when the field name is the empty string. I'm gonna see if i can make a testcase/issue against 3.x separately for this... because this patch is already too big. NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtestmethod=testEmptyFieldName -Dtests.seed=57fd2807ecfb5a2b:5556d32d3a1f68b7:469f7ed779c63825 -Dtests.codec=PreFlex
          Hide
          Robert Muir added a comment -

          More fixes:

          • catch(UOE)'s are removed in checkindex because now this stat just returns -1 like other stats when it isnt available.
          • DR.getUniqueTermCount returns -1 like MR now, and I adjusted the test to expect this rather than the exception.
          • Preflex is disabled in the test: only because we currently 'write' preflex indexes actually as PerFieldCodecWrapper... which is bogus but a larger issue. Still, the preflex method is being tested for the older indexes when checkIndex runs. once we fix PreFlexRW to be a "real codec" that writes actual valid 3.x indexes we can remove the assume()
          Show
          Robert Muir added a comment - More fixes: catch(UOE)'s are removed in checkindex because now this stat just returns -1 like other stats when it isnt available. DR.getUniqueTermCount returns -1 like MR now, and I adjusted the test to expect this rather than the exception. Preflex is disabled in the test: only because we currently 'write' preflex indexes actually as PerFieldCodecWrapper... which is bogus but a larger issue. Still, the preflex method is being tested for the older indexes when checkIndex runs. once we fix PreFlexRW to be a "real codec" that writes actual valid 3.x indexes we can remove the assume()
          Hide
          Robert Muir added a comment -

          Patch adding the checks to checkindex.

          there were some problems:

          • IndexReader.getUniqueTermCount doesn't work in trunk, but works fine in 3.x. This is because it sums per-field across the Terms api, but PreFlex codec doesn't know this information per-field
          • If a field has no postings (but exists in fieldinfos), then IR.getUniqueTermCount hits an NPE (ant test-core -Dtestcase=TestNorms -Dtestmethod=testCustomEncoder -Dtests.seed=-6a2248fc7313e45:c41a685f840f6ed:-5a3fd5b8ec315508)
          • MemoryCodec didn't implement Fields.getUniqueTermCount, probably just forgotten because its not abstract (instead throwing UOE by default).

          So, i fixed MemoryCodec to impl Terms.getUniqueTermCount, changed Terms.getUniqueTermCount to be abstract (throw -1 if you cannot implement it), and added Fields.getUniqueTermCount, called by IR.getUniqueTermCount: default implementation sums across fields, but PreFlex overrides so that its IR.getUniqueTermCount works again.

          we might want to deprecate the latter method when 3.x indexes no longer need to be supported, or maybe its just fine as-is (you have to do the summing somewhere).

          Show
          Robert Muir added a comment - Patch adding the checks to checkindex. there were some problems: IndexReader.getUniqueTermCount doesn't work in trunk, but works fine in 3.x. This is because it sums per-field across the Terms api, but PreFlex codec doesn't know this information per-field If a field has no postings (but exists in fieldinfos), then IR.getUniqueTermCount hits an NPE (ant test-core -Dtestcase=TestNorms -Dtestmethod=testCustomEncoder -Dtests.seed=-6a2248fc7313e45:c41a685f840f6ed:-5a3fd5b8ec315508) MemoryCodec didn't implement Fields.getUniqueTermCount, probably just forgotten because its not abstract (instead throwing UOE by default). So, i fixed MemoryCodec to impl Terms.getUniqueTermCount, changed Terms.getUniqueTermCount to be abstract (throw -1 if you cannot implement it), and added Fields.getUniqueTermCount, called by IR.getUniqueTermCount: default implementation sums across fields, but PreFlex overrides so that its IR.getUniqueTermCount works again. we might want to deprecate the latter method when 3.x indexes no longer need to be supported, or maybe its just fine as-is (you have to do the summing somewhere).

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development