Lucene - Core
  1. Lucene - Core
  2. LUCENE-6233

CheckIndex is dog slow when checking term vectors

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I'm working on a test that creates a biggish index and I noticed the CheckIndex takes a surprisingly long time to check term vectors.

      I profiled it and uncovered that we spend a lot of time (not sure this explains all of it) in Terms.getMin/getMax. Since CompressingTermVectorsReader doesn't impl these methods efficiently (which is fine), we fallback to super's impl, which does a digit-by-digit binary search using seekCeil.

      But for TVs this sometimes results in a linear scan.

      I think CheckIndex should not check Terms.getMin/Max for TVs?

      1. LUCENE-6223.patch
        33 kB
        Robert Muir
      2. LUCENE-6233.patch
        14 kB
        Robert Muir
      3. LUCENE-6233.patch
        13 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        This was introduced with LUCENE-5610

        I'll fix the nightly Lucene benchmark to plot CheckIndex time ... we could have spotted this performance regression.

        Show
        Michael McCandless added a comment - This was introduced with LUCENE-5610 I'll fix the nightly Lucene benchmark to plot CheckIndex time ... we could have spotted this performance regression.
        Hide
        Robert Muir added a comment -

        I think CheckIndex should not check Terms.getMin/Max for TVs?

        +1

        Show
        Robert Muir added a comment - I think CheckIndex should not check Terms.getMin/Max for TVs? +1
        Hide
        Michael McCandless added a comment -

        Patch.

        I disabled Terms.getMin/Max checking for TVs, fixed the "test with the
        one doc deleted" to only run on the first doc, and only test 1
        "advance" doc.

        I also added time taken to each part we test, e.g.:

          1 of 24: name=_1b docCount=10309
            version=6.0.0
            id=cd308kthf553d7dl049vw982u
            codec=Asserting(Lucene50)
            compound=true
            numFiles=3
            size (MB)=30.358
            diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_25, lucene.version=6.0.0, mergeMaxNumSegments=-1, os.arch=amd64, source=merge, mergeFactor=3, os.version=3.13.0-37-generic, timestamp=1423588030806}
            no deletions
            test: open reader.........OK
            test: check integrity.....OK
            test: check live docs.....OK [took 0.000 sec]
            test: field infos.........OK [8 fields] [took 0.000 sec]
            test: field norms.........OK [2 fields] [took 0.005 sec]
            test: terms, freq, prox...OK [381010 terms; 1154763 terms/docs paris; 1883324 tokens] [took 0.550 sec]
            test: stored fields.......OK [41236 total field count; avg 4.0 fields per doc] [took 0.323 sec]
            test: term vectors........OK [20617 total term vector count; avg 2.0 term/freq vector fields per doc] [took 1.257 sec]
            test: docvalues...........OK [2 docvalues fields; 0 BINARY; 1 NUMERIC; 1 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.020 sec]
        

        Term vectors checking is still slowish, but at least it's faster: on
        my smallish test index the total CheckIndex time improves from 33.6
        seconds to 12.5 seconds.

        I also plotted the time to CheckIndex in the nightly benchmark: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html

        However that index doesn't have term vectors so this issue shouldn't
        affect it ...

        Show
        Michael McCandless added a comment - Patch. I disabled Terms.getMin/Max checking for TVs, fixed the "test with the one doc deleted" to only run on the first doc, and only test 1 "advance" doc. I also added time taken to each part we test, e.g.: 1 of 24: name=_1b docCount=10309 version=6.0.0 id=cd308kthf553d7dl049vw982u codec=Asserting(Lucene50) compound=true numFiles=3 size (MB)=30.358 diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_25, lucene.version=6.0.0, mergeMaxNumSegments=-1, os.arch=amd64, source=merge, mergeFactor=3, os.version=3.13.0-37-generic, timestamp=1423588030806} no deletions test: open reader.........OK test: check integrity.....OK test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [8 fields] [took 0.000 sec] test: field norms.........OK [2 fields] [took 0.005 sec] test: terms, freq, prox...OK [381010 terms; 1154763 terms/docs paris; 1883324 tokens] [took 0.550 sec] test: stored fields.......OK [41236 total field count; avg 4.0 fields per doc] [took 0.323 sec] test: term vectors........OK [20617 total term vector count; avg 2.0 term/freq vector fields per doc] [took 1.257 sec] test: docvalues...........OK [2 docvalues fields; 0 BINARY; 1 NUMERIC; 1 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.020 sec] Term vectors checking is still slowish, but at least it's faster: on my smallish test index the total CheckIndex time improves from 33.6 seconds to 12.5 seconds. I also plotted the time to CheckIndex in the nightly benchmark: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html However that index doesn't have term vectors so this issue shouldn't affect it ...
        Hide
        Robert Muir added a comment -

        I added two more timings to the patch. here is the output on one of my wiki10m segments:

        size (MB)=624.591
            diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_40-ea, lucene.version=6.0.0, mergeMaxNumSegments=-1, os.arch=amd64, source=merge, mergeFactor=10, os.version=3.13.0-43-generic, timestamp=1423097209630}
            has deletions [delGen=6]
            test: open reader.........OK [took 0.075 sec]
            test: check integrity.....OK [took 1.515 sec]
            test: check live docs.....OK [90031 deleted docs]
            test: field infos.........OK [8 fields] [took 0.000 sec]
            test: field norms.........OK [2 fields] [took 0.046 sec]
            test: terms, freq, prox...OK [6844227 terms; 170452948 terms/docs pairs; 240913350 tokens] [took 13.171 sec]
            test (ignoring deletes): terms, freq, prox...OK [7105194 terms; 179422787 terms/docs pairs; 253586353 tokens] [took 9.632 sec]
            test: stored fields.......OK [5135307 total field count; avg 3.0 fields per doc] [took 4.648 sec]
            test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.036 sec]
            test: docvalues...........OK [2 docvalues fields; 0 BINARY; 1 NUMERIC; 1 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.206 sec]
        

        Maybe check index should have a integrity-check only option as a followup. It would just be sugar to the user, but this would always be pretty fast.

        Show
        Robert Muir added a comment - I added two more timings to the patch. here is the output on one of my wiki10m segments: size (MB)=624.591 diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_40-ea, lucene.version=6.0.0, mergeMaxNumSegments=-1, os.arch=amd64, source=merge, mergeFactor=10, os.version=3.13.0-43-generic, timestamp=1423097209630} has deletions [delGen=6] test: open reader.........OK [took 0.075 sec] test: check integrity.....OK [took 1.515 sec] test: check live docs.....OK [90031 deleted docs] test: field infos.........OK [8 fields] [took 0.000 sec] test: field norms.........OK [2 fields] [took 0.046 sec] test: terms, freq, prox...OK [6844227 terms; 170452948 terms/docs pairs; 240913350 tokens] [took 13.171 sec] test (ignoring deletes): terms, freq, prox...OK [7105194 terms; 179422787 terms/docs pairs; 253586353 tokens] [took 9.632 sec] test: stored fields.......OK [5135307 total field count; avg 3.0 fields per doc] [took 4.648 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.036 sec] test: docvalues...........OK [2 docvalues fields; 0 BINARY; 1 NUMERIC; 1 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.206 sec] Maybe check index should have a integrity-check only option as a followup. It would just be sugar to the user, but this would always be pretty fast.
        Hide
        Robert Muir added a comment -

        sorry, here is the correct patch.

        Show
        Robert Muir added a comment - sorry, here is the correct patch.
        Hide
        Michael McCandless added a comment -

        OK I noticed one case where live docs didn't confess how long it took

        I'll fix that and commit.

        Show
        Michael McCandless added a comment - OK I noticed one case where live docs didn't confess how long it took I'll fix that and commit.
        Hide
        ASF subversion and git services added a comment -

        Commit 1658831 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1658831 ]

        LUCENE-6233: speed up CheckIndex when the index has term vectors

        Show
        ASF subversion and git services added a comment - Commit 1658831 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1658831 ] LUCENE-6233 : speed up CheckIndex when the index has term vectors
        Hide
        ASF subversion and git services added a comment -

        Commit 1658832 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1658832 ]

        LUCENE-6233: speed up CheckIndex when the index has term vectors

        Show
        ASF subversion and git services added a comment - Commit 1658832 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1658832 ] LUCENE-6233 : speed up CheckIndex when the index has term vectors
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development