Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is fairly slow today, very ram intensive, and has some buggy stuff (e.g. postingsenum reuse bugs). We can do better...

      1. LUCENE-6320.patch
        45 kB
        Robert Muir
      2. LUCENE-6320.patch
        42 kB
        Robert Muir

        Activity

        Hide
        rcmuir Robert Muir added a comment -

        Here is a patch. We use codec apis to do these checks, so the optimizations we already worked on for merge help a lot (esp. stored fields, norms, docvalues).

        When we check postings without deletes, we weren't reusing postingsenum and were clone()'ing for every term.

        FieldInfos.get(int) is a cpu hog for stored fields and vectors, since its called for every field in the doc and we do O(log N) lookup each time. Its wasteful in memory usually too (using a treemap always when in most cases a simple array is smaller and faster).

        Show
        rcmuir Robert Muir added a comment - Here is a patch. We use codec apis to do these checks, so the optimizations we already worked on for merge help a lot (esp. stored fields, norms, docvalues). When we check postings without deletes, we weren't reusing postingsenum and were clone()'ing for every term. FieldInfos.get(int) is a cpu hog for stored fields and vectors, since its called for every field in the doc and we do O(log N) lookup each time. Its wasteful in memory usually too (using a treemap always when in most cases a simple array is smaller and faster).
        Hide
        rcmuir Robert Muir added a comment -

        I also thought about moving the integrity check into the different pieces just like merge does. We can still do that, but I think its less important at the moment. this is big enough for chew on for now.

        Show
        rcmuir Robert Muir added a comment - I also thought about moving the integrity check into the different pieces just like merge does. We can still do that, but I think its less important at the moment. this is big enough for chew on for now.
        Hide
        jpountz Adrien Grand added a comment -

        This looks good to me. Do we already have tests that exercise lookup by number on sparse field infos?

        Show
        jpountz Adrien Grand added a comment - This looks good to me. Do we already have tests that exercise lookup by number on sparse field infos?
        Hide
        rcmuir Robert Muir added a comment -

        we have lots of field number tests like that, otherwise we have a bigger problem

        retrieving stored fields or vectors is enough to do lookup by number (this is why i made teh change)

        Show
        rcmuir Robert Muir added a comment - we have lots of field number tests like that, otherwise we have a bigger problem retrieving stored fields or vectors is enough to do lookup by number (this is why i made teh change)
        Hide
        rcmuir Robert Muir added a comment -

        We can split FieldInfos change out as a sub-task for LUCENE-6199 and just fix CI to not be stupid?

        The reason i did it here was for performance reasons, it does impact checkindex (and probably merging time at least in some cases) significantly even if the index has only 10 fields. This was like a 10-15% improvement, but the other changes here to CI are more important.

        Show
        rcmuir Robert Muir added a comment - We can split FieldInfos change out as a sub-task for LUCENE-6199 and just fix CI to not be stupid? The reason i did it here was for performance reasons, it does impact checkindex (and probably merging time at least in some cases) significantly even if the index has only 10 fields. This was like a 10-15% improvement, but the other changes here to CI are more important.
        Hide
        jpountz Adrien Grand added a comment -

        We can split FieldInfos change out as a sub-task for LUCENE-6199 and just fix CI to not be stupid?

        +1

        Show
        jpountz Adrien Grand added a comment - We can split FieldInfos change out as a sub-task for LUCENE-6199 and just fix CI to not be stupid? +1
        Hide
        rcmuir Robert Muir added a comment -

        Here is a patch without the FIS change (I just reverted that file). I'll open another issue for that.

        Show
        rcmuir Robert Muir added a comment - Here is a patch without the FIS change (I just reverted that file). I'll open another issue for that.
        Hide
        jpountz Adrien Grand added a comment -

        +1

        Show
        jpountz Adrien Grand added a comment - +1
        Hide
        mikemccand Michael McCandless added a comment -

        +1

        Show
        mikemccand Michael McCandless added a comment - +1
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1663505 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1663505 ]

        LUCENE-6320: speed up checkindex

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1663505 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1663505 ] LUCENE-6320 : speed up checkindex
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1663510 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1663510 ]

        LUCENE-6320: speed up checkindex

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1663510 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1663510 ] LUCENE-6320 : speed up checkindex
        Hide
        mikemccand Michael McCandless added a comment -

        This was a nice speedup (~31%) in the CheckIndex time in the nightly benchmarks: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html

        See annotation BC.

        Should be even faster for smaller docs...

        Show
        mikemccand Michael McCandless added a comment - This was a nice speedup (~31%) in the CheckIndex time in the nightly benchmarks: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html See annotation BC. Should be even faster for smaller docs...
        Hide
        thelabdude Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        thelabdude Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Unassigned
            Reporter:
            rcmuir Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development