Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is fairly slow today, very ram intensive, and has some buggy stuff (e.g. postingsenum reuse bugs). We can do better...

      1. LUCENE-6320.patch
        42 kB
        Robert Muir
      2. LUCENE-6320.patch
        45 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here is a patch. We use codec apis to do these checks, so the optimizations we already worked on for merge help a lot (esp. stored fields, norms, docvalues).

        When we check postings without deletes, we weren't reusing postingsenum and were clone()'ing for every term.

        FieldInfos.get(int) is a cpu hog for stored fields and vectors, since its called for every field in the doc and we do O(log N) lookup each time. Its wasteful in memory usually too (using a treemap always when in most cases a simple array is smaller and faster).

        Show
        Robert Muir added a comment - Here is a patch. We use codec apis to do these checks, so the optimizations we already worked on for merge help a lot (esp. stored fields, norms, docvalues). When we check postings without deletes, we weren't reusing postingsenum and were clone()'ing for every term. FieldInfos.get(int) is a cpu hog for stored fields and vectors, since its called for every field in the doc and we do O(log N) lookup each time. Its wasteful in memory usually too (using a treemap always when in most cases a simple array is smaller and faster).
        Hide
        Robert Muir added a comment -

        I also thought about moving the integrity check into the different pieces just like merge does. We can still do that, but I think its less important at the moment. this is big enough for chew on for now.

        Show
        Robert Muir added a comment - I also thought about moving the integrity check into the different pieces just like merge does. We can still do that, but I think its less important at the moment. this is big enough for chew on for now.
        Hide
        Adrien Grand added a comment -

        This looks good to me. Do we already have tests that exercise lookup by number on sparse field infos?

        Show
        Adrien Grand added a comment - This looks good to me. Do we already have tests that exercise lookup by number on sparse field infos?
        Hide
        Robert Muir added a comment -

        we have lots of field number tests like that, otherwise we have a bigger problem

        retrieving stored fields or vectors is enough to do lookup by number (this is why i made teh change)

        Show
        Robert Muir added a comment - we have lots of field number tests like that, otherwise we have a bigger problem retrieving stored fields or vectors is enough to do lookup by number (this is why i made teh change)
        Hide
        Robert Muir added a comment -

        We can split FieldInfos change out as a sub-task for LUCENE-6199 and just fix CI to not be stupid?

        The reason i did it here was for performance reasons, it does impact checkindex (and probably merging time at least in some cases) significantly even if the index has only 10 fields. This was like a 10-15% improvement, but the other changes here to CI are more important.

        Show
        Robert Muir added a comment - We can split FieldInfos change out as a sub-task for LUCENE-6199 and just fix CI to not be stupid? The reason i did it here was for performance reasons, it does impact checkindex (and probably merging time at least in some cases) significantly even if the index has only 10 fields. This was like a 10-15% improvement, but the other changes here to CI are more important.
        Hide
        Adrien Grand added a comment -

        We can split FieldInfos change out as a sub-task for LUCENE-6199 and just fix CI to not be stupid?

        +1

        Show
        Adrien Grand added a comment - We can split FieldInfos change out as a sub-task for LUCENE-6199 and just fix CI to not be stupid? +1
        Hide
        Robert Muir added a comment -

        Here is a patch without the FIS change (I just reverted that file). I'll open another issue for that.

        Show
        Robert Muir added a comment - Here is a patch without the FIS change (I just reverted that file). I'll open another issue for that.
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1663505 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1663505 ]

        LUCENE-6320: speed up checkindex

        Show
        ASF subversion and git services added a comment - Commit 1663505 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1663505 ] LUCENE-6320 : speed up checkindex
        Hide
        ASF subversion and git services added a comment -

        Commit 1663510 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1663510 ]

        LUCENE-6320: speed up checkindex

        Show
        ASF subversion and git services added a comment - Commit 1663510 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1663510 ] LUCENE-6320 : speed up checkindex
        Hide
        Michael McCandless added a comment -

        This was a nice speedup (~31%) in the CheckIndex time in the nightly benchmarks: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html

        See annotation BC.

        Should be even faster for smaller docs...

        Show
        Michael McCandless added a comment - This was a nice speedup (~31%) in the CheckIndex time in the nightly benchmarks: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html See annotation BC. Should be even faster for smaller docs...
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development