Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      FieldInfos.fieldInfo(int) looks up a field by number and returns its FieldInfo.
      This method is called per-field-per-doc in things like stored fields and vectors readers.

      Unfortunately, today this method is always backed by a TreeMap. In most cases a simple array is better, its faster and uses less memory.

      These changes made significant difference in stored fields checkindex time with my test index (had only 10 fields). Maybe it helps merge as well.

      1. LUCENE-6325.patch
        3 kB
        Michael McCandless
      2. LUCENE-6325.patch
        3 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here was my first patch (moved from LUCENE-6320).

        Maybe we can add some fieldinfos tests to exercise this directly, and maybe there is a way to simplify the scary logic too.

        Show
        Robert Muir added a comment - Here was my first patch (moved from LUCENE-6320 ). Maybe we can add some fieldinfos tests to exercise this directly, and maybe there is a way to simplify the scary logic too.
        Hide
        Robert Muir added a comment -

        btw, as far as whether the "sparse" case is tested, i added "assert false" to this and ran core tests (no @Slow, no @Nightly):

           [junit4] Tests with failures (first 10 out of 37):
           [junit4]   - org.apache.lucene.index.TestStressIndexing2.testRandom
           [junit4]   - org.apache.lucene.index.TestStressIndexing2.testRandomIWReader
           [junit4]   - org.apache.lucene.index.TestStressIndexing2.testMultiConfig
           [junit4]   - org.apache.lucene.codecs.asserting.TestAssertingStoredFieldsFormat.testMerge
           [junit4]   - org.apache.lucene.search.TestSearcherManager.testSearcherManager
           [junit4]   - org.apache.lucene.index.TestBinaryDocValuesUpdates.testDeleteUnusedUpdatesFiles
           [junit4]   - org.apache.lucene.index.TestBinaryDocValuesUpdates.testManyReopensAndFields
           [junit4]   - org.apache.lucene.index.TestBinaryDocValuesUpdates.testSegmentMerges
           [junit4]   - org.apache.lucene.index.TestBinaryDocValuesUpdates.testMultipleDocValuesTypes
           [junit4]   - org.apache.lucene.index.TestBinaryDocValuesUpdates.testUpdatesOrder
        

        So I think its tested (there are several tests cranking fields like TestManyFields). But if we can do it better, I am all for that.

        Show
        Robert Muir added a comment - btw, as far as whether the "sparse" case is tested, i added "assert false" to this and ran core tests (no @Slow, no @Nightly): [junit4] Tests with failures (first 10 out of 37): [junit4] - org.apache.lucene.index.TestStressIndexing2.testRandom [junit4] - org.apache.lucene.index.TestStressIndexing2.testRandomIWReader [junit4] - org.apache.lucene.index.TestStressIndexing2.testMultiConfig [junit4] - org.apache.lucene.codecs.asserting.TestAssertingStoredFieldsFormat.testMerge [junit4] - org.apache.lucene.search.TestSearcherManager.testSearcherManager [junit4] - org.apache.lucene.index.TestBinaryDocValuesUpdates.testDeleteUnusedUpdatesFiles [junit4] - org.apache.lucene.index.TestBinaryDocValuesUpdates.testManyReopensAndFields [junit4] - org.apache.lucene.index.TestBinaryDocValuesUpdates.testSegmentMerges [junit4] - org.apache.lucene.index.TestBinaryDocValuesUpdates.testMultipleDocValuesTypes [junit4] - org.apache.lucene.index.TestBinaryDocValuesUpdates.testUpdatesOrder So I think its tested (there are several tests cranking fields like TestManyFields). But if we can do it better, I am all for that.
        Hide
        Michael McCandless added a comment -

        +1, this looks like an easy, possibly high-impact win.

        I think we can be more aggressive about using the array: TreeMap has much more than 50% overhead, since it also needs Integer key and pointer to that key, and object overhead holding that key pointer and value pointer. I think we can safely do this opto when it's > 10% of the space?

        Show
        Michael McCandless added a comment - +1, this looks like an easy, possibly high-impact win. I think we can be more aggressive about using the array: TreeMap has much more than 50% overhead, since it also needs Integer key and pointer to that key, and object overhead holding that key pointer and value pointer. I think we can safely do this opto when it's > 10% of the space?
        Hide
        Robert Muir added a comment -

        Mike, if you have time, can you do calculations and adjust the patch? You can take the issue too, i had basically given up on this.

        Show
        Robert Muir added a comment - Mike, if you have time, can you do calculations and adjust the patch? You can take the issue too, i had basically given up on this.
        Hide
        Michael McCandless added a comment -

        OK I can try to update this & commit...

        Show
        Michael McCandless added a comment - OK I can try to update this & commit...
        Hide
        Michael McCandless added a comment -

        New patch, using dense array when > 1/16th of the numbers are used:

        Each TreeMap$Entry has object header (8 or 16 bytes), 5 pointers (4 or
        8 bytes), and a boolean (likely rounded up to 4 bytes), times 2 for
        all the inner nodes of the tree, plus the overhead of Integer (object
        header, int), so net/net each entry in the TreeMap costs 68 - 124 bytes.

        The array is 4 or 8 bytes per int.

        Show
        Michael McCandless added a comment - New patch, using dense array when > 1/16th of the numbers are used: Each TreeMap$Entry has object header (8 or 16 bytes), 5 pointers (4 or 8 bytes), and a boolean (likely rounded up to 4 bytes), times 2 for all the inner nodes of the tree, plus the overhead of Integer (object header, int), so net/net each entry in the TreeMap costs 68 - 124 bytes. The array is 4 or 8 bytes per int.
        Hide
        Michael McCandless added a comment -

        I think the patch is ready...

        Show
        Michael McCandless added a comment - I think the patch is ready...
        Hide
        Robert Muir added a comment -

        Thanks, can you change 16 to 16L? I don't want to think about overflowing.

        Show
        Robert Muir added a comment - Thanks, can you change 16 to 16L? I don't want to think about overflowing.
        Hide
        Michael McCandless added a comment -

        Thanks, can you change 16 to 16L?

        Oh, good catch! I'll fix ...

        Show
        Michael McCandless added a comment - Thanks, can you change 16 to 16L? Oh, good catch! I'll fix ...
        Hide
        Robert Muir added a comment -

        +1 to commit with that modification, thanks for doing the calculations here.

        Show
        Robert Muir added a comment - +1 to commit with that modification, thanks for doing the calculations here.
        Hide
        ASF subversion and git services added a comment -

        Commit 1687789 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1687789 ]

        LUCENE-6325: use array for number -> FieldInfo lookup, except in very sparse cases

        Show
        ASF subversion and git services added a comment - Commit 1687789 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1687789 ] LUCENE-6325 : use array for number -> FieldInfo lookup, except in very sparse cases
        Hide
        ASF subversion and git services added a comment -

        Commit 1687792 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1687792 ]

        LUCENE-6325: use array for number -> FieldInfo lookup, except in very sparse cases

        Show
        ASF subversion and git services added a comment - Commit 1687792 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1687792 ] LUCENE-6325 : use array for number -> FieldInfo lookup, except in very sparse cases
        Hide
        Michael McCandless added a comment -

        Thanks Robert Muir!

        Show
        Michael McCandless added a comment - Thanks Robert Muir !
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development