Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently we still load monotonic blocks into memory to map doc ids to an offset on disk. Since these data structures are usually consumed sequentially I would like to investigate putting them to disk.

      1. LUCENE-6840.patch
        404 kB
        Adrien Grand
      2. LUCENE-6840.patch
        402 kB
        Adrien Grand
      3. LUCENE-6840.patch
        48 kB
        Adrien Grand
      4. LUCENE-6840.patch
        34 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch that should improve memory usage for:

        • variable-length binary fields
        • multi-valued sorted numeric fields
        • multi-valued sorted set fields

        On the other hand, the BINARY_PREFIX_COMPRESSED format still uses MononicBlockPackedReader/Writer.

        I wrote the patch by changing Lucene50DocValuesFormat to make it easier to review, but when it's ready I plan to make it a whole new format (with new Lucene54Codec, etc.).

        Compared to previously, only per-block metadata is kept around in memory, data is written to disk using the DirectWriter/slice APIs. Out of curiosity I tried to write all entries of my /usr/share/dict/words file into a binary dv field to see how it compares to trunk:

        trunk
          .dvd: 992334 bytes
          .dvm 128 bytes
          memory usage 153124 bytes
        patch
          .dvd 1038100 bytes
          .dvm 165 bytes
          memory usage 232 bytes
        

        One important thing is that I had to introduce some per-thread memory usage: each thread needs to have its own array of DirectReader instances (one per block). This is why I raised the block size from 16K to 64K in order to have fewer blocks. Maybe this would need to be even more increased (but this would also hurt compression a bit). In the worst case that someone has a segment with 2B documents, there would be 32k blocks of 64k values so each thread would need about 1.2MB of memory. In my opinion it's ok since apps should query their Lucene indices from a reasonable number of threads, and it would probably still be much better than today since even requiring a single bit of memory per document (with today's MonotonicBlockPackedReader) would use 256MB of memory.

        Show
        Adrien Grand added a comment - Here is a patch that should improve memory usage for: variable-length binary fields multi-valued sorted numeric fields multi-valued sorted set fields On the other hand, the BINARY_PREFIX_COMPRESSED format still uses MononicBlockPackedReader/Writer. I wrote the patch by changing Lucene50DocValuesFormat to make it easier to review, but when it's ready I plan to make it a whole new format (with new Lucene54Codec, etc.). Compared to previously, only per-block metadata is kept around in memory, data is written to disk using the DirectWriter/slice APIs. Out of curiosity I tried to write all entries of my /usr/share/dict/words file into a binary dv field to see how it compares to trunk: trunk .dvd: 992334 bytes .dvm 128 bytes memory usage 153124 bytes patch .dvd 1038100 bytes .dvm 165 bytes memory usage 232 bytes One important thing is that I had to introduce some per-thread memory usage: each thread needs to have its own array of DirectReader instances (one per block). This is why I raised the block size from 16K to 64K in order to have fewer blocks. Maybe this would need to be even more increased (but this would also hurt compression a bit). In the worst case that someone has a segment with 2B documents, there would be 32k blocks of 64k values so each thread would need about 1.2MB of memory. In my opinion it's ok since apps should query their Lucene indices from a reasonable number of threads, and it would probably still be much better than today since even requiring a single bit of memory per document (with today's MonotonicBlockPackedReader) would use 256MB of memory.
        Hide
        Adrien Grand added a comment -

        I did a quick benchmark with the geonames dataset (10740477 documents), with only two doc values fields:

        • name as a binary dv field
        • alternatenames as a sorted set dv field

        Then I measured disk/memory usage and tried to sort on both fields (using SortedSetSelector.Type.MIN for the multi-valued field) with a MatchAllDocsQuery:

        branch index size(MB) memory usage(MB) time to sort on "name" (ms) time to sort on "alternatenames" (ms)
        trunk 311 33.1 4750 3570
        patch 319 (+2%) 1.4 (-96%) 5390 (+13%) 4000 (+12%)

        Maybe there are things we can optimize with the patch, but even with these numbers I think this patch has a better trade-off: I am not very happy that the current format takes more than 3 bytes of memory per document for only two doc values fields.

        Show
        Adrien Grand added a comment - I did a quick benchmark with the geonames dataset (10740477 documents), with only two doc values fields: name as a binary dv field alternatenames as a sorted set dv field Then I measured disk/memory usage and tried to sort on both fields (using SortedSetSelector.Type.MIN for the multi-valued field) with a MatchAllDocsQuery: branch index size(MB) memory usage(MB) time to sort on "name" (ms) time to sort on "alternatenames" (ms) trunk 311 33.1 4750 3570 patch 319 (+2%) 1.4 (-96%) 5390 (+13%) 4000 (+12%) Maybe there are things we can optimize with the patch, but even with these numbers I think this patch has a better trade-off: I am not very happy that the current format takes more than 3 bytes of memory per document for only two doc values fields.
        Hide
        Adrien Grand added a comment -

        Here is a new patch that folds the ability to read after a given offset into DirectReader directly, so that we don't have to introduce a new implementation of RandomAccessInput, which seemed to confuse hotspot a bit.

        branch index size(MB) memory usage(MB) time to sort on "name" (ms) time to sort on "alternatenames" (ms)
        trunk 311 33.1 4780 3580
        patch 319 (+2%) 1.4 (-96%) 4960 (+4%) 3790 (+6%)
        Show
        Adrien Grand added a comment - Here is a new patch that folds the ability to read after a given offset into DirectReader directly, so that we don't have to introduce a new implementation of RandomAccessInput, which seemed to confuse hotspot a bit. branch index size(MB) memory usage(MB) time to sort on "name" (ms) time to sort on "alternatenames" (ms) trunk 311 33.1 4780 3580 patch 319 (+2%) 1.4 (-96%) 4960 (+4%) 3790 (+6%)
        Hide
        Adrien Grand added a comment -

        Here is a patch that makes this a new DocValuesFormat, with a new associated Codec and moves Lucene53Codec and Lucene50DocValuesFormat to lucene/backward-codecs. I would like to commit soon if there are no objections.

        Show
        Adrien Grand added a comment - Here is a patch that makes this a new DocValuesFormat, with a new associated Codec and moves Lucene53Codec and Lucene50DocValuesFormat to lucene/backward-codecs. I would like to commit soon if there are no objections.
        Hide
        Robert Muir added a comment -

        Can we add back a helper like this to DirectReader? I think it helps the api for the general case, where an offset from a slice is kinda silly (but i get where it helps for this case).

        public static LongValues getInstance(RandomAccessInput slice, int bitsPerValue) {
          return getInstance(slice, bitsPerValue, 0);
        }
        

        thanks for adding the offset tests to directreader tests.

        Should we move LongValues.EMPTY to DirectMonotonicReader since its the only one using it right now?

        Show
        Robert Muir added a comment - Can we add back a helper like this to DirectReader? I think it helps the api for the general case, where an offset from a slice is kinda silly (but i get where it helps for this case). public static LongValues getInstance(RandomAccessInput slice, int bitsPerValue) { return getInstance(slice, bitsPerValue, 0); } thanks for adding the offset tests to directreader tests. Should we move LongValues.EMPTY to DirectMonotonicReader since its the only one using it right now?
        Hide
        Adrien Grand added a comment -

        Here is an updated patch that should address your comments. Thank you for the review, I'll commit soon.

        Show
        Adrien Grand added a comment - Here is an updated patch that should address your comments. Thank you for the review, I'll commit soon.
        Hide
        ASF subversion and git services added a comment -

        Commit 1710876 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1710876 ]

        LUCENE-6840: Put ord indexes on disk.

        Show
        ASF subversion and git services added a comment - Commit 1710876 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1710876 ] LUCENE-6840 : Put ord indexes on disk.
        Hide
        ASF subversion and git services added a comment -

        Commit 1710880 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1710880 ]

        LUCENE-6840: Put ord indexes on disk.

        Show
        ASF subversion and git services added a comment - Commit 1710880 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1710880 ] LUCENE-6840 : Put ord indexes on disk.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development