Lucene - Core
  1. Lucene - Core
  2. LUCENE-5591

ReaderAndUpdates should create a proper IOContext when writing DV updates

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Today we pass IOContext.DEFAULT. If DV updates are used in conjunction w/ NRTCachingDirectory, it means the latter will attempt to write the entire DV field in its RAMDirectory, which could lead to OOM.

      Would be good if we can build our own FlushInfo, estimating the number of bytes we're about to write. I didn't see off hand a quick way to guesstimate that - I thought to use the current DV's sizeInBytes as an approximation, but I don't see a way to get it, not a direct way at least.

      Maybe we can use the size of the in-memory updates to guesstimate that amount? Something like sizeOfInMemUpdates * (maxDoc/numUpdatedDocs)? Is it a too wild guess?

      1. LUCENE-5591.patch
        14 kB
        Shai Erera
      2. LUCENE-5591.patch
        14 kB
        Shai Erera
      3. LUCENE-5591.patch
        10 kB
        Shai Erera

        Activity

        Hide
        Michael McCandless added a comment -

        +1, good catch.

        I think that guesstimate is a good start? It likely wildly over-estimates though, since in-RAM structures are usually much more costly than the on-disk format, maybe try it out and see how much it over-estimates?

        Show
        Michael McCandless added a comment - +1, good catch. I think that guesstimate is a good start? It likely wildly over-estimates though, since in-RAM structures are usually much more costly than the on-disk format, maybe try it out and see how much it over-estimates?
        Hide
        Shai Erera added a comment -

        I will. I think over estimating is better than under estimating in that case, since worse case the files will be flushed to disk, rather than app hits OOM.

        Show
        Shai Erera added a comment - I will. I think over estimating is better than under estimating in that case, since worse case the files will be flushed to disk, rather than app hits OOM.
        Hide
        Shai Erera added a comment -

        Patch adds an approximation flush size:

        • DocValuesFieldUpdates.avgUpdateSize, implemented by both Numeric and Binary.
        • ReaderAndUpdates creates a FlushInfo w/ the total avgUpdateSize (over all fields) x maxDoc as an approximation. I also approximated the size of the FIS.
        • Added two tests to TestNumeric/BinaryDVUpdates using NRTCachingDirectory, making sure that we don't pass IOContext.DEFAULT (ensuring there are no cached files after applying an update).

        I think it's ready.

        Show
        Shai Erera added a comment - Patch adds an approximation flush size: DocValuesFieldUpdates.avgUpdateSize, implemented by both Numeric and Binary. ReaderAndUpdates creates a FlushInfo w/ the total avgUpdateSize (over all fields) x maxDoc as an approximation. I also approximated the size of the FIS. Added two tests to TestNumeric/BinaryDVUpdates using NRTCachingDirectory, making sure that we don't pass IOContext.DEFAULT (ensuring there are no cached files after applying an update). I think it's ready.
        Hide
        Shai Erera added a comment -

        BTW, I started by adding ramBytesUsed() to DocValuesFieldUpdates, but that was way over estimated, especially when the number of updates is small. That's due to the buffers used by these classes, e.g. GrowableWriter with pageSize=1024. I don't think that the RAM representation should be used as an estimate, rather the avg-update size is closer to what will eventually be written to disk.

        Show
        Shai Erera added a comment - BTW, I started by adding ramBytesUsed() to DocValuesFieldUpdates , but that was way over estimated, especially when the number of updates is small. That's due to the buffers used by these classes, e.g. GrowableWriter with pageSize=1024. I don't think that the RAM representation should be used as an estimate, rather the avg-update size is closer to what will eventually be written to disk.
        Hide
        Michael McCandless added a comment -

        Looks great Shai, thanks!

        Is avgUpdateSize supposed to be "bytes per doc"? If so, instead of bitsPerValue, shouldn't we return bitsPerValue/8, maybe rounded up to the nearest byte? Should we rename the method ... maybe ramBytesPerDoc or something?

        Shouldn't BinaryDocValuesFieldUpdates.avgUpdateSize also include the docs/offsets/lengths RAM used too?

        Separately, I noticed BinaryDocValuesFieldUpdates's add method is doing a BytesRef.append of each added value ... isn't this slowish (O(N^2) where N = number of docs that have been updated)? BytesRef.append doesn't use ArrayUtil.grow to size the array on overflow...

        Show
        Michael McCandless added a comment - Looks great Shai, thanks! Is avgUpdateSize supposed to be "bytes per doc"? If so, instead of bitsPerValue, shouldn't we return bitsPerValue/8, maybe rounded up to the nearest byte? Should we rename the method ... maybe ramBytesPerDoc or something? Shouldn't BinaryDocValuesFieldUpdates.avgUpdateSize also include the docs/offsets/lengths RAM used too? Separately, I noticed BinaryDocValuesFieldUpdates's add method is doing a BytesRef.append of each added value ... isn't this slowish (O(N^2) where N = number of docs that have been updated)? BytesRef.append doesn't use ArrayUtil.grow to size the array on overflow...
        Hide
        Shai Erera added a comment -

        Thanks Mike. I modified to ramBytesPerDoc and fixed Numeric to return a proper approximation (I neglected to factor in the values themselves!!). Also, I estimate the amount of RAM per document used by the PagedGrowableWriters.

        I don't call BytesRef.append(), but grow() and arraycopy(). I could have used bytesRef.grow() followed by bytesRef.append(), but it double-checks the capacity...

        Tests pass.

        Show
        Shai Erera added a comment - Thanks Mike. I modified to ramBytesPerDoc and fixed Numeric to return a proper approximation (I neglected to factor in the values themselves!!). Also, I estimate the amount of RAM per document used by the PagedGrowableWriters. I don't call BytesRef.append(), but grow() and arraycopy(). I could have used bytesRef.grow() followed by bytesRef.append(), but it double-checks the capacity... Tests pass.
        Hide
        Michael McCandless added a comment -

        Looks good, thanks Shai.

        Maybe we can just do floating point math and take ceil in the end? The ceil2 is sort of confusing...

        Show
        Michael McCandless added a comment - Looks good, thanks Shai. Maybe we can just do floating point math and take ceil in the end? The ceil2 is sort of confusing...
        Hide
        Shai Erera added a comment -

        OK I moved to Math.ceil(). I thought ceil2() is quite cool . But this isn't a hot code, it's called once per flush and better to have readable code..

        Show
        Shai Erera added a comment - OK I moved to Math.ceil(). I thought ceil2() is quite cool . But this isn't a hot code, it's called once per flush and better to have readable code..
        Hide
        Michael McCandless added a comment -

        +1, thanks Shai!

        Show
        Michael McCandless added a comment - +1, thanks Shai!
        Hide
        ASF subversion and git services added a comment -

        Commit 1591469 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1591469 ]

        LUCENE-5591: pass proper IOContext when writing DocValues updates

        Show
        ASF subversion and git services added a comment - Commit 1591469 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1591469 ] LUCENE-5591 : pass proper IOContext when writing DocValues updates
        Hide
        ASF subversion and git services added a comment -

        Commit 1591474 from Shai Erera in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591474 ]

        LUCENE-5591: pass proper IOContext when writing DocValues updates

        Show
        ASF subversion and git services added a comment - Commit 1591474 from Shai Erera in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591474 ] LUCENE-5591 : pass proper IOContext when writing DocValues updates
        Hide
        Shai Erera added a comment -

        Thanks Mike. Committed to trunk and 4x.

        Show
        Shai Erera added a comment - Thanks Mike. Committed to trunk and 4x.

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development