Lucene - Core
  1. Lucene - Core
  2. LUCENE-5373

Lucene42DocValuesProducer.ramBytesUsed is over-estimated

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene42DocValuesProducer.ramBytesUsed uses RamUsageEstimator.sizeOf(this) to return an estimation of the memory usage. One of the issues (there might be other ones) is that this class has a reference to an IndexInput that might link to other data-structures that we wouldn't want to take into account. For example, index inputs of a RAMDirectory all point to the directory itself, so Lucene42DocValuesProducer.ramBytesUsed would return the amount of memory used by the whole directory.

      1. LUCENE-5373.patch
        32 kB
        Adrien Grand
      2. LUCENE-5373.patch
        12 kB
        Adrien Grand
      3. LUCENE-5373.patch
        11 kB
        Adrien Grand
      4. LUCENE-5373.patch
        4 kB
        Adrien Grand

        Activity

        Hide
        Shay Banon added a comment -

        as someone who found this issue, on top of the wrong computation, its also very expensive. This call should be lightweight and hopefully not use sizeOf at all... . At the very least, if possible, the result of it should be cached? Maybe even introduce size caching on a higher level (calling code) if possible.

        Show
        Shay Banon added a comment - as someone who found this issue, on top of the wrong computation, its also very expensive. This call should be lightweight and hopefully not use sizeOf at all... . At the very least, if possible, the result of it should be cached? Maybe even introduce size caching on a higher level (calling code) if possible.
        Hide
        Adrien Grand added a comment -

        Here is a patch. Lucene42DocValuesProducer no more relies on RamUsageEstimator.sizeOf(Object) but instead has a member that stores its memory usage which is incremented every time we load doc values on a new field. This should be both faster and more accurate.

        I didn't take into account object alignment, the numeric/binary/fst entries and the size of some small hash tables on purpose to keep size estimation simple. These should be very small compared to the structures that actually store doc values anyway.

        Show
        Adrien Grand added a comment - Here is a patch. Lucene42DocValuesProducer no more relies on RamUsageEstimator.sizeOf(Object) but instead has a member that stores its memory usage which is incremented every time we load doc values on a new field. This should be both faster and more accurate. I didn't take into account object alignment, the numeric/binary/fst entries and the size of some small hash tables on purpose to keep size estimation simple. These should be very small compared to the structures that actually store doc values anyway.
        Hide
        Adrien Grand added a comment -

        New patch that adds an CHANGES.txt entry and also fixes Memory and DirectDocValuesProducer.

        Show
        Adrien Grand added a comment - New patch that adds an CHANGES.txt entry and also fixes Memory and DirectDocValuesProducer.
        Hide
        Robert Muir added a comment -

        can we please use AtomicLong rather than volatile?

        Show
        Robert Muir added a comment - can we please use AtomicLong rather than volatile?
        Hide
        Adrien Grand added a comment -

        OK for AtomicLong, here is a new patch. I'll commit soon if there is no objection.

        Show
        Adrien Grand added a comment - OK for AtomicLong, here is a new patch. I'll commit soon if there is no objection.
        Hide
        Adrien Grand added a comment -

        New version of the patch with Lucene40DocValuesProducer as well. Unless I missed something, RamUsageEstimator.sizeOf(Object) is not called by SegmentReader.ramBytesUsed() anymore.

        You may still see calls to RamUsageEstimator.sizeOf(native array) but these methods are actually both accurate and fast: they basically just compute NUM_BYTES_ARRAY_HEADER + length * NUM_BYTES_element and return an aligned size.

        Show
        Adrien Grand added a comment - New version of the patch with Lucene40DocValuesProducer as well. Unless I missed something, RamUsageEstimator.sizeOf(Object) is not called by SegmentReader.ramBytesUsed() anymore. You may still see calls to RamUsageEstimator.sizeOf(native array) but these methods are actually both accurate and fast: they basically just compute NUM_BYTES_ARRAY_HEADER + length * NUM_BYTES_element and return an aligned size.
        Hide
        Simon Willnauer added a comment -

        Thanks for fixing this as well (we spoke offline about this and adrien fixed it before I was able to comment). I actually wonder if RUE.sizeOf(Object) should be test only, it could be in a different patch but this seems very dangerous IMO. There are also classes like AnalyzingSuggester etc. that uses it and I wonder if that can have bad impact on perf. I can already see users calling it in a loop

        Show
        Simon Willnauer added a comment - Thanks for fixing this as well (we spoke offline about this and adrien fixed it before I was able to comment). I actually wonder if RUE.sizeOf(Object) should be test only, it could be in a different patch but this seems very dangerous IMO. There are also classes like AnalyzingSuggester etc. that uses it and I wonder if that can have bad impact on perf. I can already see users calling it in a loop
        Hide
        Adrien Grand added a comment -

        It might be tricky in some cases. For example, if we want to keep CachingWrapperFilter.sizeInBytes(), we would probably need to add a ramBytesUsed() method on cacheable filters. Let's open another issue to review usage of RamUsageEstimator?

        Show
        Adrien Grand added a comment - It might be tricky in some cases. For example, if we want to keep CachingWrapperFilter.sizeInBytes() , we would probably need to add a ramBytesUsed() method on cacheable filters. Let's open another issue to review usage of RamUsageEstimator?
        Hide
        Simon Willnauer added a comment -

        +1 to another issue

        Show
        Simon Willnauer added a comment - +1 to another issue
        Hide
        ASF subversion and git services added a comment -

        Commit 1552751 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1552751 ]

        LUCENE-5373: Fix memory usage estimation for [Lucene40/Lucene42/Memory/Direct]DocValuesProducer.

        Show
        ASF subversion and git services added a comment - Commit 1552751 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1552751 ] LUCENE-5373 : Fix memory usage estimation for [Lucene40/Lucene42/Memory/Direct] DocValuesProducer.
        Hide
        ASF subversion and git services added a comment -

        Commit 1552753 from Adrien Grand in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1552753 ]

        LUCENE-5373: Fix memory usage estimation for [Lucene40/Lucene42/Memory/Direct]DocValuesProducer.

        Show
        ASF subversion and git services added a comment - Commit 1552753 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1552753 ] LUCENE-5373 : Fix memory usage estimation for [Lucene40/Lucene42/Memory/Direct] DocValuesProducer.
        Hide
        ASF subversion and git services added a comment -

        Commit 1552759 from Adrien Grand in branch 'dev/branches/lucene_solr_4_6'
        [ https://svn.apache.org/r1552759 ]

        LUCENE-5373: Fix memory usage estimation for [Lucene40/Lucene42/Memory/Direct]DocValuesProducer.

        Show
        ASF subversion and git services added a comment - Commit 1552759 from Adrien Grand in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1552759 ] LUCENE-5373 : Fix memory usage estimation for [Lucene40/Lucene42/Memory/Direct] DocValuesProducer.
        Hide
        Adrien Grand added a comment -

        Thanks for the reviews!

        Show
        Adrien Grand added a comment - Thanks for the reviews!

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development