Lucene - Core
  1. Lucene - Core
  2. LUCENE-1717

IndexWriter does not properly account for the RAM consumed by pending deletes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      IndexWriter, with autoCommit false, is able to carry buffered deletes for quite some time before materializing them to docIDs (thus freeing up RAM used).

      It's only on triggering a merge (or, commit/close) that the deletes are materialized and the RAM is freed.

      I expect this in practice is a smallish amount of RAM, but we should still fix it.

      I don't have a patch yet so if someone wants to grab this, feel free!!

      1. BufferedDeletes_beautification.patch
        0.6 kB
        Simon Willnauer
      2. LUCENE-1717.patch
        12 kB
        Michael McCandless

        Activity

        Hide
        Simon Willnauer added a comment -

        Maybe I miss something but IndexWriter#setMaxBufferedDeleteTerms can be used to set an upper bound for those terms. Once you hit the upper bound BufferedDeletes should be flushed to disc by calling IndexWriter#flush(). This can happend with either a add or a delete. Maybe I do not completely understand what you mean by materialized.

        Show
        Simon Willnauer added a comment - Maybe I miss something but IndexWriter#setMaxBufferedDeleteTerms can be used to set an upper bound for those terms. Once you hit the upper bound BufferedDeletes should be flushed to disc by calling IndexWriter#flush(). This can happend with either a add or a delete. Maybe I do not completely understand what you mean by materialized.
        Hide
        Michael McCandless added a comment -

        Sorry, you are correct – that's the obvious workaround here.

        But we default that to unlimited (meaning, flush when RAM limit is hit), which I think is a good default once we fix the accounting in IndexWriter to properly account for buffered delete's RAM usage.

        Show
        Michael McCandless added a comment - Sorry, you are correct – that's the obvious workaround here. But we default that to unlimited (meaning, flush when RAM limit is hit), which I think is a good default once we fix the accounting in IndexWriter to properly account for buffered delete's RAM usage.
        Hide
        Simon Willnauer added a comment -

        But we default that to unlimited (meaning, flush when RAM limit is hit), which I think is a good default once we fix the accounting in IndexWriter to properly account for buffered delete's RAM usage.

        I agree we should track the ram usage of BufferedDeletes too.
        One other thing I wonder about is why deletesInRam is only cleared on abort. Once IndexWriter#doFlushInternal is executed the DocWriter pushes deletes from deletesInRam to deletesFlushed. Shouldn't this call deletesInRam#clear() to free the memory in this instance of BufferedDeletes. That could at least help a bit if I do not miss anything important.

        Show
        Simon Willnauer added a comment - But we default that to unlimited (meaning, flush when RAM limit is hit), which I think is a good default once we fix the accounting in IndexWriter to properly account for buffered delete's RAM usage. I agree we should track the ram usage of BufferedDeletes too. One other thing I wonder about is why deletesInRam is only cleared on abort. Once IndexWriter#doFlushInternal is executed the DocWriter pushes deletes from deletesInRam to deletesFlushed. Shouldn't this call deletesInRam#clear() to free the memory in this instance of BufferedDeletes. That could at least help a bit if I do not miss anything important.
        Hide
        Michael McCandless added a comment -

        deletesInRAM is in fact cleared, on calling deletesFlushed.update(deletesInRAM). Ie, that call "transfers" the deletesInRAM to deletesFlushed.

        Then, when applyDeletes is called, we clear deletesFlushed.

        Show
        Michael McCandless added a comment - deletesInRAM is in fact cleared, on calling deletesFlushed.update(deletesInRAM). Ie, that call "transfers" the deletesInRAM to deletesFlushed. Then, when applyDeletes is called, we clear deletesFlushed.
        Hide
        Simon Willnauer added a comment -

        HA, true! I missed that. Nevermind!
        I did not catch it as it calls the clear methods directly instead of using BufferedDocs#clear(). Might be easier to catch if people look into that if we would call clear() instead. I will attach a patch for this beautification.

        Show
        Simon Willnauer added a comment - HA, true! I missed that. Nevermind! I did not catch it as it calls the clear methods directly instead of using BufferedDocs#clear(). Might be easier to catch if people look into that if we would call clear() instead. I will attach a patch for this beautification.
        Hide
        Simon Willnauer added a comment -

        Regarding buffered delete's RAM usage, accounting an exact number is quite difficult in this case as there are many strings involved (Terms with field and value) . BufferedDeletes#terms stores <Term, Num> and BufferedDeletes#queries stores <Query, Num> in both cases the value part is easy to account while especially for query the memory consumption is hard to guess similarly the amount of memory a Term takes.

        On the other hand I would like to have a notion of memory consumption os BufferedDeletes but the IndexWriters#setRAMBufferSizeMB javaDoc clearly says that this does not include the memory used by buffered deletes. I would rather tend to leave it as it is and make it clear in javadoc / wiki that setMaxBufferedDeleteTerms is the way to go if you run into memory problems. Feels quite ambiguous to estimate the memory of buffered deletes.

        I think is a good default once we fix the accounting in IndexWriter to properly account for buffered delete's RAM usage.

        is there already an issue to fix the RAM usage?

        Show
        Simon Willnauer added a comment - Regarding buffered delete's RAM usage, accounting an exact number is quite difficult in this case as there are many strings involved (Terms with field and value) . BufferedDeletes#terms stores <Term, Num> and BufferedDeletes#queries stores <Query, Num> in both cases the value part is easy to account while especially for query the memory consumption is hard to guess similarly the amount of memory a Term takes. On the other hand I would like to have a notion of memory consumption os BufferedDeletes but the IndexWriters#setRAMBufferSizeMB javaDoc clearly says that this does not include the memory used by buffered deletes. I would rather tend to leave it as it is and make it clear in javadoc / wiki that setMaxBufferedDeleteTerms is the way to go if you run into memory problems. Feels quite ambiguous to estimate the memory of buffered deletes. I think is a good default once we fix the accounting in IndexWriter to properly account for buffered delete's RAM usage. is there already an issue to fix the RAM usage?
        Hide
        Michael McCandless added a comment -

        is there already an issue to fix the RAM usage?

        This is the issue.

        I think even if our measure is not perfect we should try to take a stab at accounting for RAM usage of deletes; it's a trap, now. We shouldn't set traps.

        Show
        Michael McCandless added a comment - is there already an issue to fix the RAM usage? This is the issue. I think even if our measure is not perfect we should try to take a stab at accounting for RAM usage of deletes; it's a trap, now. We shouldn't set traps.
        Hide
        Michael McCandless added a comment -

        Attached patch.

        I made an estimate of RAM usage for buffered delete terms and docIDs that I think should be fairly close. Buffered delete Query instances, however, are undercounted (I say this in the javadocs) since measuring that would be rather challenging.

        Show
        Michael McCandless added a comment - Attached patch. I made an estimate of RAM usage for buffered delete terms and docIDs that I think should be fairly close. Buffered delete Query instances, however, are undercounted (I say this in the javadocs) since measuring that would be rather challenging.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development