Lucene - Core
  1. Lucene - Core
  2. LUCENE-2762

Don't leak deleted open file handles with pooled readers

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.4, 3.0.3, 3.1, 4.0-ALPHA
    • Fix Version/s: 2.9.4, 3.0.3, 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If you have CFS enabled today, and pooling is enabled (either directly
      or because you've pulled an NRT reader), IndexWriter will hold open
      SegmentReaders against the non-CFS format of each merged segment.

      So even if you close all NRT readers you've pulled from the writer,
      you'll still see file handles open against files that have been
      deleted.

      This count will not grow unbounded, since it's limited by the number
      of segments in the index, but it's still a serious problem since the
      app had turned off CFS in the first place presumably to avoid risk of
      too-many-open-files. It's also bad because it ties up disk space
      since these files would otherwise be deleted.

      1. LUCENE-2762.patch
        37 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        I made a new test case showing the problem (and also modified a
        pre-existing test case to also show the problem). Patch includes the
        fix (I don't commit the merged segment until after the CFS is built)
        and both tests now pass.

        I also added a new core test utility class (LineFileDocs) that's able
        to read line files produced by contrib/benchmark, and the new test
        uses this to index the line for Wikipedia. I think we should somehow
        mark this test to run only at night on hudson eg with a new @nightly
        tag or something... it's already disabled if it can't find the line
        file, though it's now hardwired to the line file location in my env.

        In general I think our tests should sometimes use "real" data so if we
        can get this wikipedia line file in a standard place on hudson then
        longer-running @nightly tests can use it...

        Show
        Michael McCandless added a comment - Patch. I made a new test case showing the problem (and also modified a pre-existing test case to also show the problem). Patch includes the fix (I don't commit the merged segment until after the CFS is built) and both tests now pass. I also added a new core test utility class (LineFileDocs) that's able to read line files produced by contrib/benchmark, and the new test uses this to index the line for Wikipedia. I think we should somehow mark this test to run only at night on hudson eg with a new @nightly tag or something... it's already disabled if it can't find the line file, though it's now hardwired to the line file location in my env. In general I think our tests should sometimes use "real" data so if we can get this wikipedia line file in a standard place on hudson then longer-running @nightly tests can use it...
        Hide
        Michael McCandless added a comment -

        So with this patch, we now build the CFS for a merged segment before
        adding that segment to the segment infos.

        This is important, to prevent an NRT reader from opening the pre-CFS
        version, thus tying open the files, using up extra disk space, and
        leaking deleted-but-open files even once all NRT readers are closed.

        But, unfortunately, this means the worst-case temporary peak free disk
        space required when using CFS has gone up... this worst case is hit if
        you 1) open an existing index, 2) call optimize on it, 3) the index
        needs more than 1 merge to become optimized, and 4) on the final merge
        of that optimize just after it's built the CFS but hasn't yet
        committed it to the segment infos. At that point you have 1X due to
        starting segments (which cannot be deleted until commit), another 1X
        due to the segments created by the prior merge (now being merged),
        another 1X by the newly merged single segment, and a final 1X from the
        final CFS. In this worst case that means we require 3X of your index
        size in temporary space.

        In other cases we use less disk space (the NRT case).

        And of course if CFS is off there's no change to the temp disk space.

        I've noted this in the javadocs and will add to CHANGES...

        But... I think we should improve our default MP. First, maybe we
        should set a maxMergeMB by default? Because immense merges cause all
        sorts of problems, and, likely are not going to impact search perf.
        Second, I think if a newly merged segment will be more than X% of the
        index, I think we should leave it in non-compound-file format even if
        "useCompoundFile" is enabled... I think there's a separate issue open
        somewhere for that 2nd one.

        Show
        Michael McCandless added a comment - So with this patch, we now build the CFS for a merged segment before adding that segment to the segment infos. This is important, to prevent an NRT reader from opening the pre-CFS version, thus tying open the files, using up extra disk space, and leaking deleted-but-open files even once all NRT readers are closed. But, unfortunately, this means the worst-case temporary peak free disk space required when using CFS has gone up... this worst case is hit if you 1) open an existing index, 2) call optimize on it, 3) the index needs more than 1 merge to become optimized, and 4) on the final merge of that optimize just after it's built the CFS but hasn't yet committed it to the segment infos. At that point you have 1X due to starting segments (which cannot be deleted until commit), another 1X due to the segments created by the prior merge (now being merged), another 1X by the newly merged single segment, and a final 1X from the final CFS. In this worst case that means we require 3X of your index size in temporary space. In other cases we use less disk space (the NRT case). And of course if CFS is off there's no change to the temp disk space. I've noted this in the javadocs and will add to CHANGES... But... I think we should improve our default MP. First, maybe we should set a maxMergeMB by default? Because immense merges cause all sorts of problems, and, likely are not going to impact search perf. Second, I think if a newly merged segment will be more than X% of the index, I think we should leave it in non-compound-file format even if "useCompoundFile" is enabled... I think there's a separate issue open somewhere for that 2nd one.
        Hide
        Jason Rutherglen added a comment -

        I think we should improve our default MP. First, maybe we should set a maxMergeMB by default?

        That's a good idea, however would we set an absolute size or a size relative to the aggregate size of the index? I'm using 5 GB in production as otherwise I'm not sure the merge cost is worth the potential performance improvement, ie, long merges adversely affects indexing performance.

        Show
        Jason Rutherglen added a comment - I think we should improve our default MP. First, maybe we should set a maxMergeMB by default? That's a good idea, however would we set an absolute size or a size relative to the aggregate size of the index? I'm using 5 GB in production as otherwise I'm not sure the merge cost is worth the potential performance improvement, ie, long merges adversely affects indexing performance.
        Hide
        Michael McCandless added a comment -

        Fixed.

        I'll open a new issue to make cfs selection dependent on segment size...

        Show
        Michael McCandless added a comment - Fixed. I'll open a new issue to make cfs selection dependent on segment size...
        Hide
        Josef added a comment -

        Since I am still observing file references on deleted index files, I have a quick question regarding this issue:

        We are using Lucene 3.0.3 with standard configuration and near real-time reader, as the following pseudo code shows:

        ...
                // initially obtaining and referencing a near-real time searcher
                IndexSearcher currentSearcher = new IndexSearcher(writer.getReader())
        ...
        
        	// subsequently obtaining a new near-real time searcher if necessary
        	if (!currentSearcher.getIndexReader().isCurrent()) {
        	    IndexReader newReader = currentSearcher.getIndexReader().reopen();
        	    IndexSearcher newSearcher = new IndexSearcher(newReader);
        	    
        	    // release old searcher (by decreasing reference)
        	    currentSearcher.getIndexReader().decRef();
        	    
        	    currentSearcher = newSearcher;
        	}
        ...
        

        After running the application for a while with index updates and doing new searcher using the newly obtained IndexSearcher, we noticed that JVM still holds references to already deleted index files.
        For example:

        java    20742 xxx  394r   REG                8,6      366 3412376 /home/xxx/Desktop/mp.home/dev/index/Artist/_fc.cfs (deleted)
        java    20742 xxx  398r   REG                8,6      366 3412375 /home/xxx/Desktop/mp.home/dev/index/Artist/_fq.cfs
        java    20742 xxx  401uw  REG                8,6        0 3412333 /home/xxx/Desktop/mp.home/dev/index/Artist/write.lock
        java    20742 xxx  415r   REG                8,6      128 3412349 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.tis
        java    20742 xxx  416r   REG                8,6      366 3412341 /home/xxx/Desktop/mp.home/dev/index/Artist/_fd.cfs (deleted)
        java    20742 xxx  417r   REG                8,6      366 3412344 /home/xxx/Desktop/mp.home/dev/index/Artist/_fe.cfs (deleted)
        java    20742 xxx  418r   REG                8,6       71 3412356 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.tis (deleted)
        java    20742 xxx  424r   REG                8,6        7 3412362 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.frq (deleted)
        java    20742 xxx  425r   REG                8,6        7 3412363 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.prx (deleted)
        java    20742 xxx  426r   REG                8,6       23 3412351 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.fdt (deleted)
        java    20742 xxx  427r   REG                8,6       12 3412352 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.fdx (deleted)
        java    20742 xxx  428r   REG                8,6       10 3412365 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.nrm (deleted)
        java    20742 xxx  429r   REG                8,6       21 3412357 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.frq
        java    20742 xxx  432r   REG                8,6       21 3412358 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.prx
        java    20742 xxx  433r   REG                8,6       61 3412347 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.fdt
        java    20742 xxx  434r   REG                8,6       28 3412348 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.fdx
        java    20742 xxx  445r   REG                8,6       22 3412360 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.nrm
        

        The application reaches the limit of maximum number of open files and then stops.

        Are we doing anything wrong here or does the bug still exist in version 3.0.3?
        Any advice is welcome.
        Thanks!

        Show
        Josef added a comment - Since I am still observing file references on deleted index files, I have a quick question regarding this issue: We are using Lucene 3.0.3 with standard configuration and near real-time reader, as the following pseudo code shows: ... // initially obtaining and referencing a near-real time searcher IndexSearcher currentSearcher = new IndexSearcher(writer.getReader()) ... // subsequently obtaining a new near-real time searcher if necessary if (!currentSearcher.getIndexReader().isCurrent()) { IndexReader newReader = currentSearcher.getIndexReader().reopen(); IndexSearcher newSearcher = new IndexSearcher(newReader); // release old searcher (by decreasing reference) currentSearcher.getIndexReader().decRef(); currentSearcher = newSearcher; } ... After running the application for a while with index updates and doing new searcher using the newly obtained IndexSearcher, we noticed that JVM still holds references to already deleted index files. For example: java 20742 xxx 394r REG 8,6 366 3412376 /home/xxx/Desktop/mp.home/dev/index/Artist/_fc.cfs (deleted) java 20742 xxx 398r REG 8,6 366 3412375 /home/xxx/Desktop/mp.home/dev/index/Artist/_fq.cfs java 20742 xxx 401uw REG 8,6 0 3412333 /home/xxx/Desktop/mp.home/dev/index/Artist/write.lock java 20742 xxx 415r REG 8,6 128 3412349 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.tis java 20742 xxx 416r REG 8,6 366 3412341 /home/xxx/Desktop/mp.home/dev/index/Artist/_fd.cfs (deleted) java 20742 xxx 417r REG 8,6 366 3412344 /home/xxx/Desktop/mp.home/dev/index/Artist/_fe.cfs (deleted) java 20742 xxx 418r REG 8,6 71 3412356 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.tis (deleted) java 20742 xxx 424r REG 8,6 7 3412362 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.frq (deleted) java 20742 xxx 425r REG 8,6 7 3412363 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.prx (deleted) java 20742 xxx 426r REG 8,6 23 3412351 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.fdt (deleted) java 20742 xxx 427r REG 8,6 12 3412352 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.fdx (deleted) java 20742 xxx 428r REG 8,6 10 3412365 /home/xxx/Desktop/mp.home/dev/index/Artist/_fb.nrm (deleted) java 20742 xxx 429r REG 8,6 21 3412357 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.frq java 20742 xxx 432r REG 8,6 21 3412358 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.prx java 20742 xxx 433r REG 8,6 61 3412347 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.fdt java 20742 xxx 434r REG 8,6 28 3412348 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.fdx java 20742 xxx 445r REG 8,6 22 3412360 /home/xxx/Desktop/mp.home/dev/index/Artist/_fp.nrm The application reaches the limit of maximum number of open files and then stops. Are we doing anything wrong here or does the bug still exist in version 3.0.3? Any advice is welcome. Thanks!
        Hide
        Michael McCandless added a comment -

        Hmmm... the fix here was backported to 3.0.x, so it should be fixed in 3.0.3.

        The code fragment looks correct, but your usage of decRef (not close) likely means you have other places that also incRef the reader? Are you sure all incRefs are balanced by a decRef?

        Are you sure you really have leaking file handles and not just too many open files? EG what mergeFactor are you using...?

        Show
        Michael McCandless added a comment - Hmmm... the fix here was backported to 3.0.x, so it should be fixed in 3.0.3. The code fragment looks correct, but your usage of decRef (not close) likely means you have other places that also incRef the reader? Are you sure all incRefs are balanced by a decRef? Are you sure you really have leaking file handles and not just too many open files? EG what mergeFactor are you using...?
        Hide
        Josef added a comment -

        Thanks for the quick response.

        You are right, we just increase the reference for the current Searcher when it is obtained and used. Decreasing it again is part of a finally block. I have triple checked this code to make sure, we are not doing anything wrong and I am very sure, we correctly decrease the reference in any case.

        The output of the 'lsof' command above shows handles to files already marked deleted. The number of files marked deleted increases over time until we get a Java 'IOException: Too many open files'.
        If we don't hold the reference to the reader contained in our searcher, who is?

        Regarding merge factor etc: we are using standard configurations, i.e. no tweaking (yet).
        Besides, we do not call IndexWriter.optimize() in our code yet.

        Show
        Josef added a comment - Thanks for the quick response. You are right, we just increase the reference for the current Searcher when it is obtained and used. Decreasing it again is part of a finally block. I have triple checked this code to make sure, we are not doing anything wrong and I am very sure, we correctly decrease the reference in any case. The output of the 'lsof' command above shows handles to files already marked deleted. The number of files marked deleted increases over time until we get a Java 'IOException: Too many open files'. If we don't hold the reference to the reader contained in our searcher, who is? Regarding merge factor etc: we are using standard configurations, i.e. no tweaking (yet). Besides, we do not call IndexWriter.optimize() in our code yet.
        Hide
        Michael McCandless added a comment -

        We have a test, TestNRTThreads, that stress-tests NRT to try to catch issues like this.

        I've backported to 3.0.x, and was able to run it for 30 minutes with max 1024 file handles, successfully... so if there is a file handle leak, this test failed to uncover it.

        Not sure what's up... can you try to boil down your test case to a small test showing the problem?

        Note that when you incRef the searcher you have to do that sync'd somehow with your reopen logic to prevent the final decRef from running at the same time. (I don't think that's related to the file handle issue but it's important to get right else you have a thread hazard).

        Show
        Michael McCandless added a comment - We have a test, TestNRTThreads, that stress-tests NRT to try to catch issues like this. I've backported to 3.0.x, and was able to run it for 30 minutes with max 1024 file handles, successfully... so if there is a file handle leak, this test failed to uncover it. Not sure what's up... can you try to boil down your test case to a small test showing the problem? Note that when you incRef the searcher you have to do that sync'd somehow with your reopen logic to prevent the final decRef from running at the same time. (I don't think that's related to the file handle issue but it's important to get right else you have a thread hazard).
        Hide
        Josef added a comment -

        Thanks for your response.
        I'll try to extract a small test mimicking the behavior.

        And thanks for pointing out incRef / decRef and threading issue. I am aware of that and the relevant code is synchronized

        Show
        Josef added a comment - Thanks for your response. I'll try to extract a small test mimicking the behavior. And thanks for pointing out incRef / decRef and threading issue. I am aware of that and the relevant code is synchronized
        Hide
        Josef added a comment -

        I have extracted the relevant code into a small Java application, but I am not able to reproduce the observed behavior with it.

        It might have to do with our integration and usage of Lucene in the Glassfish application server.
        I'll go hunting there no.
        Any advice with such integration?

        Thanks for your help anyway!

        Show
        Josef added a comment - I have extracted the relevant code into a small Java application, but I am not able to reproduce the observed behavior with it. It might have to do with our integration and usage of Lucene in the Glassfish application server. I'll go hunting there no. Any advice with such integration? Thanks for your help anyway!
        Hide
        Michael McCandless added a comment -

        Hmm I don't know anything about Glassfish... but it's interesting your separate tool couldn't repro the problem. Maybe re-scrutinize to verify your standalone tool truly matches how the app is using Lucene?

        Show
        Michael McCandless added a comment - Hmm I don't know anything about Glassfish... but it's interesting your separate tool couldn't repro the problem. Maybe re-scrutinize to verify your standalone tool truly matches how the app is using Lucene?
        Hide
        Shai Erera added a comment -

        Could this be related to all the leaked handles we fixed at LUCENE-3147?

        Show
        Shai Erera added a comment - Could this be related to all the leaked handles we fixed at LUCENE-3147 ?
        Hide
        Michael McCandless added a comment -

        Could this be related to all the leaked handles we fixed at LUCENE-3147?

        I think that's unlikely? Ie, the leaks we are fixing in LUCENE-3147 are all cases where an exception was hit at some point?

        Unless: Josef, are you regularly hitting exceptions somewhere....?

        Show
        Michael McCandless added a comment - Could this be related to all the leaked handles we fixed at LUCENE-3147 ? I think that's unlikely? Ie, the leaks we are fixing in LUCENE-3147 are all cases where an exception was hit at some point? Unless: Josef, are you regularly hitting exceptions somewhere....?
        Hide
        Josef added a comment -

        Unless: Josef, are you regularly hitting exceptions somewhere....?

        No, we don't get any exceptions. I'll do some testing with our Glassfish integration and also make sure, my test scenario matches the concrete application integration. Hope this will isolate our problem. I'll report back here.

        Show
        Josef added a comment - Unless: Josef, are you regularly hitting exceptions somewhere....? No, we don't get any exceptions. I'll do some testing with our Glassfish integration and also make sure, my test scenario matches the concrete application integration. Hope this will isolate our problem. I'll report back here.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development