HBase
  1. HBase
  2. HBASE-4387

Error while syncing: DFSOutputStream is closed

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.90.5
    • Component/s: wal
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In a billion-row load on ~25 servers, I see "error while syncing" reasonable often with the error "DFSOutputStream is closed" around a roll. We have some race where a roll at the same time as heavy inserts causes a problem.

      1. errors-with-context.txt
        350 kB
        Todd Lipcon
      2. 4387.txt
        0.9 kB
        Lars Hofhansl

        Activity

        Todd Lipcon created issue -
        Hide
        Todd Lipcon added a comment -

        Here are the logs with 100 lines of context around all the ERROR lines

        Show
        Todd Lipcon added a comment - Here are the logs with 100 lines of context around all the ERROR lines
        Todd Lipcon made changes -
        Field Original Value New Value
        Attachment errors-with-context.txt [ 12494253 ]
        Hide
        Jean-Daniel Cryans added a comment -

        HLog.syncer() syncs outside of the updateLock and has the following comment:

        // Done in parallel for all writer threads, thanks to HDFS-895

        So we don't need to synchronize for sync'ing but we do need something when closing the file.

        Show
        Jean-Daniel Cryans added a comment - HLog.syncer() syncs outside of the updateLock and has the following comment: // Done in parallel for all writer threads, thanks to HDFS-895 So we don't need to synchronize for sync'ing but we do need something when closing the file.
        Hide
        Lars Hofhansl added a comment -

        The writer is closed with the UpdateLock held.

        So the failure scenario is that we sync the wrong (closed) writer and if the RS dies, we end up with WAL entries that were not sync'ed.

        A simple fix, then, might be to redo the sync with updateLock held if it failed the first time, and only throw the exception when it failed the 2nd time.

        Show
        Lars Hofhansl added a comment - The writer is closed with the UpdateLock held. So the failure scenario is that we sync the wrong (closed) writer and if the RS dies, we end up with WAL entries that were not sync'ed. A simple fix, then, might be to redo the sync with updateLock held if it failed the first time, and only throw the exception when it failed the 2nd time.
        Hide
        Lars Hofhansl added a comment -

        Something like this.
        Since the writer is rolled with updateLock held the 2nd invocation will get the correct writer.

        We still get the benefit of sync'ing without the lock held most of the time.

        (This is a variation of double-checked-locking, but with a 32 bit primitive, so it should work reliably).

        Show
        Lars Hofhansl added a comment - Something like this. Since the writer is rolled with updateLock held the 2nd invocation will get the correct writer. We still get the benefit of sync'ing without the lock held most of the time. (This is a variation of double-checked-locking, but with a 32 bit primitive, so it should work reliably).
        Lars Hofhansl made changes -
        Attachment 4387.txt [ 12495170 ]
        Hide
        Lars Hofhansl added a comment -

        @Todd Do you think retry your test with this change?
        (Maybe 100m rows would do too )

        Show
        Lars Hofhansl added a comment - @Todd Do you think retry your test with this change? (Maybe 100m rows would do too )
        Lars Hofhansl made changes -
        Assignee Lars Hofhansl [ lhofhansl ]
        Hide
        Lars Hofhansl added a comment -

        Assigned to me so this has an owner.

        Show
        Lars Hofhansl added a comment - Assigned to me so this has an owner.
        Hide
        Todd Lipcon added a comment -

        I'm traveling for the next week or so, so probably won't have a chance to do so. I'll be continuing to work on 0.92 stabilization over the next couple of months, though - so I'll certainly be running this test again in the relatively near future.

        Show
        Todd Lipcon added a comment - I'm traveling for the next week or so, so probably won't have a chance to do so. I'll be continuing to work on 0.92 stabilization over the next couple of months, though - so I'll certainly be running this test again in the relatively near future.
        Hide
        Lars Hofhansl added a comment -

        Until then... anybody opposed to this change?

        J-D diagnosed the problem correct, I think. And my change should fix it. As this is a race there is no proof of course.

        The only question is whether the first attempt should be logged if it failed, maybe even at info level so that we can see these cases even without debug logging.

        Show
        Lars Hofhansl added a comment - Until then... anybody opposed to this change? J-D diagnosed the problem correct, I think. And my change should fix it. As this is a race there is no proof of course. The only question is whether the first attempt should be logged if it failed, maybe even at info level so that we can see these cases even without debug logging.
        Hide
        stack added a comment -

        I think we should try the patch. Seems like the incidence is low. If so, should we continue to log its happening? Or, I suppose, there is no need since the second call will throw an IOE if it fails?

        Let me commit for now ahead of testing. We can reopen if still a prob. or has new manifestation.

        Show
        stack added a comment - I think we should try the patch. Seems like the incidence is low. If so, should we continue to log its happening? Or, I suppose, there is no need since the second call will throw an IOE if it fails? Let me commit for now ahead of testing. We can reopen if still a prob. or has new manifestation.
        Hide
        stack added a comment -

        Applied two branches and trunk. Thanks for the patch Lars.

        Show
        stack added a comment - Applied two branches and trunk. Thanks for the patch Lars.
        stack made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Fix Version/s 0.90.5 [ 12317145 ]
        Fix Version/s 0.92.0 [ 12314223 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2246 (See https://builds.apache.org/job/HBase-TRUNK/2246/)
        HBASE-4387 Error while syncing: DFSOutputStream is closed
        HBASE-4387 Error while syncing: DFSOutputStream is closed

        stack :
        Files :

        • /hbase/trunk/CHANGES.txt

        stack :
        Files :

        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2246 (See https://builds.apache.org/job/HBase-TRUNK/2246/ ) HBASE-4387 Error while syncing: DFSOutputStream is closed HBASE-4387 Error while syncing: DFSOutputStream is closed stack : Files : /hbase/trunk/CHANGES.txt stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
        Show
        stack added a comment - Does this cause TestLogRolling to fail? https://builds.apache.org/job/HBase-TRUNK/2246/testReport/org.apache.hadoop.hbase.regionserver.wal/TestLogRolling/testLogRollOnPipelineRestart/ I need to look at it....
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #17 (See https://builds.apache.org/job/HBase-0.92/17/)
        HBASE-4387 Error while syncing: DFSOutputStream is closed

        stack :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #17 (See https://builds.apache.org/job/HBase-0.92/17/ ) HBASE-4387 Error while syncing: DFSOutputStream is closed stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
        Hide
        Lars Hofhansl added a comment -

        Don't see how it could. I'll be on computer soon to check it out.

        Show
        Lars Hofhansl added a comment - Don't see how it could. I'll be on computer soon to check it out.
        Hide
        Gary Helmling added a comment -

        The TestLogRolling failure is on the assertion:

            assertTrue(preLogRolledCalled.size() == 1);
        

        So I would suspect it's more due to HBASE-4132. Unless there's some interaction between this change and that.

        Show
        Gary Helmling added a comment - The TestLogRolling failure is on the assertion: assertTrue(preLogRolledCalled.size() == 1); So I would suspect it's more due to HBASE-4132 . Unless there's some interaction between this change and that.
        stack made changes -
        Fix Version/s 0.90.6 [ 12319200 ]
        Fix Version/s 0.90.5 [ 12317145 ]
        stack made changes -
        Fix Version/s 0.90.5 [ 12317145 ]
        Fix Version/s 0.90.6 [ 12319200 ]

          People

          • Assignee:
            Lars Hofhansl
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development