HBase
  1. HBase
  2. HBASE-2234

Roll Hlog if any datanode in the write pipeline dies

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      HDFS does not replicate the last block of a file that is being written to. This means that is datanodes in the write pipeline die, then the data blocks in the transaction log would be experiencing reduced redundancy. It would be good if the region server can detect datanode-death in the write pipeline while writing to the transaction log and if this happens, close the current log an open a new one. This depends on HDFS-826

      1. HBASE-2234-20.4-1.patch
        16 kB
        Nicolas Spiegelberg
      2. HBASE-2234-20.4.patch
        15 kB
        Nicolas Spiegelberg

        Issue Links

          Activity

          Hide
          stack added a comment -

          I forward-ported this patch to TRUNK (Some had been done as part of hbase-2544, the tests were ported forwarded as part of hbase-2551 part 1).

          Show
          stack added a comment - I forward-ported this patch to TRUNK (Some had been done as part of hbase-2544, the tests were ported forwarded as part of hbase-2551 part 1).
          Hide
          stack added a comment -

          Marking these as fixed against 0.21.0 rather than against 0.20.5.

          Show
          stack added a comment - Marking these as fixed against 0.21.0 rather than against 0.20.5.
          Hide
          Jean-Daniel Cryans added a comment -

          I opened HBASE-2382 and assigned it to Nicolas so that he can take a look.

          Show
          Jean-Daniel Cryans added a comment - I opened HBASE-2382 and assigned it to Nicolas so that he can take a look.
          Hide
          dhruba borthakur added a comment -

          I agree with JD. Can we use <hlogpath>.getFiletatus().getReplication() instead of fs.getDefaltReplication()? This will will ensure that we look at the repl factor of the precise file we are interested in, rather than what the system-wide default value is.

          Show
          dhruba borthakur added a comment - I agree with JD. Can we use <hlogpath>.getFiletatus().getReplication() instead of fs.getDefaltReplication()? This will will ensure that we look at the repl factor of the precise file we are interested in, rather than what the system-wide default value is.
          Hide
          Jean-Daniel Cryans added a comment -

          I saw a minor issue with this patch, if HBase doesn't have hdfs-site.xml in its CP or isn't configured with the same dfs.replication, this call will not return the right value:

          +                numCurrentReplicas < fs.getDefaultReplication()) {  
          

          For example, I was testing HBASE-2337 on a single node and didn't put hadoop's conf dir in HBase's CP but did configure rep=1, I ended up rolling the logs for every edit:

          WARN org.apache.hadoop.hbase.regionserver.HLog: HDFS pipeline error detected. Found 1 replicas but expecting 3 replicas.  Requesting close of hlog.
          WARN org.apache.hadoop.hbase.regionserver.HLog: HDFS pipeline error detected. Found 1 replicas but expecting 3 replicas.  Requesting close of hlog.
          WARN org.apache.hadoop.hbase.regionserver.HLog: HDFS pipeline error detected. Found 1 replicas but expecting 3 replicas.  Requesting close of hlog.
          ...
          

          I can see hordes of new users coming with the same issue. Is there a better way to ask HDFS about the default replication setting and store it?

          Show
          Jean-Daniel Cryans added a comment - I saw a minor issue with this patch, if HBase doesn't have hdfs-site.xml in its CP or isn't configured with the same dfs.replication, this call will not return the right value: + numCurrentReplicas < fs.getDefaultReplication()) { For example, I was testing HBASE-2337 on a single node and didn't put hadoop's conf dir in HBase's CP but did configure rep=1, I ended up rolling the logs for every edit: WARN org.apache.hadoop.hbase.regionserver.HLog: HDFS pipeline error detected. Found 1 replicas but expecting 3 replicas. Requesting close of hlog. WARN org.apache.hadoop.hbase.regionserver.HLog: HDFS pipeline error detected. Found 1 replicas but expecting 3 replicas. Requesting close of hlog. WARN org.apache.hadoop.hbase.regionserver.HLog: HDFS pipeline error detected. Found 1 replicas but expecting 3 replicas. Requesting close of hlog. ... I can see hordes of new users coming with the same issue. Is there a better way to ask HDFS about the default replication setting and store it?
          Hide
          dhruba borthakur added a comment -

          > I presume syncFs forces dfsclient to flush all local data out to the datanode buffers?

          That's right. It goes to the OS buffers on the datanodes.

          Show
          dhruba borthakur added a comment - > I presume syncFs forces dfsclient to flush all local data out to the datanode buffers? That's right. It goes to the OS buffers on the datanodes.
          Hide
          stack added a comment -

          Resolving. Thanks for the patch Nicolas.

          Show
          stack added a comment - Resolving. Thanks for the patch Nicolas.
          Hide
          stack added a comment -

          I committed this patch to branch. All tests pass but the TestHeapSize. I'll fix that in a sec (HBASE-2307). I opened hbase-2306 to forward-port the patch. Its a little awkward at the moment since TRUNK has hadoop 0.21 in it but that might change soon making forward-port straight-forward.

          On the above buffering note above, dumb question, I presume syncFs forces dfsclient to flush all local data out to the datanode buffers?

          Show
          stack added a comment - I committed this patch to branch. All tests pass but the TestHeapSize. I'll fix that in a sec ( HBASE-2307 ). I opened hbase-2306 to forward-port the patch. Its a little awkward at the moment since TRUNK has hadoop 0.21 in it but that might change soon making forward-port straight-forward. On the above buffering note above, dumb question, I presume syncFs forces dfsclient to flush all local data out to the datanode buffers?
          Hide
          dhruba borthakur added a comment -

          The DFSClient has a siliding window protocol to send data to datanodes. The default size of the window is 80 packets, each of size 64K.

          Show
          dhruba borthakur added a comment - The DFSClient has a siliding window protocol to send data to datanodes. The default size of the window is 80 packets, each of size 64K.
          Hide
          Nicolas Spiegelberg added a comment -

          Go ahead and apply the existing patch with the comment changes.

          I spent a little bit of time yesterday trying to understand all the layers of buffering between the SequenceFile.Writer & actually having the pipeline opened and content sent to the datanodes. I figured I'd pass that information along since 0.20.2 currently does not support syncFs(). Without syncFs, the pipeline seems to be created every 64k, which is 'dfs.write.packet.size'. The stack trace, with associated buffering, that I was following:

          1. SequenceFile.Writer.append()
          2. FSOutputSummer.write() --> buffers to maxChunkSize. An HDFS chunk is the amount of data in between checksums. (default: 512bytes)
          3. FSOutputSummer.flushBuffer()
          4. FSOutputSummer. writeChecksumChunk()
          5. DFSOutputStream.writeChunk() --> buffers to currentPacket.maxChunk. This is the maximum HDFS chunk count that can be place in a Packet. Approx byte count is min("dfs.block.size" (default:64MB), "hbase.regionserver.hlog.blocksize" (default:"dfs.block.size"), "dfs.write.packet.size" (default:64k))
          5. DataStreamer.run() <-- creates the pipeline

          Show
          Nicolas Spiegelberg added a comment - Go ahead and apply the existing patch with the comment changes. I spent a little bit of time yesterday trying to understand all the layers of buffering between the SequenceFile.Writer & actually having the pipeline opened and content sent to the datanodes. I figured I'd pass that information along since 0.20.2 currently does not support syncFs(). Without syncFs, the pipeline seems to be created every 64k, which is 'dfs.write.packet.size'. The stack trace, with associated buffering, that I was following: 1. SequenceFile.Writer.append() 2. FSOutputSummer.write() --> buffers to maxChunkSize. An HDFS chunk is the amount of data in between checksums. (default: 512bytes) 3. FSOutputSummer.flushBuffer() 4. FSOutputSummer. writeChecksumChunk() 5. DFSOutputStream.writeChunk() --> buffers to currentPacket.maxChunk. This is the maximum HDFS chunk count that can be place in a Packet. Approx byte count is min("dfs.block.size" (default:64MB), "hbase.regionserver.hlog.blocksize" (default:"dfs.block.size"), "dfs.write.packet.size" (default:64k)) 5. DataStreamer.run() <-- creates the pipeline
          Hide
          stack added a comment -

          @Nicolas And i think then this patch fine as is unless you want to add more comments (I can make my small suggested changes above on commit).

          Show
          stack added a comment - @Nicolas And i think then this patch fine as is unless you want to add more comments (I can make my small suggested changes above on commit).
          Hide
          stack added a comment -

          OK. I agree with you. Let me go build an hadoop 0.20.2 that has hdfs-826 and hdfs-200 applied. Any others? And then commit to 0.20 branch.

          Show
          stack added a comment - OK. I agree with you. Let me go build an hadoop 0.20.2 that has hdfs-826 and hdfs-200 applied. Any others? And then commit to 0.20 branch.
          Hide
          Nicolas Spiegelberg added a comment -

          Will correct comment/messaging issues. The actual patch should work fine without append support, however I would lean towards applying HDFS-200 just to exercise this test. In the unit test, we really just use append support for syncFs(), since it's a deterministic way to initialize the pipeline for reading [I'll clarify in code comments]. I've experimented with modifying "io.file.buffer.size", but it would take a lot more introspection and another nastiness to test it that way. My biggest worry about 'skipping the test' is that it's easy to accidentally replace the default with an un-patched JAR and not know about it since HBase has hundreds of tests. We've been regularly applying new HDFS client-side patches over here for append support, and I messed up once or twice when I was task switching between this issue and other testing.

          Show
          Nicolas Spiegelberg added a comment - Will correct comment/messaging issues. The actual patch should work fine without append support, however I would lean towards applying HDFS-200 just to exercise this test. In the unit test, we really just use append support for syncFs(), since it's a deterministic way to initialize the pipeline for reading [I'll clarify in code comments] . I've experimented with modifying "io.file.buffer.size", but it would take a lot more introspection and another nastiness to test it that way. My biggest worry about 'skipping the test' is that it's easy to accidentally replace the default with an un-patched JAR and not know about it since HBase has hundreds of tests. We've been regularly applying new HDFS client-side patches over here for append support, and I messed up once or twice when I was task switching between this issue and other testing.
          Hide
          stack added a comment -

          Assigning Nicolas since he's doing the work (Made N a contributor).

          @Nicolas, a few comments and solicitation of opinon.

          + We need to update the hadoop we bundle. We'll want to ship with hadoop 0.20.2. It has at a minimum hdfs-127 fix. We should probably apply hdfs-826 to the hadoop we ship too since its a client-side only change. If we included hdfs-200, that'd make it so this test you've included actually gets exercised so we should apply it too?

          + In fact, it looks like this test fails if 826 and 200 are not in place, is that right? You probably don't want that. Maybe skip out the test if they are not in place but don't fail I'd say.

          + Your test is great.

          + FYI, we try not to reference log4j explicitly. – i.e. the logger implementation – but i think in this case you have no choice going by the commons dictum that the logger config. is outside of its scope (I was reading under "Configuring the Underlying Logging System" in http://commons.apache.org/logging/apidocs/org/apache/commons/logging/package-summary.html).

          + I like the comments you've added to HLog.java

          + The log message says hadoop-4379 if hdfs-200 is found... maybe add or change mention of hdfs-200

          Patch looks good otherwise.

          Show
          stack added a comment - Assigning Nicolas since he's doing the work (Made N a contributor). @Nicolas, a few comments and solicitation of opinon. + We need to update the hadoop we bundle. We'll want to ship with hadoop 0.20.2. It has at a minimum hdfs-127 fix. We should probably apply hdfs-826 to the hadoop we ship too since its a client-side only change. If we included hdfs-200, that'd make it so this test you've included actually gets exercised so we should apply it too? + In fact, it looks like this test fails if 826 and 200 are not in place, is that right? You probably don't want that. Maybe skip out the test if they are not in place but don't fail I'd say. + Your test is great. + FYI, we try not to reference log4j explicitly. – i.e. the logger implementation – but i think in this case you have no choice going by the commons dictum that the logger config. is outside of its scope (I was reading under "Configuring the Underlying Logging System" in http://commons.apache.org/logging/apidocs/org/apache/commons/logging/package-summary.html ). + I like the comments you've added to HLog.java + The log message says hadoop-4379 if hdfs-200 is found... maybe add or change mention of hdfs-200 Patch looks good otherwise.
          Hide
          Nicolas Spiegelberg added a comment -

          ...and then I clicked the wrong button on the Attachment License

          Show
          Nicolas Spiegelberg added a comment - ...and then I clicked the wrong button on the Attachment License
          Hide
          Nicolas Spiegelberg added a comment -

          Updates to address comments on from this jira & internal review. Notable changes:
          1. added checks to ensure clients with HDFS-826 or append support would not be negatively affected
          2. simplified rebindWriterFunc(). Only happens inside rollWriter()
          3. asserts in TestLogRolling to discover the presence of HDFS-826 & append support. You can easily change the asserts to LOG.info() depending upon your default HDFS jar distro.

          Show
          Nicolas Spiegelberg added a comment - Updates to address comments on from this jira & internal review. Notable changes: 1. added checks to ensure clients with HDFS-826 or append support would not be negatively affected 2. simplified rebindWriterFunc(). Only happens inside rollWriter() 3. asserts in TestLogRolling to discover the presence of HDFS-826 & append support. You can easily change the asserts to LOG.info() depending upon your default HDFS jar distro.
          Hide
          Nicolas Spiegelberg added a comment -

          @ryan I could go a couple ways with rebindWriterFunc(). I either need the "this" pointer to set "this.hdfs_out", in which case I don't need to explicitly pass in any params since it's always "this.writer". Or I could return an OutputStream and make this function static. I'm new to Java from an embedded C++ background; but to me it look like this is 6 of one, half-dozen of the other. I have a couple internal comments about this diff I may need to apply as well, so I'll roll that change into my next diff.

          Show
          Nicolas Spiegelberg added a comment - @ryan I could go a couple ways with rebindWriterFunc(). I either need the "this" pointer to set "this.hdfs_out", in which case I don't need to explicitly pass in any params since it's always "this.writer". Or I could return an OutputStream and make this function static. I'm new to Java from an embedded C++ background; but to me it look like this is 6 of one, half-dozen of the other. I have a couple internal comments about this diff I may need to apply as well, so I'll roll that change into my next diff.
          Hide
          ryan rawson added a comment -

          would it be necessary or possible to move this function to a static context:
          + private void rebindWriterFunc(SequenceFile.Writer writer) throws IOException {

          I guess since there is only one hlog instance per RS, maybe it's ok?

          Show
          ryan rawson added a comment - would it be necessary or possible to move this function to a static context: + private void rebindWriterFunc(SequenceFile.Writer writer) throws IOException { I guess since there is only one hlog instance per RS, maybe it's ok?
          Hide
          stack added a comment -

          @ Nicolas – Sweet. Patch looks good.

          Show
          stack added a comment - @ Nicolas – Sweet. Patch looks good.
          Hide
          Nicolas Spiegelberg added a comment -

          The patch that I supplied should merge properly if the HBASE-1939-0.24 patch is applied first.

          Show
          Nicolas Spiegelberg added a comment - The patch that I supplied should merge properly if the HBASE-1939 -0.24 patch is applied first.
          Hide
          Nicolas Spiegelberg added a comment -

          Here is a patch with associated unit test. I was trying to figure out whether to test at the HLog or the HRegionServer level. I wrote the same tests at both levels, but I submitted the HRegionServer one. Let me know if you need the other. Note that this also includes our group-commit code and syncFs() modifications, so it won't work straight off the trunk.

          Show
          Nicolas Spiegelberg added a comment - Here is a patch with associated unit test. I was trying to figure out whether to test at the HLog or the HRegionServer level. I wrote the same tests at both levels, but I submitted the HRegionServer one. Let me know if you need the other. Note that this also includes our group-commit code and syncFs() modifications, so it won't work straight off the trunk.
          Hide
          stack added a comment -

          Adding into 0.20.4 and 0.21. Also assigned it to myself. Take it back if you want to do it Dhruba but I figure you have enough going on hdfs-side.

          Show
          stack added a comment - Adding into 0.20.4 and 0.21. Also assigned it to myself. Take it back if you want to do it Dhruba but I figure you have enough going on hdfs-side.

            People

            • Assignee:
              Nicolas Spiegelberg
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development