Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-822

Appends to already-finalized blocks can rename across volumes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.21.0, 0.22.0
    • Fix Version/s: 0.21.0
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is a performance thing. As I understand the code in FSDataset.append, if the block is already finalized, it needs to move it into the RBW directory so it can go back into a "being written" state. This is done using volumes.getNextVolume without preference to the volume that the block currently exists on. It seems to me that this could cause a lot of slow cross-volume copies on applications that periodically append/close/append/close a file. Instead, getNextVolume could provide an alternate form that gives preference to a particular volume, so the rename stays on the same disk.

      1. HDFS-822.patch
        3 kB
        Hairong Kuang
      2. HDFS-822.patch
        0.7 kB
        Hairong Kuang
      3. HDFS-822.patch
        0.8 kB
        Hairong Kuang

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #205 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/205/)

        Show
        Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #205 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/205/ )
        Hide
        Hudson added a comment -

        Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #101 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/101/)
        . Appends to already-finalized blocks can rename across volumes. Contributed by Hairong Kuang.

        Show
        Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #101 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/101/ ) . Appends to already-finalized blocks can rename across volumes. Contributed by Hairong Kuang.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #206 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/206/)
        . Appends to already-finalized blocks can rename across volumes. Contributed by Hairong Kuang.

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #206 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/206/ ) . Appends to already-finalized blocks can rename across volumes. Contributed by Hairong Kuang.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #173 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/173/)
        . Appends to already-finalized blocks can rename across volumes. Contributed by Hairong Kuang.

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #173 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/173/ ) . Appends to already-finalized blocks can rename across volumes. Contributed by Hairong Kuang.
        Hide
        Hairong Kuang added a comment -

        I've just committed this.

        Show
        Hairong Kuang added a comment - I've just committed this.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12430802/HDFS-822.patch
        against trunk revision 899747.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12430802/HDFS-822.patch against trunk revision 899747. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Looks good to me. +1

        Show
        Todd Lipcon added a comment - Looks good to me. +1
        Hide
        Hairong Kuang added a comment -

        Todd's comment makes sense. This patch incorporates his comment and adds a unit test.

        Show
        Hairong Kuang added a comment - Todd's comment makes sense. This patch incorporates his comment and adds a unit test.
        Hide
        Todd Lipcon added a comment -

        So basically this patch delays the decision to check disk space as opposed to Todd's proposal.

        Only problem with this is that it now ignores the dfs.du.reserved setting, since that's only set at allocation-time. Maybe that's OK, since there are plenty of other ways to overrun the reserved space - just wanted to bring it up.

        Show
        Todd Lipcon added a comment - So basically this patch delays the decision to check disk space as opposed to Todd's proposal. Only problem with this is that it now ignores the dfs.du.reserved setting, since that's only set at allocation-time. Maybe that's OK, since there are plenty of other ways to overrun the reserved space - just wanted to bring it up.
        Hide
        Hairong Kuang added a comment -

        This new patch simply does what I proposed and it sets up a pipeline without checking the available space.

        If the disk does not have space when actual writes happen, the write will fail and the out-of-space datanode will be taken out of the pipeline. So basically this patch delays the decision to check disk space as opposed to Todd's proposal.

        Show
        Hairong Kuang added a comment - This new patch simply does what I proposed and it sets up a pipeline without checking the available space. If the disk does not have space when actual writes happen, the write will fail and the out-of-space datanode will be taken out of the pipeline. So basically this patch delays the decision to check disk space as opposed to Todd's proposal.
        Hide
        dhruba borthakur added a comment -

        > But anyway my proposal is to disallow the across volume rename for now. Does this sound OK to everybody?

        sounds ok to me.

        Show
        dhruba borthakur added a comment - > But anyway my proposal is to disallow the across volume rename for now. Does this sound OK to everybody? sounds ok to me.
        Hide
        Todd Lipcon added a comment -

        In that case, shouldn't the current patch throw an exception when the current volume's full, rather than calling getNextVolume?

        Show
        Todd Lipcon added a comment - In that case, shouldn't the current patch throw an exception when the current volume's full, rather than calling getNextVolume?
        Hide
        Hairong Kuang added a comment -

        > Current code handles this. It removes the partial file in the destination if rename fails.
        I checked the code again. Unfortunately the code does not handle this.

        But anyway my proposal is to disallow the across volume rename for now. Does this sound OK to everybody?

        Show
        Hairong Kuang added a comment - > Current code handles this. It removes the partial file in the destination if rename fails. I checked the code again. Unfortunately the code does not handle this. But anyway my proposal is to disallow the across volume rename for now. Does this sound OK to everybody?
        Hide
        Hairong Kuang added a comment -

        > if the datanode dies in the middle of a rename, we can have a partial file in the destination as well as the complete file in the source, isn't it?
        Current code handles this. It removes the partial file in the destination if rename fails.

        Sorry that I did not make my proposal clear. What I am thinking to do in this jira is to disallow across volume rename. So appending to a block simply moves the block to the rbw directory in the same volume. This is definitely an improvement to 0.21 & trunk because 0.21 & trunk may move the block to a different volume even if the current volume has space. This is a critical bug that should be fixed. As to whether we should rename across volume when the current volume is full, lets do it in a different jira.

        Show
        Hairong Kuang added a comment - > if the datanode dies in the middle of a rename, we can have a partial file in the destination as well as the complete file in the source, isn't it? Current code handles this. It removes the partial file in the destination if rename fails. Sorry that I did not make my proposal clear. What I am thinking to do in this jira is to disallow across volume rename. So appending to a block simply moves the block to the rbw directory in the same volume. This is definitely an improvement to 0.21 & trunk because 0.21 & trunk may move the block to a different volume even if the current volume has space. This is a critical bug that should be fixed. As to whether we should rename across volume when the current volume is full, lets do it in a different jira.
        Hide
        Todd Lipcon added a comment -

        Moreover, rename across two mount points is not atomic, ie. if the datanode dies in the middle of a rename, we can have a partial file in the destination as well as the complete file in the source, isn't it?

        Yes, that's true. But wouldn't that just present as a corrupt replica when the block scanner catches up to it, or a client tries to read it?

        Show
        Todd Lipcon added a comment - Moreover, rename across two mount points is not atomic, ie. if the datanode dies in the middle of a rename, we can have a partial file in the destination as well as the complete file in the source, isn't it? Yes, that's true. But wouldn't that just present as a corrupt replica when the block scanner catches up to it, or a client tries to read it?
        Hide
        dhruba borthakur added a comment -

        If there is space in the same volume, then the rename occur within the same volume. In this case, holding the FSDataset lock across the rename is ok.

        If there is no space, then the rename is actually a datacopy. In this case, we should not hold the FSDataset lock across the rename. Moreover, rename across two mount points is not atomic, ie. if the datanode dies in the middle of a rename, we can have a partial file in the destination as well as the complete file in the source, isn't it?

        Show
        dhruba borthakur added a comment - If there is space in the same volume, then the rename occur within the same volume. In this case, holding the FSDataset lock across the rename is ok. If there is no space, then the rename is actually a datacopy. In this case, we should not hold the FSDataset lock across the rename. Moreover, rename across two mount points is not atomic, ie. if the datanode dies in the middle of a rename, we can have a partial file in the destination as well as the complete file in the source, isn't it?
        Hide
        Todd Lipcon added a comment -

        Sounds reasonable, so long as we're confident that holding the FSDataset lock for 10-20 seconds isn't a bug (eg appending to a 250MB block will take 10 seconds at 25M/sec cross-volume copy rate, could very well be slower if either disk is stressed)

        Show
        Todd Lipcon added a comment - Sounds reasonable, so long as we're confident that holding the FSDataset lock for 10-20 seconds isn't a bug (eg appending to a 250MB block will take 10 seconds at 25M/sec cross-volume copy rate, could very well be slower if either disk is stressed)
        Hide
        Hairong Kuang added a comment -

        It seems to me that moving a block to be appended to a different volume when the current volume has space is a bug, which I'd like to fix in this jira. However, moving a block to be appended to a different volume when the current volume does not have enough space for the estimated new bytes is an optimization, which probably we should work on it in a different jira.

        Show
        Hairong Kuang added a comment - It seems to me that moving a block to be appended to a different volume when the current volume has space is a bug, which I'd like to fix in this jira. However, moving a block to be appended to a different volume when the current volume does not have enough space for the estimated new bytes is an optimization, which probably we should work on it in a different jira.
        Hide
        Todd Lipcon added a comment -

        Patch looks like an improvement over the current code, but is there still a concern about holding the FSDataset in the case that this volume is full?

        Given that we don't have an intra-datanode balancer, it seems reasonably common that some disks are entirely full while others have free space (eg on an old cluster where 500G disks are replaced piecemeal with TB)

        Show
        Todd Lipcon added a comment - Patch looks like an improvement over the current code, but is there still a concern about holding the FSDataset in the case that this volume is full? Given that we don't have an intra-datanode balancer, it seems reasonably common that some disks are entirely full while others have free space (eg on an old cluster where 500G disks are replaced piecemeal with TB)
        Hide
        Hairong Kuang added a comment -

        This patch makes sure that the replica to be appended is moved to the same volume if the volume has enough space available.

        Show
        Hairong Kuang added a comment - This patch makes sure that the replica to be appended is moved to the same volume if the volume has enough space available.
        Hide
        Hairong Kuang added a comment -

        I am looking at this issue. If datanode releases the lock while copying replicas, I am not sure how can we ensure data integrity. In append, there are two potential file copies, one is copy on write (the unlink operation) and other is cross-volumn copy as Todd pointed out.

        Show
        Hairong Kuang added a comment - I am looking at this issue. If datanode releases the lock while copying replicas, I am not sure how can we ensure data integrity. In append, there are two potential file copies, one is copy on write (the unlink operation) and other is cross-volumn copy as Todd pointed out.
        Hide
        Todd Lipcon added a comment -

        I think in that case it should do a cross-volume block rename. However, it should avoid holding the FSDataset lock while doing so

        Show
        Todd Lipcon added a comment - I think in that case it should do a cross-volume block rename. However, it should avoid holding the FSDataset lock while doing so
        Hide
        dhruba borthakur added a comment -

        This looks like a bug that might be good to fix in 0.21 itself.

        On a related note, suppose the last partial block of a file has three replicas on three different volumes on three different datanodes. And let's suppose that these volumes (disks) have no more free space on them whereas there is plenty of free space on other datanodes and disks. Now, if an application re-opens the file to append to it, the append will fail because there isn't any free space on those devices (even though the cluster has tons of free space). Is it possible to avoid this inconsistent behaviour?

        Show
        dhruba borthakur added a comment - This looks like a bug that might be good to fix in 0.21 itself. On a related note, suppose the last partial block of a file has three replicas on three different volumes on three different datanodes. And let's suppose that these volumes (disks) have no more free space on them whereas there is plenty of free space on other datanodes and disks. Now, if an application re-opens the file to append to it, the append will fail because there isn't any free space on those devices (even though the cluster has tons of free space). Is it possible to avoid this inconsistent behaviour?
        Hide
        Hairong Kuang added a comment -

        This is a valid point. Let's get it fixed in 0.21.

        Show
        Hairong Kuang added a comment - This is a valid point. Let's get it fixed in 0.21.
        Hide
        Todd Lipcon added a comment -

        I haven't tried to actually produce this bug yet, but it may be reasonably important, since this is also a synchronized method, and a full cross-volume block copy could take several seconds at least, during which the DN can't do a whole lot.

        Show
        Todd Lipcon added a comment - I haven't tried to actually produce this bug yet, but it may be reasonably important, since this is also a synchronized method, and a full cross-volume block copy could take several seconds at least, during which the DN can't do a whole lot.

          People

          • Assignee:
            Hairong Kuang
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development