Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha2
    • Fix Version/s: 3.0.0-alpha2
    • Component/s: balancer & mover
    • Labels:
      None

      Description

      In HDFS-9671, Arpit Agarwal commented that we should explore the possibility of removing references to FsVolumeSpi at any point and only deal with storage ID. We are not sure if this is possible, this JIRA is to explore if that can be done without issues.

      1. HDFS-9850.001.patch
        27 kB
        Manoj Govindassamy
      2. HDFS-9850.002.patch
        27 kB
        Manoj Govindassamy
      3. HDFS-9850.003.patch
        22 kB
        Manoj Govindassamy
      4. HDFS-9850.004.patch
        22 kB
        Manoj Govindassamy

        Activity

        Hide
        manojg Manoj Govindassamy added a comment -

        Taking up this task after discussing with Lei (Eddy) Xu and Anu Engineer.

        Show
        manojg Manoj Govindassamy added a comment - Taking up this task after discussing with Lei (Eddy) Xu and Anu Engineer .
        Hide
        manojg Manoj Govindassamy added a comment -

        Attached v01 patch which addresses the following:

        • Replaced FsVolumeSpi object references in VolumePair and getStorageIDToVolumeMap with volume storage UUIDs.
          • So that, unnecessary object references to FsVolume are avoided
          • Whenever needed, FsVolumeSpi object is retrieved looking up by stored UUID
          • This way, any removed volumes after the plan creation or during the disk balancing activity can be easily tracked and appropriate error messages can be returned
        • Added a unit test TestDiskBalancer#testDiskBalancerWhenRemovingVolumes to verify the new functionality
          • Test executes a computed disk balancing plan between 2 storage volumes
          • While DiskBalancer#copyBlocks are happening, the desitnation storage volume is removed
          • DiskBalancer detects this and returns gracefully with proper error message set in the WorkItem
          • Without HDFS-9850 fix, test fails with ClosedChannelException when obtaining the reference for the deleted volume.

        Lei (Eddy) Xu, Anu Engineer, please take a look at the patch and let me know your comments.

        Show
        manojg Manoj Govindassamy added a comment - Attached v01 patch which addresses the following: Replaced FsVolumeSpi object references in VolumePair and getStorageIDToVolumeMap with volume storage UUIDs. So that, unnecessary object references to FsVolume are avoided Whenever needed, FsVolumeSpi object is retrieved looking up by stored UUID This way, any removed volumes after the plan creation or during the disk balancing activity can be easily tracked and appropriate error messages can be returned Added a unit test TestDiskBalancer#testDiskBalancerWhenRemovingVolumes to verify the new functionality Test executes a computed disk balancing plan between 2 storage volumes While DiskBalancer#copyBlocks are happening, the desitnation storage volume is removed DiskBalancer detects this and returns gracefully with proper error message set in the WorkItem Without HDFS-9850 fix, test fails with ClosedChannelException when obtaining the reference for the deleted volume. Lei (Eddy) Xu , Anu Engineer , please take a look at the patch and let me know your comments.
        Hide
        hadoopqa Hadoop QA added a comment -
        +1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 23s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 9m 16s trunk passed
        +1 compile 1m 1s trunk passed
        +1 checkstyle 0m 32s trunk passed
        +1 mvnsite 1m 0s trunk passed
        +1 mvneclipse 0m 14s trunk passed
        +1 findbugs 1m 54s trunk passed
        +1 javadoc 0m 57s trunk passed
        +1 mvninstall 0m 48s the patch passed
        +1 compile 0m 45s the patch passed
        +1 javac 0m 45s the patch passed
        +1 checkstyle 0m 25s the patch passed
        +1 mvnsite 0m 53s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 2m 14s the patch passed
        +1 javadoc 0m 54s the patch passed
        +1 unit 60m 37s hadoop-hdfs in the patch passed.
        +1 asflicense 0m 20s The patch does not generate ASF License warnings.
        83m 45s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Issue HDFS-9850
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12829977/HDFS-9850.001.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 987d9a04477d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / d85d9b2
        Default Java 1.8.0_101
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16838/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16838/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 23s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 9m 16s trunk passed +1 compile 1m 1s trunk passed +1 checkstyle 0m 32s trunk passed +1 mvnsite 1m 0s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 54s trunk passed +1 javadoc 0m 57s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 45s the patch passed +1 javac 0m 45s the patch passed +1 checkstyle 0m 25s the patch passed +1 mvnsite 0m 53s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 14s the patch passed +1 javadoc 0m 54s the patch passed +1 unit 60m 37s hadoop-hdfs in the patch passed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 83m 45s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-9850 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12829977/HDFS-9850.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 987d9a04477d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d85d9b2 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16838/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16838/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        eddyxu Lei (Eddy) Xu added a comment -

        Hi, Manoj Govindassamy Thanks for working on this issue.

        public FsVolumeReference getFsVolumeReference(String volUuid) {
           for (FsVolumeImpl fsVolume : volumes.getVolumes()) {
             if (fsVolume.getStorageID().equals(volUuid)) {
                 try {
                   return fsVolume.obtainReference();
                 } catch (ClosedChannelException e) {
                   LOG.warn("Unable to get a reference for the Volume: " + volUuid, e);
                 }
               }
            }
            return null;
        }
        

        I think it'd be better to throw CloseChannelException instead of swallowing it. This function has two expected errors: 1) the volume with the given volUuid does not exist, which should return null. or 2) the volume exists, but be closed already. In this case, it is better to throw the exception. Similarly, FsDatasetSpi#getFsVolumeReference should throw CCE as well.

        • In testDiskBalancerWhenRemovingVolumes(), do you need to start MiniDFSCluster by calling waitActive?
        • In DiskBalancer#copyBLocks. I think you need to hold volume reference when you do the block movement.

        Thanks.

        Show
        eddyxu Lei (Eddy) Xu added a comment - Hi, Manoj Govindassamy Thanks for working on this issue. public FsVolumeReference getFsVolumeReference( String volUuid) { for (FsVolumeImpl fsVolume : volumes.getVolumes()) { if (fsVolume.getStorageID().equals(volUuid)) { try { return fsVolume.obtainReference(); } catch (ClosedChannelException e) { LOG.warn( "Unable to get a reference for the Volume: " + volUuid, e); } } } return null ; } I think it'd be better to throw CloseChannelException instead of swallowing it. This function has two expected errors: 1) the volume with the given volUuid does not exist, which should return null . or 2) the volume exists, but be closed already. In this case, it is better to throw the exception. Similarly, FsDatasetSpi#getFsVolumeReference should throw CCE as well. In testDiskBalancerWhenRemovingVolumes() , do you need to start MiniDFSCluster by calling waitActive ? In DiskBalancer#copyBLocks . I think you need to hold volume reference when you do the block movement. Thanks.
        Hide
        manojg Manoj Govindassamy added a comment - - edited

        Thanks for the review Lei (Eddy) Xu.

        • Thats right, there are 2 cases in getFsVolumeReference(). But, for the caller who is getting the case 2 – CCE, should that be treated as volume existence or non-existence ? I assumed non-existence for the real use and so logging it and returning null eventually. But, i am ok with throwing CCE also and let the caller handle it in whatever way he wants.
        • Test is using ClusterBuilder#build and it already has cluster.waitActive()
        • DiskBalancer#copyBLocks is already holding vol reference under this call FsDatasetImpl#moveBlockAcrossVolumes and the reference is released once a block move is completed. Holding reference only for the current block move and re-grabbing it for the next block move goes well with other competing operations for the reference. Else, we might starve them.
        Show
        manojg Manoj Govindassamy added a comment - - edited Thanks for the review Lei (Eddy) Xu . Thats right, there are 2 cases in getFsVolumeReference(). But, for the caller who is getting the case 2 – CCE, should that be treated as volume existence or non-existence ? I assumed non-existence for the real use and so logging it and returning null eventually. But, i am ok with throwing CCE also and let the caller handle it in whatever way he wants. Test is using ClusterBuilder#build and it already has cluster.waitActive() DiskBalancer#copyBLocks is already holding vol reference under this call FsDatasetImpl#moveBlockAcrossVolumes and the reference is released once a block move is completed. Holding reference only for the current block move and re-grabbing it for the next block move goes well with other competing operations for the reference. Else, we might starve them.
        Hide
        manojg Manoj Govindassamy added a comment -

        Attached v02 patch to address previous review comment.
        – Letting FsDatasetSpi#getFsVolumeReference throw ClosedChannelException and made the callers handle the exception.

        Lei (Eddy) Xu, kindly take a look at v02 patch.

        Show
        manojg Manoj Govindassamy added a comment - Attached v02 patch to address previous review comment. – Letting FsDatasetSpi#getFsVolumeReference throw ClosedChannelException and made the callers handle the exception. Lei (Eddy) Xu , kindly take a look at v02 patch.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 15s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 7m 9s trunk passed
        +1 compile 0m 46s trunk passed
        +1 checkstyle 0m 27s trunk passed
        +1 mvnsite 0m 57s trunk passed
        +1 mvneclipse 0m 12s trunk passed
        +1 findbugs 1m 58s trunk passed
        +1 javadoc 0m 57s trunk passed
        +1 mvninstall 0m 50s the patch passed
        +1 compile 0m 44s the patch passed
        +1 javac 0m 44s the patch passed
        -0 checkstyle 0m 25s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 171 unchanged - 0 fixed = 172 total (was 171)
        +1 mvnsite 0m 51s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 54s the patch passed
        +1 javadoc 0m 55s the patch passed
        -1 unit 61m 47s hadoop-hdfs in the patch failed.
        +1 asflicense 0m 26s The patch does not generate ASF License warnings.
        82m 0s



        Reason Tests
        Failed junit tests hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Issue HDFS-9850
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830135/HDFS-9850.002.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux e50305efef11 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 6eb700e
        Default Java 1.8.0_101
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/16850/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/16850/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16850/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16850/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 7m 9s trunk passed +1 compile 0m 46s trunk passed +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 57s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 58s trunk passed +1 javadoc 0m 57s trunk passed +1 mvninstall 0m 50s the patch passed +1 compile 0m 44s the patch passed +1 javac 0m 44s the patch passed -0 checkstyle 0m 25s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 171 unchanged - 0 fixed = 172 total (was 171) +1 mvnsite 0m 51s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 54s the patch passed +1 javadoc 0m 55s the patch passed -1 unit 61m 47s hadoop-hdfs in the patch failed. +1 asflicense 0m 26s The patch does not generate ASF License warnings. 82m 0s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-9850 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830135/HDFS-9850.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux e50305efef11 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6eb700e Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/16850/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/16850/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16850/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16850/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        anu Anu Engineer added a comment -

        Manoj Govindassamy Thanks for updating the patch as well providing a fix for this issues. Couple of minor comments.

        In DiskBalancer.java

        Changes in queryWorkStatus can be pulled into a function and invoked twice for source and destination.

        createWorkPlan – We left the "Disk Balancer" – in the datanode side so it is easy to grep for error messages in datanode logs. If you don't mind can you please put it back.

        createWorkPlan – In this error messages it is easier for people to puzzle out the volume from the path than from UUID. I think we should leave the path in the error message. Please scan the file for removal of "Disk Balancer" from logging, also add that string in LOG messages that are added new.

        nit : Line : 655 function comments say we are returning volume, but in reality we are returning volume UUID.

        copyBlocks#1013, 1021 : this.setExitFlag is spurious, since we return from the function in the next line. That flag never gets evaluated. Also would you be able to set the path of volume in the error string instead of the UUID of the volumes.

        FsDataSetImpl.java#getFsVolumeReference Do we really need to a function to FsDataSet Interface ? Cannot we do this via getVolumeReferences and using a helper function in DiskBalancer.java itself, I don't think we should add this to FsDatasetSpi.

        TestDiskBalancer.java#577 – You might want to assert that test failed along with logging an info for the reconfig error.

        TestDiskBalancer.java#597 – We verify that we are getting DiskbalancerException, but not the payload inside the exception. Would it make sense to verify that error is indeed some kind of volume error and string verify that verifies whether it is the source or dest ?

        Show
        anu Anu Engineer added a comment - Manoj Govindassamy Thanks for updating the patch as well providing a fix for this issues. Couple of minor comments. In DiskBalancer.java Changes in queryWorkStatus can be pulled into a function and invoked twice for source and destination. createWorkPlan – We left the "Disk Balancer" – in the datanode side so it is easy to grep for error messages in datanode logs. If you don't mind can you please put it back. createWorkPlan – In this error messages it is easier for people to puzzle out the volume from the path than from UUID. I think we should leave the path in the error message. Please scan the file for removal of "Disk Balancer" from logging, also add that string in LOG messages that are added new. nit : Line : 655 function comments say we are returning volume, but in reality we are returning volume UUID. copyBlocks#1013, 1021 : this.setExitFlag is spurious, since we return from the function in the next line. That flag never gets evaluated. Also would you be able to set the path of volume in the error string instead of the UUID of the volumes. FsDataSetImpl.java#getFsVolumeReference Do we really need to a function to FsDataSet Interface ? Cannot we do this via getVolumeReferences and using a helper function in DiskBalancer.java itself, I don't think we should add this to FsDatasetSpi. TestDiskBalancer.java#577 – You might want to assert that test failed along with logging an info for the reconfig error. TestDiskBalancer.java#597 – We verify that we are getting DiskbalancerException, but not the payload inside the exception. Would it make sense to verify that error is indeed some kind of volume error and string verify that verifies whether it is the source or dest ?
        Hide
        manojg Manoj Govindassamy added a comment -

        Thanks for the review and comments Anu Engineer. Will be uploading the comments incorporated patch soon.
        1. Done. Made use of private method getFsVolume(), just like other places.
        2. Done. Restored "Disk Balancer -" in the removed logs and in the new logs. Even if you don't have these explicit signature in the logging, they already carry file name from where the logging is happening and here in this case it will be DiskBalancer.java, which can also be used for grepping
        3. Done. Restored Volume path in the error message. True volume path when available can be used for error logging for they are much more easier to read.
        4. Done. Corrected the method comments.
        5. Done. Removed the unnecessary setExitFlag() calls.
        6. Done. Removed the newly introduced interface method. Rather implemented as a new private method in DiskBalancer using the already available getVolumeReferences().
        7. Done. Added the assert for reconfigure error.
        8. Done. Test now checks for the error message prefix also.

        Show
        manojg Manoj Govindassamy added a comment - Thanks for the review and comments Anu Engineer . Will be uploading the comments incorporated patch soon. 1. Done. Made use of private method getFsVolume(), just like other places. 2. Done. Restored "Disk Balancer -" in the removed logs and in the new logs. Even if you don't have these explicit signature in the logging, they already carry file name from where the logging is happening and here in this case it will be DiskBalancer.java, which can also be used for grepping 3. Done. Restored Volume path in the error message. True volume path when available can be used for error logging for they are much more easier to read. 4. Done. Corrected the method comments. 5. Done. Removed the unnecessary setExitFlag() calls. 6. Done. Removed the newly introduced interface method. Rather implemented as a new private method in DiskBalancer using the already available getVolumeReferences(). 7. Done. Added the assert for reconfigure error. 8. Done. Test now checks for the error message prefix also.
        Hide
        manojg Manoj Govindassamy added a comment -

        Attaching v03 patch to address checkstyle and review comments from Lei (Eddy) Xu and Anu Engineer. Kindly take a look.

        Show
        manojg Manoj Govindassamy added a comment - Attaching v03 patch to address checkstyle and review comments from Lei (Eddy) Xu and Anu Engineer . Kindly take a look.
        Hide
        anu Anu Engineer added a comment - - edited

        Manoj Govindassamy Thanks for quick turnaround. I was reading code and something struck me. Could you please help me understand what would happen in the following scenario ?

        Let us say a plan was executed, then either the source or destination disk was removed/reconfigured and then I call into queryStatus,
        Would it work or would it fail ?

        Show
        anu Anu Engineer added a comment - - edited Manoj Govindassamy Thanks for quick turnaround. I was reading code and something struck me. Could you please help me understand what would happen in the following scenario ? Let us say a plan was executed, then either the source or destination disk was removed/reconfigured and then I call into queryStatus, Would it work or would it fail ?
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 15s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 7m 3s trunk passed
        +1 compile 0m 48s trunk passed
        +1 checkstyle 0m 26s trunk passed
        +1 mvnsite 1m 0s trunk passed
        +1 mvneclipse 0m 18s trunk passed
        +1 findbugs 1m 59s trunk passed
        +1 javadoc 0m 54s trunk passed
        +1 mvninstall 0m 48s the patch passed
        +1 compile 0m 42s the patch passed
        +1 javac 0m 42s the patch passed
        +1 checkstyle 0m 23s the patch passed
        +1 mvnsite 0m 48s the patch passed
        +1 mvneclipse 0m 9s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 50s the patch passed
        +1 javadoc 0m 53s the patch passed
        -1 unit 62m 44s hadoop-hdfs in the patch failed.
        +1 asflicense 0m 20s The patch does not generate ASF License warnings.
        82m 32s



        Reason Tests
        Failed junit tests hadoop.hdfs.server.mover.TestStorageMover
          hadoop.hdfs.TestDecommissionWithStriped



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Issue HDFS-9850
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830432/HDFS-9850.003.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 6ecaa6ee4eb6 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / edf0d0f
        Default Java 1.8.0_101
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/16874/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16874/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16874/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 3s trunk passed +1 compile 0m 48s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 1m 0s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 59s trunk passed +1 javadoc 0m 54s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 42s the patch passed +1 javac 0m 42s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 48s the patch passed +1 mvneclipse 0m 9s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 50s the patch passed +1 javadoc 0m 53s the patch passed -1 unit 62m 44s hadoop-hdfs in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 82m 32s Reason Tests Failed junit tests hadoop.hdfs.server.mover.TestStorageMover   hadoop.hdfs.TestDecommissionWithStriped Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-9850 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830432/HDFS-9850.003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6ecaa6ee4eb6 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / edf0d0f Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/16874/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16874/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16874/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        manojg Manoj Govindassamy added a comment -

        With the proposed fix in this jira issue HDFS-9850, both DiskBalancerMover#copyBlocks and DiskBalancer#queryWorkStatus handles the case where any of the involved storage volumes in the plan are no more available.

        • copyBlocks logs the error about missing volume and returns without crashing/exception
          • hence that particular DiskBalancerWorkItem thus gets skipped and moves on to the next one
          • But, if the step happens to be the last one, then the currentResult gets stuck at PLAN_UNDER_PROGRESS even if it returned with error
        • queryWorkStatus also logs the error about missing volume and throws DiskBalancerException INTERNAL_ERROR.
          • QueryCommand which invokes queryWorkStatus gets the exception, and logs the error query plan failed
          • May be we should also print out something to the user so that he gets to see some meaningful error message

        As part of fixing this jira, I have also added a unit test case TestDiskBalancer#testDiskBalancerWhenRemovingVolumes for the exact above case. For now, the test code verifies that DiskBalancer is not crashing or throwing weird exceptions.

        To attack both above cases, jira HDFS-10904 has been filed so that user gets to see proper error message and proper balancing operation state even when the involved volumes are removed. And btw, this corner case was already there and not introduced by this jira. This jira attempts to address unnecessary FsVolumeSpi object references issue. Please let me know if you need more clarifications.

        Show
        manojg Manoj Govindassamy added a comment - With the proposed fix in this jira issue HDFS-9850 , both DiskBalancerMover#copyBlocks and DiskBalancer#queryWorkStatus handles the case where any of the involved storage volumes in the plan are no more available. copyBlocks logs the error about missing volume and returns without crashing/exception hence that particular DiskBalancerWorkItem thus gets skipped and moves on to the next one But, if the step happens to be the last one, then the currentResult gets stuck at PLAN_UNDER_PROGRESS even if it returned with error queryWorkStatus also logs the error about missing volume and throws DiskBalancerException INTERNAL_ERROR. QueryCommand which invokes queryWorkStatus gets the exception, and logs the error query plan failed May be we should also print out something to the user so that he gets to see some meaningful error message As part of fixing this jira, I have also added a unit test case TestDiskBalancer#testDiskBalancerWhenRemovingVolumes for the exact above case. For now, the test code verifies that DiskBalancer is not crashing or throwing weird exceptions. To attack both above cases, jira HDFS-10904 has been filed so that user gets to see proper error message and proper balancing operation state even when the involved volumes are removed. And btw, this corner case was already there and not introduced by this jira. This jira attempts to address unnecessary FsVolumeSpi object references issue. Please let me know if you need more clarifications.
        Hide
        anu Anu Engineer added a comment -

        This corner case was already there and not introduced by this jira

        I am not convinced that this statement is correct. Let us enumerate the cases.

        Old behaviour
        1. If a source or destination disk is removed while diskbalancer is running – There is a failure – crash perhaps ?, I will check that by running the test case and the plan status flag will be stuck in PLAN_UNDER_PROGRESS.
        2. if a source or destination disk is removed after diskbalancer is completely run. A queryStatus would return the status that it remembers.

        As far as I understand (Please correct me if I am wrong), the new behaviour is as follows.

        New behaviour
        1. If a source or destination disk is removed while diskbalancer is running – There is a failure and the error message is set appropriately, but plan status flag will be stuck in PLAN_UNDER_PROGRESS – and details of error is the error message.
        2. if a source or destination disk is removed after diskbalancer is completely run. A queryStatus would return an error saying that disk is missing.

        The fact that some disk got removed after disk balancer was run should not be any consequence to Disk balancer. I am not arguing about the first case at all. My concern is about second point – behaviour of queryStatus after diskbalancer has finished running. I am arguing that it should not fail if a disk is missing – when querystatus is done.

        Show
        anu Anu Engineer added a comment - This corner case was already there and not introduced by this jira I am not convinced that this statement is correct. Let us enumerate the cases. Old behaviour If a source or destination disk is removed while diskbalancer is running – There is a failure – crash perhaps ?, I will check that by running the test case and the plan status flag will be stuck in PLAN_UNDER_PROGRESS. if a source or destination disk is removed after diskbalancer is completely run. A queryStatus would return the status that it remembers. As far as I understand (Please correct me if I am wrong), the new behaviour is as follows. New behaviour If a source or destination disk is removed while diskbalancer is running – There is a failure and the error message is set appropriately, but plan status flag will be stuck in PLAN_UNDER_PROGRESS – and details of error is the error message. if a source or destination disk is removed after diskbalancer is completely run. A queryStatus would return an error saying that disk is missing . The fact that some disk got removed after disk balancer was run should not be any consequence to Disk balancer. I am not arguing about the first case at all. My concern is about second point – behaviour of queryStatus after diskbalancer has finished running. I am arguing that it should not fail if a disk is missing – when querystatus is done.
        Hide
        manojg Manoj Govindassamy added a comment -

        Thanks for the comments.

        • Old behavior:
          1. In DiskBalancerMover#copyBlocks(), this.dataset.moveBlockAcrossVolumes(block, dest) fails with ClosedChannelException when it tries to get the volume reference.
          2. Yes, queryStatus() gets the volume base path as it already has the FsVolumeSpi reference stored.
        • New behavior:
          1. Yes, the new fix sets the error message when the volume is not available. But, the fix does not take care of currentResult state and it will still be in PLAN_UNDER_PROGRESS
          2. Yes, this is different from old behavior as queryStatus would return an error instead of the stored paths.

        Got your point for getQueryStatus(). Since FsVolumeSpi references are no more stored and only StorageVolume UUIDs are available, its not possible to construct the path for a non-existing volume.

        ConcurrentHashMap<VolumePair, DiskBalancerWorkItem> workMap;
        
          public static class VolumePair {
            private final String sourceVolUuid;
            private final String destVolUuid;
        

        Here are few other options on fixing queryWorkStatus:

        1. Instead of throwing error, may be we can report volume path as <volume missing> or related error message whenever we find the volume removed
        2. Or, persist volume paths also along with UUID, so that query will always work as before

        Your thoughts please ?

        Show
        manojg Manoj Govindassamy added a comment - Thanks for the comments. Old behavior: In DiskBalancerMover#copyBlocks() , this.dataset.moveBlockAcrossVolumes(block, dest) fails with ClosedChannelException when it tries to get the volume reference. Yes, queryStatus() gets the volume base path as it already has the FsVolumeSpi reference stored. New behavior: Yes, the new fix sets the error message when the volume is not available. But, the fix does not take care of currentResult state and it will still be in PLAN_UNDER_PROGRESS Yes, this is different from old behavior as queryStatus would return an error instead of the stored paths. Got your point for getQueryStatus(). Since FsVolumeSpi references are no more stored and only StorageVolume UUIDs are available, its not possible to construct the path for a non-existing volume. ConcurrentHashMap<VolumePair, DiskBalancerWorkItem> workMap; public static class VolumePair { private final String sourceVolUuid; private final String destVolUuid; Here are few other options on fixing queryWorkStatus: Instead of throwing error, may be we can report volume path as <volume missing> or related error message whenever we find the volume removed Or, persist volume paths also along with UUID, so that query will always work as before Your thoughts please ?
        Hide
        anu Anu Engineer added a comment -

        Thanks for your comments.I think adding paths to volume pair is an easier fix and not too memory intensive. As a side effect, it will eliminate any calls into FsVolumeSpi during query.

        Show
        anu Anu Engineer added a comment - Thanks for your comments.I think adding paths to volume pair is an easier fix and not too memory intensive. As a side effect, it will eliminate any calls into FsVolumeSpi during query.
        Hide
        manojg Manoj Govindassamy added a comment -

        Attaching v04 patch to address following. Anu Engineer, kindly take a look.

        • Restored old behavior of getQueryStatus to return volume paths even if the volume is removed. Made VolumePair to hold the volume paths.
        • Accordingly changed the expectation in TestDiskBalancer. Test will not expect DiskBalancerException and instead awaits for the status to reach PLAN_DONE.
        Show
        manojg Manoj Govindassamy added a comment - Attaching v04 patch to address following. Anu Engineer , kindly take a look. Restored old behavior of getQueryStatus to return volume paths even if the volume is removed. Made VolumePair to hold the volume paths. Accordingly changed the expectation in TestDiskBalancer. Test will not expect DiskBalancerException and instead awaits for the status to reach PLAN_DONE.
        Hide
        anu Anu Engineer added a comment -

        +1, Pending Jenkins. The changes in v4 patch look very good. Thanks for getting this done. I will commit this as soon as we have jenkins run.

        Show
        anu Anu Engineer added a comment - +1, Pending Jenkins. The changes in v4 patch look very good. Thanks for getting this done. I will commit this as soon as we have jenkins run.
        Hide
        manojg Manoj Govindassamy added a comment -

        Thanks a lot for the quick review.

        Show
        manojg Manoj Govindassamy added a comment - Thanks a lot for the quick review.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 17s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 7m 14s trunk passed
        +1 compile 0m 45s trunk passed
        +1 checkstyle 0m 25s trunk passed
        +1 mvnsite 0m 51s trunk passed
        +1 mvneclipse 0m 12s trunk passed
        +1 findbugs 1m 41s trunk passed
        +1 javadoc 0m 55s trunk passed
        +1 mvninstall 0m 48s the patch passed
        +1 compile 0m 42s the patch passed
        +1 javac 0m 42s the patch passed
        +1 checkstyle 0m 24s the patch passed
        +1 mvnsite 0m 50s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 51s the patch passed
        +1 javadoc 0m 53s the patch passed
        -1 unit 57m 48s hadoop-hdfs in the patch failed.
        +1 asflicense 0m 18s The patch does not generate ASF License warnings.
        77m 23s



        Reason Tests
        Failed junit tests hadoop.tracing.TestTracing
          hadoop.hdfs.server.namenode.TestDiskspaceQuotaUpdate



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Issue HDFS-9850
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830605/HDFS-9850.004.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux a779b26c5db0 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / d144398
        Default Java 1.8.0_101
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/16892/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16892/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16892/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 14s trunk passed +1 compile 0m 45s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 51s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 41s trunk passed +1 javadoc 0m 55s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 42s the patch passed +1 javac 0m 42s the patch passed +1 checkstyle 0m 24s the patch passed +1 mvnsite 0m 50s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 51s the patch passed +1 javadoc 0m 53s the patch passed -1 unit 57m 48s hadoop-hdfs in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 77m 23s Reason Tests Failed junit tests hadoop.tracing.TestTracing   hadoop.hdfs.server.namenode.TestDiskspaceQuotaUpdate Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-9850 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830605/HDFS-9850.004.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a779b26c5db0 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d144398 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/16892/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16892/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16892/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        manojg Manoj Govindassamy added a comment -

        Unit test failures are not related to the attached patch.

        Show
        manojg Manoj Govindassamy added a comment - Unit test failures are not related to the attached patch.
        Hide
        anu Anu Engineer added a comment -

        Manoj Govindassamy Thanks for fixing this issue. I have committed this to trunk.

        Show
        anu Anu Engineer added a comment - Manoj Govindassamy Thanks for fixing this issue. I have committed this to trunk.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10505 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10505/)
        HDFS-9850. DiskBalancer: Explore removing references to FsVolumeSpi. (aengineer: rev 03f519a757ce83d76e7fc9f6aadf271e38bb9f6d)

        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DiskBalancer.java
        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/diskbalancer/TestDiskBalancer.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10505 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10505/ ) HDFS-9850 . DiskBalancer: Explore removing references to FsVolumeSpi. (aengineer: rev 03f519a757ce83d76e7fc9f6aadf271e38bb9f6d) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DiskBalancer.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/diskbalancer/TestDiskBalancer.java
        Hide
        manojg Manoj Govindassamy added a comment -

        Thanks Anu Engineer, Lei (Eddy) Xu for the review and commit help.

        Show
        manojg Manoj Govindassamy added a comment - Thanks Anu Engineer , Lei (Eddy) Xu for the review and commit help.

          People

          • Assignee:
            manojg Manoj Govindassamy
            Reporter:
            anu Anu Engineer
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development