[HDFS-9850] DiskBalancer : Explore removing references to FsVolumeSpi - ASF JIRA

Manoj Govindassamy added a comment - 13/Sep/16 19:38

Taking up this task after discussing with eddyxu and anu.

Manoj Govindassamy added a comment - 13/Sep/16 19:38 Taking up this task after discussing with eddyxu and anu .

Manoj Govindassamy added a comment - 23/Sep/16 02:29

Attached v01 patch which addresses the following:

Replaced FsVolumeSpi object references in VolumePair and getStorageIDToVolumeMap with volume storage UUIDs.
- So that, unnecessary object references to FsVolume are avoided
- Whenever needed, FsVolumeSpi object is retrieved looking up by stored UUID
- This way, any removed volumes after the plan creation or during the disk balancing activity can be easily tracked and appropriate error messages can be returned

Added a unit test TestDiskBalancer#testDiskBalancerWhenRemovingVolumes to verify the new functionality
- Test executes a computed disk balancing plan between 2 storage volumes
- While DiskBalancer#copyBlocks are happening, the desitnation storage volume is removed
- DiskBalancer detects this and returns gracefully with proper error message set in the WorkItem
- Without ~~HDFS-9850~~ fix, test fails with ClosedChannelException when obtaining the reference for the deleted volume.

eddyxu, anu, please take a look at the patch and let me know your comments.

Manoj Govindassamy added a comment - 23/Sep/16 02:29 Attached v01 patch which addresses the following: Replaced FsVolumeSpi object references in VolumePair and getStorageIDToVolumeMap with volume storage UUIDs. So that, unnecessary object references to FsVolume are avoided Whenever needed, FsVolumeSpi object is retrieved looking up by stored UUID This way, any removed volumes after the plan creation or during the disk balancing activity can be easily tracked and appropriate error messages can be returned Added a unit test TestDiskBalancer#testDiskBalancerWhenRemovingVolumes to verify the new functionality Test executes a computed disk balancing plan between 2 storage volumes While DiskBalancer#copyBlocks are happening, the desitnation storage volume is removed DiskBalancer detects this and returns gracefully with proper error message set in the WorkItem Without HDFS-9850 fix, test fails with ClosedChannelException when obtaining the reference for the deleted volume. eddyxu , anu , please take a look at the patch and let me know your comments.

Hadoop QA added a comment - 23/Sep/16 05:15

+1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 23s	Docker mode activated.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	test4tests	0m 0s	The patch appears to include 3 new or modified test files.
+1	mvninstall	9m 16s	trunk passed
+1	compile	1m 1s	trunk passed
+1	checkstyle	0m 32s	trunk passed
+1	mvnsite	1m 0s	trunk passed
+1	mvneclipse	0m 14s	trunk passed
+1	findbugs	1m 54s	trunk passed
+1	javadoc	0m 57s	trunk passed
+1	mvninstall	0m 48s	the patch passed
+1	compile	0m 45s	the patch passed
+1	javac	0m 45s	the patch passed
+1	checkstyle	0m 25s	the patch passed
+1	mvnsite	0m 53s	the patch passed
+1	mvneclipse	0m 10s	the patch passed
+1	whitespace	0m 0s	The patch has no whitespace issues.
+1	findbugs	2m 14s	the patch passed
+1	javadoc	0m 54s	the patch passed
+1	unit	60m 37s	hadoop-hdfs in the patch passed.
+1	asflicense	0m 20s	The patch does not generate ASF License warnings.
		83m 45s

Subsystem	Report/Notes
Docker	Image:yetus/hadoop:9560f25
JIRA Issue	~~HDFS-9850~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12829977/HDFS-9850.001.patch
Optional Tests	asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
uname	Linux 987d9a04477d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	/testptch/hadoop/patchprocess/precommit/personality/provided.sh
git revision	trunk / d85d9b2
Default Java	1.8.0_101
findbugs	v3.0.0
Test Results	https://builds.apache.org/job/PreCommit-HDFS-Build/16838/testReport/
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://builds.apache.org/job/PreCommit-HDFS-Build/16838/console
Powered by	Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 23/Sep/16 05:15 +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 23s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 9m 16s trunk passed +1 compile 1m 1s trunk passed +1 checkstyle 0m 32s trunk passed +1 mvnsite 1m 0s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 54s trunk passed +1 javadoc 0m 57s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 45s the patch passed +1 javac 0m 45s the patch passed +1 checkstyle 0m 25s the patch passed +1 mvnsite 0m 53s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 14s the patch passed +1 javadoc 0m 54s the patch passed +1 unit 60m 37s hadoop-hdfs in the patch passed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 83m 45s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-9850 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12829977/HDFS-9850.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 987d9a04477d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d85d9b2 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16838/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16838/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.

Lei (Eddy) Xu added a comment - 23/Sep/16 20:50

Hi, manojg Thanks for working on this issue.

public FsVolumeReference getFsVolumeReference(String volUuid) {
   for (FsVolumeImpl fsVolume : volumes.getVolumes()) {
     if (fsVolume.getStorageID().equals(volUuid)) {
         try {
           return fsVolume.obtainReference();
         } catch (ClosedChannelException e) {
           LOG.warn("Unable to get a reference for the Volume: " + volUuid, e);
         }
       }
    }
    return null;
}

I think it'd be better to throw CloseChannelException instead of swallowing it. This function has two expected errors: 1) the volume with the given volUuid does not exist, which should return null. or 2) the volume exists, but be closed already. In this case, it is better to throw the exception. Similarly, FsDatasetSpi#getFsVolumeReference should throw CCE as well.

In testDiskBalancerWhenRemovingVolumes(), do you need to start MiniDFSCluster by calling waitActive?

In DiskBalancer#copyBLocks. I think you need to hold volume reference when you do the block movement.

Thanks.

Lei (Eddy) Xu added a comment - 23/Sep/16 20:50 Hi, manojg Thanks for working on this issue. public FsVolumeReference getFsVolumeReference( String volUuid) { for (FsVolumeImpl fsVolume : volumes.getVolumes()) { if (fsVolume.getStorageID().equals(volUuid)) { try { return fsVolume.obtainReference(); } catch (ClosedChannelException e) { LOG.warn( "Unable to get a reference for the Volume: " + volUuid, e); } } } return null ; } I think it'd be better to throw CloseChannelException instead of swallowing it. This function has two expected errors: 1) the volume with the given volUuid does not exist, which should return null . or 2) the volume exists, but be closed already. In this case, it is better to throw the exception. Similarly, FsDatasetSpi#getFsVolumeReference should throw CCE as well. In testDiskBalancerWhenRemovingVolumes() , do you need to start MiniDFSCluster by calling waitActive ? In DiskBalancer#copyBLocks . I think you need to hold volume reference when you do the block movement. Thanks.

Manoj Govindassamy added a comment - 23/Sep/16 21:43 - edited

Thanks for the review eddyxu.

Thats right, there are 2 cases in getFsVolumeReference(). But, for the caller who is getting the case 2 – CCE, should that be treated as volume existence or non-existence ? I assumed non-existence for the real use and so logging it and returning null eventually. But, i am ok with throwing CCE also and let the caller handle it in whatever way he wants.

Test is using ClusterBuilder#build and it already has cluster.waitActive()

DiskBalancer#copyBLocks is already holding vol reference under this call FsDatasetImpl#moveBlockAcrossVolumes and the reference is released once a block move is completed. Holding reference only for the current block move and re-grabbing it for the next block move goes well with other competing operations for the reference. Else, we might starve them.

Manoj Govindassamy added a comment - 23/Sep/16 21:43 - edited Thanks for the review eddyxu . Thats right, there are 2 cases in getFsVolumeReference(). But, for the caller who is getting the case 2 – CCE, should that be treated as volume existence or non-existence ? I assumed non-existence for the real use and so logging it and returning null eventually. But, i am ok with throwing CCE also and let the caller handle it in whatever way he wants. Test is using ClusterBuilder#build and it already has cluster.waitActive() DiskBalancer#copyBLocks is already holding vol reference under this call FsDatasetImpl#moveBlockAcrossVolumes and the reference is released once a block move is completed. Holding reference only for the current block move and re-grabbing it for the next block move goes well with other competing operations for the reference. Else, we might starve them.

Manoj Govindassamy added a comment - 23/Sep/16 23:48

Attached v02 patch to address previous review comment.
– Letting FsDatasetSpi#getFsVolumeReference throw ClosedChannelException and made the callers handle the exception.

eddyxu, kindly take a look at v02 patch.

Manoj Govindassamy added a comment - 23/Sep/16 23:48 Attached v02 patch to address previous review comment. – Letting FsDatasetSpi#getFsVolumeReference throw ClosedChannelException and made the callers handle the exception. eddyxu , kindly take a look at v02 patch.

Hadoop QA added a comment - 24/Sep/16 01:39

-1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 15s	Docker mode activated.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	test4tests	0m 0s	The patch appears to include 3 new or modified test files.
+1	mvninstall	7m 9s	trunk passed
+1	compile	0m 46s	trunk passed
+1	checkstyle	0m 27s	trunk passed
+1	mvnsite	0m 57s	trunk passed
+1	mvneclipse	0m 12s	trunk passed
+1	findbugs	1m 58s	trunk passed
+1	javadoc	0m 57s	trunk passed
+1	mvninstall	0m 50s	the patch passed
+1	compile	0m 44s	the patch passed
+1	javac	0m 44s	the patch passed
-0	checkstyle	0m 25s	hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 171 unchanged - 0 fixed = 172 total (was 171)
+1	mvnsite	0m 51s	the patch passed
+1	mvneclipse	0m 10s	the patch passed
+1	whitespace	0m 0s	The patch has no whitespace issues.
+1	findbugs	1m 54s	the patch passed
+1	javadoc	0m 55s	the patch passed
-1	unit	61m 47s	hadoop-hdfs in the patch failed.
+1	asflicense	0m 26s	The patch does not generate ASF License warnings.
		82m 0s

Reason	Tests
Failed junit tests	hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

Subsystem	Report/Notes
Docker	Image:yetus/hadoop:9560f25
JIRA Issue	~~HDFS-9850~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12830135/HDFS-9850.002.patch
Optional Tests	asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
uname	Linux e50305efef11 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	/testptch/hadoop/patchprocess/precommit/personality/provided.sh
git revision	trunk / 6eb700e
Default Java	1.8.0_101
findbugs	v3.0.0
checkstyle	https://builds.apache.org/job/PreCommit-HDFS-Build/16850/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
unit	https://builds.apache.org/job/PreCommit-HDFS-Build/16850/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
Test Results	https://builds.apache.org/job/PreCommit-HDFS-Build/16850/testReport/
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://builds.apache.org/job/PreCommit-HDFS-Build/16850/console
Powered by	Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 24/Sep/16 01:39 -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 7m 9s trunk passed +1 compile 0m 46s trunk passed +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 57s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 58s trunk passed +1 javadoc 0m 57s trunk passed +1 mvninstall 0m 50s the patch passed +1 compile 0m 44s the patch passed +1 javac 0m 44s the patch passed -0 checkstyle 0m 25s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 171 unchanged - 0 fixed = 172 total (was 171) +1 mvnsite 0m 51s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 54s the patch passed +1 javadoc 0m 55s the patch passed -1 unit 61m 47s hadoop-hdfs in the patch failed. +1 asflicense 0m 26s The patch does not generate ASF License warnings. 82m 0s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-9850 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830135/HDFS-9850.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux e50305efef11 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6eb700e Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/16850/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/16850/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16850/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16850/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.

Anu Engineer added a comment - 26/Sep/16 18:06

manojg Thanks for updating the patch as well providing a fix for this issues. Couple of minor comments.

In DiskBalancer.java

Changes in queryWorkStatus can be pulled into a function and invoked twice for source and destination.

createWorkPlan – We left the "Disk Balancer" – in the datanode side so it is easy to grep for error messages in datanode logs. If you don't mind can you please put it back.

createWorkPlan – In this error messages it is easier for people to puzzle out the volume from the path than from UUID. I think we should leave the path in the error message. Please scan the file for removal of "Disk Balancer" from logging, also add that string in LOG messages that are added new.

nit : Line : 655 function comments say we are returning volume, but in reality we are returning volume UUID.

copyBlocks#1013, 1021 : this.setExitFlag is spurious, since we return from the function in the next line. That flag never gets evaluated. Also would you be able to set the path of volume in the error string instead of the UUID of the volumes.

FsDataSetImpl.java#getFsVolumeReference Do we really need to a function to FsDataSet Interface ? Cannot we do this via getVolumeReferences and using a helper function in DiskBalancer.java itself, I don't think we should add this to FsDatasetSpi.

TestDiskBalancer.java#577 – You might want to assert that test failed along with logging an info for the reconfig error.

TestDiskBalancer.java#597 – We verify that we are getting DiskbalancerException, but not the payload inside the exception. Would it make sense to verify that error is indeed some kind of volume error and string verify that verifies whether it is the source or dest ?

Anu Engineer added a comment - 26/Sep/16 18:06 manojg Thanks for updating the patch as well providing a fix for this issues. Couple of minor comments. In DiskBalancer.java Changes in queryWorkStatus can be pulled into a function and invoked twice for source and destination. createWorkPlan – We left the "Disk Balancer" – in the datanode side so it is easy to grep for error messages in datanode logs. If you don't mind can you please put it back. createWorkPlan – In this error messages it is easier for people to puzzle out the volume from the path than from UUID. I think we should leave the path in the error message. Please scan the file for removal of "Disk Balancer" from logging, also add that string in LOG messages that are added new. nit : Line : 655 function comments say we are returning volume, but in reality we are returning volume UUID. copyBlocks#1013, 1021 : this.setExitFlag is spurious, since we return from the function in the next line. That flag never gets evaluated. Also would you be able to set the path of volume in the error string instead of the UUID of the volumes. FsDataSetImpl.java#getFsVolumeReference Do we really need to a function to FsDataSet Interface ? Cannot we do this via getVolumeReferences and using a helper function in DiskBalancer.java itself, I don't think we should add this to FsDatasetSpi. TestDiskBalancer.java#577 – You might want to assert that test failed along with logging an info for the reconfig error. TestDiskBalancer.java#597 – We verify that we are getting DiskbalancerException, but not the payload inside the exception. Would it make sense to verify that error is indeed some kind of volume error and string verify that verifies whether it is the source or dest ?

Manoj Govindassamy added a comment - 27/Sep/16 01:29

Thanks for the review and comments anu. Will be uploading the comments incorporated patch soon.
1. Done. Made use of private method getFsVolume(), just like other places.
2. Done. Restored "Disk Balancer -" in the removed logs and in the new logs. Even if you don't have these explicit signature in the logging, they already carry file name from where the logging is happening and here in this case it will be DiskBalancer.java, which can also be used for grepping
3. Done. Restored Volume path in the error message. True volume path when available can be used for error logging for they are much more easier to read.
4. Done. Corrected the method comments.
5. Done. Removed the unnecessary setExitFlag() calls.
6. Done. Removed the newly introduced interface method. Rather implemented as a new private method in DiskBalancer using the already available getVolumeReferences().
7. Done. Added the assert for reconfigure error.
8. Done. Test now checks for the error message prefix also.

Manoj Govindassamy added a comment - 27/Sep/16 01:29 Thanks for the review and comments anu . Will be uploading the comments incorporated patch soon. 1. Done. Made use of private method getFsVolume(), just like other places. 2. Done. Restored "Disk Balancer -" in the removed logs and in the new logs. Even if you don't have these explicit signature in the logging, they already carry file name from where the logging is happening and here in this case it will be DiskBalancer.java, which can also be used for grepping 3. Done. Restored Volume path in the error message. True volume path when available can be used for error logging for they are much more easier to read. 4. Done. Corrected the method comments. 5. Done. Removed the unnecessary setExitFlag() calls. 6. Done. Removed the newly introduced interface method. Rather implemented as a new private method in DiskBalancer using the already available getVolumeReferences(). 7. Done. Added the assert for reconfigure error. 8. Done. Test now checks for the error message prefix also.

Manoj Govindassamy added a comment - 27/Sep/16 01:30

Attaching v03 patch to address checkstyle and review comments from eddyxu and anu. Kindly take a look.

Manoj Govindassamy added a comment - 27/Sep/16 01:30 Attaching v03 patch to address checkstyle and review comments from eddyxu and anu . Kindly take a look.

Anu Engineer added a comment - 27/Sep/16 03:48 - edited

manojg Thanks for quick turnaround. I was reading code and something struck me. Could you please help me understand what would happen in the following scenario ?

Let us say a plan was executed, then either the source or destination disk was removed/reconfigured and then I call into queryStatus,
Would it work or would it fail ?

Anu Engineer added a comment - 27/Sep/16 03:48 - edited manojg Thanks for quick turnaround. I was reading code and something struck me. Could you please help me understand what would happen in the following scenario ? Let us say a plan was executed, then either the source or destination disk was removed/reconfigured and then I call into queryStatus, Would it work or would it fail ?

Hadoop QA added a comment - 27/Sep/16 04:10

-1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 15s	Docker mode activated.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
+1	mvninstall	7m 3s	trunk passed
+1	compile	0m 48s	trunk passed
+1	checkstyle	0m 26s	trunk passed
+1	mvnsite	1m 0s	trunk passed
+1	mvneclipse	0m 18s	trunk passed
+1	findbugs	1m 59s	trunk passed
+1	javadoc	0m 54s	trunk passed
+1	mvninstall	0m 48s	the patch passed
+1	compile	0m 42s	the patch passed
+1	javac	0m 42s	the patch passed
+1	checkstyle	0m 23s	the patch passed
+1	mvnsite	0m 48s	the patch passed
+1	mvneclipse	0m 9s	the patch passed
+1	whitespace	0m 0s	The patch has no whitespace issues.
+1	findbugs	1m 50s	the patch passed
+1	javadoc	0m 53s	the patch passed
-1	unit	62m 44s	hadoop-hdfs in the patch failed.
+1	asflicense	0m 20s	The patch does not generate ASF License warnings.
		82m 32s

Reason	Tests
Failed junit tests	hadoop.hdfs.server.mover.TestStorageMover
	hadoop.hdfs.TestDecommissionWithStriped

Subsystem	Report/Notes
Docker	Image:yetus/hadoop:9560f25
JIRA Issue	~~HDFS-9850~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12830432/HDFS-9850.003.patch
Optional Tests	asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
uname	Linux 6ecaa6ee4eb6 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	/testptch/hadoop/patchprocess/precommit/personality/provided.sh
git revision	trunk / edf0d0f
Default Java	1.8.0_101
findbugs	v3.0.0
unit	https://builds.apache.org/job/PreCommit-HDFS-Build/16874/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
Test Results	https://builds.apache.org/job/PreCommit-HDFS-Build/16874/testReport/
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://builds.apache.org/job/PreCommit-HDFS-Build/16874/console
Powered by	Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 27/Sep/16 04:10 -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 3s trunk passed +1 compile 0m 48s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 1m 0s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 59s trunk passed +1 javadoc 0m 54s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 42s the patch passed +1 javac 0m 42s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 48s the patch passed +1 mvneclipse 0m 9s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 50s the patch passed +1 javadoc 0m 53s the patch passed -1 unit 62m 44s hadoop-hdfs in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 82m 32s Reason Tests Failed junit tests hadoop.hdfs.server.mover.TestStorageMover hadoop.hdfs.TestDecommissionWithStriped Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-9850 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830432/HDFS-9850.003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6ecaa6ee4eb6 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / edf0d0f Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/16874/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16874/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16874/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.

Manoj Govindassamy added a comment - 27/Sep/16 17:49

With the proposed fix in this jira issue ~~HDFS-9850~~, both DiskBalancerMover#copyBlocks and DiskBalancer#queryWorkStatus handles the case where any of the involved storage volumes in the plan are no more available.

copyBlocks logs the error about missing volume and returns without crashing/exception
- hence that particular DiskBalancerWorkItem thus gets skipped and moves on to the next one
- But, if the step happens to be the last one, then the currentResult gets stuck at PLAN_UNDER_PROGRESS even if it returned with error
queryWorkStatus also logs the error about missing volume and throws DiskBalancerException INTERNAL_ERROR.
- QueryCommand which invokes queryWorkStatus gets the exception, and logs the error query plan failed
- May be we should also print out something to the user so that he gets to see some meaningful error message

As part of fixing this jira, I have also added a unit test case TestDiskBalancer#testDiskBalancerWhenRemovingVolumes for the exact above case. For now, the test code verifies that DiskBalancer is not crashing or throwing weird exceptions.

To attack both above cases, jira ~~HDFS-10904~~ has been filed so that user gets to see proper error message and proper balancing operation state even when the involved volumes are removed. And btw, this corner case was already there and not introduced by this jira. This jira attempts to address unnecessary FsVolumeSpi object references issue. Please let me know if you need more clarifications.

Manoj Govindassamy added a comment - 27/Sep/16 17:49 With the proposed fix in this jira issue HDFS-9850 , both DiskBalancerMover#copyBlocks and DiskBalancer#queryWorkStatus handles the case where any of the involved storage volumes in the plan are no more available. copyBlocks logs the error about missing volume and returns without crashing/exception hence that particular DiskBalancerWorkItem thus gets skipped and moves on to the next one But, if the step happens to be the last one, then the currentResult gets stuck at PLAN_UNDER_PROGRESS even if it returned with error queryWorkStatus also logs the error about missing volume and throws DiskBalancerException INTERNAL_ERROR. QueryCommand which invokes queryWorkStatus gets the exception, and logs the error query plan failed May be we should also print out something to the user so that he gets to see some meaningful error message As part of fixing this jira, I have also added a unit test case TestDiskBalancer#testDiskBalancerWhenRemovingVolumes for the exact above case. For now, the test code verifies that DiskBalancer is not crashing or throwing weird exceptions. To attack both above cases, jira HDFS-10904 has been filed so that user gets to see proper error message and proper balancing operation state even when the involved volumes are removed. And btw, this corner case was already there and not introduced by this jira. This jira attempts to address unnecessary FsVolumeSpi object references issue. Please let me know if you need more clarifications.

Anu Engineer added a comment - 27/Sep/16 18:07

This corner case was already there and not introduced by this jira

I am not convinced that this statement is correct. Let us enumerate the cases.

Old behaviour

If a source or destination disk is removed while diskbalancer is running – There is a failure – crash perhaps ?, I will check that by running the test case and the plan status flag will be stuck in PLAN_UNDER_PROGRESS.
if a source or destination disk is removed after diskbalancer is completely run. A queryStatus would return the status that it remembers.

As far as I understand (Please correct me if I am wrong), the new behaviour is as follows.

New behaviour

If a source or destination disk is removed while diskbalancer is running – There is a failure and the error message is set appropriately, but plan status flag will be stuck in PLAN_UNDER_PROGRESS – and details of error is the error message.
if a source or destination disk is removed after diskbalancer is completely run. A queryStatus would return an error saying that disk is missing.

The fact that some disk got removed after disk balancer was run should not be any consequence to Disk balancer. I am not arguing about the first case at all. My concern is about second point – behaviour of queryStatus after diskbalancer has finished running. I am arguing that it should not fail if a disk is missing – when querystatus is done.

Anu Engineer added a comment - 27/Sep/16 18:07 This corner case was already there and not introduced by this jira I am not convinced that this statement is correct. Let us enumerate the cases. Old behaviour If a source or destination disk is removed while diskbalancer is running – There is a failure – crash perhaps ?, I will check that by running the test case and the plan status flag will be stuck in PLAN_UNDER_PROGRESS. if a source or destination disk is removed after diskbalancer is completely run. A queryStatus would return the status that it remembers. As far as I understand (Please correct me if I am wrong), the new behaviour is as follows. New behaviour If a source or destination disk is removed while diskbalancer is running – There is a failure and the error message is set appropriately, but plan status flag will be stuck in PLAN_UNDER_PROGRESS – and details of error is the error message. if a source or destination disk is removed after diskbalancer is completely run. A queryStatus would return an error saying that disk is missing . The fact that some disk got removed after disk balancer was run should not be any consequence to Disk balancer. I am not arguing about the first case at all. My concern is about second point – behaviour of queryStatus after diskbalancer has finished running. I am arguing that it should not fail if a disk is missing – when querystatus is done.

Manoj Govindassamy added a comment - 27/Sep/16 19:05

Thanks for the comments.

Old behavior:
1. In DiskBalancerMover#copyBlocks(), this.dataset.moveBlockAcrossVolumes(block, dest) fails with ClosedChannelException when it tries to get the volume reference.
2. Yes, queryStatus() gets the volume base path as it already has the FsVolumeSpi reference stored.

New behavior:
1. Yes, the new fix sets the error message when the volume is not available. But, the fix does not take care of currentResult state and it will still be in PLAN_UNDER_PROGRESS
2. Yes, this is different from old behavior as queryStatus would return an error instead of the stored paths.

Got your point for getQueryStatus(). Since FsVolumeSpi references are no more stored and only StorageVolume UUIDs are available, its not possible to construct the path for a non-existing volume.

ConcurrentHashMap<VolumePair, DiskBalancerWorkItem> workMap;

  public static class VolumePair {
    private final String sourceVolUuid;
    private final String destVolUuid;

Here are few other options on fixing queryWorkStatus:

Instead of throwing error, may be we can report volume path as <volume missing> or related error message whenever we find the volume removed
Or, persist volume paths also along with UUID, so that query will always work as before

Your thoughts please ?

Manoj Govindassamy added a comment - 27/Sep/16 19:05 Thanks for the comments. Old behavior: In DiskBalancerMover#copyBlocks() , this.dataset.moveBlockAcrossVolumes(block, dest) fails with ClosedChannelException when it tries to get the volume reference. Yes, queryStatus() gets the volume base path as it already has the FsVolumeSpi reference stored. New behavior: Yes, the new fix sets the error message when the volume is not available. But, the fix does not take care of currentResult state and it will still be in PLAN_UNDER_PROGRESS Yes, this is different from old behavior as queryStatus would return an error instead of the stored paths. Got your point for getQueryStatus(). Since FsVolumeSpi references are no more stored and only StorageVolume UUIDs are available, its not possible to construct the path for a non-existing volume. ConcurrentHashMap<VolumePair, DiskBalancerWorkItem> workMap; public static class VolumePair { private final String sourceVolUuid; private final String destVolUuid; Here are few other options on fixing queryWorkStatus: Instead of throwing error, may be we can report volume path as <volume missing> or related error message whenever we find the volume removed Or, persist volume paths also along with UUID, so that query will always work as before Your thoughts please ?

Anu Engineer added a comment - 27/Sep/16 19:29

Thanks for your comments.I think adding paths to volume pair is an easier fix and not too memory intensive. As a side effect, it will eliminate any calls into FsVolumeSpi during query.

Anu Engineer added a comment - 27/Sep/16 19:29 Thanks for your comments.I think adding paths to volume pair is an easier fix and not too memory intensive. As a side effect, it will eliminate any calls into FsVolumeSpi during query.

Manoj Govindassamy added a comment - 28/Sep/16 00:42

Attaching v04 patch to address following. anu, kindly take a look.

Restored old behavior of getQueryStatus to return volume paths even if the volume is removed. Made VolumePair to hold the volume paths.
Accordingly changed the expectation in TestDiskBalancer. Test will not expect DiskBalancerException and instead awaits for the status to reach PLAN_DONE.

Manoj Govindassamy added a comment - 28/Sep/16 00:42 Attaching v04 patch to address following. anu , kindly take a look. Restored old behavior of getQueryStatus to return volume paths even if the volume is removed. Made VolumePair to hold the volume paths. Accordingly changed the expectation in TestDiskBalancer. Test will not expect DiskBalancerException and instead awaits for the status to reach PLAN_DONE.

Anu Engineer added a comment - 28/Sep/16 01:11

+1, Pending Jenkins. The changes in v4 patch look very good. Thanks for getting this done. I will commit this as soon as we have jenkins run.

Anu Engineer added a comment - 28/Sep/16 01:11 +1, Pending Jenkins. The changes in v4 patch look very good. Thanks for getting this done. I will commit this as soon as we have jenkins run.

Manoj Govindassamy added a comment - 28/Sep/16 01:13

Thanks a lot for the quick review.

Manoj Govindassamy added a comment - 28/Sep/16 01:13 Thanks a lot for the quick review.

Hadoop QA added a comment - 28/Sep/16 03:14

-1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 17s	Docker mode activated.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
+1	mvninstall	7m 14s	trunk passed
+1	compile	0m 45s	trunk passed
+1	checkstyle	0m 25s	trunk passed
+1	mvnsite	0m 51s	trunk passed
+1	mvneclipse	0m 12s	trunk passed
+1	findbugs	1m 41s	trunk passed
+1	javadoc	0m 55s	trunk passed
+1	mvninstall	0m 48s	the patch passed
+1	compile	0m 42s	the patch passed
+1	javac	0m 42s	the patch passed
+1	checkstyle	0m 24s	the patch passed
+1	mvnsite	0m 50s	the patch passed
+1	mvneclipse	0m 10s	the patch passed
+1	whitespace	0m 0s	The patch has no whitespace issues.
+1	findbugs	1m 51s	the patch passed
+1	javadoc	0m 53s	the patch passed
-1	unit	57m 48s	hadoop-hdfs in the patch failed.
+1	asflicense	0m 18s	The patch does not generate ASF License warnings.
		77m 23s

Reason	Tests
Failed junit tests	hadoop.tracing.TestTracing
	hadoop.hdfs.server.namenode.TestDiskspaceQuotaUpdate

Subsystem	Report/Notes
Docker	Image:yetus/hadoop:9560f25
JIRA Issue	~~HDFS-9850~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12830605/HDFS-9850.004.patch
Optional Tests	asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
uname	Linux a779b26c5db0 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	/testptch/hadoop/patchprocess/precommit/personality/provided.sh
git revision	trunk / d144398
Default Java	1.8.0_101
findbugs	v3.0.0
unit	https://builds.apache.org/job/PreCommit-HDFS-Build/16892/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
Test Results	https://builds.apache.org/job/PreCommit-HDFS-Build/16892/testReport/
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://builds.apache.org/job/PreCommit-HDFS-Build/16892/console
Powered by	Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 28/Sep/16 03:14 -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 14s trunk passed +1 compile 0m 45s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 51s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 41s trunk passed +1 javadoc 0m 55s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 42s the patch passed +1 javac 0m 42s the patch passed +1 checkstyle 0m 24s the patch passed +1 mvnsite 0m 50s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 51s the patch passed +1 javadoc 0m 53s the patch passed -1 unit 57m 48s hadoop-hdfs in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 77m 23s Reason Tests Failed junit tests hadoop.tracing.TestTracing hadoop.hdfs.server.namenode.TestDiskspaceQuotaUpdate Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-9850 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830605/HDFS-9850.004.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a779b26c5db0 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d144398 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/16892/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16892/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16892/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.

Manoj Govindassamy added a comment - 28/Sep/16 04:25

Unit test failures are not related to the attached patch.

Manoj Govindassamy added a comment - 28/Sep/16 04:25 Unit test failures are not related to the attached patch.

Anu Engineer added a comment - 28/Sep/16 04:38

manojg Thanks for fixing this issue. I have committed this to trunk.

Anu Engineer added a comment - 28/Sep/16 04:38 manojg Thanks for fixing this issue. I have committed this to trunk.

Hudson added a comment - 28/Sep/16 05:32

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10505 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10505/)
~~HDFS-9850~~. DiskBalancer: Explore removing references to FsVolumeSpi. (aengineer: rev 03f519a757ce83d76e7fc9f6aadf271e38bb9f6d)

(edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DiskBalancer.java
(edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/diskbalancer/TestDiskBalancer.java

Hudson added a comment - 28/Sep/16 05:32 SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10505 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10505/ ) HDFS-9850 . DiskBalancer: Explore removing references to FsVolumeSpi. (aengineer: rev 03f519a757ce83d76e7fc9f6aadf271e38bb9f6d) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DiskBalancer.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/diskbalancer/TestDiskBalancer.java

Manoj Govindassamy added a comment - 28/Sep/16 17:55

Thanks anu, eddyxu for the review and commit help.

Manoj Govindassamy added a comment - 28/Sep/16 17:55 Thanks anu , eddyxu for the review and commit help.

Hadoop HDFS

DiskBalancer : Explore removing references to FsVolumeSpi

Details

Description

Attachments

Attachments

Activity

People

Dates