Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14986

ReplicaCachingGetSpaceUsed throws ConcurrentModificationException

Details

    • Reviewed

    Description

      Running DU across lots of disks is very expensive . We applied the patch HDFS-14313 to get  used space from ReplicaInfo in memory.However, new du threads throw the exception

      // 2019-11-08 18:07:13,858 ERROR [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992-XXXX-1450855658517] org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: ReplicaCachingGetSpaceUsed refresh error
      java.util.ConcurrentModificationException: Tree has been modified outside of iterator    
      at org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311)    
      at org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256)    
      at java.util.AbstractCollection.addAll(AbstractCollection.java:343)    
      at java.util.HashSet.<init>(HashSet.java:120)    
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052)    
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73)    
      at org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178)   
      at java.lang.Thread.run(Thread.java:748)
      

      Attachments

        1. HDFS-14986.006.patch
          9 kB
          Mingxiang Li
        2. HDFS-14986.005.patch
          9 kB
          Mingxiang Li
        3. HDFS-14986.004.patch
          9 kB
          Mingxiang Li
        4. HDFS-14986.003.patch
          8 kB
          Mingxiang Li
        5. HDFS-14986.002.patch
          8 kB
          Mingxiang Li
        6. HDFS-14986.001.patch
          3 kB
          Mingxiang Li

        Issue Links

          Activity

            jianliang.wu Ryan Wu added a comment -

            The DeepCopyReplica method gets FoldedTreeSet<ReplicaInfo> from ReplicaMap,at the same time the FoldedTreeSet<ReplicaInfo> could be changed due to addblock or removeblock. This ConcurrentModificationException appears .

            jianliang.wu Ryan Wu added a comment - The DeepCopyReplica method gets FoldedTreeSet<ReplicaInfo> from ReplicaMap,at the same time the FoldedTreeSet<ReplicaInfo> could be changed due to addblock or removeblock. This ConcurrentModificationException appears .
            hexiaoqiao Xiaoqiao He added a comment -

            we meet very similar case (ConcurrentModificationException or DeadLock) when backport HDFS-14313 to branch-2.7, and I believe there is still some space to improve current implementation.
            Aiphag0 may have some more information. Would you like to offer our patch here?

            hexiaoqiao Xiaoqiao He added a comment - we meet very similar case (ConcurrentModificationException or DeadLock) when backport HDFS-14313 to branch-2.7, and I believe there is still some space to improve current implementation. Aiphag0 may have some more information. Would you like to offer our patch here?
            linyiqun Yiqun Lin added a comment -

            +leosun08 to have a look for this as well.

            we meet very similar case (ConcurrentModificationException or DeadLock)...

            hexiaoqiao, I wonder how you find the case of deadlock, I remember leosun08 and me talked about the deadlock problem in HDFS-14313 and should not happen. It will be great if you can provide the stack info for the deadlock.

            linyiqun Yiqun Lin added a comment - + leosun08  to have a look for this as well. we meet very similar case (ConcurrentModificationException or DeadLock)... hexiaoqiao , I wonder how you find the case of deadlock, I remember leosun08 and me talked about the deadlock problem in HDFS-14313 and should not happen. It will be great if you can provide the stack info for the deadlock.
            Aiphag0 Mingxiang Li added a comment -

            In HDFS-14313 we found same concurrent problem before base our hadoop version.So we solve it by add dataset lock at FsDatasetImpl#deepCopyReplica().  But this may cause deadlock problem when restart datanode.When BlockPoolSlice init first time,the BlockPoolSlice#loadDfsUsed() may return -1.And the ReplicaCachingGetSpaceUsed#init() will blocking because of FsDatasetImpl#deepCopyReplica() can't get the dataset lock,and the thread can't release dataset lock the same time.

            Aiphag0 Mingxiang Li added a comment - In  HDFS-14313  we found same concurrent problem before base our hadoop version.So we solve it by add dataset lock at FsDatasetImpl#deepCopyReplica().  But this may cause deadlock problem when restart datanode.When BlockPoolSlice init first time,the BlockPoolSlice#loadDfsUsed() may return -1.And the ReplicaCachingGetSpaceUsed#init() will blocking because of FsDatasetImpl#deepCopyReplica() can't get the dataset lock,and the thread can't release dataset lock the same time.
            Aiphag0 Mingxiang Li added a comment -

            I would like to submit a patch to try to fix this bug later.

            Aiphag0 Mingxiang Li added a comment - I would like to submit a patch to try to fix this bug later.
            Aiphag0 Mingxiang Li added a comment -

            "DataNode: [....... #283 daemon prio=5 os_prio=0 tid=0x00007fd9826a7800 nid=0x7463 in Object.wait() [0x00007fd949616000]
            java.lang.Thread.State: WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)

            • waiting on <0x00000006b375a1d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2)
              at java.lang.Thread.join(Thread.java:1249)
            • locked <0x00000006b375a1d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2)
              at java.lang.Thread.join(Thread.java:1323)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.addBlockPool(FsVolumeList.java:423)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.addBlockPool(FsDatasetImpl.java:2509)
            • locked <0x00000006b33a3b10> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1388)
              at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:311)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:232)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:720)
            Aiphag0 Mingxiang Li added a comment - "DataNode: [....... #283 daemon prio=5 os_prio=0 tid=0x00007fd9826a7800 nid=0x7463 in Object.wait() [0x00007fd949616000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0x00000006b375a1d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2) at java.lang.Thread.join(Thread.java:1249) locked <0x00000006b375a1d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2) at java.lang.Thread.join(Thread.java:1323) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.addBlockPool(FsVolumeList.java:423) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.addBlockPool(FsDatasetImpl.java:2509) locked <0x00000006b33a3b10> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1388) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:311) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:232) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:720)
            hexiaoqiao Xiaoqiao He added a comment -

            linyiqun, we did some modify when backport HDFS-14313 since there are very diffs between branch-2.7 and trunk.
            As Aiphag0 said above, if no synchronous control, it will throw CME, but meet deadlock if add sync when restart DataNode. The root cause is `BlockPoolSlice#loadDfsUsed() may return -1. And the ReplicaCachingGetSpaceUsed#init() will blocks` when start as mentioned above.

            hexiaoqiao Xiaoqiao He added a comment - linyiqun , we did some modify when backport HDFS-14313 since there are very diffs between branch-2.7 and trunk. As Aiphag0 said above, if no synchronous control, it will throw CME, but meet deadlock if add sync when restart DataNode. The root cause is `BlockPoolSlice#loadDfsUsed() may return -1. And the ReplicaCachingGetSpaceUsed#init() will blocks` when start as mentioned above.
            leosun08 Lisheng Sun added a comment -

            FsDatasetImpl#deepCopyReplica should be synchronized before branch-2.8.  HDFS-10682  Replace FsDatasetImpl object lock with a separate lock objectin branch-2.8.

            public Set<? extends Replica> deepCopyReplica(String bpid){
            ...
            }
            

             

            leosun08 Lisheng Sun added a comment - FsDatasetImpl#deepCopyReplica should be synchronized before branch-2.8.   HDFS-10682  Replace FsDatasetImpl object lock with a separate lock objectin branch-2.8. public Set<? extends Replica> deepCopyReplica( String bpid){ ... }  
            leosun08 Lisheng Sun added a comment -

            Aiphag0  which branch could you use?

            leosun08 Lisheng Sun added a comment - Aiphag0   which branch could you use?
            Aiphag0 Mingxiang Li added a comment -

            We use the branch before 2.8 with the synchronized,and we fix the deadlock problem.And I think it's better to and add dataset lock at FsDatasetImpl#deepCopyReplica in trunk,and the method to slove deadlock problem is the same.

            Aiphag0 Mingxiang Li added a comment - We use the branch before 2.8 with the synchronized,and we fix the deadlock problem.And I think it's better to and add dataset lock at FsDatasetImpl#deepCopyReplica in trunk,and the method to slove deadlock problem is the same.
            leosun08 Lisheng Sun added a comment -

            Aiphag0

            if add dataset lock at FsDatasetImpl#deepCopyReplica in trunk,  it will happen dead lock.

            I don't use FsDatasetImpl#datasetLock , Because FsDatasetImpl#addBlockPool with datasetLock call FsDatasetImpl#deepCopyReplica in another Thread when dn start.

            https://issues.apache.org/jira/browse/HDFS-14313?focusedCommentId=16887859&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16887859

            leosun08 Lisheng Sun added a comment - Aiphag0 if add dataset lock at FsDatasetImpl#deepCopyReplica in trunk,  it will happen dead lock. I don't use FsDatasetImpl#datasetLock , Because FsDatasetImpl#addBlockPool with datasetLock call FsDatasetImpl#deepCopyReplica in another Thread when dn start. https://issues.apache.org/jira/browse/HDFS-14313?focusedCommentId=16887859&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16887859
            Aiphag0 Mingxiang Li added a comment -

            I means I can fix the dead lock problem with add dataset lock,my first comment have say the solution.

            Aiphag0 Mingxiang Li added a comment - I means I can fix the dead lock problem with add dataset lock,my first comment have say the solution.
            Aiphag0 Mingxiang Li added a comment -

            Here is the patch for the trunk can leosun08 hexiaoqiao linyiqun review this one? Thanks very much.HDFS-14986.001.patch

            Aiphag0 Mingxiang Li added a comment - Here is the patch for the trunk can  leosun08 hexiaoqiao   linyiqun  review this one? Thanks very much. HDFS-14986.001.patch
            linyiqun Yiqun Lin added a comment -

            Thanks for providing the patch, Aiphag0. I like the fix way. Only minor comments:

            RefreshThread(CachingGetSpaceUsed spaceUsed,boolean runImmediately)

            Can you add one space between spaceUsed,boolean?

            Can you add the comment for the method deepCopyReplica in FsDatasetSpi?

              /**
               * Deep copy the replica info belonging to given block pool.
               * @param bpid Specified block pool id.
               * @return A set of replica info.
               * @throws IOException
               */
              Set<? extends Replica> deepCopyReplica(String bpid) throws IOException;
            

            And then remove following outdated comment in FsDatasetImpl?

              /**
               * The deepCopyReplica call doesn't use the datasetock since it will lead the
               * potential deadlock with the {@link FsVolumeList#addBlockPool} call.
               */
            

            Additionally, can you add a unit test for this? It will be a good habit when we fix a bug then give a corresponding unit test case.

            linyiqun Yiqun Lin added a comment - Thanks for providing the patch,  Aiphag0 . I like the fix way. Only minor comments: RefreshThread(CachingGetSpaceUsed spaceUsed,boolean runImmediately) Can you add one space between spaceUsed,boolean ? Can you add the comment for the method deepCopyReplica in FsDatasetSpi? /** * Deep copy the replica info belonging to given block pool. * @param bpid Specified block pool id. * @ return A set of replica info. * @ throws IOException */ Set<? extends Replica> deepCopyReplica( String bpid) throws IOException; And then remove following outdated comment in FsDatasetImpl? /** * The deepCopyReplica call doesn't use the datasetock since it will lead the * potential deadlock with the {@link FsVolumeList#addBlockPool} call. */ Additionally, can you add a unit test for this? It will be a good habit when we fix a bug then give a corresponding unit test case.
            Aiphag0 Mingxiang Li added a comment -

            Thanks for your advice, I'll fix later.

            Aiphag0 Mingxiang Li added a comment - Thanks for your advice, I'll fix later.
            Aiphag0 Mingxiang Li added a comment -

            Hi linyiqun,I updata the patch.Can you review this again?HDFS-14986.002.patch

            Aiphag0 Mingxiang Li added a comment - Hi  linyiqun ,I updata the patch.Can you review this again? HDFS-14986.002.patch
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 54s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            0 mvndep 0m 26s Maven dependency ordering for branch
            +1 mvninstall 20m 19s trunk passed
            +1 compile 20m 35s trunk passed
            +1 checkstyle 3m 3s trunk passed
            +1 mvnsite 3m 6s trunk passed
            +1 shadedclient 21m 50s branch has no errors when building and testing our client artifacts.
            +1 findbugs 4m 45s trunk passed
            +1 javadoc 3m 11s trunk passed
                  Patch Compile Tests
            0 mvndep 0m 28s Maven dependency ordering for patch
            +1 mvninstall 2m 0s the patch passed
            +1 compile 20m 31s the patch passed
            +1 javac 20m 31s the patch passed
            -0 checkstyle 3m 7s root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117)
            +1 mvnsite 3m 16s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 15m 28s patch has no errors when building and testing our client artifacts.
            +1 findbugs 5m 15s the patch passed
            +1 javadoc 3m 9s the patch passed
                  Other Tests
            -1 unit 8m 44s hadoop-common in the patch failed.
            -1 unit 106m 30s hadoop-hdfs in the patch failed.
            +1 asflicense 0m 45s The patch does not generate ASF License warnings.
            244m 52s



            Reason Tests
            Failed junit tests hadoop.fs.TestDFCachingGetSpaceUsed
              hadoop.security.TestFixKerberosTicketOrder
              hadoop.fs.TestDU
              hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer
              hadoop.hdfs.server.balancer.TestBalancer



            Subsystem Report/Notes
            Docker Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169
            JIRA Issue HDFS-14986
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986219/HDFS-14986.002.patch
            Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux 70d35e564673 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / 9fbfe6c
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_222
            findbugs v3.1.0-RC1
            checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/diff-checkstyle-root.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
            Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28341/testReport/
            Max. process+thread count 2839 (vs. ulimit of 5500)
            modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28341/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 54s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests 0 mvndep 0m 26s Maven dependency ordering for branch +1 mvninstall 20m 19s trunk passed +1 compile 20m 35s trunk passed +1 checkstyle 3m 3s trunk passed +1 mvnsite 3m 6s trunk passed +1 shadedclient 21m 50s branch has no errors when building and testing our client artifacts. +1 findbugs 4m 45s trunk passed +1 javadoc 3m 11s trunk passed       Patch Compile Tests 0 mvndep 0m 28s Maven dependency ordering for patch +1 mvninstall 2m 0s the patch passed +1 compile 20m 31s the patch passed +1 javac 20m 31s the patch passed -0 checkstyle 3m 7s root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117) +1 mvnsite 3m 16s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 15m 28s patch has no errors when building and testing our client artifacts. +1 findbugs 5m 15s the patch passed +1 javadoc 3m 9s the patch passed       Other Tests -1 unit 8m 44s hadoop-common in the patch failed. -1 unit 106m 30s hadoop-hdfs in the patch failed. +1 asflicense 0m 45s The patch does not generate ASF License warnings. 244m 52s Reason Tests Failed junit tests hadoop.fs.TestDFCachingGetSpaceUsed   hadoop.security.TestFixKerberosTicketOrder   hadoop.fs.TestDU   hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer   hadoop.hdfs.server.balancer.TestBalancer Subsystem Report/Notes Docker Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 JIRA Issue HDFS-14986 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986219/HDFS-14986.002.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 70d35e564673 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 9fbfe6c maven version: Apache Maven 3.3.9 Default Java 1.8.0_222 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28341/testReport/ Max. process+thread count 2839 (vs. ulimit of 5500) modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28341/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.
            linyiqun Yiqun Lin added a comment - - edited

            Hi Aiphag0, the patch almost looks good to me. Only some minor comments for the unit test, I think it's enough that we just verify deepCopyReplica call won't throw CME.

            So while loop in unit test can simplified to this:

                int retryTimes = 3;
                while (retryTimes > 0) {
                	try {
                        Set<? extends Replica> replicas = spyDataset.deepCopyReplica(bpid);
                        if (replicas.size() > 0) {
                        	retryTimes--;
                          // DeepCopyReplica should not throw ConcurrentModificationException
                          modifyThread.setShouldRun(false);
                        }
                      } catch (IOException e) {
                        modifyThread.setShouldRun(false);
                        Assert.fail("Encounter IOException when deep copy replica.");
                      }
                }
                modifyThread.setShouldRun(false);
            

             
            In additional, we need to close the fs output stream and delete created test files in ModifyThread. Can you update as following?

            public void run() {
                  FSDataOutputStream os = null;
                  while (shouldRun) {
                    try {
                      int id = RandomUtils.nextInt();
                      os = fs.create(new Path("/testFsDatasetImplDeepCopyReplica/" + id));
                      byte[] bytes = new byte[2048];
                      InputStream is = new ByteArrayInputStream(bytes);
                      IOUtils.copyBytes(is, os, bytes.length);
                      os.hsync();
                    } catch (IOException e) {
                      // ignored exception
                    }
                  }
            
                  try {
            	if (os != null) {
            	   os.close();
            	}
            	fs.delete(new Path("/testFsDatasetImplDeepCopyReplica"), true);
            	} catch (IOException e) {
            	}
            }

            Also please check following two failed unit tests if they are related:

            • TestDU
            • TestDFCachingGetSpaceUsed
            linyiqun Yiqun Lin added a comment - - edited Hi Aiphag0 , the patch almost looks good to me. Only some minor comments for the unit test, I think it's enough that we just verify deepCopyReplica call won't throw CME. So while loop in unit test can simplified to this: int retryTimes = 3; while (retryTimes > 0) { try { Set<? extends Replica> replicas = spyDataset.deepCopyReplica(bpid); if (replicas.size() > 0) { retryTimes--; // DeepCopyReplica should not throw ConcurrentModificationException modifyThread.setShouldRun( false ); } } catch (IOException e) { modifyThread.setShouldRun( false ); Assert.fail( "Encounter IOException when deep copy replica." ); } } modifyThread.setShouldRun( false );   In additional, we need to close the fs output stream and delete created test files in ModifyThread. Can you update as following? public void run() { FSDataOutputStream os = null ; while (shouldRun) { try { int id = RandomUtils.nextInt(); os = fs.create( new Path( "/testFsDatasetImplDeepCopyReplica/" + id)); byte [] bytes = new byte [2048]; InputStream is = new ByteArrayInputStream(bytes); IOUtils.copyBytes(is, os, bytes.length); os.hsync(); } catch (IOException e) { // ignored exception } } try { if (os != null ) { os.close(); } fs.delete( new Path( "/testFsDatasetImplDeepCopyReplica" ), true ); } catch (IOException e) { } } Also please check following two failed unit tests if they are related: TestDU TestDFCachingGetSpaceUsed
            Aiphag0 Mingxiang Li added a comment -

            Hi linyiqun,Thank you for your valuable advice.In previous patch I modify CachingGetSpaceUsed#init(),but this will influences the subclass of CachingGetSpaceUsed like DU.So I add a filter,and now the related unit tests can pass.HDFS-14986.003.patch

            Aiphag0 Mingxiang Li added a comment - Hi linyiqun ,Thank you for your valuable advice.In previous patch I modify CachingGetSpaceUsed#init(),but this will influences the subclass of CachingGetSpaceUsed like DU.So I add a filter,and now the related unit tests can pass. HDFS-14986.003.patch
            Aiphag0 Mingxiang Li added a comment -

            hi jianliang.wu, I just assign this Jira to myself, please feel free to assign back if you would also like to work on this.

            Aiphag0 Mingxiang Li added a comment - hi jianliang.wu , I just assign this Jira to myself, please feel free to assign back if you would also like to work on this.
            jianliang.wu Ryan Wu added a comment -

            Hi Aiphago, thanks for providing the patch that fix me too. I will backport it. 

            jianliang.wu Ryan Wu added a comment - Hi Aiphago, thanks for providing the patch that fix me too. I will backport it. 
            linyiqun Yiqun Lin added a comment -

            Hi Aiphag0,

            +      // Prevent ReplicaCachingGetSpaceUsed dead lock at FsDatasetImpl#deepCopyReplica
            +      if (this.getClass().getSimpleName().equals("ReplicaCachingGetSpaceUsed")) {
            +        initRefeshThread(true);
            +        return;
            +      }
            

            Judgement for the subclass in parent class is a little tricky, we can add a new flag and reset that in subclass. I find a way to remove runImmediately for RefreshThreaad and add shouldInitRefresh = true in CachingGetSpaceUsed. By default, we will do init refresh.

                if (used.get() < 0) {
                  used.set(0);
                  if(shouldInitRefresh) {
                	  refresh();
                  }
                }
            

            And we need to define a protected method to override this only in FSCachingGetSpaceUsed.

              /**
               * Reset that if we need to do the initial refresh.
               * @param shouldInitRefresh The flag value to set.
               */
              protected void setShouldInitRefresh(boolean shouldInitRefresh) {
            	  this.shouldInitRefresh = shouldInitRefresh;
              }
            
              public FSCachingGetSpaceUsed(Builder builder) throws IOException {
                super(builder);
                this.setShouldInitRefresh(false);
              }
            

             

            linyiqun Yiqun Lin added a comment - Hi Aiphag0 , + // Prevent ReplicaCachingGetSpaceUsed dead lock at FsDatasetImpl#deepCopyReplica + if ( this .getClass().getSimpleName().equals( "ReplicaCachingGetSpaceUsed" )) { + initRefeshThread( true ); + return ; + } Judgement for the subclass in parent class is a little tricky, we can add a new flag and reset that in subclass. I find a way to remove runImmediately for RefreshThreaad and add shouldInitRefresh = true in CachingGetSpaceUsed. By default, we will do init refresh. if (used.get() < 0) { used.set(0); if (shouldInitRefresh) { refresh(); } } And we need to define a protected method to override this only in FSCachingGetSpaceUsed. /** * Reset that if we need to do the initial refresh. * @param shouldInitRefresh The flag value to set. */ protected void setShouldInitRefresh( boolean shouldInitRefresh) { this .shouldInitRefresh = shouldInitRefresh; } public FSCachingGetSpaceUsed(Builder builder) throws IOException { super (builder); this .setShouldInitRefresh( false ); }  
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 44s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            0 mvndep 0m 59s Maven dependency ordering for branch
            +1 mvninstall 19m 18s trunk passed
            +1 compile 16m 8s trunk passed
            +1 checkstyle 2m 34s trunk passed
            +1 mvnsite 2m 28s trunk passed
            +1 shadedclient 18m 36s branch has no errors when building and testing our client artifacts.
            +1 findbugs 4m 0s trunk passed
            +1 javadoc 2m 36s trunk passed
                  Patch Compile Tests
            0 mvndep 0m 20s Maven dependency ordering for patch
            +1 mvninstall 1m 44s the patch passed
            +1 compile 15m 27s the patch passed
            +1 javac 15m 27s the patch passed
            -0 checkstyle 2m 34s root: The patch generated 3 new + 117 unchanged - 0 fixed = 120 total (was 117)
            +1 mvnsite 2m 20s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 12m 49s patch has no errors when building and testing our client artifacts.
            +1 findbugs 4m 19s the patch passed
            +1 javadoc 2m 38s the patch passed
                  Other Tests
            -1 unit 8m 30s hadoop-common in the patch failed.
            -1 unit 106m 55s hadoop-hdfs in the patch failed.
            +1 asflicense 0m 56s The patch does not generate ASF License warnings.
            224m 0s



            Reason Tests
            Failed junit tests hadoop.security.TestFixKerberosTicketOrder
              hadoop.security.TestRaceWhenRelogin
              hadoop.hdfs.TestMultipleNNPortQOP
              hadoop.hdfs.server.namenode.TestPersistentStoragePolicySatisfier



            Subsystem Report/Notes
            Docker Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169
            JIRA Issue HDFS-14986
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986378/HDFS-14986.003.patch
            Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux 7a09c1a83624 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / de38045
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_222
            findbugs v3.1.0-RC1
            checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/diff-checkstyle-root.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
            Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28359/testReport/
            Max. process+thread count 2706 (vs. ulimit of 5500)
            modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28359/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 44s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests 0 mvndep 0m 59s Maven dependency ordering for branch +1 mvninstall 19m 18s trunk passed +1 compile 16m 8s trunk passed +1 checkstyle 2m 34s trunk passed +1 mvnsite 2m 28s trunk passed +1 shadedclient 18m 36s branch has no errors when building and testing our client artifacts. +1 findbugs 4m 0s trunk passed +1 javadoc 2m 36s trunk passed       Patch Compile Tests 0 mvndep 0m 20s Maven dependency ordering for patch +1 mvninstall 1m 44s the patch passed +1 compile 15m 27s the patch passed +1 javac 15m 27s the patch passed -0 checkstyle 2m 34s root: The patch generated 3 new + 117 unchanged - 0 fixed = 120 total (was 117) +1 mvnsite 2m 20s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 12m 49s patch has no errors when building and testing our client artifacts. +1 findbugs 4m 19s the patch passed +1 javadoc 2m 38s the patch passed       Other Tests -1 unit 8m 30s hadoop-common in the patch failed. -1 unit 106m 55s hadoop-hdfs in the patch failed. +1 asflicense 0m 56s The patch does not generate ASF License warnings. 224m 0s Reason Tests Failed junit tests hadoop.security.TestFixKerberosTicketOrder   hadoop.security.TestRaceWhenRelogin   hadoop.hdfs.TestMultipleNNPortQOP   hadoop.hdfs.server.namenode.TestPersistentStoragePolicySatisfier Subsystem Report/Notes Docker Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 JIRA Issue HDFS-14986 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986378/HDFS-14986.003.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 7a09c1a83624 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / de38045 maven version: Apache Maven 3.3.9 Default Java 1.8.0_222 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28359/testReport/ Max. process+thread count 2706 (vs. ulimit of 5500) modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28359/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.
            Aiphag0 Mingxiang Li added a comment -

            Hi linyiqun,Thank you for your valuable advice.And now setShouldInitRefresh() is only invoke in ReplicaCachingGetSpaceUsed.HDFS-14986.004.patch

            Aiphag0 Mingxiang Li added a comment - Hi  linyiqun ,Thank you for your valuable advice.And now setShouldInitRefresh() is only invoke in ReplicaCachingGetSpaceUsed. HDFS-14986.004.patch
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 1m 11s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            0 mvndep 0m 25s Maven dependency ordering for branch
            +1 mvninstall 23m 16s trunk passed
            +1 compile 22m 15s trunk passed
            +1 checkstyle 3m 21s trunk passed
            +1 mvnsite 3m 13s trunk passed
            +1 shadedclient 20m 36s branch has no errors when building and testing our client artifacts.
            +1 findbugs 4m 2s trunk passed
            +1 javadoc 2m 42s trunk passed
                  Patch Compile Tests
            0 mvndep 0m 23s Maven dependency ordering for patch
            +1 mvninstall 1m 48s the patch passed
            +1 compile 15m 38s the patch passed
            +1 javac 15m 38s the patch passed
            -0 checkstyle 2m 37s root: The patch generated 2 new + 118 unchanged - 0 fixed = 120 total (was 118)
            +1 mvnsite 2m 24s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 12m 53s patch has no errors when building and testing our client artifacts.
            +1 findbugs 4m 18s the patch passed
            +1 javadoc 2m 36s the patch passed
                  Other Tests
            -1 unit 8m 31s hadoop-common in the patch failed.
            -1 unit 104m 14s hadoop-hdfs in the patch failed.
            +1 asflicense 0m 44s The patch does not generate ASF License warnings.
            233m 40s



            Reason Tests
            Failed junit tests hadoop.security.TestFixKerberosTicketOrder
              hadoop.security.TestRaceWhenRelogin
              hadoop.hdfs.tools.TestDFSZKFailoverController



            Subsystem Report/Notes
            Docker Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169
            JIRA Issue HDFS-14986
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986493/HDFS-14986.004.patch
            Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux ae856fa84a00 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / b25e94c
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_222
            findbugs v3.1.0-RC1
            checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/diff-checkstyle-root.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
            Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28374/testReport/
            Max. process+thread count 2809 (vs. ulimit of 5500)
            modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28374/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 1m 11s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests 0 mvndep 0m 25s Maven dependency ordering for branch +1 mvninstall 23m 16s trunk passed +1 compile 22m 15s trunk passed +1 checkstyle 3m 21s trunk passed +1 mvnsite 3m 13s trunk passed +1 shadedclient 20m 36s branch has no errors when building and testing our client artifacts. +1 findbugs 4m 2s trunk passed +1 javadoc 2m 42s trunk passed       Patch Compile Tests 0 mvndep 0m 23s Maven dependency ordering for patch +1 mvninstall 1m 48s the patch passed +1 compile 15m 38s the patch passed +1 javac 15m 38s the patch passed -0 checkstyle 2m 37s root: The patch generated 2 new + 118 unchanged - 0 fixed = 120 total (was 118) +1 mvnsite 2m 24s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 12m 53s patch has no errors when building and testing our client artifacts. +1 findbugs 4m 18s the patch passed +1 javadoc 2m 36s the patch passed       Other Tests -1 unit 8m 31s hadoop-common in the patch failed. -1 unit 104m 14s hadoop-hdfs in the patch failed. +1 asflicense 0m 44s The patch does not generate ASF License warnings. 233m 40s Reason Tests Failed junit tests hadoop.security.TestFixKerberosTicketOrder   hadoop.security.TestRaceWhenRelogin   hadoop.hdfs.tools.TestDFSZKFailoverController Subsystem Report/Notes Docker Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 JIRA Issue HDFS-14986 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986493/HDFS-14986.004.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux ae856fa84a00 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / b25e94c maven version: Apache Maven 3.3.9 Default Java 1.8.0_222 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28374/testReport/ Max. process+thread count 2809 (vs. ulimit of 5500) modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28374/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.
            linyiqun Yiqun Lin added a comment - - edited

            Hi Aiphag0, the latest patch can still be simplified. Actually we don't need to add runImmediately for RefreshThread. Only one thing we need to do is that we skip init refresh in FSCachingGetSpaceUsed. Here init refresh just means first refresh operation. Hope you have got my idea.
            So the change can simplified to this

            --- a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java
            +++ b/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java
            @@ -47,6 +47,7 @@
               private final long jitter;
               private final String dirPath;
               private Thread refreshUsed;
            +  private boolean shouldInitRefresh;
            
               /**
                * This is the constructor used by the builder.
            @@ -79,12 +80,15 @@ public CachingGetSpaceUsed(CachingGetSpaceUsed.Builder builder)
                 this.refreshInterval = interval;
                 this.jitter = jitter;
                 this.used.set(initialUsed);
            +    this.shouldInitRefresh = true;
               }
            
               void init() {
                 if (used.get() < 0) {
                   used.set(0);
            -      refresh();
            +      if (shouldInitRefresh) {
            +       refresh();
            +      }
                 }
            
                 if (refreshInterval > 0) {
            @@ -145,6 +149,14 @@ protected void setUsed(long usedValue) {
                 this.used.set(usedValue);
               }
            
            +  /**
            +   * Reset that if we need to do the initial refresh.
            +   * @param shouldInitRefresh The flag value to set.
            +   */
            +  protected void setShouldInitRefresh(boolean shouldInitRefresh) {
            +    this.shouldInitRefresh = shouldInitRefresh;
            +  }
            +
               @Override
               public void close() throws IOException {
                 running.set(false);
            diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java
            index a5f350860f3..0a922b2d22f 100644
            --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java
            +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java
            @@ -39,6 +39,7 @@
            
               public FSCachingGetSpaceUsed(Builder builder) throws IOException {
                 super(builder);
            +    this.setShouldInitRefresh(false);
               }
            

             

            linyiqun Yiqun Lin added a comment - - edited Hi Aiphag0 , the latest patch can still be simplified. Actually we don't need to add runImmediately for RefreshThread. Only one thing we need to do is that we skip init refresh in FSCachingGetSpaceUsed. Here init refresh just means first refresh operation. Hope you have got my idea. So the change can simplified to this --- a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java +++ b/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java @@ -47,6 +47,7 @@ private final long jitter; private final String dirPath; private Thread refreshUsed; + private boolean shouldInitRefresh; /** * This is the constructor used by the builder. @@ -79,12 +80,15 @@ public CachingGetSpaceUsed(CachingGetSpaceUsed.Builder builder) this.refreshInterval = interval; this.jitter = jitter; this.used.set(initialUsed); + this.shouldInitRefresh = true; } void init() { if (used.get() < 0) { used.set(0); - refresh(); + if (shouldInitRefresh) { + refresh(); + } } if (refreshInterval > 0) { @@ -145,6 +149,14 @@ protected void setUsed(long usedValue) { this.used.set(usedValue); } + /** + * Reset that if we need to do the initial refresh. + * @param shouldInitRefresh The flag value to set. + */ + protected void setShouldInitRefresh(boolean shouldInitRefresh) { + this.shouldInitRefresh = shouldInitRefresh; + } + @Override public void close() throws IOException { running.set(false); diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java index a5f350860f3..0a922b2d22f 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java @@ -39,6 +39,7 @@ public FSCachingGetSpaceUsed(Builder builder) throws IOException { super(builder); + this.setShouldInitRefresh(false); }  
            Aiphag0 Mingxiang Li added a comment -

            Hi  linyiqun,I get your idea.But this may have a small problem.Because when use FSCachingGetSpaceUsed invoke init() and when 

            used < 0,so pass the frist refresh().This will make used = 0 until default 10 min until next refresh().So I add runImmediately to slove this problem.And to prevent other subClass not refresh() twice at first du.So this is my idea in patch 004.

            Aiphag0 Mingxiang Li added a comment - Hi    linyiqun ,I get your idea.But this may have a small problem.Because when use FSCachingGetSpaceUsed invoke init() and when  used < 0,so pass the frist refresh().This will make used = 0 until default 10 min until next refresh().So I add runImmediately to slove this problem.And to prevent other subClass not refresh() twice at first du.So this is my idea in patch 004.
            linyiqun Yiqun Lin added a comment - - edited

            Comment makes sense to me. I have minor suggestion for shouldInitRefresh

            The meaning of shouldInitRefresh should be that 'should do init refresh operation', so the default value should be true rather than false. So I prefer to change this and add comment:

               void init() {
                 if (used.get() < 0) {
                   used.set(0);
                    // Skip initial refresh operation, so we need to do first refresh 
                    // operation immediately in refresh thread.
            +      if (!shouldInitRefresh) {
            +        initRefeshThread(true);
            +        return;
            +      }
                   refresh();
                 }
            +    initRefeshThread(false);
            +  }
            

            Can we move this method into CachingGetSpaceUsed? The variable is under CachingGetSpaceUsed, it will be better also put the set operation in this class and let sub-classes to use this.

            +  /**
            +   * Reset that if we need to do the initial refresh.
            +   * @param shouldInitRefresh The flag value to set.
            +   */
            +  protected void setShouldInitRefresh(boolean shouldInitRefresh) {
            +    this.shouldInitRefresh = shouldInitRefresh;
            +  }
            

            Can you update method name and comment for this method?

            void initRefeshThread (boolean runImmediately)
            

            to

            // add comment here.
            private void initRefeshThread (boolean runImmediately)
            
            linyiqun Yiqun Lin added a comment - - edited Comment makes sense to me. I have minor suggestion for shouldInitRefresh .  The meaning of shouldInitRefresh should be that 'should do init refresh operation', so the default value should be true rather than false. So I prefer to change this and add comment: void init() { if (used.get() < 0) { used.set(0); // Skip initial refresh operation, so we need to do first refresh // operation immediately in refresh thread. + if (!shouldInitRefresh) { + initRefeshThread( true ); + return ; + } refresh(); } + initRefeshThread( false ); + } Can we move this method into CachingGetSpaceUsed? The variable is under CachingGetSpaceUsed, it will be better also put the set operation in this class and let sub-classes to use this. + /** + * Reset that if we need to do the initial refresh. + * @param shouldInitRefresh The flag value to set. + */ + protected void setShouldInitRefresh( boolean shouldInitRefresh) { + this .shouldInitRefresh = shouldInitRefresh; + } Can you update method name and comment for this method? void initRefeshThread ( boolean runImmediately) to // add comment here. private void initRefeshThread ( boolean runImmediately)
            Aiphag0 Mingxiang Li added a comment -

            Hi linyiqun,Thank you for your valuable advice.I improve the patch as your comment.Addition I rename shouldInitRefresh to shouldFirstRefresh order to easy distinguish.HDFS-14986.005.patch

            Aiphag0 Mingxiang Li added a comment - Hi  linyiqun ,Thank you for your valuable advice.I improve the patch as your comment.Addition I rename shouldInitRefresh to shouldFirstRefresh order to easy distinguish. HDFS-14986.005.patch
            linyiqun Yiqun Lin added a comment -

            Aiphag0, the change looks good to me. I rerun the unit test in my local, I find it's hard to reproduce CME exception based on 3 retry attempts. Can you increase the times from 3 to 10? I can see the CME error based on 10 times without lock change.

            int retryTimes = 10;
            

            In addition, there is a resource leak that I missed before. Can you move close operation into while loop?

                public void run() {
                  FSDataOutputStream os = null;
                  while (shouldRun) {
                    try {
                      int id = RandomUtils.nextInt();
                      os = fs.create(new Path("/testFsDatasetImplDeepCopyReplica/" + id));
                      byte[] bytes = new byte[2048];
                      InputStream is = new ByteArrayInputStream(bytes);
                      IOUtils.copyBytes(is, os, bytes.length);
                      os.hsync();
                      os.close();   <=== move here
                    } catch (IOException e) {}
                  }
            
                  try {
                    fs.delete(new Path("/testFsDatasetImplDeepCopyReplica"), true);
                  } catch (IOException e) {}
                }
            

             
            Others looks good to me, .

            linyiqun Yiqun Lin added a comment - Aiphag0 , the change looks good to me. I rerun the unit test in my local, I find it's hard to reproduce CME exception based on 3 retry attempts. Can you increase the times from 3 to 10? I can see the CME error based on 10 times without lock change. int retryTimes = 10; In addition, there is a resource leak that I missed before. Can you move close operation into while loop? public void run() { FSDataOutputStream os = null ; while (shouldRun) { try { int id = RandomUtils.nextInt(); os = fs.create( new Path( "/testFsDatasetImplDeepCopyReplica/" + id)); byte [] bytes = new byte [2048]; InputStream is = new ByteArrayInputStream(bytes); IOUtils.copyBytes(is, os, bytes.length); os.hsync(); os.close(); <=== move here } catch (IOException e) {} } try { fs.delete( new Path( "/testFsDatasetImplDeepCopyReplica" ), true ); } catch (IOException e) {} }   Others looks good to me, .
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 32m 7s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            0 mvndep 1m 17s Maven dependency ordering for branch
            +1 mvninstall 21m 49s trunk passed
            +1 compile 17m 24s trunk passed
            +1 checkstyle 2m 50s trunk passed
            +1 mvnsite 2m 45s trunk passed
            +1 shadedclient 19m 16s branch has no errors when building and testing our client artifacts.
            +1 findbugs 4m 26s trunk passed
            +1 javadoc 3m 4s trunk passed
                  Patch Compile Tests
            0 mvndep 0m 25s Maven dependency ordering for patch
            +1 mvninstall 2m 10s the patch passed
            +1 compile 20m 9s the patch passed
            +1 javac 20m 9s the patch passed
            -0 checkstyle 3m 17s root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117)
            +1 mvnsite 3m 14s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 14m 30s patch has no errors when building and testing our client artifacts.
            +1 findbugs 5m 5s the patch passed
            +1 javadoc 3m 7s the patch passed
                  Other Tests
            -1 unit 9m 48s hadoop-common in the patch failed.
            -1 unit 110m 0s hadoop-hdfs in the patch failed.
            +1 asflicense 1m 5s The patch does not generate ASF License warnings.
            275m 52s



            Reason Tests
            Failed junit tests hadoop.security.TestFixKerberosTicketOrder
              hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits



            Subsystem Report/Notes
            Docker Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169
            JIRA Issue HDFS-14986
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986884/HDFS-14986.005.patch
            Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux 1b4d8f61cf6c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / c8bef4d
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_222
            findbugs v3.1.0-RC1
            checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/diff-checkstyle-root.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
            Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28404/testReport/
            Max. process+thread count 2986 (vs. ulimit of 5500)
            modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28404/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 32m 7s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests 0 mvndep 1m 17s Maven dependency ordering for branch +1 mvninstall 21m 49s trunk passed +1 compile 17m 24s trunk passed +1 checkstyle 2m 50s trunk passed +1 mvnsite 2m 45s trunk passed +1 shadedclient 19m 16s branch has no errors when building and testing our client artifacts. +1 findbugs 4m 26s trunk passed +1 javadoc 3m 4s trunk passed       Patch Compile Tests 0 mvndep 0m 25s Maven dependency ordering for patch +1 mvninstall 2m 10s the patch passed +1 compile 20m 9s the patch passed +1 javac 20m 9s the patch passed -0 checkstyle 3m 17s root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117) +1 mvnsite 3m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 14m 30s patch has no errors when building and testing our client artifacts. +1 findbugs 5m 5s the patch passed +1 javadoc 3m 7s the patch passed       Other Tests -1 unit 9m 48s hadoop-common in the patch failed. -1 unit 110m 0s hadoop-hdfs in the patch failed. +1 asflicense 1m 5s The patch does not generate ASF License warnings. 275m 52s Reason Tests Failed junit tests hadoop.security.TestFixKerberosTicketOrder   hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits Subsystem Report/Notes Docker Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 JIRA Issue HDFS-14986 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986884/HDFS-14986.005.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 1b4d8f61cf6c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / c8bef4d maven version: Apache Maven 3.3.9 Default Java 1.8.0_222 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28404/testReport/ Max. process+thread count 2986 (vs. ulimit of 5500) modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28404/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.
            Aiphag0 Mingxiang Li added a comment -

            Good advice,I change the retrytimes to 10 and close the stream in while loop.HDFS-14986.006.patch

            Aiphag0 Mingxiang Li added a comment - Good advice,I change the retrytimes to 10 and close the stream in while loop. HDFS-14986.006.patch
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 17s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            0 mvndep 1m 4s Maven dependency ordering for branch
            +1 mvninstall 19m 9s trunk passed
            +1 compile 15m 46s trunk passed
            +1 checkstyle 2m 32s trunk passed
            +1 mvnsite 2m 26s trunk passed
            +1 shadedclient 18m 37s branch has no errors when building and testing our client artifacts.
            +1 findbugs 3m 59s trunk passed
            +1 javadoc 2m 36s trunk passed
                  Patch Compile Tests
            0 mvndep 0m 22s Maven dependency ordering for patch
            +1 mvninstall 1m 42s the patch passed
            +1 compile 15m 6s the patch passed
            +1 javac 15m 6s the patch passed
            -0 checkstyle 2m 33s root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117)
            +1 mvnsite 2m 21s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 12m 45s patch has no errors when building and testing our client artifacts.
            +1 findbugs 4m 16s the patch passed
            +1 javadoc 2m 40s the patch passed
                  Other Tests
            +1 unit 8m 49s hadoop-common in the patch passed.
            -1 unit 84m 30s hadoop-hdfs in the patch failed.
            +1 asflicense 0m 45s The patch does not generate ASF License warnings.
            200m 20s



            Reason Tests
            Failed junit tests hadoop.hdfs.server.namenode.ha.TestHASafeMode



            Subsystem Report/Notes
            Docker Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169
            JIRA Issue HDFS-14986
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986950/HDFS-14986.006.patch
            Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux 7e4938caed7c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / 7f2ea2a
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_222
            findbugs v3.1.0-RC1
            checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28411/artifact/out/diff-checkstyle-root.txt
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/28411/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
            Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28411/testReport/
            Max. process+thread count 3825 (vs. ulimit of 5500)
            modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28411/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests 0 mvndep 1m 4s Maven dependency ordering for branch +1 mvninstall 19m 9s trunk passed +1 compile 15m 46s trunk passed +1 checkstyle 2m 32s trunk passed +1 mvnsite 2m 26s trunk passed +1 shadedclient 18m 37s branch has no errors when building and testing our client artifacts. +1 findbugs 3m 59s trunk passed +1 javadoc 2m 36s trunk passed       Patch Compile Tests 0 mvndep 0m 22s Maven dependency ordering for patch +1 mvninstall 1m 42s the patch passed +1 compile 15m 6s the patch passed +1 javac 15m 6s the patch passed -0 checkstyle 2m 33s root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117) +1 mvnsite 2m 21s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 12m 45s patch has no errors when building and testing our client artifacts. +1 findbugs 4m 16s the patch passed +1 javadoc 2m 40s the patch passed       Other Tests +1 unit 8m 49s hadoop-common in the patch passed. -1 unit 84m 30s hadoop-hdfs in the patch failed. +1 asflicense 0m 45s The patch does not generate ASF License warnings. 200m 20s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestHASafeMode Subsystem Report/Notes Docker Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 JIRA Issue HDFS-14986 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12986950/HDFS-14986.006.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 7e4938caed7c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 7f2ea2a maven version: Apache Maven 3.3.9 Default Java 1.8.0_222 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/28411/artifact/out/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/28411/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/28411/testReport/ Max. process+thread count 3825 (vs. ulimit of 5500) modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/28411/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.
            linyiqun Yiqun Lin added a comment -

            LGTM, +1. Committing this.

            linyiqun Yiqun Lin added a comment - LGTM, +1. Committing this.
            linyiqun Yiqun Lin added a comment -

            Committed this to trunk, branch-2 and branch-2.10.
            Thanks Aiphag0 for the contribution and thanks jianliang.wu for reporting this.

            linyiqun Yiqun Lin added a comment - Committed this to trunk, branch-2 and branch-2.10. Thanks Aiphag0 for the contribution and thanks jianliang.wu for reporting this.
            hudson Hudson added a comment -

            SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17705 (See https://builds.apache.org/job/Hadoop-trunk-Commit/17705/)
            HDFS-14986. ReplicaCachingGetSpaceUsed throws (yqlin: rev 2b452b4e6063072b2bec491edd3f412eb7ac21f3)

            • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
            • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java
            • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestReplicaCachingGetSpaceUsed.java
            • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
            • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/ReplicaCachingGetSpaceUsed.java
            hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17705 (See https://builds.apache.org/job/Hadoop-trunk-Commit/17705/ ) HDFS-14986 . ReplicaCachingGetSpaceUsed throws (yqlin: rev 2b452b4e6063072b2bec491edd3f412eb7ac21f3) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestReplicaCachingGetSpaceUsed.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/ReplicaCachingGetSpaceUsed.java
            Aiphag0 Mingxiang Li added a comment -

            Thanks a lot for the review linyiqun.

            Aiphag0 Mingxiang Li added a comment - Thanks a lot for the review  linyiqun .
            fanghanyun fanghanyun added a comment - - edited

            Hi,Aiphago

            hadoop version 2.6.0-cdh5.13.1

            public Set<? extends Replica> deepCopyReplica(String bpid) throws IOException {
            //Set<? extends Replica> replicas = new HashSet<>(volumeMap.replicas(bpid) == null ? Collections.EMPTY_SET
            // :volumeMap.replicas(bpid));
            Set<? extends Replica> replicas = null;
            try (AutoCloseableLock lock = datasetLock.acquire())

            { replicas = new HashSet<>(volumeMap.replicas(bpid) == null ? Collections. EMPTY_SET : volumeMap.replicas(bpid)); }

            Cannot solve symbol 'datasetLock'

            fanghanyun fanghanyun added a comment - - edited Hi, Aiphago hadoop version 2.6.0-cdh5.13.1 public Set<? extends Replica> deepCopyReplica(String bpid) throws IOException { //Set<? extends Replica> replicas = new HashSet<>(volumeMap.replicas(bpid) == null ? Collections.EMPTY_SET // :volumeMap.replicas(bpid)); Set<? extends Replica> replicas = null; try (AutoCloseableLock lock = datasetLock .acquire()) { replicas = new HashSet<>(volumeMap.replicas(bpid) == null ? Collections. EMPTY_SET : volumeMap.replicas(bpid)); } Cannot solve symbol 'datasetLock'
            jhung Jonathan Hung added a comment -

            Removing 2.11.0 fix version after branch-2 -> branch-2.10 rename

            jhung Jonathan Hung added a comment - Removing 2.11.0 fix version after branch-2 -> branch-2.10 rename
            ebadger Eric Badger added a comment -

            weichiu, cherry-picking this patch to branch-3.2 broke compilation. I have reverted it from branch-3.2. I see that you cherry-picked several other patches to branch-3.2 after this one. Please compile to make sure that you don't unintentionally break the build and cause other developers to spend time fixing it. cliang, you also committed a patch to branch-3.2 (albeit in YARN, not HDFS) after this patch had broken compilation. I know it's annoying to compile every little change, but it's pretty frustrating having to track down the patch that broke compilation, revert it, and update the relevant JIRAs.

            ebadger Eric Badger added a comment - weichiu , cherry-picking this patch to branch-3.2 broke compilation. I have reverted it from branch-3.2. I see that you cherry-picked several other patches to branch-3.2 after this one. Please compile to make sure that you don't unintentionally break the build and cause other developers to spend time fixing it. cliang , you also committed a patch to branch-3.2 (albeit in YARN, not HDFS) after this patch had broken compilation. I know it's annoying to compile every little change, but it's pretty frustrating having to track down the patch that broke compilation, revert it, and update the relevant JIRAs.

            People

              Aiphag0 Mingxiang Li
              jianliang.wu Ryan Wu
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: