Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.10.0
-
None
-
Reviewed
Description
Running DU across lots of disks is very expensive . We applied the patch HDFS-14313 to get used space from ReplicaInfo in memory.However, new du threads throw the exception
// 2019-11-08 18:07:13,858 ERROR [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992-XXXX-1450855658517] org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: ReplicaCachingGetSpaceUsed refresh error java.util.ConcurrentModificationException: Tree has been modified outside of iterator at org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) at org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) at java.util.AbstractCollection.addAll(AbstractCollection.java:343) at java.util.HashSet.<init>(HashSet.java:120) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) at org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) at java.lang.Thread.run(Thread.java:748)
Attachments
Attachments
- HDFS-14986.006.patch
- 9 kB
- Mingxiang Li
- HDFS-14986.005.patch
- 9 kB
- Mingxiang Li
- HDFS-14986.004.patch
- 9 kB
- Mingxiang Li
- HDFS-14986.003.patch
- 8 kB
- Mingxiang Li
- HDFS-14986.002.patch
- 8 kB
- Mingxiang Li
- HDFS-14986.001.patch
- 3 kB
- Mingxiang Li
Issue Links
- is caused by
-
HDFS-14313 Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
- Resolved
Activity
we meet very similar case (ConcurrentModificationException or DeadLock) when backport HDFS-14313 to branch-2.7, and I believe there is still some space to improve current implementation.
Aiphag0 may have some more information. Would you like to offer our patch here?
+leosun08 to have a look for this as well.
we meet very similar case (ConcurrentModificationException or DeadLock)...
hexiaoqiao, I wonder how you find the case of deadlock, I remember leosun08 and me talked about the deadlock problem in HDFS-14313 and should not happen. It will be great if you can provide the stack info for the deadlock.
In we found same concurrent problem before base our hadoop version.So we solve it by add dataset lock at FsDatasetImpl#deepCopyReplica(). But this may cause deadlock problem when restart datanode.When BlockPoolSlice init first time,the BlockPoolSlice#loadDfsUsed() may return -1.And the ReplicaCachingGetSpaceUsed#init() will blocking because of FsDatasetImpl#deepCopyReplica() can't get the dataset lock,and the thread can't release dataset lock the same time.HDFS-14313
"DataNode: [....... #283 daemon prio=5 os_prio=0 tid=0x00007fd9826a7800 nid=0x7463 in Object.wait() [0x00007fd949616000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000006b375a1d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2)
at java.lang.Thread.join(Thread.java:1249) - locked <0x00000006b375a1d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2)
at java.lang.Thread.join(Thread.java:1323)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.addBlockPool(FsVolumeList.java:423)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.addBlockPool(FsDatasetImpl.java:2509) - locked <0x00000006b33a3b10> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1388)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:311)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:232)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:720)
linyiqun, we did some modify when backport HDFS-14313 since there are very diffs between branch-2.7 and trunk.
As Aiphag0 said above, if no synchronous control, it will throw CME, but meet deadlock if add sync when restart DataNode. The root cause is `BlockPoolSlice#loadDfsUsed() may return -1. And the ReplicaCachingGetSpaceUsed#init() will blocks` when start as mentioned above.
FsDatasetImpl#deepCopyReplica should be synchronized before branch-2.8. HDFS-10682 Replace FsDatasetImpl object lock with a separate lock objectin branch-2.8.
public Set<? extends Replica> deepCopyReplica(String bpid){ ... }
We use the branch before 2.8 with the synchronized,and we fix the deadlock problem.And I think it's better to and add dataset lock at FsDatasetImpl#deepCopyReplica in trunk,and the method to slove deadlock problem is the same.
if add dataset lock at FsDatasetImpl#deepCopyReplica in trunk, it will happen dead lock.
I don't use FsDatasetImpl#datasetLock , Because FsDatasetImpl#addBlockPool with datasetLock call FsDatasetImpl#deepCopyReplica in another Thread when dn start.
I means I can fix the dead lock problem with add dataset lock,my first comment have say the solution.
Here is the patch for the trunk can leosun08 hexiaoqiao linyiqun review this one? Thanks very much.HDFS-14986.001.patch
Thanks for providing the patch, Aiphag0. I like the fix way. Only minor comments:
RefreshThread(CachingGetSpaceUsed spaceUsed,boolean runImmediately)
Can you add one space between spaceUsed,boolean?
Can you add the comment for the method deepCopyReplica in FsDatasetSpi?
/** * Deep copy the replica info belonging to given block pool. * @param bpid Specified block pool id. * @return A set of replica info. * @throws IOException */ Set<? extends Replica> deepCopyReplica(String bpid) throws IOException;
And then remove following outdated comment in FsDatasetImpl?
/** * The deepCopyReplica call doesn't use the datasetock since it will lead the * potential deadlock with the {@link FsVolumeList#addBlockPool} call. */
Additionally, can you add a unit test for this? It will be a good habit when we fix a bug then give a corresponding unit test case.
Hi linyiqun,I updata the patch.Can you review this again?HDFS-14986.002.patch
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 54s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
0 | mvndep | 0m 26s | Maven dependency ordering for branch |
+1 | mvninstall | 20m 19s | trunk passed |
+1 | compile | 20m 35s | trunk passed |
+1 | checkstyle | 3m 3s | trunk passed |
+1 | mvnsite | 3m 6s | trunk passed |
+1 | shadedclient | 21m 50s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 4m 45s | trunk passed |
+1 | javadoc | 3m 11s | trunk passed |
Patch Compile Tests | |||
0 | mvndep | 0m 28s | Maven dependency ordering for patch |
+1 | mvninstall | 2m 0s | the patch passed |
+1 | compile | 20m 31s | the patch passed |
+1 | javac | 20m 31s | the patch passed |
-0 | checkstyle | 3m 7s | root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117) |
+1 | mvnsite | 3m 16s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 15m 28s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 5m 15s | the patch passed |
+1 | javadoc | 3m 9s | the patch passed |
Other Tests | |||
-1 | unit | 8m 44s | hadoop-common in the patch failed. |
-1 | unit | 106m 30s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 45s | The patch does not generate ASF License warnings. |
244m 52s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.fs.TestDFCachingGetSpaceUsed |
hadoop.security.TestFixKerberosTicketOrder | |
hadoop.fs.TestDU | |
hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer | |
hadoop.hdfs.server.balancer.TestBalancer |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12986219/HDFS-14986.002.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 70d35e564673 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 9fbfe6c |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_222 |
findbugs | v3.1.0-RC1 |
checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/diff-checkstyle-root.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28341/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28341/testReport/ |
Max. process+thread count | 2839 (vs. ulimit of 5500) |
modules | C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28341/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
Hi Aiphag0, the patch almost looks good to me. Only some minor comments for the unit test, I think it's enough that we just verify deepCopyReplica call won't throw CME.
So while loop in unit test can simplified to this:
int retryTimes = 3; while (retryTimes > 0) { try { Set<? extends Replica> replicas = spyDataset.deepCopyReplica(bpid); if (replicas.size() > 0) { retryTimes--; // DeepCopyReplica should not throw ConcurrentModificationException modifyThread.setShouldRun(false); } } catch (IOException e) { modifyThread.setShouldRun(false); Assert.fail("Encounter IOException when deep copy replica."); } } modifyThread.setShouldRun(false);
In additional, we need to close the fs output stream and delete created test files in ModifyThread. Can you update as following?
public void run() { FSDataOutputStream os = null; while (shouldRun) { try { int id = RandomUtils.nextInt(); os = fs.create(new Path("/testFsDatasetImplDeepCopyReplica/" + id)); byte[] bytes = new byte[2048]; InputStream is = new ByteArrayInputStream(bytes); IOUtils.copyBytes(is, os, bytes.length); os.hsync(); } catch (IOException e) { // ignored exception } } try { if (os != null) { os.close(); } fs.delete(new Path("/testFsDatasetImplDeepCopyReplica"), true); } catch (IOException e) { } }
Also please check following two failed unit tests if they are related:
- TestDU
- TestDFCachingGetSpaceUsed
Hi linyiqun,Thank you for your valuable advice.In previous patch I modify CachingGetSpaceUsed#init(),but this will influences the subclass of CachingGetSpaceUsed like DU.So I add a filter,and now the related unit tests can pass.HDFS-14986.003.patch
hi jianliang.wu, I just assign this Jira to myself, please feel free to assign back if you would also like to work on this.
Hi Aiphago, thanks for providing the patch that fix me too. I will backport it.
Hi Aiphag0,
+ // Prevent ReplicaCachingGetSpaceUsed dead lock at FsDatasetImpl#deepCopyReplica + if (this.getClass().getSimpleName().equals("ReplicaCachingGetSpaceUsed")) { + initRefeshThread(true); + return; + }
Judgement for the subclass in parent class is a little tricky, we can add a new flag and reset that in subclass. I find a way to remove runImmediately for RefreshThreaad and add shouldInitRefresh = true in CachingGetSpaceUsed. By default, we will do init refresh.
if (used.get() < 0) { used.set(0); if(shouldInitRefresh) { refresh(); } }
And we need to define a protected method to override this only in FSCachingGetSpaceUsed.
/** * Reset that if we need to do the initial refresh. * @param shouldInitRefresh The flag value to set. */ protected void setShouldInitRefresh(boolean shouldInitRefresh) { this.shouldInitRefresh = shouldInitRefresh; } public FSCachingGetSpaceUsed(Builder builder) throws IOException { super(builder); this.setShouldInitRefresh(false); }
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 44s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
0 | mvndep | 0m 59s | Maven dependency ordering for branch |
+1 | mvninstall | 19m 18s | trunk passed |
+1 | compile | 16m 8s | trunk passed |
+1 | checkstyle | 2m 34s | trunk passed |
+1 | mvnsite | 2m 28s | trunk passed |
+1 | shadedclient | 18m 36s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 4m 0s | trunk passed |
+1 | javadoc | 2m 36s | trunk passed |
Patch Compile Tests | |||
0 | mvndep | 0m 20s | Maven dependency ordering for patch |
+1 | mvninstall | 1m 44s | the patch passed |
+1 | compile | 15m 27s | the patch passed |
+1 | javac | 15m 27s | the patch passed |
-0 | checkstyle | 2m 34s | root: The patch generated 3 new + 117 unchanged - 0 fixed = 120 total (was 117) |
+1 | mvnsite | 2m 20s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 12m 49s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 4m 19s | the patch passed |
+1 | javadoc | 2m 38s | the patch passed |
Other Tests | |||
-1 | unit | 8m 30s | hadoop-common in the patch failed. |
-1 | unit | 106m 55s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 56s | The patch does not generate ASF License warnings. |
224m 0s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.security.TestFixKerberosTicketOrder |
hadoop.security.TestRaceWhenRelogin | |
hadoop.hdfs.TestMultipleNNPortQOP | |
hadoop.hdfs.server.namenode.TestPersistentStoragePolicySatisfier |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12986378/HDFS-14986.003.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 7a09c1a83624 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / de38045 |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_222 |
findbugs | v3.1.0-RC1 |
checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/diff-checkstyle-root.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28359/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28359/testReport/ |
Max. process+thread count | 2706 (vs. ulimit of 5500) |
modules | C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28359/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
Hi linyiqun,Thank you for your valuable advice.And now setShouldInitRefresh() is only invoke in ReplicaCachingGetSpaceUsed.HDFS-14986.004.patch
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 1m 11s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
0 | mvndep | 0m 25s | Maven dependency ordering for branch |
+1 | mvninstall | 23m 16s | trunk passed |
+1 | compile | 22m 15s | trunk passed |
+1 | checkstyle | 3m 21s | trunk passed |
+1 | mvnsite | 3m 13s | trunk passed |
+1 | shadedclient | 20m 36s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 4m 2s | trunk passed |
+1 | javadoc | 2m 42s | trunk passed |
Patch Compile Tests | |||
0 | mvndep | 0m 23s | Maven dependency ordering for patch |
+1 | mvninstall | 1m 48s | the patch passed |
+1 | compile | 15m 38s | the patch passed |
+1 | javac | 15m 38s | the patch passed |
-0 | checkstyle | 2m 37s | root: The patch generated 2 new + 118 unchanged - 0 fixed = 120 total (was 118) |
+1 | mvnsite | 2m 24s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 12m 53s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 4m 18s | the patch passed |
+1 | javadoc | 2m 36s | the patch passed |
Other Tests | |||
-1 | unit | 8m 31s | hadoop-common in the patch failed. |
-1 | unit | 104m 14s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 44s | The patch does not generate ASF License warnings. |
233m 40s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.security.TestFixKerberosTicketOrder |
hadoop.security.TestRaceWhenRelogin | |
hadoop.hdfs.tools.TestDFSZKFailoverController |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12986493/HDFS-14986.004.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux ae856fa84a00 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / b25e94c |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_222 |
findbugs | v3.1.0-RC1 |
checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/diff-checkstyle-root.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28374/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28374/testReport/ |
Max. process+thread count | 2809 (vs. ulimit of 5500) |
modules | C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28374/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
Hi Aiphag0, the latest patch can still be simplified. Actually we don't need to add runImmediately for RefreshThread. Only one thing we need to do is that we skip init refresh in FSCachingGetSpaceUsed. Here init refresh just means first refresh operation. Hope you have got my idea.
So the change can simplified to this
--- a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java +++ b/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java @@ -47,6 +47,7 @@ private final long jitter; private final String dirPath; private Thread refreshUsed; + private boolean shouldInitRefresh; /** * This is the constructor used by the builder. @@ -79,12 +80,15 @@ public CachingGetSpaceUsed(CachingGetSpaceUsed.Builder builder) this.refreshInterval = interval; this.jitter = jitter; this.used.set(initialUsed); + this.shouldInitRefresh = true; } void init() { if (used.get() < 0) { used.set(0); - refresh(); + if (shouldInitRefresh) { + refresh(); + } } if (refreshInterval > 0) { @@ -145,6 +149,14 @@ protected void setUsed(long usedValue) { this.used.set(usedValue); } + /** + * Reset that if we need to do the initial refresh. + * @param shouldInitRefresh The flag value to set. + */ + protected void setShouldInitRefresh(boolean shouldInitRefresh) { + this.shouldInitRefresh = shouldInitRefresh; + } + @Override public void close() throws IOException { running.set(false); diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java index a5f350860f3..0a922b2d22f 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java @@ -39,6 +39,7 @@ public FSCachingGetSpaceUsed(Builder builder) throws IOException { super(builder); + this.setShouldInitRefresh(false); }
Hi linyiqun,I get your idea.But this may have a small problem.Because when use FSCachingGetSpaceUsed invoke init() and when
used < 0,so pass the frist refresh().This will make used = 0 until default 10 min until next refresh().So I add runImmediately to slove this problem.And to prevent other subClass not refresh() twice at first du.So this is my idea in patch 004.
Comment makes sense to me. I have minor suggestion for shouldInitRefresh.
The meaning of shouldInitRefresh should be that 'should do init refresh operation', so the default value should be true rather than false. So I prefer to change this and add comment:
void init() { if (used.get() < 0) { used.set(0); // Skip initial refresh operation, so we need to do first refresh // operation immediately in refresh thread. + if (!shouldInitRefresh) { + initRefeshThread(true); + return; + } refresh(); } + initRefeshThread(false); + }
Can we move this method into CachingGetSpaceUsed? The variable is under CachingGetSpaceUsed, it will be better also put the set operation in this class and let sub-classes to use this.
+ /** + * Reset that if we need to do the initial refresh. + * @param shouldInitRefresh The flag value to set. + */ + protected void setShouldInitRefresh(boolean shouldInitRefresh) { + this.shouldInitRefresh = shouldInitRefresh; + }
Can you update method name and comment for this method?
void initRefeshThread (boolean runImmediately)
to
// add comment here. private void initRefeshThread (boolean runImmediately)
Hi linyiqun,Thank you for your valuable advice.I improve the patch as your comment.Addition I rename shouldInitRefresh to shouldFirstRefresh order to easy distinguish.HDFS-14986.005.patch
Aiphag0, the change looks good to me. I rerun the unit test in my local, I find it's hard to reproduce CME exception based on 3 retry attempts. Can you increase the times from 3 to 10? I can see the CME error based on 10 times without lock change.
int retryTimes = 10;
In addition, there is a resource leak that I missed before. Can you move close operation into while loop?
public void run() { FSDataOutputStream os = null; while (shouldRun) { try { int id = RandomUtils.nextInt(); os = fs.create(new Path("/testFsDatasetImplDeepCopyReplica/" + id)); byte[] bytes = new byte[2048]; InputStream is = new ByteArrayInputStream(bytes); IOUtils.copyBytes(is, os, bytes.length); os.hsync(); os.close(); <=== move here } catch (IOException e) {} } try { fs.delete(new Path("/testFsDatasetImplDeepCopyReplica"), true); } catch (IOException e) {} }
Others looks good to me, .
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 32m 7s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
0 | mvndep | 1m 17s | Maven dependency ordering for branch |
+1 | mvninstall | 21m 49s | trunk passed |
+1 | compile | 17m 24s | trunk passed |
+1 | checkstyle | 2m 50s | trunk passed |
+1 | mvnsite | 2m 45s | trunk passed |
+1 | shadedclient | 19m 16s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 4m 26s | trunk passed |
+1 | javadoc | 3m 4s | trunk passed |
Patch Compile Tests | |||
0 | mvndep | 0m 25s | Maven dependency ordering for patch |
+1 | mvninstall | 2m 10s | the patch passed |
+1 | compile | 20m 9s | the patch passed |
+1 | javac | 20m 9s | the patch passed |
-0 | checkstyle | 3m 17s | root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117) |
+1 | mvnsite | 3m 14s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 14m 30s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 5m 5s | the patch passed |
+1 | javadoc | 3m 7s | the patch passed |
Other Tests | |||
-1 | unit | 9m 48s | hadoop-common in the patch failed. |
-1 | unit | 110m 0s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 1m 5s | The patch does not generate ASF License warnings. |
275m 52s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.security.TestFixKerberosTicketOrder |
hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12986884/HDFS-14986.005.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 1b4d8f61cf6c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / c8bef4d |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_222 |
findbugs | v3.1.0-RC1 |
checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/diff-checkstyle-root.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28404/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28404/testReport/ |
Max. process+thread count | 2986 (vs. ulimit of 5500) |
modules | C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28404/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
Good advice,I change the retrytimes to 10 and close the stream in while loop.HDFS-14986.006.patch
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 17s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
0 | mvndep | 1m 4s | Maven dependency ordering for branch |
+1 | mvninstall | 19m 9s | trunk passed |
+1 | compile | 15m 46s | trunk passed |
+1 | checkstyle | 2m 32s | trunk passed |
+1 | mvnsite | 2m 26s | trunk passed |
+1 | shadedclient | 18m 37s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 3m 59s | trunk passed |
+1 | javadoc | 2m 36s | trunk passed |
Patch Compile Tests | |||
0 | mvndep | 0m 22s | Maven dependency ordering for patch |
+1 | mvninstall | 1m 42s | the patch passed |
+1 | compile | 15m 6s | the patch passed |
+1 | javac | 15m 6s | the patch passed |
-0 | checkstyle | 2m 33s | root: The patch generated 1 new + 117 unchanged - 0 fixed = 118 total (was 117) |
+1 | mvnsite | 2m 21s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 12m 45s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 4m 16s | the patch passed |
+1 | javadoc | 2m 40s | the patch passed |
Other Tests | |||
+1 | unit | 8m 49s | hadoop-common in the patch passed. |
-1 | unit | 84m 30s | hadoop-hdfs in the patch failed. |
+1 | asflicense | 0m 45s | The patch does not generate ASF License warnings. |
200m 20s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hdfs.server.namenode.ha.TestHASafeMode |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12986950/HDFS-14986.006.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 7e4938caed7c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 7f2ea2a |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_222 |
findbugs | v3.1.0-RC1 |
checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28411/artifact/out/diff-checkstyle-root.txt |
unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28411/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt |
Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28411/testReport/ |
Max. process+thread count | 3825 (vs. ulimit of 5500) |
modules | C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . |
Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28411/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
Committed this to trunk, branch-2 and branch-2.10.
Thanks Aiphag0 for the contribution and thanks jianliang.wu for reporting this.
SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17705 (See https://builds.apache.org/job/Hadoop-trunk-Commit/17705/)
HDFS-14986. ReplicaCachingGetSpaceUsed throws (yqlin: rev 2b452b4e6063072b2bec491edd3f412eb7ac21f3)
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
- (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CachingGetSpaceUsed.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestReplicaCachingGetSpaceUsed.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
- (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/ReplicaCachingGetSpaceUsed.java
Hi,Aiphago
hadoop version 2.6.0-cdh5.13.1
public Set<? extends Replica> deepCopyReplica(String bpid) throws IOException {
//Set<? extends Replica> replicas = new HashSet<>(volumeMap.replicas(bpid) == null ? Collections.EMPTY_SET
// :volumeMap.replicas(bpid));
Set<? extends Replica> replicas = null;
try (AutoCloseableLock lock = datasetLock.acquire())
Cannot solve symbol 'datasetLock'
weichiu, cherry-picking this patch to branch-3.2 broke compilation. I have reverted it from branch-3.2. I see that you cherry-picked several other patches to branch-3.2 after this one. Please compile to make sure that you don't unintentionally break the build and cause other developers to spend time fixing it. cliang, you also committed a patch to branch-3.2 (albeit in YARN, not HDFS) after this patch had broken compilation. I know it's annoying to compile every little change, but it's pretty frustrating having to track down the patch that broke compilation, revert it, and update the relevant JIRAs.
The DeepCopyReplica method gets FoldedTreeSet<ReplicaInfo> from ReplicaMap,at the same time the FoldedTreeSet<ReplicaInfo> could be changed due to addblock or removeblock. This ConcurrentModificationException appears .