Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.0
Description
I am using Ozone APIs to create containers, and it occasionally aborts due to a data race in acessing the RBDMetric instance:
2021-01-09 02:39:36,944 [pool-1-thread-4] INFO keyvalue.KeyValueContainer: Container 318054 is closed with bcsId 0. 2021-01-09 02:39:36,988 [pool-1-thread-17] ERROR freon.BaseFreonGenerator: Error on executing task 318048 com.google.common.util.concurrent.UncheckedExecutionException: org.apache.hadoop.metrics2.MetricsException: Metrics source RDBMetrics already exists! at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051) at com.google.common.cache.LocalCache.get(LocalCache.java:3951) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958) at org.apache.hadoop.ozone.freon.ContainerGenerator.lambda$writeContainer$1(ContainerGenerator.java:489) at com.codahale.metrics.Timer.time(Timer.java:101) at org.apache.hadoop.ozone.freon.ContainerGenerator.writeContainer(ContainerGenerator.java:485) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:189) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:169) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$0(BaseFreonGenerator.java:152) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: org.apache.hadoop.metrics2.MetricsException: Metrics source RDBMetrics already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.hdds.utils.db.RDBMetrics.create(RDBMetrics.java:47) at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:152) at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:191) at org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.start(AbstractDatanodeStore.java:128) at org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(AbstractDatanodeStore.java:103) at org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaTwoImpl.<init>(DatanodeStoreSchemaTwoImpl.java:48) at org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerUtil.createContainerMetaData(KeyValueContainerUtil.java:112) at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:133) at org.apache.hadoop.ozone.freon.ContainerGenerator.createContainer(ContainerGenerator.java:463) at org.apache.hadoop.ozone.freon.ContainerGenerator.access$100(ContainerGenerator.java:109) at org.apache.hadoop.ozone.freon.ContainerGenerator$ContainerCreator.load(ContainerGenerator.java:357) at org.apache.hadoop.ozone.freon.ContainerGenerator$ContainerCreator.load(ContainerGenerator.java:353) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
Looking at the code, I believe RDBMetrics#unRegister() should be made synchronized. Otherwise create and close RDBStore objects could lead to race of the RDBMetrics instance object.
After making RDBMetrics#unRegister() synchronized, the tool no longer aborts due to the race.
Attachments
Issue Links
- links to