Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-13439

Fix race between TestMetricsSystemImpl and TestGangliaMetrics

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TestGangliaMetrics#testGangliaMetrics2 set *.period to 120 but 8 was used.

      2016-06-27 15:21:31,480 INFO  impl.MetricsSystemImpl (MetricsSystemImpl.java:startTimer(375)) - Scheduled snapshot period at 8 second(s).
      
      1. HADOOP-13439.001.patch
        4 kB
        Masatake Iwasaki
      2. HADOOP-13439.001.patch
        4 kB
        Chen Liang
      3. HADOOP-13439.002.patch
        4 kB
        Chen Liang

        Issue Links

          Activity

          Hide
          vagarychen Chen Liang added a comment - - edited

          Hi Masatake Iwasaki,

          Any updates on this issue? As I found this JIRA and HADOOP-12588 interesting, do you remind I assign this to myself and work on this?

          Show
          vagarychen Chen Liang added a comment - - edited Hi Masatake Iwasaki , Any updates on this issue? As I found this JIRA and HADOOP-12588 interesting, do you remind I assign this to myself and work on this?
          Hide
          iwasakims Masatake Iwasaki added a comment -

          Chen Liang, I assigned this to you. Thanks for your help.

          Show
          iwasakims Masatake Iwasaki added a comment - Chen Liang , I assigned this to you. Thanks for your help.
          Hide
          vagarychen Chen Liang added a comment - - edited

          Hi Masatake Iwasaki,

          Thanks for assigning this to me. One question though, I was not able to reproduce this error in my environment, the test report of failed tests in HADOOP-13323 seems no longer accessible. I will leave a comment there. But do you mind elaborate a little bit more on the error and your thoughts there? Thanks.

          Show
          vagarychen Chen Liang added a comment - - edited Hi Masatake Iwasaki , Thanks for assigning this to me. One question though, I was not able to reproduce this error in my environment, the test report of failed tests in HADOOP-13323 seems no longer accessible. I will leave a comment there. But do you mind elaborate a little bit more on the error and your thoughts there? Thanks.
          Show
          yuanbo Yuanbo Liu added a comment - Chen Liang I hit a similar error, this is a link for your reference: https://issues.apache.org/jira/browse/HADOOP-13441?focusedCommentId=15401700&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401700
          Hide
          vagarychen Chen Liang added a comment -

          Hi Yuanbo Liu,

          Hmm...this does seem to be the same issue at the first glance, but I will dig into it a bit more. Thanks for the information!

          Show
          vagarychen Chen Liang added a comment - Hi Yuanbo Liu , Hmm...this does seem to be the same issue at the first glance, but I will dig into it a bit more. Thanks for the information!
          Hide
          iwasakims Masatake Iwasaki added a comment -

          But do you mind elaborate a little bit more on the error and your thoughts there? Thanks.

          Both TestGangliaMetrics and TestMetricsSystemImpl write configuration file to set up metrics system in the unit tests.

              ConfigBuilder cb = new ConfigBuilder().add("*.period", 120)
          	.add("test.sink.gsink30.context", "test") // filter out only "test"
          	.add("test.sink.gsink31.context", "test") // filter out only "test"
          	.save(TestMetricsConfig.getTestFilename("hadoop-metrics2-test"));
          
              new ConfigBuilder().add("*.period", 8)
                  //.add("test.sink.plugin.urls", getPluginUrlsAsString())
                  .add("test.sink.test.class", TestSink.class.getName())
                  .add("test.*.source.filter.exclude", "s0")
                  .add("test.source.s1.metric.filter.exclude", "X*")
                  .add("test.sink.sink1.metric.filter.exclude", "Y*")
                  .add("test.sink.sink2.metric.filter.exclude", "Y*")
                  .save(TestMetricsConfig.getTestFilename("hadoop-metrics2-test"));
          

          They seem to race and read the configuration file created by another test when you run tests in parallel (e.g. by mvn test -Dtest=TestGangliaMetrics,TestMetricsSystemImpl -Pparallel-tests). I thought maven configuration in the parallel-tests profile takes care of test data dirs but we may need additional fix.

          Show
          iwasakims Masatake Iwasaki added a comment - But do you mind elaborate a little bit more on the error and your thoughts there? Thanks. Both TestGangliaMetrics and TestMetricsSystemImpl write configuration file to set up metrics system in the unit tests. ConfigBuilder cb = new ConfigBuilder().add("*.period", 120) .add("test.sink.gsink30.context", "test") // filter out only "test" .add("test.sink.gsink31.context", "test") // filter out only "test" .save(TestMetricsConfig.getTestFilename("hadoop-metrics2-test")); new ConfigBuilder().add("*.period", 8) //.add("test.sink.plugin.urls", getPluginUrlsAsString()) .add("test.sink.test.class", TestSink.class.getName()) .add("test.*.source.filter.exclude", "s0") .add("test.source.s1.metric.filter.exclude", "X*") .add("test.sink.sink1.metric.filter.exclude", "Y*") .add("test.sink.sink2.metric.filter.exclude", "Y*") .save(TestMetricsConfig.getTestFilename("hadoop-metrics2-test")); They seem to race and read the configuration file created by another test when you run tests in parallel (e.g. by mvn test -Dtest=TestGangliaMetrics,TestMetricsSystemImpl -Pparallel-tests ). I thought maven configuration in the parallel-tests profile takes care of test data dirs but we may need additional fix.
          Hide
          vagarychen Chen Liang added a comment - - edited

          Thanks for such detailed feedbacks Masatake Iwasaki, it helped a lot!

          Given what you mentioned here, I did some more investigation, and I think I verified that it is possible for one test to read config created by another.

          The two classes both have this line:

          .save(TestMetricsConfig.getTestFilename("hadoop-metrics2-test"));

          Which writes the config to the exactly same file "hadoop-metrics2-test.properties" first. Then both classes have the call:

          ms.start();

          Which reads the file. So there can be race condition from the write to the read. This error can be reproduced by adding delay between .save() and ms.start() call, and run the tests at the same time.

          Another problem with this race condition is the missing metric error of checkMetrics() as mentioned in HADOOP-12588. For example, one class might have the following in its config

          .add("test.source.s1.metric.filter.exclude", "X*")

          which configures to remove Xxx from the metric sources. If the other class is expecting this metric, we would end up getting this error:

          java.lang.AssertionError: Missing metrics: test.s1rec.Xxx

          One solution I'm thinking of is to separate the config file they are writing to by introducing a unique prefix for each test case, such that each test uses a different config file, similar to RollingFileSystemSinkTestBase#initMetricsSystem. The pros is that we are still able to run test cases in parallel without race condition, and the cons is that there might be a couple tens more .properties generated. What do you think about this approach? Or do you have any other suggestions?

          Show
          vagarychen Chen Liang added a comment - - edited Thanks for such detailed feedbacks Masatake Iwasaki , it helped a lot! Given what you mentioned here, I did some more investigation, and I think I verified that it is possible for one test to read config created by another. The two classes both have this line: .save(TestMetricsConfig.getTestFilename( "hadoop-metrics2-test" )); Which writes the config to the exactly same file "hadoop-metrics2-test.properties" first. Then both classes have the call: ms.start(); Which reads the file. So there can be race condition from the write to the read. This error can be reproduced by adding delay between .save() and ms.start() call, and run the tests at the same time. Another problem with this race condition is the missing metric error of checkMetrics() as mentioned in HADOOP-12588 . For example, one class might have the following in its config .add( "test.source.s1.metric.filter.exclude" , "X*" ) which configures to remove Xxx from the metric sources. If the other class is expecting this metric, we would end up getting this error: java.lang.AssertionError: Missing metrics: test.s1rec.Xxx One solution I'm thinking of is to separate the config file they are writing to by introducing a unique prefix for each test case, such that each test uses a different config file, similar to RollingFileSystemSinkTestBase#initMetricsSystem. The pros is that we are still able to run test cases in parallel without race condition, and the cons is that there might be a couple tens more .properties generated. What do you think about this approach? Or do you have any other suggestions?
          Hide
          vagarychen Chen Liang added a comment -

          Since it seems TestGangliaMetrics is the test cases that has been causing trouble mostly, this patch changes it's config file to another one to avoid sharing same config with other test cases. Please leave a comment if you think it is necessary to do this for all potentially competing test cases.

          Show
          vagarychen Chen Liang added a comment - Since it seems TestGangliaMetrics is the test cases that has been causing trouble mostly, this patch changes it's config file to another one to avoid sharing same config with other test cases. Please leave a comment if you think it is necessary to do this for all potentially competing test cases.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          Thanks for the patch, Chen Liang. +1, pending jenkins. I saw no failure in 100 runs of mvn test -Dtest=TestGangliaMetrics,TestMetricsSystemImpl -Pparallel-tests on my local.

          Please leave a comment if you think it is necessary to do this for all potentially competing test cases.

          I think writing config file specific for TestGangliaMetrics is enough here.

          Show
          iwasakims Masatake Iwasaki added a comment - Thanks for the patch, Chen Liang . +1, pending jenkins. I saw no failure in 100 runs of mvn test -Dtest=TestGangliaMetrics,TestMetricsSystemImpl -Pparallel-tests on my local. Please leave a comment if you think it is necessary to do this for all potentially competing test cases. I think writing config file specific for TestGangliaMetrics is enough here.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 11s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 50s trunk passed
          +1 compile 8m 0s trunk passed
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 1m 3s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 33s trunk passed
          +1 javadoc 0m 47s trunk passed
          +1 mvninstall 0m 43s the patch passed
          +1 compile 8m 18s the patch passed
          +1 javac 8m 18s the patch passed
          -0 checkstyle 0m 26s hadoop-common-project/hadoop-common: The patch generated 9 new + 9 unchanged - 5 fixed = 18 total (was 14)
          +1 mvnsite 1m 1s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 46s the patch passed
          +1 javadoc 0m 52s the patch passed
          +1 unit 8m 40s hadoop-common in the patch passed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          44m 49s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12822682/HADOOP-13439.001.patch
          JIRA Issue HADOOP-13439
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux c1ee284b3bce 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0705489
          Default Java 1.8.0_101
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10205/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10205/testReport/
          modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10205/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 50s trunk passed +1 compile 8m 0s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 1m 3s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 33s trunk passed +1 javadoc 0m 47s trunk passed +1 mvninstall 0m 43s the patch passed +1 compile 8m 18s the patch passed +1 javac 8m 18s the patch passed -0 checkstyle 0m 26s hadoop-common-project/hadoop-common: The patch generated 9 new + 9 unchanged - 5 fixed = 18 total (was 14) +1 mvnsite 1m 1s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 46s the patch passed +1 javadoc 0m 52s the patch passed +1 unit 8m 40s hadoop-common in the patch passed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 44m 49s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12822682/HADOOP-13439.001.patch JIRA Issue HADOOP-13439 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux c1ee284b3bce 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0705489 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10205/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10205/testReport/ modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10205/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          vagarychen Chen Liang added a comment -

          Fix for checkstyle

          Show
          vagarychen Chen Liang added a comment - Fix for checkstyle
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 10s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 54s trunk passed
          +1 compile 6m 49s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 55s trunk passed
          +1 mvneclipse 0m 12s trunk passed
          +1 findbugs 1m 17s trunk passed
          +1 javadoc 0m 45s trunk passed
          +1 mvninstall 0m 37s the patch passed
          +1 compile 6m 41s the patch passed
          +1 javac 6m 41s the patch passed
          +1 checkstyle 0m 22s hadoop-common-project/hadoop-common: The patch generated 0 new + 9 unchanged - 5 fixed = 9 total (was 14)
          +1 mvnsite 0m 51s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 25s the patch passed
          +1 javadoc 0m 46s the patch passed
          +1 unit 7m 43s hadoop-common in the patch passed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          37m 49s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12822701/HADOOP-13439.002.patch
          JIRA Issue HADOOP-13439
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 3e4efc68a0e2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0ad48aa
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10207/testReport/
          modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10207/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 10s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 54s trunk passed +1 compile 6m 49s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 55s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 17s trunk passed +1 javadoc 0m 45s trunk passed +1 mvninstall 0m 37s the patch passed +1 compile 6m 41s the patch passed +1 javac 6m 41s the patch passed +1 checkstyle 0m 22s hadoop-common-project/hadoop-common: The patch generated 0 new + 9 unchanged - 5 fixed = 9 total (was 14) +1 mvnsite 0m 51s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 25s the patch passed +1 javadoc 0m 46s the patch passed +1 unit 7m 43s hadoop-common in the patch passed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 37m 49s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12822701/HADOOP-13439.002.patch JIRA Issue HADOOP-13439 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3e4efc68a0e2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0ad48aa Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10207/testReport/ modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10207/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          +1 on 002.

          Show
          iwasakims Masatake Iwasaki added a comment - +1 on 002.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #10240 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10240/)
          HADOOP-13439. Fix race between TestMetricsSystemImpl and (iwasakims: rev 8f9b61852bf6600b65e49875fec172bac9e0a85d)

          • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/metrics2/impl/TestGangliaMetrics.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10240 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10240/ ) HADOOP-13439 . Fix race between TestMetricsSystemImpl and (iwasakims: rev 8f9b61852bf6600b65e49875fec172bac9e0a85d) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/metrics2/impl/TestGangliaMetrics.java
          Hide
          iwasakims Masatake Iwasaki added a comment -

          Committed to branch-2.8, branch-2 and trunk. Thanks, Chen Liang.

          Show
          iwasakims Masatake Iwasaki added a comment - Committed to branch-2.8, branch-2 and trunk. Thanks, Chen Liang .

            People

            • Assignee:
              vagarychen Chen Liang
              Reporter:
              iwasakims Masatake Iwasaki
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development