Hive
  1. Hive
  2. HIVE-4808

WebHCat job submission is killed by TaskTracker since it's not sending a heartbeat properly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.0
    • Fix Version/s: 0.12.0
    • Component/s: HCatalog
    • Labels:
      None

      Description

      (set mapred.task.timeout=70000)
      curl -i -d user.name=ekoifman \
      -d jar=/user/ekoifman/webhcate2e/hexamples.jar \
      -d class=sleep \
      -d arg="-mt" \
      -d arg="50000" \
      -d statusdir=/tmp \
      'http://localhost:50111/templeton/v1/mapreduce/jar'
      The TempletonControllerJob gets retried 4 times (Thus there are 4 SleepJob invocations) with message that it was killed due to inactivity.

      hexamples.jar = hadoop-examples-*.jar

      1. HIVE-4808.1.patch
        9 kB
        Eugene Koifman
      2. HIVE-4808.patch
        3 kB
        Eugene Koifman

        Issue Links

          Activity

          Hide
          Ashutosh Chauhan added a comment -

          This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.

          Show
          Ashutosh Chauhan added a comment - This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.
          Hide
          Alan Gates added a comment -

          Patch checked in. Thanks Eugene.

          Show
          Alan Gates added a comment - Patch checked in. Thanks Eugene.
          Hide
          Alan Gates added a comment -

          Never mind, my mistake. I had my test harness configured incorrectly. Tests pass, I'll check this in shortly.

          Show
          Alan Gates added a comment - Never mind, my mistake. I had my test harness configured incorrectly. Tests pass, I'll check this in shortly.
          Hide
          Alan Gates added a comment -

          I'm not sure I'm running the tests properly. When I run the new test TestHeartbeat_2 it fails with:

          ./test_harness.pl::TestDriverCurl::checkResStatusCode INFO Check failed: status_code 200 expected, test returned <400>
          

          Do I need to do something to set it up properly?

          Show
          Alan Gates added a comment - I'm not sure I'm running the tests properly. When I run the new test TestHeartbeat_2 it fails with: ./test_harness.pl::TestDriverCurl::checkResStatusCode INFO Check failed: status_code 200 expected, test returned <400> Do I need to do something to set it up properly?
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12593808/HIVE-4808.1.patch

          ERROR: -1 due to 1 failed/errored test(s), 2649 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_view_cast
          

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/164/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/164/console

          Messages:

          Executing org.apache.hive.ptest.execution.CleanupPhase
          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests failed with: TestsFailedException: 1 tests failed
          

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12593808/HIVE-4808.1.patch ERROR: -1 due to 1 failed/errored test(s), 2649 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_view_cast Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/164/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/164/console Messages: Executing org.apache.hive.ptest.execution.CleanupPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests failed with: TestsFailedException: 1 tests failed This message is automatically generated.
          Hide
          Hive QA added a comment -

          Overall: -1 no tests executed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12593808/HIVE-4808.1.patch

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/159/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/159/console

          Messages:

          Executing org.apache.hive.ptest.execution.CleanupPhase
          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Tests failed with: IllegalStateException: Too many bad hosts: 1.0% (10 / 10) is greater than threshold of 50%
          

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12593808/HIVE-4808.1.patch Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/159/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/159/console Messages: Executing org.apache.hive.ptest.execution.CleanupPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Tests failed with: IllegalStateException: Too many bad hosts: 1.0% (10 / 10) is greater than threshold of 50% This message is automatically generated.
          Hide
          Eugene Koifman added a comment -

          Added a test for this case.
          Ran Templeton e2e tests.
          fork.factor.group=3 and fork.factor.conf.file=6 the suite runs in 11 minutes.

          Added support for timeout_seconds property in .conf files to specify custom timeout.

          Show
          Eugene Koifman added a comment - Added a test for this case. Ran Templeton e2e tests. fork.factor.group=3 and fork.factor.conf.file=6 the suite runs in 11 minutes. Added support for timeout_seconds property in .conf files to specify custom timeout.
          Hide
          Eugene Koifman added a comment -

          Couple of suggestions from Hadoop user list on setting timeout programmatically:

          Yes, you can set it into your Job configuration object in code. If
          your driver uses the Tool framework, then you can also pass a
          -Dmapred.task.timeout=value CLI argument when invoking your program.
          AND
          'mapred.task.timeout' is deprecated configuration. You can use 'mapreduce.task.timeout' property to do the same.
          You could set this configuration while submitting the Job using org.apache.hadoop.conf.Configuration.setLong(String name, long value) API from conf or JobConf.

          Show
          Eugene Koifman added a comment - Couple of suggestions from Hadoop user list on setting timeout programmatically: Yes, you can set it into your Job configuration object in code. If your driver uses the Tool framework, then you can also pass a -Dmapred.task.timeout=value CLI argument when invoking your program. AND 'mapred.task.timeout' is deprecated configuration. You can use 'mapreduce.task.timeout' property to do the same. You could set this configuration while submitting the Job using org.apache.hadoop.conf.Configuration.setLong(String name, long value) API from conf or JobConf.
          Hide
          Eugene Koifman added a comment -

          3 things to keep in mind when testing:
          1. TempletonControllerJob is hardcoded to send ping every 1 second
          2. Set mapred.task.timeout to 70+ seconds, i.e. larger than interval in #1
          3. when running test case (in bug description) set -mt 90000 (to be larger than # 2)

          Show
          Eugene Koifman added a comment - 3 things to keep in mind when testing: 1. TempletonControllerJob is hardcoded to send ping every 1 second 2. Set mapred.task.timeout to 70+ seconds, i.e. larger than interval in #1 3. when running test case (in bug description) set -mt 90000 (to be larger than # 2)
          Hide
          Thejas M Nair added a comment -

          It would be good to have an automated test for this as well. I understand that this will add 10+ minutes to the system tests (effectively almost doubling the run time). Maybe, we can have a separate ant target that runs the shorter running system tests, for use by developers before committing changes.

          Show
          Thejas M Nair added a comment - It would be good to have an automated test for this as well. I understand that this will add 10+ minutes to the system tests (effectively almost doubling the run time). Maybe, we can have a separate ant target that runs the shorter running system tests, for use by developers before committing changes.
          Hide
          Eugene Koifman added a comment -

          HIVE-4808.patch has the diffs.
          Tested manually.
          Ran WebHcat e2e tests.

          Show
          Eugene Koifman added a comment - HIVE-4808 .patch has the diffs. Tested manually. Ran WebHcat e2e tests.

            People

            • Assignee:
              Eugene Koifman
              Reporter:
              Eugene Koifman
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development