Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2024

TaskFinishedEvent may not be logged in recovery

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    1. TEZ-2024-1.patch
      0.6 kB
      Jeff Zhang
    2. TEZ-2024-2.patch
      10 kB
      Jeff Zhang
    3. TEZ-2024-3.patch
      5 kB
      Jeff Zhang

      Activity

      Hide
      zjffdu Jeff Zhang added a comment -

      Attach the patch, Hitesh Shah please help review it.

      Show
      zjffdu Jeff Zhang added a comment - Attach the patch, Hitesh Shah please help review it.
      Hide
      hadoopqa Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12696152/TEZ-2024-1.patch
      against master revision cfa637a.

      +1 @author. The patch does not contain any @author tags.

      -1 tests included. The patch doesn't appear to include any new or modified tests.
      Please justify why no new tests are needed for this patch.
      Also please list what manual steps were performed to verify this patch.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 javadoc. There were no new javadoc warning messages.

      +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed unit tests in .

      Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/114//testReport/
      Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/114//console

      This message is automatically generated.

      Show
      hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12696152/TEZ-2024-1.patch against master revision cfa637a. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/114//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/114//console This message is automatically generated.
      Hide
      hitesh Hitesh Shah added a comment -

      Is there a unit test to catch such issues in the future?

      Show
      hitesh Hitesh Shah added a comment - Is there a unit test to catch such issues in the future?
      Hide
      hitesh Hitesh Shah added a comment -

      Also why is "historyTaskStartGenerated" needed? Shouldn't the end event always be logged to timeline/history even if a start event was not generated?

      Show
      hitesh Hitesh Shah added a comment - Also why is "historyTaskStartGenerated" needed? Shouldn't the end event always be logged to timeline/history even if a start event was not generated?
      Hide
      zjffdu Jeff Zhang added a comment -

      Hitesh Shah Will add unit test.

      Shouldn't the end event always be logged to timeline/history even if a start event was not generated?

      It looks like the current logic is that one FinishedEvent must be associated with one StartedEvent, the same as TaskAttempt. I also think it doesn't make sense to do that, Task can finish before it is started. How about extending the scope of this jira to fix this ?

              if (!recoveryStartEventSeen) {
                throw new RuntimeException("Finished Event seen but"
                    + " no Started Event was encountered earlier");
              }
      
      Show
      zjffdu Jeff Zhang added a comment - Hitesh Shah Will add unit test. Shouldn't the end event always be logged to timeline/history even if a start event was not generated? It looks like the current logic is that one FinishedEvent must be associated with one StartedEvent, the same as TaskAttempt. I also think it doesn't make sense to do that, Task can finish before it is started. How about extending the scope of this jira to fix this ? if (!recoveryStartEventSeen) { throw new RuntimeException( "Finished Event seen but" + " no Started Event was encountered earlier" ); }
      Hide
      hitesh Hitesh Shah added a comment - - edited

      I think this issue is probably due to how we mingle both history and recovery. From a recovery point of view, it might need a start before a finish. I think the check was added for safety/correctness but you may be right that it might not be needed after all. Are there known cases where an end is recorded even though a start did not happen where the end state is not KILLED?

      From a history point of view, in some cases, a finish would help even if a start is not seen.

      Show
      hitesh Hitesh Shah added a comment - - edited I think this issue is probably due to how we mingle both history and recovery. From a recovery point of view, it might need a start before a finish. I think the check was added for safety/correctness but you may be right that it might not be needed after all. Are there known cases where an end is recorded even though a start did not happen where the end state is not KILLED? From a history point of view, in some cases, a finish would help even if a start is not seen.
      Hide
      zjffdu Jeff Zhang added a comment -

      From a recovery point of view, it might need a start before a finish. Are there known cases where an end is recorded even though a start did not happen where the end state is not KILLED?

      It is possible finish happens before it is started. Here's one case that TaskAttempt is terminated before it is started. I think we can just verify that the status must be KILLED/FAILURE if there's no StartedEvent before FinishedEvent.

            try {
              remoteTaskSpec = ta.createRemoteTaskSpec();
              LOG.info("remoteTaskSpec:" + remoteTaskSpec);
            } catch (AMUserCodeException e) {
              String msg = "Exception in " + e.getSource() + ", taskAttempt=" + ta;
              LOG.error(msg, e);
              String diag = msg + ", " + e.getMessage() + ", " + ExceptionUtils.getStackTrace(e.getCause());
              new TerminateTransition(FAILED_HELPER).transition(ta,
                  new TaskAttemptEventAttemptFailed(ta.getID(), TaskAttemptEventType.TA_FAILED, diag,
                      TaskAttemptTerminationCause.APPLICATION_ERROR));
              return TaskAttemptStateInternal.FAILED;
            }
      
      Show
      zjffdu Jeff Zhang added a comment - From a recovery point of view, it might need a start before a finish. Are there known cases where an end is recorded even though a start did not happen where the end state is not KILLED? It is possible finish happens before it is started. Here's one case that TaskAttempt is terminated before it is started. I think we can just verify that the status must be KILLED/FAILURE if there's no StartedEvent before FinishedEvent. try { remoteTaskSpec = ta.createRemoteTaskSpec(); LOG.info( "remoteTaskSpec:" + remoteTaskSpec); } catch (AMUserCodeException e) { String msg = "Exception in " + e.getSource() + ", taskAttempt=" + ta; LOG.error(msg, e); String diag = msg + ", " + e.getMessage() + ", " + ExceptionUtils.getStackTrace(e.getCause()); new TerminateTransition(FAILED_HELPER).transition(ta, new TaskAttemptEventAttemptFailed(ta.getID(), TaskAttemptEventType.TA_FAILED, diag, TaskAttemptTerminationCause.APPLICATION_ERROR)); return TaskAttemptStateInternal.FAILED; }
      Hide
      zjffdu Jeff Zhang added a comment -

      Attach new patch both resolve TEZ-2024 & TEZ-2037.

      Hitesh Shah, Still use the simple way to fix the issue, for the other issues we discussed above, I think can put it in another jira to keep this patch small and safe.
      I have verified this patch through tez-ui, please help review.

      Show
      zjffdu Jeff Zhang added a comment - Attach new patch both resolve TEZ-2024 & TEZ-2037 . Hitesh Shah , Still use the simple way to fix the issue, for the other issues we discussed above, I think can put it in another jira to keep this patch small and safe. I have verified this patch through tez-ui, please help review.
      Hide
      hitesh Hitesh Shah added a comment - - edited

      +1 pending pre-commit build results though might be better to split out TEZ-2037 related fix into a separate patch as that might be needed in branch-0.5.

      Show
      hitesh Hitesh Shah added a comment - - edited +1 pending pre-commit build results though might be better to split out TEZ-2037 related fix into a separate patch as that might be needed in branch-0.5.
      Hide
      hadoopqa Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12698336/TEZ-2024-2.patch
      against master revision 54bd104.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 2 new or modified test files.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 javadoc. There were no new javadoc warning messages.

      -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed unit tests in .

      Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/178//testReport/
      Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/178//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html
      Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/178//console

      This message is automatically generated.

      Show
      hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12698336/TEZ-2024-2.patch against master revision 54bd104. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/178//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/178//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/178//console This message is automatically generated.
      Hide
      zjffdu Jeff Zhang added a comment - - edited

      Hitesh Shah Should both of TEZ-2024 and TEZ-2037 go to 0.5 ? Since both of them are not ui related

      Show
      zjffdu Jeff Zhang added a comment - - edited Hitesh Shah Should both of TEZ-2024 and TEZ-2037 go to 0.5 ? Since both of them are not ui related
      Hide
      hitesh Hitesh Shah added a comment -

      I think the main question for 0.5 is whether they affect any recovery related functionality? If yes, it makes sense for 0.5. If it is only for UI related historical data, then it makes sense only for 0.6

      Show
      hitesh Hitesh Shah added a comment - I think the main question for 0.5 is whether they affect any recovery related functionality? If yes, it makes sense for 0.5. If it is only for UI related historical data, then it makes sense only for 0.6
      Hide
      zjffdu Jeff Zhang added a comment -

      Hitesh Shah Then TEZ-2037 would affect task_attempt's recovery functionality. TEZ-2024 may affect the second recovery, but it should be able to recover task to the right state as long as task attempt recovery event is completed.

      I will commit TEZ-2037 to 0.5 and TEZ-2024 to 0.6.

      Show
      zjffdu Jeff Zhang added a comment - Hitesh Shah Then TEZ-2037 would affect task_attempt's recovery functionality. TEZ-2024 may affect the second recovery, but it should be able to recover task to the right state as long as task attempt recovery event is completed. I will commit TEZ-2037 to 0.5 and TEZ-2024 to 0.6.
      Hide
      zjffdu Jeff Zhang added a comment -

      Upload new patch, exclude the changes for TEZ-2037.

      Will commit it after prebuild it done.

      Show
      zjffdu Jeff Zhang added a comment - Upload new patch, exclude the changes for TEZ-2037 . Will commit it after prebuild it done.
      Hide
      hadoopqa Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12698603/TEZ-2024-3.patch
      against master revision c266289.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 2 new or modified test files.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 javadoc. There were no new javadoc warning messages.

      -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed unit tests in .

      Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/183//testReport/
      Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/183//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html
      Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/183//console

      This message is automatically generated.

      Show
      hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12698603/TEZ-2024-3.patch against master revision c266289. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/183//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/183//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/183//console This message is automatically generated.
      Hide
      zjffdu Jeff Zhang added a comment -

      Committed to master & branch-0.6

      commit b74bab471649021cf9e8d4f9087bfce0fa9bf757 (HEAD, origin/master, origin/HEAD, master, TEZ-2024)
      Author: Jeff Zhang <zjffdu@apache.org>
      Date: Fri Feb 13 11:06:38 2015 +0800

      TEZ-2024. TaskFinishedEvent may not be logged in recovery (zjffdu)

      commit d6126aacaff4361a1e2229222e0285b0f23b8e40 (HEAD, origin/branch-0.6, branch-0.6)
      Author: Jeff Zhang <zjffdu@apache.org>
      Date: Fri Feb 13 11:06:38 2015 +0800

      TEZ-2024. TaskFinishedEvent may not be logged in recovery (zjffdu)

      (cherry picked from commit b74bab471649021cf9e8d4f9087bfce0fa9bf757)

      Show
      zjffdu Jeff Zhang added a comment - Committed to master & branch-0.6 commit b74bab471649021cf9e8d4f9087bfce0fa9bf757 (HEAD, origin/master, origin/HEAD, master, TEZ-2024 ) Author: Jeff Zhang <zjffdu@apache.org> Date: Fri Feb 13 11:06:38 2015 +0800 TEZ-2024 . TaskFinishedEvent may not be logged in recovery (zjffdu) commit d6126aacaff4361a1e2229222e0285b0f23b8e40 (HEAD, origin/branch-0.6, branch-0.6) Author: Jeff Zhang <zjffdu@apache.org> Date: Fri Feb 13 11:06:38 2015 +0800 TEZ-2024 . TaskFinishedEvent may not be logged in recovery (zjffdu) (cherry picked from commit b74bab471649021cf9e8d4f9087bfce0fa9bf757)
      Hide
      hitesh Hitesh Shah added a comment -

      Closing issue as 0.5.4, 0.6.1 and 0.7.0 have been released.

      Show
      hitesh Hitesh Shah added a comment - Closing issue as 0.5.4, 0.6.1 and 0.7.0 have been released.

        People

        • Assignee:
          zjffdu Jeff Zhang
          Reporter:
          zjffdu Jeff Zhang
        • Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development