Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-480

allow option to retry map-reduce tasks

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.

      ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

        Issue Links

          Activity

          Hide
          zshao Zheng Shao added a comment -

          I will add a hive option to control the number of retries.

          Show
          zshao Zheng Shao added a comment - I will add a hive option to control the number of retries.
          Hide
          zshao Zheng Shao added a comment -

          As a side note, the conf in hadoop is "mapred.max.tracker.failures" which controls the number of maximum permitted failures for each task.

          Show
          zshao Zheng Shao added a comment - As a side note, the conf in hadoop is "mapred.max.tracker.failures" which controls the number of maximum permitted failures for each task.
          Hide
          zshao Zheng Shao added a comment -

          This patch adds an additional config: "hive.exec.retries.max" (default: 1) to HiveConf and hive-default.xml

          Show
          zshao Zheng Shao added a comment - This patch adds an additional config: "hive.exec.retries.max" (default: 1) to HiveConf and hive-default.xml
          Hide
          namit Namit Jain added a comment -

          The changes look good - I had a question about the usage. When will the default be greater than 1 ?
          If a long job gets retried after running for 5 hours, it may really increase the load on the cluster.
          So, if a cluster is unhealthy for some random reason, it may incur further pain on the cluster.

          Although, this is mute till max retries is 1, where current behavior is preserved.

          Show
          namit Namit Jain added a comment - The changes look good - I had a question about the usage. When will the default be greater than 1 ? If a long job gets retried after running for 5 hours, it may really increase the load on the cluster. So, if a cluster is unhealthy for some random reason, it may incur further pain on the cluster. Although, this is mute till max retries is 1, where current behavior is preserved.
          Hide
          namit Namit Jain added a comment -

          a lot of tests failed - can you fix and resubmit the patch

          Show
          namit Namit Jain added a comment - a lot of tests failed - can you fix and resubmit the patch
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          one concern i have is that if the cluster goes down temporarily - then the retries will fail promptly and this fix would serve no purpose.

          on the other hand - if the failure is due to genuine problems with the job (like problems in user scripts or bad input etc.) - then we will try this unnecessarily and cause excess load.

          we need to think about how to distinguish these cases. in some cases (interactive cli session) - it may be better to leave the decision to user (give a prompt and ask the user whether they want to retry the job).

          ideally - we should be able to do something like this for a non-interactive session as well - but that seems much more complicated (suspending and resuming a query given a queryid)

          Show
          jsensarma Joydeep Sen Sarma added a comment - one concern i have is that if the cluster goes down temporarily - then the retries will fail promptly and this fix would serve no purpose. on the other hand - if the failure is due to genuine problems with the job (like problems in user scripts or bad input etc.) - then we will try this unnecessarily and cause excess load. we need to think about how to distinguish these cases. in some cases (interactive cli session) - it may be better to leave the decision to user (give a prompt and ask the user whether they want to retry the job). ideally - we should be able to do something like this for a non-interactive session as well - but that seems much more complicated (suspending and resuming a query given a queryid)
          Hide
          prasadc Prasad Chakka added a comment -

          if all queries go through Hive Server, it can figure out when to start queuing queries to be executed when the cluster comes back either.

          can we integrate the code that Pete wrote to figure out whether cluster is up or not into Hive CLI so that we can display a neat message to user that cluster is not available?

          Show
          prasadc Prasad Chakka added a comment - if all queries go through Hive Server, it can figure out when to start queuing queries to be executed when the cluster comes back either. can we integrate the code that Pete wrote to figure out whether cluster is up or not into Hive CLI so that we can display a neat message to user that cluster is not available?
          Hide
          namit Namit Jain added a comment -

          We should wait for this till we have a Hive Server.

          If we have a Hive server, then we have a cache (query -> job file (which contains all map-reduce tasks) + base dependencies with latest timestamps), we can use jdbm for that.

          Metastore needs to keep track of latest modification of a base object (table/partition) if it does not do so already.

          Then, we dont need retries - the results will automatically get shared even across multiple users.

          Show
          namit Namit Jain added a comment - We should wait for this till we have a Hive Server. If we have a Hive server, then we have a cache (query -> job file (which contains all map-reduce tasks) + base dependencies with latest timestamps), we can use jdbm for that. Metastore needs to keep track of latest modification of a base object (table/partition) if it does not do so already. Then, we dont need retries - the results will automatically get shared even across multiple users.
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          unfortunately the hive server route would almost mean that this jira is dead. even FB hasn't standardized on using it. other installs may never use it.

          is there something short term we can do? for example - taking the current patch and adding a user prompt (unless prompting is disabled and for the case of '-f' execution) would provide a short term solution that may help some subset of users.

          another practical solution could be to try and distinguish between communication failures to the JT (which scream for sleep/retry) and failures of the job due to task failure (which means we shouldn't retry automatically). is not possible to do this distinction at all? (if not - perhaps we can do something on the hadoop side to enable this).

          Show
          jsensarma Joydeep Sen Sarma added a comment - unfortunately the hive server route would almost mean that this jira is dead. even FB hasn't standardized on using it. other installs may never use it. is there something short term we can do? for example - taking the current patch and adding a user prompt (unless prompting is disabled and for the case of '-f' execution) would provide a short term solution that may help some subset of users. another practical solution could be to try and distinguish between communication failures to the JT (which scream for sleep/retry) and failures of the job due to task failure (which means we shouldn't retry automatically). is not possible to do this distinction at all? (if not - perhaps we can do something on the hadoop side to enable this).

            People

            • Assignee:
              Unassigned
              Reporter:
              jsensarma Joydeep Sen Sarma
            • Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development