Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-43

Retired jobs are not present in the job list returned to the job-client.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      After mapred.jobtracker.retirejob.interval elapses, completed jobs are no longer maintained by the JT, but instead when job-client ask for job status, counters or task completion events, the relevant information is picked up from completed job store. But a retired job is not listed in the output of "hadoop job -list all", without which other information from completed job store isn't quite useful unless a job-id is known from elsewhere.

        Activity

        Hide
        Owen O'Malley added a comment -

        In the future this list will get very long. It probably makes sense to add a parameter that specifies the earliest date of jobs they are interested in seeing and defaulting it to 1 day ago.

        Show
        Owen O'Malley added a comment - In the future this list will get very long. It probably makes sense to add a parameter that specifies the earliest date of jobs they are interested in seeing and defaulting it to 1 day ago.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        This list might get long depending upon how long job-status information is persisted in DFS, which is given by the value of mapred.job.tracker.persist.jobstatus.hours specified in hours. So, in similar lines, we can have a "-n NoOfHours" option to -list option, defaulting it to 24 hours.

        Show
        Vinod Kumar Vavilapalli added a comment - This list might get long depending upon how long job-status information is persisted in DFS, which is given by the value of mapred.job.tracker.persist.jobstatus.hours specified in hours. So, in similar lines, we can have a "-n NoOfHours" option to -list option, defaulting it to 24 hours.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        May be we better also have a parameter N saying we only need the list of last N jobs.

        I am seeing a common pattern here - get the list of jobs that are yet to complete (JobSubmissionProtocol.jobsToComplete), get the list of running jobs, queued jobs, jobs that have started in the last N hours, last N jobs etc. We have two options here:

        • Get all jobs list from the JT and prune them on the client side.
        • Pass a filter to the JT asking for only those jobs that we need.

        When the number of jobs could get very long, the second option seems better - less data from JT to client, accomodates all the above types of listing jobs in a single RPC, for e.g., we then won't need a separate jobsToComplete. Thoughts?

        Show
        Vinod Kumar Vavilapalli added a comment - May be we better also have a parameter N saying we only need the list of last N jobs. I am seeing a common pattern here - get the list of jobs that are yet to complete (JobSubmissionProtocol.jobsToComplete), get the list of running jobs, queued jobs, jobs that have started in the last N hours, last N jobs etc. We have two options here: Get all jobs list from the JT and prune them on the client side. Pass a filter to the JT asking for only those jobs that we need. When the number of jobs could get very long, the second option seems better - less data from JT to client, accomodates all the above types of listing jobs in a single RPC, for e.g., we then won't need a separate jobsToComplete. Thoughts?
        Hide
        Owen O'Malley added a comment -

        When the number of jobs could get very long, the second option seems better - less data from JT to client, accomodates all the above types of listing jobs in a single RPC, for e.g., we then won't need a separate jobsToComplete. Thoughts?

        We need to design all of the protocols so that it is easy to page through them without duplicates. In the task events we did it via:

        TaskCompletionEvent[] getTaskCompletionEvents(JobID jobid, int fromEventId,
               int maxEvents) throws IOException;
        

        for job status, we need to use time:

        JobStatus[] getJobStatus(long minSubmitDate, int maxJobs) throws IOException;
        

        Note that we'd also need to add the submit date to the JobStatus so that you can page through them with the client. Thoughts?

        Show
        Owen O'Malley added a comment - When the number of jobs could get very long, the second option seems better - less data from JT to client, accomodates all the above types of listing jobs in a single RPC, for e.g., we then won't need a separate jobsToComplete. Thoughts? We need to design all of the protocols so that it is easy to page through them without duplicates. In the task events we did it via: TaskCompletionEvent[] getTaskCompletionEvents(JobID jobid, int fromEventId, int maxEvents) throws IOException; for job status, we need to use time: JobStatus[] getJobStatus( long minSubmitDate, int maxJobs) throws IOException; Note that we'd also need to add the submit date to the JobStatus so that you can page through them with the client. Thoughts?
        Hide
        Vinod Kumar Vavilapalli added a comment -

        As I mentioned earlier, instead of just passing minSubmitDate and maxJobs, having a filter helps easy extensibility -

          interface JobStatusFilter {
            public JobStatus[] filter(JobStatus[] jobStatusList);
          }
        

        Then, getJobStatus API will look like

        JobStatus[] getJobStatus(List<JobStatusFilter> jobStatusFilters) throws IOException;
        

        Specific filters -

          /** Replaces
          * public JobStatus[] jobsToComplete() throws IOException;
          */
          class CompleteJobStatusFilter implements JobStatusFilter {
            public JobStatus[] filter(JobStatus[] statusList) {
              ArrayList<JobStatus> retList = new ArrayList<JobStatus>();
              for(JobStatus js : statusList) {
                if (js.getRunState() == JobStatus.RUNNING || js.getRunState() == JobStatus.PREP) {
                  retList.add(js);
                }
              return retList.toArray(new JobStatus[retList.size()]);
              }
            }
          }
        
          /**
          * Return last N jobs.
          */
          class LastNJobsFilter implements JobStatusFilter {
            private int N;
            // getters and setters for N
            public JobStatus[] filter(JobStatus[] statusList) {
              ArrayList<JobStatus> retList = new ArrayList<JobStatus>();
              for(JobStatus js : statusList) {
                if (N-- != 0) {
                  retList.add(js);
                }
              return retList.toArray(new JobStatus[retList.size()]);
              }
            }
          }
        

        If we need a new kind of list, we would just add a new filter, instead of changing RPC signature or adding a new api. Features to obtain queued-jobs list, running-jobs list, list of jobs within last N hours can be easily added.

        Note that we'd also need to add the submit date to the JobStatus so that you can page through them with the client.

        +1.

        Show
        Vinod Kumar Vavilapalli added a comment - As I mentioned earlier, instead of just passing minSubmitDate and maxJobs, having a filter helps easy extensibility - interface JobStatusFilter { public JobStatus[] filter(JobStatus[] jobStatusList); } Then, getJobStatus API will look like JobStatus[] getJobStatus(List<JobStatusFilter> jobStatusFilters) throws IOException; Specific filters - /** Replaces * public JobStatus[] jobsToComplete() throws IOException; */ class CompleteJobStatusFilter implements JobStatusFilter { public JobStatus[] filter(JobStatus[] statusList) { ArrayList<JobStatus> retList = new ArrayList<JobStatus>(); for(JobStatus js : statusList) { if (js.getRunState() == JobStatus.RUNNING || js.getRunState() == JobStatus.PREP) { retList.add(js); } return retList.toArray(new JobStatus[retList.size()]); } } } /** * Return last N jobs. */ class LastNJobsFilter implements JobStatusFilter { private int N; // getters and setters for N public JobStatus[] filter(JobStatus[] statusList) { ArrayList<JobStatus> retList = new ArrayList<JobStatus>(); for(JobStatus js : statusList) { if (N-- != 0) { retList.add(js); } return retList.toArray(new JobStatus[retList.size()]); } } } If we need a new kind of list, we would just add a new filter, instead of changing RPC signature or adding a new api. Features to obtain queued-jobs list, running-jobs list, list of jobs within last N hours can be easily added. Note that we'd also need to add the submit date to the JobStatus so that you can page through them with the client. +1.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Also, now that getTaskCompletionEvents is mentioned, I see some problems with that too. First, we don't have the current total number of task-events, after observing which only we can clearly say we are seeking task-events from id1 to id2. I often find myself asking for events from 0 to 100 (large value) to view all task-events. We could simply add a parameter "all" for this. Or better, use a filter like mechanism mentioned above, so that we can say things like return all the task-events, only the first/last N task-events etc.

        Show
        Vinod Kumar Vavilapalli added a comment - Also, now that getTaskCompletionEvents is mentioned, I see some problems with that too. First, we don't have the current total number of task-events, after observing which only we can clearly say we are seeking task-events from id1 to id2. I often find myself asking for events from 0 to 100 (large value) to view all task-events. We could simply add a parameter "all" for this. Or better, use a filter like mechanism mentioned above, so that we can say things like return all the task-events, only the first/last N task-events etc.
        Hide
        Owen O'Malley added a comment -

        Passing Filters across RPC is a bad plan. I'd suggest something like:

        enum JobState {PREP, RUNNING,SUCCEEDED,FAILED};
        
        JobStatus[] getJobStatus(long minSubmitDate, int maxJobs, JobState filter) throws IOException;
        

        and handle the filter = null as no filter. That would let you page through the results and do the server side filtering. You may also want to add a filter for the user, since that will be another common operation.

        Show
        Owen O'Malley added a comment - Passing Filters across RPC is a bad plan. I'd suggest something like: enum JobState {PREP, RUNNING,SUCCEEDED,FAILED}; JobStatus[] getJobStatus( long minSubmitDate, int maxJobs, JobState filter) throws IOException; and handle the filter = null as no filter. That would let you page through the results and do the server side filtering. You may also want to add a filter for the user, since that will be another common operation.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Discussed about this with Owen. Some conclusions -

        • Passing a filter through rpc doesn't seem to be a good idea, because of version related problems - different versions of client and server may have different sets of filters available, and even same filter class available on both client and server may be of different versions.
        • Instead, we can introduce job-submit-date for paging and user-name for basic filtering. These two will let the client get most of the relevant information without huge requests.
          • The RPC, then, will look like -
             JobStatus[] getJobStatus(long minSubmitDate, long maxJobs, String username) throws IOException; 
          • The usage looks like -
            minSubmitDate maxJobs usage
            =0 <=0 Last 25 days(default) from last one day(default)
            =0 N Last N jobs from last one day(default)
            Long..MAX_VALUE N Last N jobs
            M N Last N jobs from last M days
            M <=0 Last 25(default) Jobs from last M days
            Long.MIN_VALUE N First N jobs
            -M N First N jobs from first M days
            -M <=0 First 25 jobs(default) from first M days

            user = null will give list for all users.

          • For this, we need to add job-submit-time into JobStatus.
        • JobStatus filters can further cut down data flow from server to client. For now, we will just let state filters be on the client side, and add JobState filters later, if we see that we need them even on server side, perhaps as a part of another JIRA,. Side note: We should define a JobState enum instead of using the int finals.

        Regarding the problem that I mentioned w.r.t getTaskCompletionEvents, we can get all information by giving a range of 0-MAX_INT. That would still be bad, but for now we continue trusting on the client to do it in a while loop getting say 1000 items at a time. If we wish to be fool-proof here too, may be we should limit, on the server side, the max number of items we send at a time.

        Show
        Vinod Kumar Vavilapalli added a comment - Discussed about this with Owen. Some conclusions - Passing a filter through rpc doesn't seem to be a good idea, because of version related problems - different versions of client and server may have different sets of filters available, and even same filter class available on both client and server may be of different versions. Instead, we can introduce job-submit-date for paging and user-name for basic filtering. These two will let the client get most of the relevant information without huge requests. The RPC, then, will look like - JobStatus[] getJobStatus( long minSubmitDate, long maxJobs, String username) throws IOException; The usage looks like - minSubmitDate maxJobs usage =0 <=0 Last 25 days(default) from last one day(default) =0 N Last N jobs from last one day(default) Long..MAX_VALUE N Last N jobs M N Last N jobs from last M days M <=0 Last 25(default) Jobs from last M days Long.MIN_VALUE N First N jobs -M N First N jobs from first M days -M <=0 First 25 jobs(default) from first M days user = null will give list for all users. For this, we need to add job-submit-time into JobStatus. JobStatus filters can further cut down data flow from server to client. For now, we will just let state filters be on the client side, and add JobState filters later, if we see that we need them even on server side, perhaps as a part of another JIRA,. Side note: We should define a JobState enum instead of using the int finals. Regarding the problem that I mentioned w.r.t getTaskCompletionEvents, we can get all information by giving a range of 0-MAX_INT. That would still be bad, but for now we continue trusting on the client to do it in a while loop getting say 1000 items at a time. If we wish to be fool-proof here too, may be we should limit, on the server side, the max number of items we send at a time.
        Hide
        Harsh J added a comment -

        I think it is OK to not show retired to users. Retired should mean GC'd (or summat), and the user should go and specifically look for it if he wants it again.

        Resolving as invalid but reopen if there's still merit I do not see.

        Show
        Harsh J added a comment - I think it is OK to not show retired to users. Retired should mean GC'd (or summat), and the user should go and specifically look for it if he wants it again. Resolving as invalid but reopen if there's still merit I do not see.

          People

          • Assignee:
            Vinod Kumar Vavilapalli
            Reporter:
            Vinod Kumar Vavilapalli
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development