Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-323

Improve the way job history files are managed

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.21.0, 0.22.0
    • Fix Version/s: 0.20.203.0
    • Component/s: jobtracker
    • Labels:
      None
    • Release Note:
      Hide
      This patch does four things:

          * it changes the directory structure of the done directory that holds history logs for jobs that are completed,
          * it builds toy databases for completed jobs, so we no longer have to scan 2N files on DFS to find out facts about the N jobs that have completed since the job tracker started [which can be hundreds of thousands of files in practical cases],
          * it changes the job history browser to display more information and allow more filtering criteria, and
          * it creates a new programmatic interface for finding files matching user-chosen criteria. This allows users to no longer be concerned with our methods of storing them, in turn allowing us to change those at will.

      The new API described above, which can be used to programmatically obtain history file PATHs given search criteria, is described below:

          package org.apache.hadoop.mapreduce.jobhistory;
          ...

          // this interface is within O.A.H.mapreduce.jobhistory.JobHistory:

          // holds information about one job hostory log in the done
          // job history logs
          public static class JobHistoryJobRecord {
             public Path getPath() { ... }
             public String getJobIDString() { ... }
             public long getSubmitTime() { ... }
             public String getUserName() { ... }
             public String getJobName() { ... }
          }

          public class JobHistoryRecordRetriever implements Iterator<JobHistoryJobRecord> {
             // usual Interface methods -- remove() throws UnsupportedOperationException
             // returns the number of calls to next() that will succeed
             public int numMatches() { ... }
          }

          // returns a JobHistoryRecordRetriever that delivers all Path's of job matching job history files,
          // in no particular order. Any criterion that is null or the empty string does not constrain.
          // All criteria that are specified are applied conjunctively, except that if there's more than
          // one date you retrieve all Path's matching ANY date.
          // soughtUser and soughtJobid must match exactly.
          // soughtJobName can match the entire job name or any substring.
          // dates must be in the format exactly MM/DD/YYYY .
          // Dates' leading digits must be 2's . We're incubating a Y3K problem.
          public JobHistoryRecordRetriever getMatchingJob
              (String soughtUser, String soughtJobName, String[] dateStrings, String soughtJobid)
            throws IOException

      Show
      This patch does four things:     * it changes the directory structure of the done directory that holds history logs for jobs that are completed,     * it builds toy databases for completed jobs, so we no longer have to scan 2N files on DFS to find out facts about the N jobs that have completed since the job tracker started [which can be hundreds of thousands of files in practical cases],     * it changes the job history browser to display more information and allow more filtering criteria, and     * it creates a new programmatic interface for finding files matching user-chosen criteria. This allows users to no longer be concerned with our methods of storing them, in turn allowing us to change those at will. The new API described above, which can be used to programmatically obtain history file PATHs given search criteria, is described below:     package org.apache.hadoop.mapreduce.jobhistory;     ...     // this interface is within O.A.H.mapreduce.jobhistory.JobHistory:     // holds information about one job hostory log in the done     // job history logs     public static class JobHistoryJobRecord {        public Path getPath() { ... }        public String getJobIDString() { ... }        public long getSubmitTime() { ... }        public String getUserName() { ... }        public String getJobName() { ... }     }     public class JobHistoryRecordRetriever implements Iterator<JobHistoryJobRecord> {        // usual Interface methods -- remove() throws UnsupportedOperationException        // returns the number of calls to next() that will succeed        public int numMatches() { ... }     }     // returns a JobHistoryRecordRetriever that delivers all Path's of job matching job history files,     // in no particular order. Any criterion that is null or the empty string does not constrain.     // All criteria that are specified are applied conjunctively, except that if there's more than     // one date you retrieve all Path's matching ANY date.     // soughtUser and soughtJobid must match exactly.     // soughtJobName can match the entire job name or any substring.     // dates must be in the format exactly MM/DD/YYYY .     // Dates' leading digits must be 2's . We're incubating a Y3K problem.     public JobHistoryRecordRetriever getMatchingJob         (String soughtUser, String soughtJobName, String[] dateStrings, String soughtJobid)       throws IOException

      Description

      Today all the jobhistory files are dumped in one job-history folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a user folder. So all the jobs for user amar will go in history-folder/amar/. Jobs can be categorized using various features like jobid, date, jobname etc but using username will make the search much more efficient and also will not result into namespace explosion.

      1. MR323--2010-08-20--1533.patch
        75 kB
        Dick King
      2. MR323--2010-08-25--1632.patch
        75 kB
        Dick King
      3. MR323--2010-08-27--1359.patch
        79 kB
        Dick King
      4. MR323--2010-08-27--1613.patch
        81 kB
        Dick King
      5. MR323--2010-09-07--1636.patch
        79 kB
        Dick King

        Issue Links

          Activity

          Hide
          Arun C Murthy added a comment -

          Part of 0.20.203, no point in doing this for MRv2

          Show
          Arun C Murthy added a comment - Part of 0.20.203, no point in doing this for MRv2
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12454057/MR323--2010-09-07--1636.patch
          against trunk revision 1139400.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 13 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestMRCLI
          org.apache.hadoop.fs.TestFileSystem

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/423//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/423//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/423//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12454057/MR323--2010-09-07--1636.patch against trunk revision 1139400. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 13 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestMRCLI org.apache.hadoop.fs.TestFileSystem -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/423//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/423//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/423//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12454057/MR323--2010-09-07--1636.patch
          against trunk revision 1075422.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 13 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/101//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/101//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/101//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12454057/MR323--2010-09-07--1636.patch against trunk revision 1075422. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 13 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/101//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/101//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/101//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          In 0.20.6, we first obtain list of failed tasks from jobfailures.jsp.
          Here is sample URL:

          http://sjc1.ciq.com:50030/jobfailures.jsp?jobid=job_201012031753_0116&cause=failed
          

          Then we select N tasks from this list and log the last 4KB of task logs in our own flow log.

          Show
          Ted Yu added a comment - In 0.20.6, we first obtain list of failed tasks from jobfailures.jsp. Here is sample URL: http: //sjc1.ciq.com:50030/jobfailures.jsp?jobid=job_201012031753_0116&cause=failed Then we select N tasks from this list and log the last 4KB of task logs in our own flow log.
          Hide
          Dick King added a comment -

          This is a new patch, exactly like the old one except that I removed the TestJobCleanup mod. That mod is now on MAPREDUCE-2032 .

          Show
          Dick King added a comment - This is a new patch, exactly like the old one except that I removed the TestJobCleanup mod. That mod is now on MAPREDUCE-2032 .
          Hide
          Dick King added a comment -

          split out a piece which I modified and then attached to MAPREDUCE-2032 .

          Show
          Dick King added a comment - split out a piece which I modified and then attached to MAPREDUCE-2032 .
          Hide
          Dick King added a comment -

          The requirement is only that if you run this patch in the current state, a complete test will fail in TestJobOutputCommitter – as would the null patch.

          Show
          Dick King added a comment - The requirement is only that if you run this patch in the current state, a complete test will fail in TestJobOutputCommitter – as would the null patch.
          Hide
          Amareshwari Sriramadasu added a comment -

          TestJobCleanup was leaving behind files that were causing TestJobOutputCommitter to fail. I fixed that.

          Dick, Can you put this patch on MAPREDUCE-2032? We can get it in faster as this jira would take more time for reviews.

          Show
          Amareshwari Sriramadasu added a comment - TestJobCleanup was leaving behind files that were causing TestJobOutputCommitter to fail. I fixed that. Dick, Can you put this patch on MAPREDUCE-2032 ? We can get it in faster as this jira would take more time for reviews.
          Hide
          Dick King added a comment -

          I also fixed a problem with TestJobCleanup , which without this fix leaves files in a temp directory, trashing a subsequent TestJobOutputCommitter run if there is one before the temp directory is cleared. It's very annoying to have tests that fail in a full unit test but not in isolation.

          Show
          Dick King added a comment - I also fixed a problem with TestJobCleanup , which without this fix leaves files in a temp directory, trashing a subsequent TestJobOutputCommitter run if there is one before the temp directory is cleared. It's very annoying to have tests that fail in a full unit test but not in isolation.
          Hide
          Dick King added a comment -

          The only tests that fail are TestTaskTrackerLocalization and TestTaskLauncher , which fail in Trunk as well.

          Show
          Dick King added a comment - The only tests that fail are TestTaskTrackerLocalization and TestTaskLauncher , which fail in Trunk as well.
          Hide
          Dick King added a comment -

          There was a testing problem.

          TestJobCleanup was leaving behind files that were causing TestJobOutputCommitter to fail.

          I fixed that.

          Show
          Dick King added a comment - There was a testing problem. TestJobCleanup was leaving behind files that were causing TestJobOutputCommitter to fail. I fixed that.
          Hide
          Dick King added a comment -

          Problems fixed. ant test in progress.

          Show
          Dick King added a comment - Problems fixed. ant test in progress.
          Hide
          Dick King added a comment -

          Cancelling, pending investigation of some unit test failures:

          TestRumenJobTraces – test case should use new API

          TestJobOutputCommitter
          TestTaskLauncher

          expect to fix in a few hours

          Show
          Dick King added a comment - Cancelling, pending investigation of some unit test failures: TestRumenJobTraces – test case should use new API TestJobOutputCommitter TestTaskLauncher expect to fix in a few hours
          Hide
          Dick King added a comment -

          This file is a new patch incorporating some of the review comments by Krishna .

          What follows is a suggested release notes:

          This patch does four things:

          • it changes the directory structure of the done directory that holds history logs for jobs that are completed,
          • it builds toy databases for completed jobs, so we no longer have to scan 2N files on DFS to find out facts about the N jobs that have completed since the job tracker started [which can be hundreds of thousands of files in practical cases],
          • it changes the job history browser to display more information and allow more filtering criteria, and
          • it creates a new programmatic interface for finding files matching user-chosen criteria. This allows users to no longer be concerned with our methods of storing them, in turn allowing us to change those at will.

          The new API described above, which can be used to programmatically obtain history file PATHs given search criteria, is described below:

              package org.apache.hadoop.mapreduce.jobhistory;
              ...
          
              // within JobHistory:
          
              // holds information about one job hostory log in the done 
              //   job history logs
              public static class JobHistoryJobRecord {
                 public Path getPath() { ... }
                 public String getJobIDString() { ... }
                 public long getSubmitTime() { ... }
                 public String getUserName() { ... }
                 public String getJobName() { ... }
              }
          
              public class JobHistoryRecordRetriever implements Iterator<JobHistoryJobRecord> {
                 // usual Interface methods -- remove() throws UnsupportedOperationException
                 // returns the number of calls to next() that will succeed
                 public int numMatches() { ... }
              }
          
              // returns a JobHistoryRecordRetriever that delivers all Path's of job matching job history files,
              // in no particular order.  Any criterion that is null or the empty string does not constrain.
              // All criteria that are specified are applied conjunctively, except that if there's more than
              // one date you retrieve all Path's matching ANY date.
              // soughtUser and soughtJobid must match exactly.
              // soughtJobName can match the entire job name or any substring.
              // dates must be in the format exactly MM/DD/YYYY .  
              // Dates' leading digits must be 2's .  We're incubating a Y3K problem.
              public JobHistoryRecordRetriever
                  (String soughtUser, String soughtJobName, String[] dateStrings, String soughtJobid)
                throws IOException 
          
          Show
          Dick King added a comment - This file is a new patch incorporating some of the review comments by Krishna . What follows is a suggested release notes: This patch does four things: it changes the directory structure of the done directory that holds history logs for jobs that are completed, it builds toy databases for completed jobs, so we no longer have to scan 2N files on DFS to find out facts about the N jobs that have completed since the job tracker started [which can be hundreds of thousands of files in practical cases] , it changes the job history browser to display more information and allow more filtering criteria, and it creates a new programmatic interface for finding files matching user-chosen criteria. This allows users to no longer be concerned with our methods of storing them, in turn allowing us to change those at will. The new API described above, which can be used to programmatically obtain history file PATHs given search criteria, is described below: package org.apache.hadoop.mapreduce.jobhistory; ... // within JobHistory: // holds information about one job hostory log in the done // job history logs public static class JobHistoryJobRecord { public Path getPath() { ... } public String getJobIDString() { ... } public long getSubmitTime() { ... } public String getUserName() { ... } public String getJobName() { ... } } public class JobHistoryRecordRetriever implements Iterator<JobHistoryJobRecord> { // usual Interface methods -- remove() throws UnsupportedOperationException // returns the number of calls to next() that will succeed public int numMatches() { ... } } // returns a JobHistoryRecordRetriever that delivers all Path's of job matching job history files, // in no particular order. Any criterion that is null or the empty string does not constrain. // All criteria that are specified are applied conjunctively, except that if there's more than // one date you retrieve all Path's matching ANY date. // soughtUser and soughtJobid must match exactly. // soughtJobName can match the entire job name or any substring. // dates must be in the format exactly MM/DD/YYYY . // Dates' leading digits must be 2's . We're incubating a Y3K problem. public JobHistoryRecordRetriever (String soughtUser, String soughtJobName, String[] dateStrings, String soughtJobid) throws IOException
          Hide
          Dick King added a comment -

          It is not used and I'll remove it.

          Kind of a minor point ... saves about four executed lines on an error
          path that normally never happens ... but I'll make the change.

          Because when we create a new subdirectory within the new runnable within
          moveToDone(JobID), the thread waits until enough time passes that that
          subdirectory will never have any entries added again, and then it writes
          the index. That ties up a tread, so we need an additional one to move
          the mail.

          Indeed it isn't.

          I left it in the parameter chain because future code changes may use it.
          In particular we might place a ceiling on how many jobs there could
          ever come to be in one subdirectory, and that would take a JobID to enforce.

          Actually, I have it backwards. We're indexing on every call, whether it
          needs it or not, which is bad. I'll fix this.

          Yeah, when I abstracted out buildIndex I didn't delete enough code from
          the inline.

          I make the 5 minute checkpoints because there is a small exposure to
          some history logs not getting indexed after a job tracker crash. This
          measure reduces the exposure.

          I made the busy wait loop 30 seconds, rather than one second, and on
          every pass to reduce the load and to make this code run only as often as
          it needs to. However, I therefore increased the thread pool size to THREE:

          1 to be in the loop waiting for the hour to end,

          1 to be obsolete because the hour already ended during its half minute
          but it doesn't realize it yet, and

          1 to copy a history file.

          That's three. If we're in the usual case where there is only one
          instance busy-waiting, then two instances might flow into the copying
          code. This is harmless but not useful [since the whole copying code is
          run with the lock taken].

          Show
          Dick King added a comment - It is not used and I'll remove it. Kind of a minor point ... saves about four executed lines on an error path that normally never happens ... but I'll make the change. Because when we create a new subdirectory within the new runnable within moveToDone(JobID), the thread waits until enough time passes that that subdirectory will never have any entries added again, and then it writes the index. That ties up a tread, so we need an additional one to move the mail. Indeed it isn't. I left it in the parameter chain because future code changes may use it. In particular we might place a ceiling on how many jobs there could ever come to be in one subdirectory, and that would take a JobID to enforce. Actually, I have it backwards. We're indexing on every call, whether it needs it or not, which is bad. I'll fix this. Yeah, when I abstracted out buildIndex I didn't delete enough code from the inline. I make the 5 minute checkpoints because there is a small exposure to some history logs not getting indexed after a job tracker crash. This measure reduces the exposure. I made the busy wait loop 30 seconds, rather than one second, and on every pass to reduce the load and to make this code run only as often as it needs to. However, I therefore increased the thread pool size to THREE: 1 to be in the loop waiting for the hour to end, 1 to be obsolete because the hour already ended during its half minute but it doesn't realize it yet, and 1 to copy a history file. That's three. If we're in the usual case where there is only one instance busy-waiting, then two instances might flow into the copying code. This is harmless but not useful [since the whole copying code is run with the lock taken].
          Hide
          Krishna Ramachandran added a comment -

          few comments to start with:

          JobHistory.java

          private static final SortedMap<Long, String>jobToDirectoryMap
          = new TreeMap<Long, String>();

          how is this used?

          public String getConfFilePath(JobID jobId) {
          MovedFileInfo info = jobHistoryFileMap.get(jobId);
          if (info == null)

          Unknown macro: { return null; }

          final Path historyFileDir
          = (new Path(getHistoryFilePath(jobId))).getParent();
          return getConfFile(historyFileDir, jobId).toString();
          }

          instead "info" has this data?
          info.historyFile ?

          suggest simple modification to
          setupEventWriter

          public void setupEventWriter(JobID jobId, JobConf jobConf)
          throws IOException {
          if (logDir == null)

          Unknown macro: { LOG.info("Log Directory is null, returning"); throw new IOException("Missing Log Directory for History"); }

          MetaInfo oldFi = fileMap.get(jobId);

          long submitTime = (oldFi == null ? System.currentTimeMillis() : oldFi.submitTime);

          String user = getUserName(jobConf);
          String jobName = getJobName(jobConf);
          ....

          On ThreadPoolExecutor - why increased pool size?

          canonicalHistoryLogDir(JobId,...)

          jobId is not used in the following

          canonicalHistoryLogDir(

          In this block

          synchronized (ueState) {
          ......
          + iShouldMonitor = true;
          +
          + ueState.unindexedElements = new LinkedList<JobHistoryIndexElement>();
          + ueState.currentDoneSubdirectory = resultDir;
          +
          + ueState.monitoredDirectory = resultDir;
          .....
          + ueState.unindexedElements.
          + add(new JobHistoryIndexElement(millisecondTime, id, metaInfo));

          This code is not enitrely clear.
          should we increment the count here? unindexedElementCount++
          unindexedElements

          related item:
          get/addUnindexedElements() - who calls these?

          In
          class UnindexedElementsState.closeCurrentDirectory()

          OutputStream newIndexOStream = null;
          PrintStream newIndexPStream = null;

          are unused

          + // time, because iShouldMonitor is only set true when
          + // ueState.monitoredDirectory changes, which will force the
          + // current incumbent to abend at the earliest opportunity.
          + while (iShouldMonitor) {
          + int roundCounter = 0;
          +
          + int interruptionsToAbort = 2;
          +
          + try

          Unknown macro: { + Thread.sleep(1000); + }

          catch (InterruptedException e) {
          + if (--interruptionsToAbort == 0)

          Unknown macro: { + return; + }

          + }
          +
          + synchronized (ueState) {
          + if (ueState.monitoredDirectory != resultDir)

          Unknown macro: { + // someone else closed out the directory I was monitoring + iShouldMonitor = false; + }

          else if (++roundCounter % 30 == 0) {
          + interruptionsToAbort = 2;
          +

          is in a busy wait loop with an arbitrary 1 sec sleep. This check can go up to a maximum of 1 hour?
          The 5 minute checkpoint does not set anything?

          } else if (++roundCounter % 300 == 0) {
          // called for side effect – a 5 minute checkpoint
          try

          Unknown macro: { ueState.getACurrentIndex(ueState.currentDoneSubdirectory); // why? }

          catch (IOException e)

          Unknown macro: { LOG.warn("Couldn't build an interim Job History index for " + ueState.currentDoneSubdirectory); }

          Show
          Krishna Ramachandran added a comment - few comments to start with: JobHistory.java private static final SortedMap<Long, String>jobToDirectoryMap = new TreeMap<Long, String>(); how is this used? public String getConfFilePath(JobID jobId) { MovedFileInfo info = jobHistoryFileMap.get(jobId); if (info == null) Unknown macro: { return null; } final Path historyFileDir = (new Path(getHistoryFilePath(jobId))).getParent(); return getConfFile(historyFileDir, jobId).toString(); } instead "info" has this data? info.historyFile ? suggest simple modification to setupEventWriter public void setupEventWriter(JobID jobId, JobConf jobConf) throws IOException { if (logDir == null) Unknown macro: { LOG.info("Log Directory is null, returning"); throw new IOException("Missing Log Directory for History"); } MetaInfo oldFi = fileMap.get(jobId); long submitTime = (oldFi == null ? System.currentTimeMillis() : oldFi.submitTime); String user = getUserName(jobConf); String jobName = getJobName(jobConf); .... On ThreadPoolExecutor - why increased pool size? canonicalHistoryLogDir(JobId,...) jobId is not used in the following canonicalHistoryLogDir( In this block synchronized (ueState) { ...... + iShouldMonitor = true; + + ueState.unindexedElements = new LinkedList<JobHistoryIndexElement>(); + ueState.currentDoneSubdirectory = resultDir; + + ueState.monitoredDirectory = resultDir; ..... + ueState.unindexedElements. + add(new JobHistoryIndexElement(millisecondTime, id, metaInfo)); This code is not enitrely clear. should we increment the count here? unindexedElementCount++ unindexedElements related item: get/addUnindexedElements() - who calls these? In class UnindexedElementsState.closeCurrentDirectory() OutputStream newIndexOStream = null; PrintStream newIndexPStream = null; are unused + // time, because iShouldMonitor is only set true when + // ueState.monitoredDirectory changes, which will force the + // current incumbent to abend at the earliest opportunity. + while (iShouldMonitor) { + int roundCounter = 0; + + int interruptionsToAbort = 2; + + try Unknown macro: { + Thread.sleep(1000); + } catch (InterruptedException e) { + if (--interruptionsToAbort == 0) Unknown macro: { + return; + } + } + + synchronized (ueState) { + if (ueState.monitoredDirectory != resultDir) Unknown macro: { + // someone else closed out the directory I was monitoring + iShouldMonitor = false; + } else if (++roundCounter % 30 == 0) { + interruptionsToAbort = 2; + is in a busy wait loop with an arbitrary 1 sec sleep. This check can go up to a maximum of 1 hour? The 5 minute checkpoint does not set anything? } else if (++roundCounter % 300 == 0) { // called for side effect – a 5 minute checkpoint try Unknown macro: { ueState.getACurrentIndex(ueState.currentDoneSubdirectory); // why? } catch (IOException e) Unknown macro: { LOG.warn("Couldn't build an interim Job History index for " + ueState.currentDoneSubdirectory); }
          Hide
          Dick King added a comment -

          The API created by the patch differs from the one described by me on July 30 in a few minor ways:

          • The name PathCow is undignified. I'm using something more reasonable.
          • You can search for specific job IDs
          • We do not define the order in which the jobs' Path s are output.
          Show
          Dick King added a comment - The API created by the patch differs from the one described by me on July 30 in a few minor ways: The name PathCow is undignified. I'm using something more reasonable. You can search for specific job IDs We do not define the order in which the jobs' Path s are output.
          Hide
          Dick King added a comment -

          I still have some testing to do, but I think this patch does the job.

          I invite comment.

          Note there is a programmatic API to get this information. See JobHistory.JobHistoryRecordRetriever and JobHistory.JobHistoryJobRecord and JobHistory.getMatchingJobs .

          Show
          Dick King added a comment - I still have some testing to do, but I think this patch does the job. I invite comment. Note there is a programmatic API to get this information. See JobHistory.JobHistoryRecordRetriever and JobHistory.JobHistoryJobRecord and JobHistory.getMatchingJobs .
          Hide
          Dick King added a comment -

          Unfortunately, since the job sequence numbers will be assigned at job start time but the DONE directory is built as jobs complete, it's not the case that we'll only be filling one serial number block at once. They will, however, be closed when the day is complete.

          Show
          Dick King added a comment - Unfortunately, since the job sequence numbers will be assigned at job start time but the DONE directory is built as jobs complete, it's not the case that we'll only be filling one serial number block at once. They will, however, be closed when the day is complete.
          Hide
          Dick King added a comment -

          I need to modify getMatchingJob(String, String, String[]) in my comment of 28/Jul/10 03:09 PM as follows:

              class PathCow implements Iterator<Path> {
                  // Iterator<Path> methods
          
                  int numberMatches();
                  // returns number of matches you could get if you drive the Iterator to
                  // the end.  Might be an approximation.        
              }
          
              PathCow getMatchingJob
                       (String user, String jobnameSubstring, String[] dateStrings, boolean backwards)
                    throws IOException
                // has no remove() method
                // any criterion can be null
                // filtering is conjunctive
                // dates are MM/DD/YYYY 
                // results happen approximately oldest first [or newest first,
                //    if backwards is true]
                // a new file that gets added after the iterator is created can either be
                //    or not be delivered by the result
                // dates are approximations of completion time
          
          Show
          Dick King added a comment - I need to modify getMatchingJob(String, String, String[]) in my comment of 28/Jul/10 03:09 PM as follows: class PathCow implements Iterator<Path> { // Iterator<Path> methods int numberMatches(); // returns number of matches you could get if you drive the Iterator to // the end. Might be an approximation. } PathCow getMatchingJob (String user, String jobnameSubstring, String[] dateStrings, boolean backwards) throws IOException // has no remove() method // any criterion can be null // filtering is conjunctive // dates are MM/DD/YYYY // results happen approximately oldest first [or newest first, // if backwards is true] // a new file that gets added after the iterator is created can either be // or not be delivered by the result // dates are approximations of completion time
          Hide
          Sharad Agarwal added a comment -

          If we go with index files, will it be useful to have some basic but most accessed information like completion state, start time, end time, job name to be available in index file? This avoid the need to parse the history files and will give rich browsing on web UIs. May be a follow up JIRA for this.

          Show
          Sharad Agarwal added a comment - If we go with index files, will it be useful to have some basic but most accessed information like completion state , start time, end time, job name to be available in index file? This avoid the need to parse the history files and will give rich browsing on web UIs. May be a follow up JIRA for this.
          Hide
          Dick King added a comment -

          Sorry, I left out some context for my previous comment.

          1: People sometimes ask for accessors. This is especially likely as we change directory formats and may change them again in the future.

          2: I hope that no code has to scan the directories at all with this API.

          Show
          Dick King added a comment - Sorry, I left out some context for my previous comment. 1: People sometimes ask for accessors. This is especially likely as we change directory formats and may change them again in the future. 2: I hope that no code has to scan the directories at all with this API.
          Hide
          Dick King added a comment -

          Here are the new APIs I propose:

          All are static public member functions of JobHistory .

          All methods return only items from the done directory. Techniques for

              Path getJobHistoryPath(JobID id) throws IOException
          
              Path jobPathToConfPath(Path jobPath) throws IOException 
                // works in memory at computer speed.  Pledges to not read the file.
                // for a syntactically legal Path that doesn't correspond to an actual
                // job, can either return the corresponding conf Path that also won't
                // exist, or throw an exception.
          
              Iterator<Path> getMatchingJob
                       (String user, String jobnameSubstring, String[] dateStrings)
                    throws IOException
                // has no remove() method
                // any criterion can be null
                // filtering is conjunctive
                // dates are MM/DD/YYYY 
                // results happen in an arbitrary order
                // a new file that gets added after the iterator is created can either be
                //   or not be delivered by the result
                // dates are approximations of completion time
          
          
          Show
          Dick King added a comment - Here are the new APIs I propose: All are static public member functions of JobHistory . All methods return only items from the done directory. Techniques for Path getJobHistoryPath(JobID id) throws IOException Path jobPathToConfPath(Path jobPath) throws IOException // works in memory at computer speed. Pledges to not read the file. // for a syntactically legal Path that doesn't correspond to an actual // job, can either return the corresponding conf Path that also won't // exist, or throw an exception. Iterator<Path> getMatchingJob (String user, String jobnameSubstring, String[] dateStrings) throws IOException // has no remove() method // any criterion can be null // filtering is conjunctive // dates are MM/DD/YYYY // results happen in an arbitrary order // a new file that gets added after the iterator is created can either be // or not be delivered by the result // dates are approximations of completion time
          Hide
          Allen Wittenauer added a comment -

          Since I don't have an army of programmers building a metrics system like Simon, I'll likely just continue doing what I'm doing now: using a perl script to find the log files using a regex over the directory structure and manipulate them that way. As long as I don't have to have Java and all the information that is currently available remains available, then I probably don't care.

          It might be helpful, however, if you put up a diagram of your directory structure.

          Show
          Allen Wittenauer added a comment - Since I don't have an army of programmers building a metrics system like Simon, I'll likely just continue doing what I'm doing now: using a perl script to find the log files using a regex over the directory structure and manipulate them that way. As long as I don't have to have Java and all the information that is currently available remains available, then I probably don't care. It might be helpful, however, if you put up a diagram of your directory structure.
          Hide
          Doug Cutting added a comment -

          Are the index files required? They reduce the amount of directory enumeration on the namenode, but is that a system bottleneck? An optimization that also might be important is packing these into archives, to preserve namespace. But I question whether the initial implementation should contain either optimization.

          Show
          Doug Cutting added a comment - Are the index files required? They reduce the amount of directory enumeration on the namenode, but is that a system bottleneck? An optimization that also might be important is packing these into archives, to preserve namespace. But I question whether the initial implementation should contain either optimization.
          Hide
          Dick King added a comment -
          Show
          Dick King added a comment - I'm seeking comments from the community, on https://issues.apache.org/jira/browse/MAPREDUCE-323?focusedCommentId=12891928&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12891928 . Please? I would like to start bending metal soon.
          Hide
          Dick King added a comment -

          I would like to retract the suggestion that job history log file names be shortened during the creation of the data base see 5a in https://issues.apache.org/jira/browse/MAPREDUCE-323?focusedCommentId=12891928&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12891928.

          That's a race condition waiting to happen, and not really all that beneficial.

          Show
          Dick King added a comment - I would like to retract the suggestion that job history log file names be shortened during the creation of the data base see 5a in https://issues.apache.org/jira/browse/MAPREDUCE-323?focusedCommentId=12891928&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12891928 . That's a race condition waiting to happen, and not really all that beneficial.
          Hide
          Dick King added a comment -

          A PROPOSAL

          introduction

          The way the completed job history file system now works is that when a job is started, an empty history file is created by the job tracker. The name of the file contains nough information about the job to let an application tell whether the file documents a job that satisfies a search criterion. In particular, it includes the job tracker instance ID, the job ID, the user name, and the job name.

          As the job progresses, records get added to the file, and when it's finished [either successfully or failed] the file is moved to another directory, the completed job history files directory [the "DONE directory"]. Currently this directory has a simple flat structure. If an application [in particular, the job history browser] wants some job histories, it reads this directory and chooses the files with names that indicate that the files will meet the criteria. In practical cases this can includes hundreds of thousands or even a million files. Note that each job is represented by two files, the history file and the config file, doubling the burden on the name node.

          proposal

          I would like to implement a simple data base to solve this problem. My proposal has the following features:

          1: The DONE directory will contain subdirectories, each containing a few hundred or a thousand files.

          2: At any time, the job tracker will be filling one of the DONE directory's subdirectories. All the rest are closed out, never to be added to again.

          3: The subdirectories have a naming scheme so they're created in lexicographical rder. We would like to use subdirectory names like 2010-07-23--0000, etc [the four digits are a serial number, not an HHMM field].

          4: When the job tracker decides to bind off a subdirectory and start a new one, it creates a new index file in the subdirectory it's closing out. That index is a simple list of the history files the directory contains.

          4a: The job tracker starts a new subdirectory whenever the first history file is copied on a given day, and whenever the current subdirectory would otherwise contain more than a certain number of files.

          4b: Perhaps the files can be renamed? These files' names are a few dozen characters each, and in a system that has run a half million jobs the names collectively occupy 100+ megabytes in the name node. Significant, but not decisive.

          4b1: 4b would require that rumen understand indices.

          5: The processing is:

          5a: [optional] create a new short name for every file in the subdirectory that's being closed out

          5a1: The job tracker keeps this information in memory. It doesn't need to read the directory

          5b: Write out the index file in a temporary location temp-index within the directory it's indexing.

          5b1: The index contains all of the names in text form [if 5a is not use] or all pairs of

          { long name, short name }

          in text form, if we are shortening the names.

          5c: rename the temp-index file to index when it's done

          5d: [optional] If we chose file renaming, delete all of the long names.

          6: When doing a search, we

          6a: determine all subdirectories of the DONE directory

          6b: see which ones have an index

          6c: read each index that exists, and

          6d: read all of the files, for the subdirectories that don't have indices yet.

          7: To aid retirement of old job history files, the job tracker always binds off the current subdirectory when the date changes, even if it doesn't have very many files, and we retire files on date boundaries, a subdirectory at a time. The relevant date is the date that the file is being moved, which is normally a short time after the job is completed.

          8: [optional] We may want to consolidate the indices of a completed day in a per-day index written as a file directly under the done directory.

          Show
          Dick King added a comment - A PROPOSAL introduction The way the completed job history file system now works is that when a job is started, an empty history file is created by the job tracker. The name of the file contains nough information about the job to let an application tell whether the file documents a job that satisfies a search criterion. In particular, it includes the job tracker instance ID, the job ID, the user name, and the job name. As the job progresses, records get added to the file, and when it's finished [either successfully or failed] the file is moved to another directory, the completed job history files directory [the "DONE directory"] . Currently this directory has a simple flat structure. If an application [in particular, the job history browser] wants some job histories, it reads this directory and chooses the files with names that indicate that the files will meet the criteria. In practical cases this can includes hundreds of thousands or even a million files. Note that each job is represented by two files, the history file and the config file, doubling the burden on the name node. proposal I would like to implement a simple data base to solve this problem. My proposal has the following features: 1: The DONE directory will contain subdirectories, each containing a few hundred or a thousand files. 2: At any time, the job tracker will be filling one of the DONE directory's subdirectories. All the rest are closed out, never to be added to again. 3: The subdirectories have a naming scheme so they're created in lexicographical rder. We would like to use subdirectory names like 2010-07-23--0000, etc [the four digits are a serial number, not an HHMM field] . 4: When the job tracker decides to bind off a subdirectory and start a new one, it creates a new index file in the subdirectory it's closing out. That index is a simple list of the history files the directory contains. 4a: The job tracker starts a new subdirectory whenever the first history file is copied on a given day, and whenever the current subdirectory would otherwise contain more than a certain number of files. 4b: Perhaps the files can be renamed? These files' names are a few dozen characters each, and in a system that has run a half million jobs the names collectively occupy 100+ megabytes in the name node. Significant, but not decisive. 4b1: 4b would require that rumen understand indices. 5: The processing is: 5a: [optional] create a new short name for every file in the subdirectory that's being closed out 5a1: The job tracker keeps this information in memory. It doesn't need to read the directory 5b: Write out the index file in a temporary location temp-index within the directory it's indexing. 5b1: The index contains all of the names in text form [if 5a is not use] or all pairs of { long name, short name } in text form, if we are shortening the names. 5c: rename the temp-index file to index when it's done 5d: [optional] If we chose file renaming, delete all of the long names. 6: When doing a search, we 6a: determine all subdirectories of the DONE directory 6b: see which ones have an index 6c: read each index that exists, and 6d: read all of the files, for the subdirectories that don't have indices yet. 7: To aid retirement of old job history files, the job tracker always binds off the current subdirectory when the date changes, even if it doesn't have very many files, and we retire files on date boundaries, a subdirectory at a time. The relevant date is the date that the file is being moved, which is normally a short time after the job is completed. 8: [optional] We may want to consolidate the indices of a completed day in a per-day index written as a file directly under the done directory.
          Hide
          Dick King added a comment -

          On a separate issue, I received a suggestion that we have a distinguished file extension such as .jhist for job history files. FileSystem.globStatus(Path) can select the config .xml files, but

          1: This is not perfect, as the text ".xml" can inprinciple occur at the end of a job history file name, and

          2: the globbing language does not have a "give me all files except" operator.

          Show
          Dick King added a comment - On a separate issue, I received a suggestion that we have a distinguished file extension such as .jhist for job history files. FileSystem.globStatus(Path) can select the config .xml files, but 1: This is not perfect, as the text ".xml" can inprinciple occur at the end of a job history file name, and 2: the globbing language does not have a "give me all files except " operator.
          Hide
          Dick King added a comment -

          I believe that it is agreed that we need a directory structure other than a single directory holding all of the history files.

          That being said, the question is how the directory tree should be organized.

          The use cases are:

          1: There is a job history web API, implemented by jobhistory.jsp, that allows users to search the job history files to retrieve information on single or multiple jobs meeting certain criteria. In particular, web users can search for jobs with a certain user, and jobs whose job name contains a certain substring.

          After a search, the current API allows the user to page through the data. They get told the total number of matching jobs, and they can browse pages of data, with 100 jobs per page. They can access the first and last page from any page, and from any pages they can access any of the previous or following five pages [if there are that many].

          2: During restart, we perform searches for specific quadruples of jobtracker IDs, job-ID, username and jobname. This may be redundant but that's what we do in the current code base.

          3: I understand that some installations archive tranches of job history files periodically, usually by date.

          Here is how I support the claim that we support these use cases, with considerable scaling and responsiveness improvements:

          1: If I use a subdirectory structure based on jobtracker IDs and then dates and then high order digits of the jobid serial number, then the performance of each of these three usage cases can be improved. I described potential improvements of use case 1 on 14/Jun/10 at 09:38 PM . To summarize, you will be able to browse by dates and time ranges as well as by the other criteria, and performance will be improved as we only search the subset of the directories we need to satisfy the query or to present the first page of the results.

          If we make changes along these lines we will no longer present to the user the total number of matching jobs. One of the complaints that lead to this jira is, after all, the possibility of a scaling problem if there are too many jobs.

          2: Because of directory restrictions, the namenode will have to generate alot fewer data, and there will be a lot less client side filtering as well if you have directories consisting of only 1000 jobs [2000 files].

          3: We could archive a day's results by harchiving a date subdirectory.

          Show
          Dick King added a comment - I believe that it is agreed that we need a directory structure other than a single directory holding all of the history files. That being said, the question is how the directory tree should be organized. The use cases are: 1: There is a job history web API, implemented by jobhistory.jsp , that allows users to search the job history files to retrieve information on single or multiple jobs meeting certain criteria. In particular, web users can search for jobs with a certain user, and jobs whose job name contains a certain substring. After a search, the current API allows the user to page through the data. They get told the total number of matching jobs, and they can browse pages of data, with 100 jobs per page. They can access the first and last page from any page, and from any pages they can access any of the previous or following five pages [if there are that many] . 2: During restart, we perform searches for specific quadruples of jobtracker IDs, job-ID, username and jobname. This may be redundant but that's what we do in the current code base. 3: I understand that some installations archive tranches of job history files periodically, usually by date. Here is how I support the claim that we support these use cases, with considerable scaling and responsiveness improvements: 1: If I use a subdirectory structure based on jobtracker IDs and then dates and then high order digits of the jobid serial number, then the performance of each of these three usage cases can be improved. I described potential improvements of use case 1 on 14/Jun/10 at 09:38 PM . To summarize, you will be able to browse by dates and time ranges as well as by the other criteria, and performance will be improved as we only search the subset of the directories we need to satisfy the query or to present the first page of the results. If we make changes along these lines we will no longer present to the user the total number of matching jobs. One of the complaints that lead to this jira is, after all, the possibility of a scaling problem if there are too many jobs. 2: Because of directory restrictions, the namenode will have to generate alot fewer data, and there will be a lot less client side filtering as well if you have directories consisting of only 1000 jobs [2000 files] . 3: We could archive a day's results by harchiving a date subdirectory.
          Hide
          Doug Cutting added a comment -

          > After some discussions, we've come to some decisions.

          Do you mean the discussion above, or some other discussion? Decisions are made in public. Do you mean you have a new proposal?

          Show
          Doug Cutting added a comment - > After some discussions, we've come to some decisions. Do you mean the discussion above, or some other discussion? Decisions are made in public. Do you mean you have a new proposal?
          Hide
          Dick King added a comment -

          After some discussions, we've come to some decisions.

          1: We'll store the completed jobs' history files in the DFS done history files tree, in the following fixed format:

          DONE/job-tracker-instance-ID/YYYY/MM/DD/987654/

          The job tracker instance ID includes both the job tracker machine name and the epoch time of the instance start. There won't be very many directories on this level.

          YYYY/MM/DD documents the date of completion [actually, the date that the history file is copied to DFS].

          987654 are the leading six digits of the job serial number, considered as a nine-digit integer. The leading zeros ARE included, so the directories can be enumerated correctly in lexicographical order. Therefore, no directory will have more than 2000 files, except in the unlikely case that there are more than 2 million jobs in one day.

          2: We will modify the web application, jobhistory.jsp , in the following ways:

          2a: We will decide how many jobs to filter based on the following criteria

          2a1: We stop at 11 tranches of serial numbers [the tenth boundary] or a day boundary, whichever comes first [but that page delivers buttons inviting you to ask for previous days,or more tranches]. Of course, as now, we stop at 100 items if we get that many items before crossing the directory boundary, but in the new code we will remember where to continue. However, in the new codebase we won't ls the files we don't present, improving the responsiveness accordingly.

          2b: We will present the job history links, newest first.

          2b1: To make this coherent, we will remember where we left off for pagination

          To summarize how the code will work, the pagination controls will look like this:

          Available Jobs in History (displaying 100 jobs from 1 to 100) [show all] [show 1000 per page] [show entire day] [first page][last page]

          < golem-jt1.megacorp.com-2010-05-18 golem-jt1.megacorp.com-2010-04-18 > [current JT instance, previous and/or following. This line of pagination controls is omitted if there is only one.]

          < newest 2010/06/14 2010/06/13 2010/06/12 2010/06/11 2010/06/10 oldest > [current day, two days previous, two days succeeding -- only within the current JT instance]

          < oldest 1 2 3 4 5 next newest > directional words change when the search direction changes

          2c: There is a notion of search direction. Currently we display oldest first, but I'm thinking of changing that because I judge "most recent first" to be the better default, especially as uptimes increase as the product becomes more mature. What do you think?

          Users can change direction by going to "last page" – or "oldest/newest date" – or "oldest/newest task tracker". When you've done that, the navigation cursors change so you're going in the right direction.

          Show
          Dick King added a comment - After some discussions, we've come to some decisions. 1: We'll store the completed jobs' history files in the DFS done history files tree, in the following fixed format: DONE/job-tracker-instance-ID/YYYY/MM/DD/987654/ The job tracker instance ID includes both the job tracker machine name and the epoch time of the instance start. There won't be very many directories on this level. YYYY/MM/DD documents the date of completion [actually, the date that the history file is copied to DFS] . 987654 are the leading six digits of the job serial number, considered as a nine-digit integer. The leading zeros ARE included, so the directories can be enumerated correctly in lexicographical order. Therefore, no directory will have more than 2000 files, except in the unlikely case that there are more than 2 million jobs in one day. 2: We will modify the web application, jobhistory.jsp , in the following ways: 2a: We will decide how many jobs to filter based on the following criteria 2a1: We stop at 11 tranches of serial numbers [the tenth boundary] or a day boundary, whichever comes first [but that page delivers buttons inviting you to ask for previous days,or more tranches] . Of course, as now, we stop at 100 items if we get that many items before crossing the directory boundary, but in the new code we will remember where to continue. However, in the new codebase we won't ls the files we don't present, improving the responsiveness accordingly. 2b: We will present the job history links, newest first. 2b1: To make this coherent, we will remember where we left off for pagination To summarize how the code will work, the pagination controls will look like this: Available Jobs in History (displaying 100 jobs from 1 to 100) [show all] [show 1000 per page] [show entire day] [first page] [last page] < golem-jt1.megacorp.com-2010-05-18 golem-jt1.megacorp.com-2010-04-18 > [current JT instance, previous and/or following. This line of pagination controls is omitted if there is only one.] < newest 2010/06/14 2010/06/13 2010/06/12 2010/06/11 2010/06/10 oldest > [current day, two days previous, two days succeeding -- only within the current JT instance] < oldest 1 2 3 4 5 next newest > directional words change when the search direction changes 2c: There is a notion of search direction. Currently we display oldest first, but I'm thinking of changing that because I judge "most recent first" to be the better default, especially as uptimes increase as the product becomes more mature. What do you think? Users can change direction by going to "last page" – or "oldest/newest date" – or "oldest/newest task tracker". When you've done that, the navigation cursors change so you're going in the right direction.
          Hide
          Chris Douglas added a comment -

          The scope of this issue has not been well defined. The designs are arguing about the correct subset of a database to implement for JobHistory, leaving a wide range of known (and as Allen points out, unknown) use cases ill served. This will not converge quickly.

          For purposes of consensus, this issue is a bug; the existing functionality is not handled efficiently. It should go without saying that the design should not be over-specific to today's use cases, but the issue's focus should remain on solving the problems cited and servicing the use cases already in the system. This is a misbehaving component, not a project implementing a small database in HDFS. Perhaps the title should change to reflect this.

          There are 3 operations to support (please amend as necessary):

          1. Lookup by JobID. This should not be worse than O(log n) (and should be O(1)), as it is a frequent operation.
          2. Find a set of jobs run by a particular user
          3. Find a set of jobs with names matching a regex

          (2) and (3) can require a scan, but the cost should be bounded. If there are common operator activities (like archiving old history, etc) then the layout should support that, but arbitrary queries are out of scope.

          The problems with the flat hierarchy are, obviously, the cost of listing files both in the JobTracker and NameNode. This can be ameliorated, somewhat, by HDFS-1091 and HDFS-985, but further optimizations/caching are possible if one can assume that recent entries are more relevant.

          Dick/Doug's format looks sound to me. Amar identified many complexities in implementing the configurable-schema, mini-database proposal and in my opinion: while the solutions are feasible, the virtues of a simpler fix for this issue outweigh the costs of solving those problems.

          I particularly like the idea of bounding scans of JobHistory to n entries, unless the user requests a deeper search. Caching recent entries, metadata about which subdirectories are sufficent for n entries, etc. are all reasonable optimizations, but adopting the new layout should be sufficient for this issue. Agreed?

          Show
          Chris Douglas added a comment - The scope of this issue has not been well defined. The designs are arguing about the correct subset of a database to implement for JobHistory, leaving a wide range of known (and as Allen points out, unknown) use cases ill served. This will not converge quickly. For purposes of consensus, this issue is a bug; the existing functionality is not handled efficiently. It should go without saying that the design should not be over-specific to today's use cases, but the issue's focus should remain on solving the problems cited and servicing the use cases already in the system. This is a misbehaving component, not a project implementing a small database in HDFS. Perhaps the title should change to reflect this. There are 3 operations to support (please amend as necessary): Lookup by JobID. This should not be worse than O(log n) (and should be O(1)), as it is a frequent operation. Find a set of jobs run by a particular user Find a set of jobs with names matching a regex (2) and (3) can require a scan, but the cost should be bounded. If there are common operator activities (like archiving old history, etc) then the layout should support that, but arbitrary queries are out of scope. The problems with the flat hierarchy are, obviously, the cost of listing files both in the JobTracker and NameNode. This can be ameliorated, somewhat, by HDFS-1091 and HDFS-985 , but further optimizations/caching are possible if one can assume that recent entries are more relevant. Dick/ Doug 's format looks sound to me. Amar identified many complexities in implementing the configurable-schema, mini-database proposal and in my opinion: while the solutions are feasible, the virtues of a simpler fix for this issue outweigh the costs of solving those problems. I particularly like the idea of bounding scans of JobHistory to n entries, unless the user requests a deeper search. Caching recent entries, metadata about which subdirectories are sufficent for n entries, etc. are all reasonable optimizations, but adopting the new layout should be sufficient for this issue. Agreed?
          Hide
          Allen Wittenauer added a comment -

          > 7: Perhaps there needs to be a programmatic API as well, reducing the need for people to read directories.

          In fact, I worry we are building something to make a faster UI but making it impossible to use in any other manner.

          Show
          Allen Wittenauer added a comment - > 7: Perhaps there needs to be a programmatic API as well, reducing the need for people to read directories. In fact, I worry we are building something to make a faster UI but making it impossible to use in any other manner.
          Hide
          Dick King added a comment -

          My proposal does present the display faster when the history files are numerous, but we do lose the ability to display the total amount or to jump to the last page.

          Show
          Dick King added a comment - My proposal does present the display faster when the history files are numerous, but we do lose the ability to display the total amount or to jump to the last page.
          Hide
          Dick King added a comment -

          I've given this some more thought and I've devised a new design.

          I don't think that the subdirectory per se is the important issue, except to keep the directory sizes manageable. However, the important operations should be supported, with good performance, preferably in the jobhistory.jsp interface. We have to support reasonable searches in the jsp . To that end, I would to do the following:

          1: let the done jobs' directory structure be DONE/jobtracker-timestamp/123/456/789 where 123456789 is the job ID serial number. Leading zeros are depicted in the directory even if they're not in the serial number. Perhaps jobtracker-timestamp should be jobtracker-id ?

          2: In the jsp, we could present newest jobs first. This is probably what people want, and in common cases it speeds up the presentation when the user displays an early page. With the current naming convention,these are the jobs with the lexicographically latest file names.

          3: All the URLs in the jsp pages [including those behind forms] would have a starting job tracker ID and serial number encoded, so we can continue from where we left off, even though we keep adding new jobs to the beginning because of 2: . Subsequent pages will not overlap previous pages just because new jobs have been added at the beginning.

          4: When we do searches, we work back through the directories in reverse order, so we can stop when we populate a page rather than reading all of the history files' names.

          5: For low-yield searches we'll consider offering to stop after, say, 10K non-matching jobs have been ignored. This lets us process mistyped queries in a reasonable time.

          6: The start time is of interest. Inside the JobHistory code, as the cached history files are being copied to the DONE directory, an approximation of the start time is available in the modification time of the conf.xml file. We can copy that, either to the modification time of the new job history file [using {{setTime}}], or encode it into the filename in some manner [as we do with the job name]. Either way, we can then present it in the jsp result, or filter based on time ranges. What does the community think?

          7: Perhaps there needs to be a programmatic API as well, reducing the need for people to read directories.

          Show
          Dick King added a comment - I've given this some more thought and I've devised a new design. I don't think that the subdirectory per se is the important issue, except to keep the directory sizes manageable. However, the important operations should be supported, with good performance, preferably in the jobhistory.jsp interface. We have to support reasonable searches in the jsp . To that end, I would to do the following: 1: let the done jobs' directory structure be DONE/jobtracker-timestamp/123/456/789 where 123456789 is the job ID serial number. Leading zeros are depicted in the directory even if they're not in the serial number. Perhaps jobtracker-timestamp should be jobtracker-id ? 2: In the jsp , we could present newest jobs first. This is probably what people want, and in common cases it speeds up the presentation when the user displays an early page. With the current naming convention,these are the jobs with the lexicographically latest file names. 3: All the URLs in the jsp pages [including those behind forms] would have a starting job tracker ID and serial number encoded, so we can continue from where we left off, even though we keep adding new jobs to the beginning because of 2: . Subsequent pages will not overlap previous pages just because new jobs have been added at the beginning. 4: When we do searches, we work back through the directories in reverse order, so we can stop when we populate a page rather than reading all of the history files' names. 5: For low-yield searches we'll consider offering to stop after, say, 10K non-matching jobs have been ignored. This lets us process mistyped queries in a reasonable time. 6: The start time is of interest. Inside the JobHistory code, as the cached history files are being copied to the DONE directory, an approximation of the start time is available in the modification time of the conf.xml file. We can copy that, either to the modification time of the new job history file [using {{setTime}}] , or encode it into the filename in some manner [as we do with the job name] . Either way, we can then present it in the jsp result, or filter based on time ranges. What does the community think? 7: Perhaps there needs to be a programmatic API as well, reducing the need for people to read directories.
          Hide
          Allen Wittenauer added a comment -

          This should work just like core dump configuration options. A pattern is provided via an option and the system replaces the pattern's parameters with the job's unique values. This way everyone can get what they want in a very simple interface. Hard-coding a log file name is something that we shouldn't be doing anyway.

          > -1. There is no point supporting configuration options which are clearly infeasible in several cases.

          If we stop hard coding log file names and use pattern substitution, then this isn't the case anymore.

          Show
          Allen Wittenauer added a comment - This should work just like core dump configuration options. A pattern is provided via an option and the system replaces the pattern's parameters with the job's unique values. This way everyone can get what they want in a very simple interface. Hard-coding a log file name is something that we shouldn't be doing anyway. > -1. There is no point supporting configuration options which are clearly infeasible in several cases. If we stop hard coding log file names and use pattern substitution, then this isn't the case anymore.
          Hide
          Arun C Murthy added a comment -

          I would prefer username to be one of the configuration options. Since its configurable, it can be turned off for clusters having lots of users.

          -1. There is no point supporting configuration options which are clearly infeasible in several cases.

          Show
          Arun C Murthy added a comment - I would prefer username to be one of the configuration options. Since its configurable, it can be turned off for clusters having lots of users. -1. There is no point supporting configuration options which are clearly infeasible in several cases.
          Hide
          Amar Kamat added a comment -

          Few comments

          1. W.r.t your comment , we could very well use the finishtime of the job. This is very well published in the job summary, stored in the job status cache within jobtracker and later archived to completed-job-status-store. Maybe we can reuse these features (i.e the job status cache and status store).
          2. We should log jobhistory activities like
            1. jobhistory folder regex used
            2. jobid to foldername mappings
              Logging will help in debugging and post mortem analysis.
          3. Formats can change across runs. How do we plan to take care of that. One thing we can do it to have a unique folder per pattern for storing the files. The (unique) folder-name should be based on the jobhistory structure pattern. This mapping of jobhistory folder regex to the foldername should be logged.
            Clients that need really old jobhistory files analyzed, will dig up the jobhistory folder format, map it to the folder, provide the username, jobid and finishtime to get the file. The client can get the username and finishtime by quering the JobTracker for the job status (via completed-jobstatus-store). See Future Steps #1.
          4. How about keeping N items in the top level directory and moving them to the appropriate place only when the total item count crosses N.
            Example (assume /done/%user/%jobid as the format and N=5)
            1. The first job gets added to /done/job1
            2. 5th job gets added to /done/job5
            3. 6th job gets added to /done/job6 and /done/job1 gets moves to /done/user1/job1
            4. and so on
              So the movement happens only on overflow. The benefit of this change is that without any indexing, we can show the recent N jobs on the jobhistory webui. This pattern can be enabled for all subfolders also. So if the jobhistory format specified is %user/ then queries like 'give the recent 5 items all the users' can also be answered quickly.
          5. Webui should provide 2 views
            1. top/recent few (show jobs from the topmost level folder)
            2. browse-able view where YYYY/MM/DD etc is shows as it is. This can be configurable and turned off for complicated structures like 00/00/00-99 etc, which the users might now be able to make sense. Also there should be somekind of widget in JobHistory that given username, joibid and finishtime provides the complete jobhistory filename. See Future steps #2.
          6. .... He raised the issue that a practical cluster has more distinct users than we would want to create DFS directories, especially if the directory structure is further split on timestamps.

            I would prefer username to be one of the configuration options. Since its configurable, it can be turned off for clusters having lots of users.

          Future steps :

          1. As of today, we have jobhistory files directly dumped in the done folder. We might want to move these files in the format we want (for a good user experience). Maybe some kind of offline admin tool can help here (maybe under mradmin?). It might make sense to name the final jobhistory file (leaf-level) as $username_$jobid_$finishtime. This will enable use to restructure job history files across various formats.
          2. There should be someway to find out which regex/format was used given the jobtracker start time (which is one of the components in jobid). To make it easier for clients, maybe the log files related to jobhistory upadates can be published or the JobTracker should be in a position to answer this.
            Thoughts?
          Show
          Amar Kamat added a comment - Few comments W.r.t your comment , we could very well use the finishtime of the job. This is very well published in the job summary, stored in the job status cache within jobtracker and later archived to completed-job-status-store. Maybe we can reuse these features (i.e the job status cache and status store). We should log jobhistory activities like jobhistory folder regex used jobid to foldername mappings Logging will help in debugging and post mortem analysis. Formats can change across runs. How do we plan to take care of that. One thing we can do it to have a unique folder per pattern for storing the files. The (unique) folder-name should be based on the jobhistory structure pattern. This mapping of jobhistory folder regex to the foldername should be logged. Clients that need really old jobhistory files analyzed, will dig up the jobhistory folder format, map it to the folder, provide the username , jobid and finishtime to get the file. The client can get the username and finishtime by quering the JobTracker for the job status (via completed-jobstatus-store). See Future Steps #1 . How about keeping N items in the top level directory and moving them to the appropriate place only when the total item count crosses N . Example (assume /done/%user/%jobid as the format and N=5) The first job gets added to /done/job1 5th job gets added to /done/job5 6th job gets added to /done/job6 and /done/job1 gets moves to /done/user1/job1 and so on So the movement happens only on overflow. The benefit of this change is that without any indexing, we can show the recent N jobs on the jobhistory webui. This pattern can be enabled for all subfolders also. So if the jobhistory format specified is %user/ then queries like ' give the recent 5 items all the users ' can also be answered quickly. Webui should provide 2 views top/recent few (show jobs from the topmost level folder) browse-able view where YYYY/MM/DD etc is shows as it is. This can be configurable and turned off for complicated structures like 00/00/00-99 etc, which the users might now be able to make sense. Also there should be somekind of widget in JobHistory that given username , joibid and finishtime provides the complete jobhistory filename. See Future steps #2 . .... He raised the issue that a practical cluster has more distinct users than we would want to create DFS directories, especially if the directory structure is further split on timestamps. I would prefer username to be one of the configuration options. Since its configurable, it can be turned off for clusters having lots of users. Future steps : As of today, we have jobhistory files directly dumped in the done folder. We might want to move these files in the format we want (for a good user experience). Maybe some kind of offline admin tool can help here (maybe under mradmin?). It might make sense to name the final jobhistory file (leaf-level) as $username_$jobid_$finishtime. This will enable use to restructure job history files across various formats. There should be someway to find out which regex/format was used given the jobtracker start time (which is one of the components in jobid). To make it easier for clients, maybe the log files related to jobhistory upadates can be published or the JobTracker should be in a position to answer this. Thoughts?
          Hide
          Dick King added a comment -

          A coworker has suggested that I remove the "%u" user option [from my previous comment, 28/May/10 06:01 PM ].

          He raised the issue that a practical cluster has more distinct users than we would want to create DFS directories, especially if the directory structure is further split on timestamps.

          Show
          Dick King added a comment - A coworker has suggested that I remove the "%u" user option [from my previous comment, 28/May/10 06:01 PM ] . He raised the issue that a practical cluster has more distinct users than we would want to create DFS directories, especially if the directory structure is further split on timestamps.
          Hide
          Dick King added a comment -

          If the cluster configuration codes any time stamps, we have to create them. We'll do this the first time we make a filename for a given job.

          Having done that, we'll have a map mapping job serial numbers to directory segments [which we will intern; there will be many duplicates].

          Having done that, we will we'll keep 250K of these; we'll drop the oldest one when we add a new one that would otherwise add more than that. We'll therefore use a TreeMap . I expect about 20-40 bytes per entry; 16 bytes each tree node, and 8 or 16 for the key which would be an Integer . Recall that the directory segments are interned and would essentially vanish.

          This table only exists if there is a time stamp operator in the format string.

          Show
          Dick King added a comment - If the cluster configuration codes any time stamps, we have to create them. We'll do this the first time we make a filename for a given job. Having done that, we'll have a map mapping job serial numbers to directory segments [which we will intern; there will be many duplicates] . Having done that , we will we'll keep 250K of these; we'll drop the oldest one when we add a new one that would otherwise add more than that. We'll therefore use a TreeMap . I expect about 20-40 bytes per entry; 16 bytes each tree node, and 8 or 16 for the key which would be an Integer . Recall that the directory segments are interned and would essentially vanish. This table only exists if there is a time stamp operator in the format string.
          Hide
          Dick King added a comment -

          It has been correctly pointed out to me that the syntax, %xi,j , is just wierd.

          In keeping with java data format conventions, I will use %leftmost_index$width.precision x to name a segment of the jobID index.

          leftmost_index names the leftmost digit and width names the number of digits; width defaults to 1. If you name a digit that doesn't exist, the output gets the empty string in the corresponding position [except as specified in {{precision}}, below]. It is an error for width to exceed leftmost_index .

          precision is a minimum number of digits to output; defaults to 0. It is an error for precision to exceed width. If precision requires more digits than exist in the index, we supply zeroes.

          It is an error to omit leftmost_index. It is an error to code a $ if there is no width. It is an error to code a . if there is no precision. It is an error to omit width if there is a precision.

          This configuration variable lives in mapreduce.jobhistory.completed.subdirectory.format . Default is the empty string [which gives the behavior that we get now; no subdirectories].

          Show
          Dick King added a comment - It has been correctly pointed out to me that the syntax, %xi,j , is just wierd. In keeping with java data format conventions, I will use %leftmost_index$width.precision x to name a segment of the jobID index. leftmost_index names the leftmost digit and width names the number of digits; width defaults to 1. If you name a digit that doesn't exist, the output gets the empty string in the corresponding position [except as specified in {{precision}}, below] . It is an error for width to exceed leftmost_index . precision is a minimum number of digits to output; defaults to 0. It is an error for precision to exceed width . If precision requires more digits than exist in the index, we supply zeroes. It is an error to omit leftmost_index . It is an error to code a $ if there is no width. It is an error to code a . if there is no precision . It is an error to omit width if there is a precision. This configuration variable lives in mapreduce.jobhistory.completed.subdirectory.format . Default is the empty string [which gives the behavior that we get now; no subdirectories] .
          Hide
          Dick King added a comment -

          The user and jobID index can both be obtained from arguments to calls to the API JobHistory.getJobHistoryFile(...) . The time stamp cannot.

          I'll have to store maps from jobID indices to time of first JobHistory.getJobHistoryFile(...) call to support the functionality if the cluster owner specifies time-stamp based directory structure. This map lives in the job tracker and creates a practical limit of perhaps a half million jobs, if this feature is used. Does this seem reasonable?

          Show
          Dick King added a comment - The user and jobID index can both be obtained from arguments to calls to the API JobHistory.getJobHistoryFile(...) . The time stamp cannot. I'll have to store maps from jobID indices to time of first JobHistory.getJobHistoryFile(...) call to support the functionality if the cluster owner specifies time-stamp based directory structure. This map lives in the job tracker and creates a practical limit of perhaps a half million jobs, if this feature is used. Does this seem reasonable?
          Hide
          Dick King added a comment -

          3: jobhistory.jsp has to be convinced to recursively process subdirectories within historyLogDir .

          Show
          Dick King added a comment - 3: jobhistory.jsp has to be convinced to recursively process subdirectories within historyLogDir .
          Hide
          Dick King added a comment -

          Okay...

          1: I will have to fix rumen to recursively descend into a directory of directories to make it capable of swallowing a history directory.

          1a: I would like to still process the job IDs in lexicographical order [which is almost always chronological order] for compatibility with applications that expect approximately chronological order.

          1b: This creates a memory footprint of about 200b/entry, which may impose a limit of one million jobs or so.

          2: I will make the directories configurable. How about the following controls?

          locution meaning
          %y year [four digits] [The Y10K problem will be someone else's problem :-) ]
          %m month [two digits, leading zeros present]
          %d day [two digits, leading zeros present]
          %h hour [two digits, leading zeros present]
          %i mInute [two digits, leading zeros present]
          %u user
          %xi-j the digits from the jobID index whose positions run from i through j, downwards, numbered from the right, 1-based. If you choose any digits that don't exist you get no characters in the output for those digits. %x9-3 will give you directories holding logs for at most 100 jobs, unless you omit timestamp selection controls.
          / directory component separator [even on platforms with a different separator character] – if there are two or more slashes in a row we swallow all but one, and note that there's an implicit leading and trailing separator character
          any other character itself

          Did I leave anything out?

          Show
          Dick King added a comment - Okay... 1: I will have to fix rumen to recursively descend into a directory of directories to make it capable of swallowing a history directory. 1a: I would like to still process the job IDs in lexicographical order [which is almost always chronological order] for compatibility with applications that expect approximately chronological order. 1b: This creates a memory footprint of about 200b/entry, which may impose a limit of one million jobs or so. 2: I will make the directories configurable. How about the following controls? locution meaning %y year [four digits] [The Y10K problem will be someone else's problem :-) ] %m month [two digits, leading zeros present] %d day [two digits, leading zeros present] %h hour [two digits, leading zeros present] %i mInute [two digits, leading zeros present] %u user %xi-j the digits from the jobID index whose positions run from i through j , downwards , numbered from the right, 1-based . If you choose any digits that don't exist you get no characters in the output for those digits. %x9-3 will give you directories holding logs for at most 100 jobs, unless you omit timestamp selection controls. / directory component separator [even on platforms with a different separator character] – if there are two or more slashes in a row we swallow all but one, and note that there's an implicit leading and trailing separator character any other character itself Did I leave anything out?
          Hide
          Edward Capriolo added a comment -

          Being able to control the structure better is definitely a nice feature. Practically, for dividing the job folders by mm/dd/yy would solve the immediate problem on having to clean and restart your JobTracker when you hit ext3 limit. Introducing a variable into the jobtracker mapred.jobhistory.maxjobhistory and a FIFO queue might be helpful as well. As things stand now a downtime and cleanup is needed to keep the JobTracker running well, this is less then optimal.

          Show
          Edward Capriolo added a comment - Being able to control the structure better is definitely a nice feature. Practically, for dividing the job folders by mm/dd/yy would solve the immediate problem on having to clean and restart your JobTracker when you hit ext3 limit. Introducing a variable into the jobtracker mapred.jobhistory.maxjobhistory and a FIFO queue might be helpful as well. As things stand now a downtime and cleanup is needed to keep the JobTracker running well, this is less then optimal.
          Hide
          Rajiv Chittajallu added a comment -

          Each site might have different requirements. Some might run 1000's of jobs with a single user and others might have lot of users. Can we make the directory and file name configurable too? Something like "%Y/%m/%d/%j"

          Show
          Rajiv Chittajallu added a comment - Each site might have different requirements. Some might run 1000's of jobs with a single user and others might have lot of users. Can we make the directory and file name configurable too? Something like "%Y/%m/%d/%j"
          Hide
          Doug Cutting added a comment -

          > Yes. I agree that breaking by job-ids make things a lot simpler.

          Great! Does anyone else have concerns with that approach?

          > admins will have to create per-user directories in history folder where JobTracker can write to

          Another possibility might be that jobs can specify a group permitted to read its logs. The jobtracker would chgrp the logs to that group. The jobtracker's uid would need to be a member of that group. The difference is that, rather than having to configure each filesystem for each user, one can just configure the user/groups database. Another difference is that this would permit logs to be readable by more than the single user who submitted the job. But this is all stuff for later...

          Show
          Doug Cutting added a comment - > Yes. I agree that breaking by job-ids make things a lot simpler. Great! Does anyone else have concerns with that approach? > admins will have to create per-user directories in history folder where JobTracker can write to Another possibility might be that jobs can specify a group permitted to read its logs. The jobtracker would chgrp the logs to that group. The jobtracker's uid would need to be a member of that group. The difference is that, rather than having to configure each filesystem for each user, one can just configure the user/groups database. Another difference is that this would permit logs to be readable by more than the single user who submitted the job. But this is all stuff for later...
          Hide
          Amareshwari Sriramadasu added a comment -

          But, as I stated later, I think breaking directories by job-id makes lookup simpler and gives us more explicit limits over directory sizes. So I'd prefer that to time-based directories.

          Yes. I agree that breaking by job-ids make things a lot simpler.

          So, if we have all job history files in a single tree, then we'd want the directories in that tree to be world readable, but the log files to be owned and readable by the job's submitter.

          To achieve this, JobTracker would need super user privileges on HDFS, to do chown. If we assume JT would have super-user privileges on HDFS, then we can go with job-id based directory structure.

          Or, if we have per-user directories, we could make those readable only by that user, providing greater privacy. Is this what you mean?

          Yes. I meant this. In this case, admins will have to create per-user directories in history folder where JobTracker can write to. JT will not need super-user privileges here.

          Show
          Amareshwari Sriramadasu added a comment - But, as I stated later, I think breaking directories by job-id makes lookup simpler and gives us more explicit limits over directory sizes. So I'd prefer that to time-based directories. Yes. I agree that breaking by job-ids make things a lot simpler. So, if we have all job history files in a single tree, then we'd want the directories in that tree to be world readable, but the log files to be owned and readable by the job's submitter. To achieve this, JobTracker would need super user privileges on HDFS, to do chown. If we assume JT would have super-user privileges on HDFS, then we can go with job-id based directory structure. Or, if we have per-user directories, we could make those readable only by that user, providing greater privacy. Is this what you mean? Yes. I meant this. In this case, admins will have to create per-user directories in history folder where JobTracker can write to. JT will not need super-user privileges here.
          Hide
          Doug Cutting added a comment -

          > Per-hour directories look like over-kill. On the average, For each user, there would be 10 jobs finished in an hour.

          The maximum is more important than the average, no? Couldn't there be a user that submits a job every minute or more? But, as I stated later, I think breaking directories by job-id makes lookup simpler and gives us more explicit limits over directory sizes. So I'd prefer that to time-based directories.

          > This looks fine. But when we have permissions in place, inserting user becomes difficult.

          I'm not sure what you mean by "permissions in place" and "inserting user". It seems that the intent is for users to be able to directly read their own job history files from HDFS. It also seems like we don't want generally users to be able to read other's job history files. So, if we have all job history files in a single tree, then we'd want the directories in that tree to be world readable, but the log files to be owned and readable by the job's submitter. Or, if we have per-user directories, we could make those readable only by that user, providing greater privacy. Is this what you mean?

          When we Cluster.getJobHistoryUrl() we'll know the user's ID, so I don't see whether there's a top-level directory per user changing things much, if that's what you're worried about. The more important decision it seems to me is how we break things into directories within that. Using job ids seems more scalable than using time-of-day. Do you agree?

          Show
          Doug Cutting added a comment - > Per-hour directories look like over-kill. On the average, For each user, there would be 10 jobs finished in an hour. The maximum is more important than the average, no? Couldn't there be a user that submits a job every minute or more? But, as I stated later, I think breaking directories by job-id makes lookup simpler and gives us more explicit limits over directory sizes. So I'd prefer that to time-based directories. > This looks fine. But when we have permissions in place, inserting user becomes difficult. I'm not sure what you mean by "permissions in place" and "inserting user". It seems that the intent is for users to be able to directly read their own job history files from HDFS. It also seems like we don't want generally users to be able to read other's job history files. So, if we have all job history files in a single tree, then we'd want the directories in that tree to be world readable, but the log files to be owned and readable by the job's submitter. Or, if we have per-user directories, we could make those readable only by that user, providing greater privacy. Is this what you mean? When we Cluster.getJobHistoryUrl() we'll know the user's ID, so I don't see whether there's a top-level directory per user changing things much, if that's what you're worried about. The more important decision it seems to me is how we break things into directories within that. Using job ids seems more scalable than using time-of-day. Do you agree?
          Hide
          Amareshwari Sriramadasu added a comment -

          Nick Rettinghouse, Tim Williamson, and Rajiv Chittajallu all suggested a preference for per-hour directories, in particular, USER/YYYY/MM/DD/HH, an option you did not list. Should we perhaps err on the side of a deeper structure, to ensure that we don't have to re-structure things again?

          Per-hour directories look like over-kill. On the average, For each user, there would be 10 jobs finished in an hour.

          However implementing Cluster.getJobHistoryUrl() would be expensive for archived jobs, since the jobtracker must search the entire directory tree.

          Here, JobTracker need not search the entire directory tree. If JobTracker does not have it in the cache, Job Client itself can do the search.

          Perhaps the directory structure should instead be based purely on the job ID? E.g., something like: jobtrackerstarttime/00/00/00

          This looks fine. But when we have permissions in place, inserting user becomes difficult.

          Show
          Amareshwari Sriramadasu added a comment - Nick Rettinghouse, Tim Williamson, and Rajiv Chittajallu all suggested a preference for per-hour directories, in particular, USER/YYYY/MM/DD/HH, an option you did not list. Should we perhaps err on the side of a deeper structure, to ensure that we don't have to re-structure things again? Per-hour directories look like over-kill. On the average, For each user, there would be 10 jobs finished in an hour. However implementing Cluster.getJobHistoryUrl() would be expensive for archived jobs, since the jobtracker must search the entire directory tree. Here, JobTracker need not search the entire directory tree. If JobTracker does not have it in the cache, Job Client itself can do the search. Perhaps the directory structure should instead be based purely on the job ID? E.g., something like: jobtrackerstarttime/00/00/00 This looks fine. But when we have permissions in place, inserting user becomes difficult.
          Hide
          Doug Cutting added a comment -

          > Options for the directory structure of the history files are

          Nick Rettinghouse, Tim Williamson, and Rajiv Chittajallu all suggested a preference for per-hour directories, in particular, USER/YYYY/MM/DD/HH, an option you did not list. Should we perhaps err on the side of a deeper structure, to ensure that we don't have to re-structure things again?

          I like the idea of a cache of recent jobs in the JobTracker. This can be initialized by walking this directory tree, then maintained incrementally. However implementing Cluster.getJobHistoryUrl() would be expensive for archived jobs, since the jobtracker must search the entire directory tree.

          Perhaps the directory structure should instead be based purely on the job ID? E.g., something like:
          jobtrackerstarttime/00/00/00
          jobtrackerstarttime/00/00/01
          ...
          jobtrackerstarttime/00/00/99
          jobtrackerstarttime/00/01/00
          etc.

          Only if a jobtracker ran more than 1M jobs would its top-level directory have more than 100 entries. Constructing the cache of recent jobs would be fast, as would Cluster.getJobHistoryUrl(JobID). Access to jobs in the cache by user id, date, etc. could be fast, since the cache is in memory. Access to older jobs by user id, date, etc. would not be supported.

          As an enhancement, we might later place index files in the higher-level directories, listing job ids sorted by username, date, etc. These might be written to leaf directories after the 100th job is added to a directory, and to non-leaves after the 10,000th job is added, etc. They could be generated from the cache. With such indexes, user and time-based queries to the archives could be resolved in logn time.

          Show
          Doug Cutting added a comment - > Options for the directory structure of the history files are Nick Rettinghouse, Tim Williamson, and Rajiv Chittajallu all suggested a preference for per-hour directories, in particular, USER/YYYY/MM/DD/HH, an option you did not list. Should we perhaps err on the side of a deeper structure, to ensure that we don't have to re-structure things again? I like the idea of a cache of recent jobs in the JobTracker. This can be initialized by walking this directory tree, then maintained incrementally. However implementing Cluster.getJobHistoryUrl() would be expensive for archived jobs, since the jobtracker must search the entire directory tree. Perhaps the directory structure should instead be based purely on the job ID? E.g., something like: jobtrackerstarttime/00/00/00 jobtrackerstarttime/00/00/01 ... jobtrackerstarttime/00/00/99 jobtrackerstarttime/00/01/00 etc. Only if a jobtracker ran more than 1M jobs would its top-level directory have more than 100 entries. Constructing the cache of recent jobs would be fast, as would Cluster.getJobHistoryUrl(JobID). Access to jobs in the cache by user id, date, etc. could be fast, since the cache is in memory. Access to older jobs by user id, date, etc. would not be supported. As an enhancement, we might later place index files in the higher-level directories, listing job ids sorted by username, date, etc. These might be written to leaf directories after the 100th job is added to a directory, and to non-leaves after the 10,000th job is added, etc. They could be generated from the cache. With such indexes, user and time-based queries to the archives could be resolved in logn time.
          Hide
          Amareshwari Sriramadasu added a comment -

          Summarizing the requirements, the issue tries to solve
          1. Maintaining the history folder by administrators becomes hard, since huge number of job history files in a single directory could not be listed.
          2. User experience (response time and searchablility) with history UI should not suffer because of huge number of files in history directory.
          3. Given a jobid, User should be able to know what the history url is.

          Show
          Amareshwari Sriramadasu added a comment - Summarizing the requirements, the issue tries to solve 1. Maintaining the history folder by administrators becomes hard, since huge number of job history files in a single directory could not be listed. 2. User experience (response time and searchablility) with history UI should not suffer because of huge number of files in history directory. 3. Given a jobid, User should be able to know what the history url is.
          Hide
          Amareshwari Sriramadasu added a comment -

          I would also request for removing jobname from the history filename.

          This is done as part of MAPREDUCE-157. I will port the change to Yahoo! distribution with this patch.

          Options for the directory structure of the history files are

          1. {$hadoop.log.dir}/history/done/YYYY-MM-DD/
          2. {$hadoop.log.dir}/history/done/YYYY-MM-DD/USER
          3. {$hadoop.log.dir}/history/done/USER/YYYY-MM-DD/
          4. {$hadoop.log.dir}/history/done/USER/YYYY/MM/DD
          5. {$hadoop.log.dir}/history/done/YYYY/MM/DD/USER

          For the directory structure, I would go with option#1, because it is easy to maintain. We can add more when needed.

          We can have a cache in JobTracker to look up the history location for each jobid (can be moved HistoryServer when we move history to a separate server). We can have JT maintain the cache for last 20 days history (configurable).
          Now, the file name of the history log file is <jobid>_<user>.log. We have job id about 20 characters long, and if user name is about 25 characters, the jobhistory file name is of length about 50 bytes. For a given jobid, the cache entry in JT will be of size at most 100 bytes. 50,000 such entries would make it 5MB.
          We can have a configuration to limit the number entries in the cache, default value being 50,000.
          Thus, the cache is controlled by the number of the days for which the cache is maintained and is also capped by number of entries in the cache.

          If the history location is not present in the JT cache, JT history web ui does not show the page.
          An Interested user can call, the api Cluster.getJobHistoryUrl(JobID, boolean getFromDFS) to get the url from the DFS, if it is not present in JT.
          We can add *bin/hadoop job -historyurl <jobid> * to get the historyurl for the jobid from JT cache. We can add another argument to the command to get the history url from DFS if it is not present in JT cache.
          Then, HistoryViewer can be used to view the history on command line.

          Thoughts?

          Show
          Amareshwari Sriramadasu added a comment - I would also request for removing jobname from the history filename. This is done as part of MAPREDUCE-157 . I will port the change to Yahoo! distribution with this patch. Options for the directory structure of the history files are {$hadoop.log.dir}/history/done/YYYY-MM-DD/ {$hadoop.log.dir}/history/done/YYYY-MM-DD/USER {$hadoop.log.dir}/history/done/USER/YYYY-MM-DD/ {$hadoop.log.dir}/history/done/USER/YYYY/MM/DD {$hadoop.log.dir}/history/done/YYYY/MM/DD/USER For the directory structure, I would go with option#1, because it is easy to maintain. We can add more when needed. We can have a cache in JobTracker to look up the history location for each jobid (can be moved HistoryServer when we move history to a separate server). We can have JT maintain the cache for last 20 days history (configurable). Now, the file name of the history log file is <jobid>_<user>.log. We have job id about 20 characters long, and if user name is about 25 characters, the jobhistory file name is of length about 50 bytes. For a given jobid, the cache entry in JT will be of size at most 100 bytes. 50,000 such entries would make it 5MB. We can have a configuration to limit the number entries in the cache, default value being 50,000. Thus, the cache is controlled by the number of the days for which the cache is maintained and is also capped by number of entries in the cache. If the history location is not present in the JT cache, JT history web ui does not show the page. An Interested user can call, the api Cluster.getJobHistoryUrl(JobID, boolean getFromDFS) to get the url from the DFS, if it is not present in JT. We can add *bin/hadoop job -historyurl <jobid> * to get the historyurl for the jobid from JT cache. We can add another argument to the command to get the history url from DFS if it is not present in JT cache. Then, HistoryViewer can be used to view the history on command line. Thoughts?
          Hide
          Hong Tang added a comment -

          I am changing this to "Bug" instead of "Improvement" because we see that hadoop script runs out of heap space when there are several hundred thousand of job history files.

          Show
          Hong Tang added a comment - I am changing this to "Bug" instead of "Improvement" because we see that hadoop script runs out of heap space when there are several hundred thousand of job history files.
          Hide
          Amar Kamat added a comment -

          I think its high time we should get this rolling. Since the job history files can be moved to hdfs, efficient searching and jobhistory file management becomes critical.

          Show
          Amar Kamat added a comment - I think its high time we should get this rolling. Since the job history files can be moved to hdfs, efficient searching and jobhistory file management becomes critical.
          Hide
          Rajiv Chittajallu added a comment -

          +1 for Tim Williamson suggestion.

          I would also request for removing jobname from the history filename.

          Show
          Rajiv Chittajallu added a comment - +1 for Tim Williamson suggestion. I would also request for removing jobname from the history filename.
          Hide
          Tim Williamson added a comment -

          It would be nice if whatever scheme adopted ensured some upper bound on the number of logs in any single directory. The YYYY/MM/DD/HH scheme would do that in practice. And there's no reason it couldn't be:
          user/YYYY/MM/DD/HH
          which would have the best of both worlds.

          Show
          Tim Williamson added a comment - It would be nice if whatever scheme adopted ensured some upper bound on the number of logs in any single directory. The YYYY/MM/DD/HH scheme would do that in practice. And there's no reason it couldn't be: user/YYYY/MM/DD/HH which would have the best of both worlds.
          Hide
          dhruba borthakur added a comment -

          The most common case is when a user is looking for the logs of a job that he had submitted earlier. So, your proposal looks good to me. +1

          On a general note, it appears that what we are trying to do is to index the metadata of completed jobs for efficient retrieval. Is there any way that Apache Derby http://db.apache.org/derby/ might help in this regard?

          Show
          dhruba borthakur added a comment - The most common case is when a user is looking for the logs of a job that he had submitted earlier. So, your proposal looks good to me. +1 On a general note, it appears that what we are trying to do is to index the metadata of completed jobs for efficient retrieval. Is there any way that Apache Derby http://db.apache.org/derby/ might help in this regard?
          Hide
          Nick Rettinghouse added a comment -

          We have an extremely high job rate. Sorting by YYYY/MM/DD/HH would be a great help. (We could live with YYYY/MM/DD.)

          Show
          Nick Rettinghouse added a comment - We have an extremely high job rate. Sorting by YYYY/MM/DD/HH would be a great help. (We could live with YYYY/MM/DD.)
          Hide
          Amar Kamat added a comment -

          I had an offline discussion with Devaraj, Hemanth and Sharad. Seems like the following structure should solve this issue :

          1. old history files : path-to-job-history/
          2. history files for jobtracker on host hostname: path-to-job-history/hostname
          3. history files for user username using jobtracker running on hostname: path-to-job-history/hostname/username
          4. job history file format : <start-time><jobid><jobname>

          Structuring it further on year, month and day might prove useful but for now it looks like a premature step. If needed we can add it later. So users who submit job at very high rate will be affected as compared to users that submit jobs less frequently. Searching will be easier per-user.

          Future steps :
          1) Add date level info in structuring or atleast display
          2) Add indexing info for faster access/display
          3) Provide various view like recent ones, sort by day/week/month/year, jobname (sorting and structuring) etc.
          4) Secure access
          5) Faster access and analysis (involves changes/tweaks to JobHistory and parsing).

          Thoughts?

          Show
          Amar Kamat added a comment - I had an offline discussion with Devaraj, Hemanth and Sharad. Seems like the following structure should solve this issue : old history files : path-to-job-history/ history files for jobtracker on host hostname: path-to-job-history/hostname history files for user username using jobtracker running on hostname: path-to-job-history/hostname/username job history file format : <start-time> <jobid> <jobname> Structuring it further on year, month and day might prove useful but for now it looks like a premature step. If needed we can add it later. So users who submit job at very high rate will be affected as compared to users that submit jobs less frequently. Searching will be easier per-user. Future steps : 1) Add date level info in structuring or atleast display 2) Add indexing info for faster access/display 3) Provide various view like recent ones, sort by day/week/month/year, jobname (sorting and structuring) etc. 4) Secure access 5) Faster access and analysis (involves changes/tweaks to JobHistory and parsing). Thoughts?
          Hide
          Amar Kamat added a comment -

          HADOOP-4017 tries to unify the way job-history filenames are handled. It would be easy to modify at one place instead of 3.

          Show
          Amar Kamat added a comment - HADOOP-4017 tries to unify the way job-history filenames are handled. It would be easy to modify at one place instead of 3.
          Hide
          Doug Cutting added a comment -

          > But then I thought its better to fix/add that when needed.

          That's fine with me!

          Show
          Doug Cutting added a comment - > But then I thought its better to fix/add that when needed. That's fine with me!
          Hide
          Amar Kamat added a comment -

          Even I thought of having a second level categorization based on date/time. But then I thought its better to fix/add that when needed. Currently fixing the search by user should help us over come the issue. Let me know if we should also categorize by date in this issue.

          Show
          Amar Kamat added a comment - Even I thought of having a second level categorization based on date/time. But then I thought its better to fix/add that when needed. Currently fixing the search by user should help us over come the issue. Let me know if we should also categorize by date in this issue.
          Hide
          Doug Cutting added a comment -

          > The type of search we do there is given a jobtracker-hostname, job-id, username and job-name [...]

          Thanks for the explanation. In that case, a directory per username probably does make sense. Really big directories are generally cumbersome, so you might still also slice things by date too, either above or below the username, so that even a user who has run lots of jobs won't cause, e.g., huge RPCs for listings.

          Show
          Doug Cutting added a comment - > The type of search we do there is given a jobtracker-hostname, job-id, username and job-name [...] Thanks for the explanation. In that case, a directory per username probably does make sense. Really big directories are generally cumbersome, so you might still also slice things by date too, either above or below the username, so that even a user who has run lots of jobs won't cause, e.g., huge RPCs for listings.
          Hide
          Amar Kamat added a comment -

          Doug, the search is w.r.t job-recovery. The type of search we do there is given a jobtracker-hostname, job-id, username and job-name search the job-history file. The way we do it now is

          • construct a regex using jobtracker-hostname, job-id, username and job-name
          • construct a path filter that accepts files that match the pattern and reject otherwise
          • use the dfs listing api to find out files matching the pattern

          This is a costly operation as all the files are scanned linearly. Over time the history folder can grow big leading to more search time. The only problem is all the users will be hit with this. With the above mentioned optimization we can reduce the search time for most of the users.

          Show
          Amar Kamat added a comment - Doug, the search is w.r.t job-recovery. The type of search we do there is given a jobtracker-hostname, job-id, username and job-name search the job-history file. The way we do it now is construct a regex using jobtracker-hostname, job-id, username and job-name construct a path filter that accepts files that match the pattern and reject otherwise use the dfs listing api to find out files matching the pattern This is a costly operation as all the files are scanned linearly. Over time the history folder can grow big leading to more search time. The only problem is all the users will be hit with this. With the above mentioned optimization we can reduce the search time for most of the users.
          Hide
          Doug Cutting added a comment -

          > using username will make the search much more efficient

          Why is that? Are most operations that touch the job history user-specific? I would have guessed that most were rather time-specific, that the most frequent operation would be to browse through the job history by time. Is that not the case?

          Show
          Doug Cutting added a comment - > using username will make the search much more efficient Why is that? Are most operations that touch the job history user-specific? I would have guessed that most were rather time-specific, that the most frequent operation would be to browse through the job history by time. Is that not the case?

            People

            • Assignee:
              Dick King
              Reporter:
              Amar Kamat
            • Votes:
              2 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development