Hadoop Common
  1. Hadoop Common
  2. HADOOP-52

mapred input and output dirs must be absolute

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.0
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      DFS converts relative pathnames to be under /user/$USER. But MapReduce jobs may be submitted by a different user than is running the jobtracker and tasktracker. Thus relative paths must be resolved before a job is submitted, so that only absolute paths are seen on the job tracker and tasktracker. I think the simplest way to fix this is to make JobConf.setInputDir(), setOutputDir(), etc. resolve relative pathnames.

      1. cwd.patch
        30 kB
        Owen O'Malley
      2. cwd2.patch
        31 kB
        Owen O'Malley
      3. cwd3.patch
        30 kB
        Owen O'Malley

        Issue Links

          Activity

          Hide
          chris added a comment -

          I just update to the lately, start all the namenode, datanode, jobtracker and tasktracker

          run nutch command (crawl / readdb etc.), and jobtracker thrown exception,

          060325 001911 Property 'file.separator' is \
          060325 001911 Property 'java.vendor.url.bug' is http://java.sun.com/cgi-bin/bugr
          eport.cgi
          060325 001911 Property 'sun.io.unicode.encoding' is UnicodeLittle
          060325 001911 Property 'sun.cpu.endian' is little
          060325 001911 Property 'sun.desktop' is windows
          060325 001911 Property 'sun.cpu.isalist' is
          060325 001911 Version Jetty/5.1.4
          060325 001911 Checking Resource aliases
          060325 001912 Started org.mortbay.jetty.servlet.WebApplicationHandler@1d7fbfb
          060325 001912 Started WebApplicationContext[/,/]
          060325 001912 Started SocketListener on 0.0.0.0:50030
          060325 001912 Started org.mortbay.jetty.Server@c88440
          060325 001921 Server connection on port 50020 from 192.168.1.10: starting
          060325 001927 Server connection on port 50020 from 192.168.1.11: starting
          060325 001933 Server connection on port 50020 from 192.168.1.12: starting
          060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-default.xml
          060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/mapred-default.xml
          060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-site.xml
          060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-default.xml
          060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/mapred-default.xml
          060325 002240 parsing \tmp\hadoop\mapred\local\job_3dgnm3.xml\jobTracker
          060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-site.xml
          060325 002258 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-default.xml
          060325 002258 parsing file:/D:/jobcall_trunk/hadoop/conf/mapred-default.xml
          060325 002258 parsing \tmp\hadoop\mapred\local\job_3dgnm3.xml\jobTracker
          060325 002258 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-site.xml
          060325 002258 job init failed
          java.io.IOException: No input directories specified in: Configuration: defaults:
          hadoop-default.xml , mapred-default.xml , \tmp\hadoop\mapred\local\job_3dgnm3.x
          ml\jobTrackerfinal: hadoop-site.xml
          at org.apache.hadoop.mapred.InputFormatBase.listFiles(InputFormatBase.ja
          va:90)
          at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.ja
          va:100)
          at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:1
          30)
          at org.apache.hadoop.mapred.JobTracker$JobInitThread.run(JobTracker.java
          :204)
          at java.lang.Thread.run(Thread.java:595)
          060325 002300 Server connection on port 50020 from 192.168.1.12: exiting

          Show
          chris added a comment - I just update to the lately, start all the namenode, datanode, jobtracker and tasktracker run nutch command (crawl / readdb etc.), and jobtracker thrown exception, 060325 001911 Property 'file.separator' is \ 060325 001911 Property 'java.vendor.url.bug' is http://java.sun.com/cgi-bin/bugr eport.cgi 060325 001911 Property 'sun.io.unicode.encoding' is UnicodeLittle 060325 001911 Property 'sun.cpu.endian' is little 060325 001911 Property 'sun.desktop' is windows 060325 001911 Property 'sun.cpu.isalist' is 060325 001911 Version Jetty/5.1.4 060325 001911 Checking Resource aliases 060325 001912 Started org.mortbay.jetty.servlet.WebApplicationHandler@1d7fbfb 060325 001912 Started WebApplicationContext [/,/] 060325 001912 Started SocketListener on 0.0.0.0:50030 060325 001912 Started org.mortbay.jetty.Server@c88440 060325 001921 Server connection on port 50020 from 192.168.1.10: starting 060325 001927 Server connection on port 50020 from 192.168.1.11: starting 060325 001933 Server connection on port 50020 from 192.168.1.12: starting 060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-default.xml 060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/mapred-default.xml 060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-site.xml 060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-default.xml 060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/mapred-default.xml 060325 002240 parsing \tmp\hadoop\mapred\local\job_3dgnm3.xml\jobTracker 060325 002240 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-site.xml 060325 002258 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-default.xml 060325 002258 parsing file:/D:/jobcall_trunk/hadoop/conf/mapred-default.xml 060325 002258 parsing \tmp\hadoop\mapred\local\job_3dgnm3.xml\jobTracker 060325 002258 parsing file:/D:/jobcall_trunk/hadoop/conf/hadoop-site.xml 060325 002258 job init failed java.io.IOException: No input directories specified in: Configuration: defaults: hadoop-default.xml , mapred-default.xml , \tmp\hadoop\mapred\local\job_3dgnm3.x ml\jobTrackerfinal: hadoop-site.xml at org.apache.hadoop.mapred.InputFormatBase.listFiles(InputFormatBase.ja va:90) at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.ja va:100) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:1 30) at org.apache.hadoop.mapred.JobTracker$JobInitThread.run(JobTracker.java :204) at java.lang.Thread.run(Thread.java:595) 060325 002300 Server connection on port 50020 from 192.168.1.12: exiting
          Hide
          Owen O'Malley added a comment -

          Can you please include the text from the exception and the stack trace? Depending on where the exception was thrown, it might be in the server log files.

          Show
          Owen O'Malley added a comment - Can you please include the text from the exception and the stack trace? Depending on where the exception was thrown, it might be in the server log files.
          Hide
          chris added a comment -

          Sorry
          but It seems this code will thown IOException in windows cygwin env......

          it can't get file path correctly.

          Show
          chris added a comment - Sorry but It seems this code will thown IOException in windows cygwin env...... it can't get file path correctly.
          Hide
          Doug Cutting added a comment -

          I have committed this. Thanks for adding unit tests! I had to add a Thread.sleep(1000) to the dfs test code before the tests would pass, to give the datanode a chance to introduce itself to the namenode before the client code started using the fs. Perhaps this is just masking a bug. Perhaps DFSClient should sleep and retry when NameNode.create() throws an exception?

          Show
          Doug Cutting added a comment - I have committed this. Thanks for adding unit tests! I had to add a Thread.sleep(1000) to the dfs test code before the tests would pass, to give the datanode a chance to introduce itself to the namenode before the client code started using the fs. Perhaps this is just masking a bug. Perhaps DFSClient should sleep and retry when NameNode.create() throws an exception?
          Hide
          Owen O'Malley added a comment -

          Ok, this patch includes cwd2.patch with the synchronization in LocalJobRunner rolled back.

          Show
          Owen O'Malley added a comment - Ok, this patch includes cwd2.patch with the synchronization in LocalJobRunner rolled back.
          Hide
          Owen O'Malley added a comment -

          By the way, i didn't see your comment from 10:17 when I submitted my patch at 10:18.

          I'll go ahead and unsynchronize LocalJobRunner and roll a new patch.

          I guess it does make sense to make FileSystem's clonable, even if LocalFileSystem sets the global system property. We can do that later.

          Show
          Owen O'Malley added a comment - By the way, i didn't see your comment from 10:17 when I submitted my patch at 10:18. I'll go ahead and unsynchronize LocalJobRunner and roll a new patch. I guess it does make sense to make FileSystem's clonable, even if LocalFileSystem sets the global system property. We can do that later.
          Hide
          Doug Cutting added a comment -

          The synchronization fixes the problem, but causes everything to run sequentially. I think with the same amount of code we could add a clone() method to FileSystem and provide a better fix. Currently one cannot pass a FileSystem instance to JobClient, but if one could, then we would be changing the CWD of that filesystem underneath other processes that might be using it w/o synchronization.

          If we want to be totally safe, we could make setWorkingDirectory return a clone, with no side-effects. Cloning an FS should be cheap enough.

          Show
          Doug Cutting added a comment - The synchronization fixes the problem, but causes everything to run sequentially. I think with the same amount of code we could add a clone() method to FileSystem and provide a better fix. Currently one cannot pass a FileSystem instance to JobClient, but if one could, then we would be changing the CWD of that filesystem underneath other processes that might be using it w/o synchronization. If we want to be totally safe, we could make setWorkingDirectory return a clone, with no side-effects. Cloning an FS should be cheap enough.
          Hide
          Owen O'Malley added a comment -

          The synchronized (fs) will help the case where the application submits two jobs to the LocalJobRunner with different working directories. Because the LocalJobRunner obeys the synchronization with respect to the file system object, it should work. (I don't have any parallel map/reduce job applications to test it with and testing for missing synchronization is almost impossible anyways.)

          Show
          Owen O'Malley added a comment - The synchronized (fs) will help the case where the application submits two jobs to the LocalJobRunner with different working directories. Because the LocalJobRunner obeys the synchronization with respect to the file system object, it should work. (I don't have any parallel map/reduce job applications to test it with and testing for missing synchronization is almost impossible anyways.)
          Hide
          Owen O'Malley added a comment -

          Ok, here is an updated patch that uses camelCase everywhere for variable names.

          I also put in synchronization between the setWorkingDirectory and running the
          application code in the LocalJobRunner.

          Show
          Owen O'Malley added a comment - Ok, here is an updated patch that uses camelCase everywhere for variable names. I also put in synchronization between the setWorkingDirectory and running the application code in the LocalJobRunner.
          Hide
          Doug Cutting added a comment -

          The 'synchronize (fs) ...' is worthless unless everyone does it, which they don't, so I wouldn't bother.

          I think you're right that my concern about LocalJobRunner is misplaced. It is not a public class, and a new instance is created for each job, so there's currently no way to have multiple jobs sharing a LocalJobRunner.

          Longer term, if this becomes an issue, I think making FileSystem cloneable is preferable to synchronization.

          Show
          Doug Cutting added a comment - The 'synchronize (fs) ...' is worthless unless everyone does it, which they don't, so I wouldn't bother. I think you're right that my concern about LocalJobRunner is misplaced. It is not a public class, and a new instance is created for each job, so there's currently no way to have multiple jobs sharing a LocalJobRunner. Longer term, if this becomes an issue, I think making FileSystem cloneable is preferable to synchronization.
          Hide
          Owen O'Malley added a comment -

          I'll go ahead and update the variable names to camelCast.

          The LocalJobRunner only has a problem when the user is submitting multiple jobs at the same time, right?

          I knew I was opening a can of worms with local directories with respect to threads. There are lots of potential solutions including making the current directory thread local. (Although the LocalFileSystem would have discontinuities, because it also sets the system property, which is by definition global.) I think for now that we are better off following the unix semantics of treating the working directory as global.

          For now, I'm proposing synchronizing on the file system, like:

          synchronize (fs)

          { setWorkingDirectory(job, fs); MapTask map = new MapTask(file, (String)mapIds.get(i), splits[i]); }

          around each of the places where the LocalJobRunner sets the working directory.

          On a side note, if we are worried about local runners with multiple jobs, we should put synchronization in around the updates of the jobs list and other fields of the LocalJobRunner.

          Show
          Owen O'Malley added a comment - I'll go ahead and update the variable names to camelCast. The LocalJobRunner only has a problem when the user is submitting multiple jobs at the same time, right? I knew I was opening a can of worms with local directories with respect to threads. There are lots of potential solutions including making the current directory thread local. (Although the LocalFileSystem would have discontinuities, because it also sets the system property, which is by definition global.) I think for now that we are better off following the unix semantics of treating the working directory as global. For now, I'm proposing synchronizing on the file system, like: synchronize (fs) { setWorkingDirectory(job, fs); MapTask map = new MapTask(file, (String)mapIds.get(i), splits[i]); } around each of the places where the LocalJobRunner sets the working directory. On a side note, if we are worried about local runners with multiple jobs, we should put synchronization in around the updates of the jobs list and other fields of the LocalJobRunner.
          Hide
          Doug Cutting added a comment -

          Overall this looks great! A few minor comments:

          . your new local variables and non-static fields use underscores as word separators, rather than camelCase, as is both the Java convention and the convention used in the files you're changing. Can you please switch these to use camelCase?

          . In LocalJobRunner there are potentially multiple threads changing the working directory of a single fs. Perhaps we should instead have a different fs instance for each job thread? Should we make FileSystem support clone()?

          Thanks!

          Show
          Doug Cutting added a comment - Overall this looks great! A few minor comments: . your new local variables and non-static fields use underscores as word separators, rather than camelCase, as is both the Java convention and the convention used in the files you're changing. Can you please switch these to use camelCase? . In LocalJobRunner there are potentially multiple threads changing the working directory of a single fs. Perhaps we should instead have a different fs instance for each job thread? Should we make FileSystem support clone()? Thanks!
          Hide
          Owen O'Malley added a comment -

          Here is a patch that fixes the problem.

          It does:
          1. It adds

          {set,get}

          WorkingDirectory to FileSystem.
          2. It implements them in both LocalFileSystem and DFS.
          3. The LocalFileSystem implementation both sets the System property user.dir and does an explicit
          conversion to absolute filenames at the API.
          4. Added new junit test cases to test the WorkingDirectory functionality.
          5. Added a utility class in the test directory to create a single-process DFS cluster for junit tests.
          6. Added the user name into the JobConf.
          7. Added the user name into the JobProfile.
          8. Added the user name into the webapp, so you can see who ran the job.
          9. Added the working directory in the default file system to the JobConf.
          10. Set the job's working directory before starting the user's Map or Reduce code. (The input splitter is given an absolute pathname
          for the input directory, but the working directory is not set, since it is done in the context of the JobTracker.)
          11. Changed the format of the percentage complete in the webapp to be ##0.00 so that you don't get 16 digits of meaningless precision
          about your job status.

          Show
          Owen O'Malley added a comment - Here is a patch that fixes the problem. It does: 1. It adds {set,get} WorkingDirectory to FileSystem. 2. It implements them in both LocalFileSystem and DFS. 3. The LocalFileSystem implementation both sets the System property user.dir and does an explicit conversion to absolute filenames at the API. 4. Added new junit test cases to test the WorkingDirectory functionality. 5. Added a utility class in the test directory to create a single-process DFS cluster for junit tests. 6. Added the user name into the JobConf. 7. Added the user name into the JobProfile. 8. Added the user name into the webapp, so you can see who ran the job. 9. Added the working directory in the default file system to the JobConf. 10. Set the job's working directory before starting the user's Map or Reduce code. (The input splitter is given an absolute pathname for the input directory, but the working directory is not set, since it is done in the context of the JobTracker.) 11. Changed the format of the percentage complete in the webapp to be ##0.00 so that you don't get 16 digits of meaningless precision about your job status.

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Doug Cutting
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development