Issue Details (XML | Word | Printable)

Key: HADOOP-2116
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Amareshwari Sriramadasu
Reporter: Milind Bhandarkar
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Job.local.dir to be exposed to tasks

Created: 28/Oct/07 07:38 PM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: 0.14.3
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works patch-2116.txt 2008-03-19 12:46 PM Amareshwari Sriramadasu 33 kB
Text File Licensed for inclusion in ASF works patch-2116.txt 2008-02-26 10:56 AM Amareshwari Sriramadasu 29 kB
Text File Licensed for inclusion in ASF works patch-2116.txt 2008-01-18 06:46 AM Amareshwari Sriramadasu 6 kB
Text File Licensed for inclusion in ASF works patch-2116.txt 2008-01-17 06:00 AM Amareshwari Sriramadasu 6 kB
Text File Licensed for inclusion in ASF works patch-2116.txt 2008-01-10 11:04 AM Amareshwari Sriramadasu 7 kB
Text File Licensed for inclusion in ASF works patch-2116.txt 2008-01-09 12:31 PM Amareshwari Sriramadasu 6 kB
Environment: All
Issue Links:
Reference
 

Hadoop Flags: Incompatible change
Release Note:
This issue restructures local job directory on the tasktracker.
Users are provided with a job-specific shared directory (mapred-local/taskTracker/jobcache/$jobid/ work) for using it as scratch space, through configuration property and system property "job.local.dir". Now, the directory "../work" is not available from the task's cwd.
Resolution Date: 20/Mar/08 03:36 PM


 Description  « Hide
Currently, since all task cwds are created under a jobcache directory, users that need a job-specific shared directory for use as scratch space, create ../work. This is hacky, and will break when HADOOP-2115 is addressed. For such jobs, hadoop mapred should expose job.local.dir via localized configuration.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Konstantin Shvachko added a comment - 21/Dec/07 07:30 PM - edited
This is also practically fixed by HADOOP-2227. The only thing left is to expose the shared directory through the configuration.
JobConf now has a property "mapred.jar" accessible through getJar() method, which points to the jar file located in the jobcache
directory, which in fact is in the common shared directory for the job tasks.
Namely,
"mapred.jar" = "mapred.local.dir"[i]/taskTracker/jobcache/<job_id>/job.jar

So we can replace configuration parameter "mapred.jar" by "job.local.dir", which will point to the parent of "mapred.jar".
JobConf.getJar() can be implemented then as

String getJar() {
    return get("job.local.dir") + "/job.jar";
}

Will that work?

With respect to all the above I wonder why do we need to use LocalDirAllocator in TaskRunner.run()
if job cache directory (jobCacheDir) can be obtained directly from TaskRunner.conf

File jobCacheDir = new File(new File(conf.getJar()).getParentFile(), "work");

Amareshwari Sriramadasu added a comment - 26/Dec/07 12:12 PM

So we can replace configuration parameter "mapred.jar" by "job.local.dir", which will point to the parent of "mapred.jar".

We cannot replace mapred.jar by job.local.dir because mapred.jar can be set and get by setJar() and getJar() from client side. For example, launchWordCount in TestMiniMRClassPath gives a different path for jar file.

To expose the shared directory through the configuration, We can set
localJobConf.set("job.local.dir", jobDir) in localizeJob()
and job Cache directory can be obtained as
File jobCacheDir = new File(new File(conf.get("job.local.dir")), "work");


Milind Bhandarkar added a comment - 27/Dec/07 07:38 PM
I would prefer separating the two. I.e. where job.jar goes, versus where the job.local.dir goes. Especially for streaming, where side-effect tasks are common, the mapper and reducer commands would need to have a clean directory (empty) where they can cache job-specific data (dictionaries downloaded off the network etc, that cannot be packaged as distributed archives). If job.jar also lives there, it might someday clash with the files downloaded, and cause issues.

So, mapred.jar, jobCacheDir, and job.local.dir all need to be different locations.

Is jobCacheDir available via a config variable ?


Amareshwari Sriramadasu added a comment - 02/Jan/08 11:42 AM
In the current state of art, jobCacheDir is "mapred/local/taskTracker/jobcache/<job_id>/work".
I far as I understood, this needs to be accessible as "job.local.dir", a job-specific shared directory for use as scratch space.

So, mapred.jar, jobCacheDir, and job.local.dir all need to be different locations.

here, jobCacheDir (existing) would become job.local.dir now.

If you want jobCacheDir to point to "mapred/local/taskTracker/jobcache/<job_id>/", this cannot be available via a config variable. Because it cannot take a unique value as it can be present in more than one disk. For example, we can have task directory ( mapred/local/taskTracker/jobcache/<job_id>/<taskid>) on a disk otherthan job.local.dir.

Finally, we will have mapred.jar and job.local.dir (earlier jobCachedir) , both at different locations.

Thoughts?


Milind Bhandarkar added a comment - 08/Jan/08 06:11 PM
There are several advantages to have job.local.dir to be empty when the first task from that job starts on a tasktracker. (It would simplify the logic for user code to populate it with job-specific cached data that cannot use jobCache functionality.)

That is why I suggest that mapred.jar, jobCacheDir, and job.local.dir all need to be different locations.


Amareshwari Sriramadasu added a comment - 09/Jan/08 06:24 AM
I propose the following the job cache directory structure to address the above needs:
mapred/local/tasktracker/jobcache/<jobid>/
                                  --------> job_jar_xml/
                                             ---------> job.jar
                                             ---------> job.xml
                                             ---------> unJarred directory
                                  --------> work/
                                  --------><taskdir>

And we can have the directories job_jar_xml, work and taskdir on different disks.
mapred/local/tasktracker/jobcache/<jobid>/job_jar_xml/job.jar is available via mapred.jar
and mapred/local/tasktracker/jobcache/<jobid>/work is available via job.local.dir , which is an empty directory.

Thoughts?


Amareshwari Sriramadasu added a comment - 09/Jan/08 12:37 PM
Submitting the patch with the proposed approach.

Milind Bhandarkar added a comment - 09/Jan/08 08:32 PM
Does this mean that all the taskdir will again use the same partition ?
It would be opposite of HADOOP-2227, right ?
Thats not good performance-wise too, since all tasks will be using the same spindle.

Devaraj Das added a comment - 09/Jan/08 08:39 PM

Does this mean that all the taskdir will again use the same partition ?

No, the taskdir will be on different disks (using the LocalDirAllocator). The common directories for all tasks of a given job will the job.local.dir/jobCacheDir and the job_jar_xml (they will be configured/setup once per job using the LocalDirAllocator).


Milind Bhandarkar added a comment - 09/Jan/08 08:43 PM
In that case, +1 for this approach !

Amareshwari Sriramadasu added a comment - 10/Jan/08 06:11 AM
has to fix the streaming jobCacheDir.

Amareshwari Sriramadasu added a comment - 10/Jan/08 11:06 AM
Submiting again with fix for streaming and isolation runner.

Lohit Vijayarenu added a comment - 10/Jan/08 09:37 PM
Hi Amareshwari,

I tested this patch against trunk for resolution of HADOOP-2570. This solves the problem mentioned in HADOOP-2570. Should this patch be marked to go in 0.15.3 ?

Thanks,
Lohit


Arun C Murthy added a comment - 11/Jan/08 05:38 AM
I light of HADOOP-2570, I'm cancelling this patch.

Reasoning:

The -file option works by putting the script into the job's jar file by unjar-ing, copying and then jar-ing it again. (yuck!)

This means that on the TaskTracker the script has moved from jobCache/work to jobCache/job_jar_xml (I propose we rename that to private, heh). Clearly user-scripts which rely on "../work/<script_name>" will break again...

Having said that we need to debate whether this feature is an incompatible-change, what do folks think?

If people say otherwise we need to ensure all files in jobCache/private are smylinked into jobCache/work... ugh!


I'd like to take this opportunity to take a hard look at streaming's -file option too. The unjar/jar way is completely backwards! We should rework the -file option to use the DistributedCache and the symlink option it provides.
So, user-scripts can simply be "./<script>" rather than "../work/<script>". Yes, the way to maintain compatibility (if we want) is to use the previous option of symlinking files into jobCache/work also. I'd strongly vote for this option.

Thoughts?


Owen O'Malley added a comment - 11/Jan/08 06:41 AM - edited
Ugh is right.

I'd propose some better names:

$local/work/$jobid/
       cache/               -- file cache
       jars/                    -- expanded jar
       job.xml               -- the generic job conf
       $taskid/
             job.xml        -- task localized job conf
             output/         -- map outputs
             work/            -- cwd for task

with each of the leaf directories being placed independently on the partitions.

We should define localized attributes to point to where each of the leaf directories is.

I agree with Arun that we should re-work the -file option to use the file cache with symlinks.


Devaraj Das added a comment - 11/Jan/08 06:02 PM
The only problem with this are the incompatible changes (like ../work and ../work/script); code, especially scripts that assume paths will break. So, is everyone okay with this for 0.16? Should we do the symlink stuff to maintain backward compatibility. As an aside, in the directory organization Owen suggested, one thing that needs to be added is the common scratch space for all tasks (like the file cache).

Another thing IMO is that we should probably just do the basic dir organization as was proposed by Amareshwari earlier and the streaming fix. The magnitude of the change required by the dir organization proposed by Owen seems pretty significant and seems aggressive for 0.16. Maybe we can do the remaining for 0.17. Thoughts?


Milind Bhandarkar added a comment - 11/Jan/08 10:14 PM
Since this bug is scheduled for 0.16, having incompatible changes in that release is fine (of course, as long as it is flagged such in the release notes.)

Hadoop QA added a comment - 12/Jan/08 08:13 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12372896/patch-2116.txt
against trunk revision r611361.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests -1. The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1552/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1552/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1552/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1552/console

This message is automatically generated.


Amareshwari Sriramadasu added a comment - 16/Jan/08 12:42 PM

I'd like to take this opportunity to take a hard look at streaming's -file option too. The unjar/jar way is completely backwards! We should rework the -file option to use the DistributedCache and the symlink option it provides.

I created HADOOP-2622 to look at -file option.
For 16.0, this issue will address the directory structure proposed earlier rather than eloberated structure proposed later.


Amareshwari Sriramadasu added a comment - 18/Jan/08 06:48 AM
This patch has the empty work directory available as scratch space through environment variable "job.local.dir".
The directory layout is as described earlier.
I did thourough testing ; tested wordcount, sort and streaming job.

Arun C Murthy added a comment - 19/Jan/08 12:18 AM
This patch sets a system property 'job.local.dir', I'm assuming that it is inherited by the children?
+        System.setProperty("job.local.dir", workDir.toString());

Even so, I think we should set a property in the JobConf to be consistent.


Overall, I'm a little concerned that this is quite late (w.r.t 0.16.0) to be getting this in. I spoke to Milind and he is happy with the HADOOP-2570 (the symlink to ../work) - especially given the number of changes we need to make where we use something.getParent().{}. Hence I propose we push this to 0.17.0 and also make it a bigger change incorporating wider changes to the task's local directories proposed by Owen. Thoughts?


Milind Bhandarkar added a comment - 19/Jan/08 12:22 AM
As long as ../work works currently from task cwd to a shared job-specific directory, I am okay with punting this.
So, +1.

Hadoop QA added a comment - 19/Jan/08 01:43 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12373478/patch-2116.txt
against trunk revision r613115.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1644/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1644/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1644/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1644/console

This message is automatically generated.


Amareshwari Sriramadasu added a comment - 22/Jan/08 06:46 AM
A clarification regarding distributed cache:
The current behavior of distributed cache is that the distributed cache is shared among the jobs. The cache is localized under mapred/local/tasktracker/archive. i.e If two jobs want to localize files with same name, they actually share them unless they have different file timestamps. Whenever a task releases cache, it decrements the reference count for the cache-id. Cache is cleaned up only when the cache size exceeds the allowed lize (local.cache.size).
Is it the intended behavior, or should the cache be job specific? With the directory structure that Owen has suggested, it seems like cache should be job specific.

Owen O'Malley added a comment - 07/Feb/08 07:37 AM
You are right that the file cache is shared between jobs and that is the desired behavior. (Although it is fair to ask the question of whether that is the right policy once we have permissions. In general, probably not since it wouldn't be hard to create a file that looks like the desired one and get access to a file that you should have access to.)

So what would you suggest for a layout?


Amareshwari Sriramadasu added a comment - 20/Feb/08 09:01 AM
I feel even with permissions, DistributedCache behavior should be the same. If the same user wants to share files across jobs, he should be allowed. And if he wants to share with other user who has permissions to access should be allowed. TaskTracker need not worry about the user permissions for localizing cache, those should be taken care in DistributedCache itself. Permissions aspect of DistributedCache has to be handled in a different JIRA.
I propose the new layout would be the same as Owen suggested without filecache as part of job cache.
So, it is
mapred/local/taskTracker/jobcache/$jobid/
                                       work/                  -- the scratch space
                                       jars/                  -- expanded jar
                                       job.xml                -- the generic job conf
                                      $taskid/
                                           job.xml          -- task localized job conf
                                           output/          -- map outputs
                                           work/            -- cwd for task
mapred/local/taskTracker/archive/   -- distributed cache

Thoughts?


Amareshwari Sriramadasu added a comment - 26/Feb/08 10:59 AM
Here is patch with proposed design.
I ran sort on 500 nodes. and also ran a streaming application on 10 nodes.
Lohit, Can you also run your streaming application and verify if this patch is fine?

Hadoop QA added a comment - 27/Feb/08 12:21 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12376475/patch-2116.txt
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 6 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1844/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1844/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1844/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1844/console

This message is automatically generated.


Devaraj Das added a comment - 19/Mar/08 09:55 AM
Please add some documentation around job.local.dir.

Amareshwari Sriramadasu added a comment - 19/Mar/08 12:46 PM
Added an api getJobLocalDir() in JobConf to get job.local.dir. Added javadoc.
Added documentation in mapred_tutorial.xml

Hadoop QA added a comment - 20/Mar/08 02:45 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378226/patch-2116.txt
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 6 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2004/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2004/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2004/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2004/console

This message is automatically generated.


Devaraj Das added a comment - 20/Mar/08 11:18 AM
I just committed this. Thanks, Amareshwari!

Hudson added a comment - 20/Mar/08 01:13 PM

Devaraj Das added a comment - 20/Mar/08 03:36 PM
I committed this. Thanks, Amareshwari!

Robert Chansler added a comment - 14/Apr/08 04:30 PM
Noted as incompatible in changes.txt