Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1288

DistributedCache localizes only once per cache URI

    Details

      Description

      As part of the file localization the distributed cache localizer creates a copy of the file in the corresponding user's private directory. The localization in DistributedCache assumes the key as the URI of the cachefile and if it already exists in the map, the localization is not done again. This means that another user cannot access the same distributed cache file. We should change the key to include the username so that localization is done for every user.

      1. MR-1288-trunk-1.patch
        13 kB
        Devaraj Das
      2. MR-1288-bp20-3.patch
        9 kB
        Devaraj Das
      3. MR-1288-bp20-2.patch
        7 kB
        Devaraj Das
      4. MR-1288-bp20-1.patch
        7 kB
        Devaraj Das

        Activity

        Hide
        Arun C Murthy added a comment -

        I'm assuming this is done only for 'private' cache files? i.e. public cache files should probably use the 'username' of the TT itself?

        Show
        Arun C Murthy added a comment - I'm assuming this is done only for 'private' cache files? i.e. public cache files should probably use the 'username' of the TT itself?
        Hide
        Devaraj Das added a comment -

        This issue is talking about how localization happens in 0.21/trunk. There is no "public" or "private" cache files currently. That is getting introduced In MAPREDUCE-744,

        Show
        Devaraj Das added a comment - This issue is talking about how localization happens in 0.21/trunk. There is no "public" or "private" cache files currently. That is getting introduced In MAPREDUCE-744 ,
        Hide
        Vinod Kumar Vavilapalli added a comment -

        +1 for putting the username also as part of the key.

        Show
        Vinod Kumar Vavilapalli added a comment - +1 for putting the username also as part of the key.
        Hide
        Hemanth Yamijala added a comment -

        I suppose one could argue that if two different users can access the same set of files on the DFS for localization, they are 'public'. But then, you could theoretically construct a use case where there's a 'group' access for some files on DFS and these are localized per user on the task tracker. Is that what we're trying to address ?

        Show
        Hemanth Yamijala added a comment - I suppose one could argue that if two different users can access the same set of files on the DFS for localization, they are 'public'. But then, you could theoretically construct a use case where there's a 'group' access for some files on DFS and these are localized per user on the task tracker. Is that what we're trying to address ?
        Hide
        Devaraj Das added a comment -

        Look at this scenario - the URI is hdfs://<host>:<port>/foo/bar/file.txt. Even if the entire path were accessible to everyone, the TaskTracker would localize it exactly once, and in a user's private directory. A second job wishing to access the same file wouldn't be able to do so since the TT wouldn't localize it again.. Does that make sense ?

        Show
        Devaraj Das added a comment - Look at this scenario - the URI is hdfs://<host>:<port>/foo/bar/file.txt. Even if the entire path were accessible to everyone, the TaskTracker would localize it exactly once, and in a user's private directory. A second job wishing to access the same file wouldn't be able to do so since the TT wouldn't localize it again.. Does that make sense ?
        Hide
        Hemanth Yamijala added a comment -

        Even if the entire path were accessible to everyone,

        If the entire path were accessible to everyone on DFS, there's really no great security for that file. I was just trying to point out that such a case may not even be valid in the context of how MAPREDUCE-856 was approached (i.e we wanted to secure localized files for users). But I am concurring that one could theoretically construct a case where the URI was accessible to a group of users on DFS and since there's no way to securely localize that per group on the TT, this bug is still valid.

        Show
        Hemanth Yamijala added a comment - Even if the entire path were accessible to everyone, If the entire path were accessible to everyone on DFS, there's really no great security for that file. I was just trying to point out that such a case may not even be valid in the context of how MAPREDUCE-856 was approached (i.e we wanted to secure localized files for users). But I am concurring that one could theoretically construct a case where the URI was accessible to a group of users on DFS and since there's no way to securely localize that per group on the TT, this bug is still valid.
        Hide
        Devaraj Das added a comment -

        All I am saying is that irrespective of the file being public or not, in the current codebase, we localize the file exactly once per TaskTracker. On a given tasktracker, users cannot share the same hdfs file as a distributed cache file..
        What I thought earlier was that the same file would be localized twice in such a case (in their respective private directories).

        Show
        Devaraj Das added a comment - All I am saying is that irrespective of the file being public or not, in the current codebase, we localize the file exactly once per TaskTracker. On a given tasktracker, users cannot share the same hdfs file as a distributed cache file.. What I thought earlier was that the same file would be localized twice in such a case (in their respective private directories).
        Hide
        Hemanth Yamijala added a comment -

        Just to be clear, I am not disagreeing at all that there's a bug. Or with the assessment that this is a blocker. +1 on both. smile

        Show
        Hemanth Yamijala added a comment - Just to be clear, I am not disagreeing at all that there's a bug. Or with the assessment that this is a blocker. +1 on both. smile
        Hide
        Allen Wittenauer added a comment -

        What happens in the case that the archive file changes in flight. For example, I submit a job using that archive. While my job is running, I notice a bug, remove the old cache file, push a new one to hdfs, and then launch a new invocation of my job. Would the new job get the old cache file because the old job is still running?

        Show
        Allen Wittenauer added a comment - What happens in the case that the archive file changes in flight. For example, I submit a job using that archive. While my job is running, I notice a bug, remove the old cache file, push a new one to hdfs, and then launch a new invocation of my job. Would the new job get the old cache file because the old job is still running?
        Hide
        Devaraj Das added a comment -

        If i am reading the code right, the tasks of the new job would fail on those nodes that localized the old archive and still has a copy of that (the TaskTracker would detect the archive has changed and assuming that the change happened while the new job was running would fail the tasks). This will continue until the archive is purged from the cache and re-localized.

        Show
        Devaraj Das added a comment - If i am reading the code right, the tasks of the new job would fail on those nodes that localized the old archive and still has a copy of that (the TaskTracker would detect the archive has changed and assuming that the change happened while the new job was running would fail the tasks). This will continue until the archive is purged from the cache and re-localized.
        Hide
        Hemanth Yamijala added a comment -

        I am trying to revisit if this bug continues to be a blocker, in light of the decision we have taken to rebase trunk as the next new release.

        For the default configuration (DefaultTaskController), even though the files are localized only once irrespective of the number of users accessing them, since all the files are localized as the TT user and can be accessed by the TT user, I suppose this is not a problem. For the LinuxTaskController, it seems like a problem. However, since we are rebasing from trunk, maybe the feature of public distributed cache files in MAPREDUCE-744 will cover cases that are otherwise not covered due to this bug. To explain more, either users want to share files, or they don't. If they do, they can use the public distributed cache feature which works fine. If they don't want to share, this bug becomes a non issue.

        Based on this, I am tending to think this is not a blocker. Thoughts ?

        Show
        Hemanth Yamijala added a comment - I am trying to revisit if this bug continues to be a blocker, in light of the decision we have taken to rebase trunk as the next new release. For the default configuration (DefaultTaskController), even though the files are localized only once irrespective of the number of users accessing them, since all the files are localized as the TT user and can be accessed by the TT user, I suppose this is not a problem. For the LinuxTaskController, it seems like a problem. However, since we are rebasing from trunk, maybe the feature of public distributed cache files in MAPREDUCE-744 will cover cases that are otherwise not covered due to this bug. To explain more, either users want to share files, or they don't. If they do, they can use the public distributed cache feature which works fine. If they don't want to share, this bug becomes a non issue. Based on this, I am tending to think this is not a blocker. Thoughts ?
        Hide
        Allen Wittenauer added a comment -

        I think the 'in flight' archive is still a problem without this fixed. Correct?

        Show
        Allen Wittenauer added a comment - I think the 'in flight' archive is still a problem without this fixed. Correct?
        Hide
        Devaraj Das added a comment -

        In my earlier analysis, i missed a code path. What I said is not true.

        Show
        Devaraj Das added a comment - In my earlier analysis, i missed a code path. What I said is not true.
        Hide
        Hemanth Yamijala added a comment -

        What happens in the case that the archive file changes in flight. For example, I submit a job using that archive. While my job is running, I notice a bug, remove the old cache file, push a new one to hdfs, and then launch a new invocation of my job. Would the new job get the old cache file because the old job is still running?

        Allen, the key that identifies a cache file on a tasktracker node is a combination of the URL and the DFS timestamp that is determined when the job is submitted. Hence, the new job would get a new key and hence be localized afresh. This is irrespective of whether the old file was ever localized on the same node or not. I am assuming here that a file upload to DFS to the same URL would modify the timestamp.

        Further, when this happens, new tasks of the old job that are running on nodes where the localization of the invalid file has already happened, will fail because the localization process for the new tasks will detect the file has changed in-flight.

        Hope this is correct.

        Show
        Hemanth Yamijala added a comment - What happens in the case that the archive file changes in flight. For example, I submit a job using that archive. While my job is running, I notice a bug, remove the old cache file, push a new one to hdfs, and then launch a new invocation of my job. Would the new job get the old cache file because the old job is still running? Allen, the key that identifies a cache file on a tasktracker node is a combination of the URL and the DFS timestamp that is determined when the job is submitted. Hence, the new job would get a new key and hence be localized afresh. This is irrespective of whether the old file was ever localized on the same node or not. I am assuming here that a file upload to DFS to the same URL would modify the timestamp. Further, when this happens, new tasks of the old job that are running on nodes where the localization of the invalid file has already happened, will fail because the localization process for the new tasks will detect the file has changed in-flight. Hope this is correct.
        Hide
        Allen Wittenauer added a comment -

        That sounds like really bad behavior.

        Why should an old job fail because of what is, essentially, an external event?

        This still sounds like a blocker to me.

        Show
        Allen Wittenauer added a comment - That sounds like really bad behavior. Why should an old job fail because of what is, essentially, an external event? This still sounds like a blocker to me.
        Hide
        Hemanth Yamijala added a comment -

        Why should an old job fail because of what is, essentially, an external event?

        The job failing is unlikely, right ? Please note that I said tasks fail. I hope someone can clarify (given the two statuses we have - i.e. tasks failed vs killed), whether this condition can lead Hadoop to abort after sufficient number of failures. Even if it does, it should happen that at least one task''s attempts should get scheduled on 4 such nodes, and fail on all four. I am thinking this is unlikely. But let's hope someone (alias Amarsri smile) can clarify this.

        Show
        Hemanth Yamijala added a comment - Why should an old job fail because of what is, essentially, an external event? The job failing is unlikely, right ? Please note that I said tasks fail. I hope someone can clarify (given the two statuses we have - i.e. tasks failed vs killed), whether this condition can lead Hadoop to abort after sufficient number of failures. Even if it does, it should happen that at least one task''s attempts should get scheduled on 4 such nodes, and fail on all four. I am thinking this is unlikely. But let's hope someone (alias Amarsri smile ) can clarify this.
        Hide
        Amareshwari Sriramadasu added a comment -

        Why should an old job fail because of what is, essentially, an external event?

        The current behavior is that Job will fail if a distributed cache file (in use) gets modified on the DFS. Even if the task is localized on a new tracker we should fail the task, is being done through MAPREDUCE-1225.

        Allen, What should be the behavior for the following use-case?
        For the case : "some tasks have downloaded a version-0 file and ran successfully; some other tasks cannot find version-0 and they can find only version-1 file, and they use version-1file and run successfully", final Job output would be undefined right? The output generated from the tasks which used version-0 file would be entirely different from output generated from the tasks which used version-1 file.
        So, I think the current behavior to fail the job, if a file added to distributed-cache gets changed after job submission is correct.

        Show
        Amareshwari Sriramadasu added a comment - Why should an old job fail because of what is, essentially, an external event? The current behavior is that Job will fail if a distributed cache file (in use) gets modified on the DFS. Even if the task is localized on a new tracker we should fail the task, is being done through MAPREDUCE-1225 . Allen, What should be the behavior for the following use-case? For the case : "some tasks have downloaded a version-0 file and ran successfully; some other tasks cannot find version-0 and they can find only version-1 file, and they use version-1file and run successfully", final Job output would be undefined right? The output generated from the tasks which used version-0 file would be entirely different from output generated from the tasks which used version-1 file. So, I think the current behavior to fail the job, if a file added to distributed-cache gets changed after job submission is correct.
        Hide
        Allen Wittenauer added a comment -

        Why would a task from an already running job not be able to find version-0? Why is the task tracker removing content from the cache of a running job? If the content moved/is different, shouldn't the job tracker be able to reschedule tasks onto a task tracker that has a copy? Why can't the task tracker copy the dcache from another task tracker that does have a copy?

        That said, I'm not convinced that in the majority of cases that version-0 vs. version-1 is undefined. From what I've seen, most of the time different versions of a dcache are downwardly compatible. As much as folks hate tunables, perhaps that is the answer here: mapred.job.dcache.failonupdate.

        Show
        Allen Wittenauer added a comment - Why would a task from an already running job not be able to find version-0? Why is the task tracker removing content from the cache of a running job? If the content moved/is different, shouldn't the job tracker be able to reschedule tasks onto a task tracker that has a copy? Why can't the task tracker copy the dcache from another task tracker that does have a copy? That said, I'm not convinced that in the majority of cases that version-0 vs. version-1 is undefined. From what I've seen, most of the time different versions of a dcache are downwardly compatible. As much as folks hate tunables, perhaps that is the answer here: mapred.job.dcache.failonupdate.
        Hide
        Amareshwari Sriramadasu added a comment -

        Allen, I understand your use-case and I agree that it should be tunable from the job. But, I think this is not a blocker anymore, because distributed cache always had this behavior to fail the job if cache file gets modified on the fly.

        Show
        Amareshwari Sriramadasu added a comment - Allen, I understand your use-case and I agree that it should be tunable from the job. But, I think this is not a blocker anymore, because distributed cache always had this behavior to fail the job if cache file gets modified on the fly.
        Hide
        Amareshwari Sriramadasu added a comment -

        I raised MAPREDUCE-1288 for the discussion on "in flight" archive.

        Show
        Amareshwari Sriramadasu added a comment - I raised MAPREDUCE-1288 for the discussion on "in flight" archive.
        Hide
        Amareshwari Sriramadasu added a comment -

        I raised MAPREDUCE-1288 for the discussion on "in flight" archive.

        Sorry, I meant MAPREDUCE-1729.

        Show
        Amareshwari Sriramadasu added a comment - I raised MAPREDUCE-1288 for the discussion on "in flight" archive. Sorry, I meant MAPREDUCE-1729 .
        Hide
        Hemanth Yamijala added a comment -

        The original intent of this JIRA is certainly different from what we ended up discussing in the end. So, I request we move discussion about the in-flight change to MAPREDUCE-1729.

        And I ask back the original question: With the availability of public and private distributed cache options, can we assume this issue is not a blocker any more ?

        Show
        Hemanth Yamijala added a comment - The original intent of this JIRA is certainly different from what we ended up discussing in the end. So, I request we move discussion about the in-flight change to MAPREDUCE-1729 . And I ask back the original question: With the availability of public and private distributed cache options, can we assume this issue is not a blocker any more ?
        Hide
        Vinod Kumar Vavilapalli added a comment -

        The only corner case that isn't handled yet is when a file on DFS is originally a private file, gets localized in private dirs on TT and then subsequently becomes a public file on DFS itself. Though it is still a possibility, it is pretty corner-case. I am +1 for moving this one out of the blocker queue for 0.21.

        Show
        Vinod Kumar Vavilapalli added a comment - The only corner case that isn't handled yet is when a file on DFS is originally a private file, gets localized in private dirs on TT and then subsequently becomes a public file on DFS itself. Though it is still a possibility, it is pretty corner-case. I am +1 for moving this one out of the blocker queue for 0.21.
        Hide
        Tom White added a comment -

        Downgrading from blocker, since this seems to be the consensus here.

        Show
        Tom White added a comment - Downgrading from blocker, since this seems to be the consensus here.
        Hide
        Devaraj Das added a comment -

        This bug surfaced on one of the secure Yahoo clusters. This is the scenario:
        1. There is a file "/a/b/c/1.txt" on the hdfs which is private (one of the directories in the path leading up to the hdfs file does not have EXECUTE permissions for OTHERS).
        2. A user "foo" uses this file in his job as a DistributedCache file, and the TTs localizes this file in a location owned by user "foo" (since this file is private it lands up in the protected place).
        3. A second user "bar" also tries to use the same file in his job. Both users belong to the same unix group.
        4. Assume some TT that localized "/a/b/c/1.txt" file before, while running foo's task, got a task of bar's job. It concludes the file was already localized since the mapping has an entry for /a/b/c/1.txt (mapping refers to the mapping between the Cache URIs and the CacheStatus objects, maintained by TT).
        5. The TT doesn't localize this file again. It instead points the tasks to the file that was localized in step (2). Since the directory where the file was localized is not readable by anyone other than "foo", the tasks of "bar"'s job fails.

        I guess earlier this issue didn't arise earlier (pre-security) since the distributed cache files, even if they were private, were getting localized in directories that were readable by all users.

        Attaching a patch for Y20S that addresses the issue.

        Show
        Devaraj Das added a comment - This bug surfaced on one of the secure Yahoo clusters. This is the scenario: 1. There is a file "/a/b/c/1.txt" on the hdfs which is private (one of the directories in the path leading up to the hdfs file does not have EXECUTE permissions for OTHERS). 2. A user "foo" uses this file in his job as a DistributedCache file, and the TTs localizes this file in a location owned by user "foo" (since this file is private it lands up in the protected place). 3. A second user "bar" also tries to use the same file in his job. Both users belong to the same unix group. 4. Assume some TT that localized "/a/b/c/1.txt" file before, while running foo's task, got a task of bar's job. It concludes the file was already localized since the mapping has an entry for /a/b/c/1.txt (mapping refers to the mapping between the Cache URIs and the CacheStatus objects, maintained by TT). 5. The TT doesn't localize this file again. It instead points the tasks to the file that was localized in step (2). Since the directory where the file was localized is not readable by anyone other than "foo", the tasks of "bar"'s job fails. I guess earlier this issue didn't arise earlier (pre-security) since the distributed cache files, even if they were private, were getting localized in directories that were readable by all users. Attaching a patch for Y20S that addresses the issue.
        Hide
        Devaraj Das added a comment -

        Attaching the correct patch..

        Show
        Devaraj Das added a comment - Attaching the correct patch..
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Devaraj, this corner case is exactly what Hemanth was trying to explain earlier on this ticket, starting with comment #4 above

        As for the approach, we have two options:
        (1) (this seems to be what the patch is doing) for group shared files, localize them separately for each user. This is a simple solution, but sacrifices the optimization ( may not be too bad?)
        (2) introduce the concept of group sharing of distributed cache files so as to avoid repetitive downloads for group shared files also. This may be a complex solution after all.

        Show
        Vinod Kumar Vavilapalli added a comment - Devaraj, this corner case is exactly what Hemanth was trying to explain earlier on this ticket, starting with comment #4 above As for the approach, we have two options: (1) (this seems to be what the patch is doing) for group shared files, localize them separately for each user. This is a simple solution, but sacrifices the optimization ( may not be too bad?) (2) introduce the concept of group sharing of distributed cache files so as to avoid repetitive downloads for group shared files also. This may be a complex solution after all.
        Hide
        Owen O'Malley added a comment -

        It looks good. I'd suggest:
        1. change DistributedChache.releaseCache to pass in the current user to TrackerDistributedCacheManager.releaseCache rather than creating a new method.
        2. it looks like the constructor for CacheFile can easily throw IOException instead of putting it in a RuntimeException.

        Show
        Owen O'Malley added a comment - It looks good. I'd suggest: 1. change DistributedChache.releaseCache to pass in the current user to TrackerDistributedCacheManager.releaseCache rather than creating a new method. 2. it looks like the constructor for CacheFile can easily throw IOException instead of putting it in a RuntimeException.
        Hide
        Devaraj Das added a comment -

        Patch addressing Owen's comments.

        Show
        Devaraj Das added a comment - Patch addressing Owen's comments.
        Hide
        Devaraj Das added a comment -

        Devaraj, this corner case is exactly what Hemanth was trying to explain earlier on this ticket, starting with comment #4 above

        Yeah.. i realized that.. That's the reason i stuck to this jira rather than opening a new one

        As for the approach, we have two options: (1) (this seems to be what the patch is doing) for group shared files, localize them separately for each user. This is a simple solution, but sacrifices the optimization ( may not be too bad?)

        Yes, I am going with this for now. If needed (after we deploy this patch on our clusters and observe), we can look at proposal (2) in your comment..

        Show
        Devaraj Das added a comment - Devaraj, this corner case is exactly what Hemanth was trying to explain earlier on this ticket, starting with comment #4 above Yeah.. i realized that.. That's the reason i stuck to this jira rather than opening a new one As for the approach, we have two options: (1) (this seems to be what the patch is doing) for group shared files, localize them separately for each user. This is a simple solution, but sacrifices the optimization ( may not be too bad?) Yes, I am going with this for now. If needed (after we deploy this patch on our clusters and observe), we can look at proposal (2) in your comment..
        Hide
        Owen O'Malley added a comment -

        (2) introduce the concept of group sharing of distributed cache files so as to avoid repetitive downloads for group shared files also. This may be a complex solution after all.

        This would be quite complex to get right. In particular, it is difficult to determine which group should have access. If we want to improve it, I'd suggest that we use hardlinks to give each user access to a single copy of the file.. Of course you need to ensure that they do in fact have read access to the original file. smile

        Show
        Owen O'Malley added a comment - (2) introduce the concept of group sharing of distributed cache files so as to avoid repetitive downloads for group shared files also. This may be a complex solution after all. This would be quite complex to get right. In particular, it is difficult to determine which group should have access. If we want to improve it, I'd suggest that we use hardlinks to give each user access to a single copy of the file.. Of course you need to ensure that they do in fact have read access to the original file. smile
        Hide
        Devaraj Das added a comment -

        Patch for trunk.

        Show
        Devaraj Das added a comment - Patch for trunk.
        Hide
        Owen O'Malley added a comment -

        Looks good. +1

        Show
        Owen O'Malley added a comment - Looks good. +1
        Hide
        Devaraj Das added a comment -

        I just committed this.

        Show
        Devaraj Das added a comment - I just committed this.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/ )

          People

          • Assignee:
            Devaraj Das
            Reporter:
            Devaraj Das
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development