Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-5290

User Cache upload contention can cause job failures

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We recently enabled the User Cache (PIG-2672) feature and found that occasionally jobs would fail because of contention when uploading JARs into the cache. Although the cache is designed to be fail-safe, i.e. to fall back to normal behavior if anything goes wrong by catching all IOException, the portion of code which closes the output stream is not wrapped within a try statement and thus an exception during the closing of that stream causes the entire job to fail. If multiple jobs are attempting to upload the same JAR failure simultaneously, the contention can cause this close statement to fail.

      The current strategy also has two other flaws. First, consider the scenario where job A begins uploading jar X. Job B also needs jar X, sees that the file exists, and launches its tasks. Yet, job A has not yet finished uploading jar X (perhaps it is large). So, the tasks are localizing a half-completed version of jar X. Second, the original design allowed for the same JAR (identical contents) to be shared between jobs even if a different name was used. In PIG-3815, however, this ability was removed, and now JARs are only shared if they have the same name.

      I propose we solve all of these issues simultaneously by returning to the listStatus based behavior (used prior to PIG-3815), but filter out entries ending in .tmp. When uploading, upload to randomNumber.tmp, then once the file is completed, do a rename to the original name of the JAR file. This ensures that incomplete files are never in a location that would be accessed by other jobs, and the only write operation accessing a shared path is a single rename operation.

      An alternative design is to use a single canonicalized name for all JAR files (they will still be unique since they are inside of directories based on their SHA1). Upload to a tmp file as previously described, then rename to the canonical name. This removes the need to do a listStatus call; however it will result in classpaths that are human unreadable since the name of the JAR file has been lost. I think it's worth it from a debugging standpoint to go with the first design.

      1. PIG-5290.patch
        5 kB
        Erik Krogen
      2. PIG-5290-1.patch
        5 kB
        Erik Krogen

        Issue Links

          Activity

          Hide
          xkrogen Erik Krogen added a comment - - edited

          Ping Rohini Palaniswamy, Koji Noguchi from previous involvement with the two JIRAs mentioned in the description.

          Show
          xkrogen Erik Krogen added a comment - - edited Ping Rohini Palaniswamy , Koji Noguchi from previous involvement with the two JIRAs mentioned in the description.
          Hide
          rohini Rohini Palaniswamy added a comment -

          Erik Krogen,
          Using randomNumber.tmp is simple and effective. +1 on the idea. Have added you to the contributors list. You can now assign the jira to yourself if you are going to work on the patch.

          Show
          rohini Rohini Palaniswamy added a comment - Erik Krogen , Using randomNumber.tmp is simple and effective. +1 on the idea. Have added you to the contributors list. You can now assign the jira to yourself if you are going to work on the patch.
          Hide
          xkrogen Erik Krogen added a comment -

          Will do, thanks Rohini Palaniswamy!

          Show
          xkrogen Erik Krogen added a comment - Will do, thanks Rohini Palaniswamy !
          Hide
          xkrogen Erik Krogen added a comment - - edited

          Attaching initial patch implementing the ideas I discussed in the JIRA. Uploads at a random file name, then moves it to the original filename. Uses listStatus to be able to share identical JAR files with nonidentical names. Makes an attempt at cleaning up hanging temp files if an issue occurs while uploading / renaming.

          I did not add any unit tests as it was not very clear to me how to do so / what the policy on that would be. Let me know if I should add some and if so, where they would go. I did manually test the patch on our Hadoop 2.6 cluster and it worked as expected, populating the cache when empty and reusing files when present (even at different names).

          Rohini Palaniswamy, please let me know if you have a chance to review. Thanks!

          Show
          xkrogen Erik Krogen added a comment - - edited Attaching initial patch implementing the ideas I discussed in the JIRA. Uploads at a random file name, then moves it to the original filename. Uses listStatus to be able to share identical JAR files with nonidentical names. Makes an attempt at cleaning up hanging temp files if an issue occurs while uploading / renaming. I did not add any unit tests as it was not very clear to me how to do so / what the policy on that would be. Let me know if I should add some and if so, where they would go. I did manually test the patch on our Hadoop 2.6 cluster and it worked as expected, populating the cache when empty and reusing files when present (even at different names). Rohini Palaniswamy , please let me know if you have a chance to review. Thanks!
          Hide
          rohini Rohini Palaniswamy added a comment -

          Few comments:
          1) Can you compile the Pattern and reuse it instead of calling matches every time? That will be more cost effective.
          2) If there is still a collision during writing temp file due to random number collision (we have run into the rare scenario where different jobs had random numbers collide during temp dir creation) or renaming to final file name (more likely than first scenario) you will still end up with an error. First case is really rare and so we can skip for now. But we should handle the rename problem by checking for FileAlreadyExistsException in a separate try/catch block and verify there if the existing file size is same as the temp file and if not thrown an error.
          3) Regarding reverting to the listStatus behavior, I checked PIG-3815-3.patch and before the change looks like it still did check to see if filename matched.
          https://github.com/apache/pig/blob/c2aedcc66486ddc721a32dc4984547f049aa5541/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java#L1638
          I don't think we should be doing the listStatus check you are doing and pick any non-tmp file in that directory. Might end up with wrong file in case of checksum collisions. Comment 1) would be not necessary if listStatus behavior is reverted.
          4) We generally expect unit tests to be added with every patch. But in this case simulating a race condition is hard, so I would say we can skip and rely on the older tests. But on checking realized that there is no actually no test for this feature. Can you add -Dpig.user.cache.enabled=true to java_params of UdfDistributedCache and MonitoredUDF tests in nightly.conf?

          Show
          rohini Rohini Palaniswamy added a comment - Few comments: 1) Can you compile the Pattern and reuse it instead of calling matches every time? That will be more cost effective. 2) If there is still a collision during writing temp file due to random number collision (we have run into the rare scenario where different jobs had random numbers collide during temp dir creation) or renaming to final file name (more likely than first scenario) you will still end up with an error. First case is really rare and so we can skip for now. But we should handle the rename problem by checking for FileAlreadyExistsException in a separate try/catch block and verify there if the existing file size is same as the temp file and if not thrown an error. 3) Regarding reverting to the listStatus behavior, I checked PIG-3815 -3.patch and before the change looks like it still did check to see if filename matched. https://github.com/apache/pig/blob/c2aedcc66486ddc721a32dc4984547f049aa5541/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java#L1638 I don't think we should be doing the listStatus check you are doing and pick any non-tmp file in that directory. Might end up with wrong file in case of checksum collisions. Comment 1) would be not necessary if listStatus behavior is reverted. 4) We generally expect unit tests to be added with every patch. But in this case simulating a race condition is hard, so I would say we can skip and rely on the older tests. But on checking realized that there is no actually no test for this feature. Can you add -Dpig.user.cache.enabled=true to java_params of UdfDistributedCache and MonitoredUDF tests in nightly.conf?
          Hide
          xkrogen Erik Krogen added a comment -

          Thank you for the review Rohini Palaniswamy!

          I see you are right about (3); I missed the line you linked. I can keep it at the old behavior.

          For (2), I agree that there is potential for collision during writing temp file, but figured that since it will fall back to default behavior if it had an error while uploading, this is not too much of an issue (just an extra JAR upload rather than using cache). I don't understand the "renaming to final file name" issue. Renames on HDFS overwrite, so if two users upload the same simultaneously, one of them will simply win (whoever comes second). Let me know if I am misunderstanding.

          Sounds good on (4), thanks for the pointer.

          Attaching v1 patch addressing your comments in (3), (4), comment in (1) is no longer relevant. Will wait on your response about (2).

          Show
          xkrogen Erik Krogen added a comment - Thank you for the review Rohini Palaniswamy ! I see you are right about (3); I missed the line you linked. I can keep it at the old behavior. For (2), I agree that there is potential for collision during writing temp file, but figured that since it will fall back to default behavior if it had an error while uploading, this is not too much of an issue (just an extra JAR upload rather than using cache). I don't understand the "renaming to final file name" issue. Renames on HDFS overwrite, so if two users upload the same simultaneously, one of them will simply win (whoever comes second). Let me know if I am misunderstanding. Sounds good on (4), thanks for the pointer. Attaching v1 patch addressing your comments in (3), (4), comment in (1) is no longer relevant. Will wait on your response about (2).
          Hide
          rohini Rohini Palaniswamy added a comment -

          +1. Looks good. Will run the e2e tests and commit later in the day.

          Renames on HDFS overwrite, so if two users upload the same simultaneously, one of them will simply win (whoever comes second). Let me know if I am misunderstanding.

          It was a misunderstanding on my part. I could not exactly remember whether rename wins on overwrite or fails with FileAlreadyExistsException. To confirm, I ended up looking at the javadoc of new rename API which said rename fails if overwrite option is not passed and got mislead by that.
          https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1417-L1451
          Verified that the old API (without the options) behavior is that last one wins and there are no failures on rename if file exists.

          Show
          rohini Rohini Palaniswamy added a comment - +1. Looks good. Will run the e2e tests and commit later in the day. Renames on HDFS overwrite, so if two users upload the same simultaneously, one of them will simply win (whoever comes second). Let me know if I am misunderstanding. It was a misunderstanding on my part. I could not exactly remember whether rename wins on overwrite or fails with FileAlreadyExistsException. To confirm, I ended up looking at the javadoc of new rename API which said rename fails if overwrite option is not passed and got mislead by that. https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1417-L1451 Verified that the old API (without the options) behavior is that last one wins and there are no failures on rename if file exists.
          Hide
          rohini Rohini Palaniswamy added a comment - - edited

          +1. e2e on both MR and Tez good and verified that files are being used from cache.

          Committed to trunk. Thanks for fixing this Erik Krogen.

          Show
          rohini Rohini Palaniswamy added a comment - - edited +1. e2e on both MR and Tez good and verified that files are being used from cache. Committed to trunk. Thanks for fixing this Erik Krogen .
          Hide
          xkrogen Erik Krogen added a comment -

          Fantastic, thank you Rohini!

          Show
          xkrogen Erik Krogen added a comment - Fantastic, thank you Rohini!

            People

            • Assignee:
              xkrogen Erik Krogen
              Reporter:
              xkrogen Erik Krogen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development