Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-714

JobConf.findContainingJar unescapes unnecessarily on Linux

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In JobConf.findContainingJar, the path name is decoded using URLDecoder.decode(...). This was done by Doug in r381794 (commit msg "Un-escape containing jar's path, which is URL-encoded. This fixes things primarily on Windows, where paths are likely to contain spaces.") Unfortunately, jar paths do not appear to be URL encoded on Linux. If you try to use "hadoop jar" on a jar with a "+" in it, this function decodes it to a space and then the job cannot be submitted.

        Activity

        Hide
        Todd Lipcon added a comment -

        I just confirmed this on Windows vs Linux. On Windows, the URL that you get back from ClassLoader.getResource has spaces encoded as "%20". On Linux it doesn't.

        Anyone have a creative solution to deal with this? We'd like to have +s in our version numbers due to standards in RPM and Debian land, but this is blocking that.

        Show
        Todd Lipcon added a comment - I just confirmed this on Windows vs Linux. On Windows, the URL that you get back from ClassLoader.getResource has spaces encoded as "%20". On Linux it doesn't. Anyone have a creative solution to deal with this? We'd like to have +s in our version numbers due to standards in RPM and Debian land, but this is blocking that.
        Hide
        Todd Lipcon added a comment -

        Here's a potential fix, which is really klugey. If no one objects to this I'll upload it as a real patch in the next couple of days:

        diff --git a/src/mapred/org/apache/hadoop/mapred/JobConf.java b/src/mapred/org/apache/hadoop/mapred/JobConf.java
        index 11be95a..4794eab 100644
        --- a/src/mapred/org/apache/hadoop/mapred/JobConf.java
        +++ b/src/mapred/org/apache/hadoop/mapred/JobConf.java
        @@ -1452,6 +1452,13 @@ public class JobConf extends Configuration {
                   if (toReturn.startsWith("file:")) {
                     toReturn = toReturn.substring("file:".length());
                   }
        +          // URLDecoder is a misnamed class, since it actually decodes
        +          // x-www-form-urlencoded MIME type rather than actual
        +          // URL encoding (which the file path has). Therefore it would
        +          // decode +s to ' 's which is incorrect (spaces are actually
        +          // either unencoded or encoded as "%20"). Replace +s first, so
        +          // that they are kept sacred during the decoding process.
        +          toReturn = toReturn.replaceAll("\\+", "%2B");
                   toReturn = URLDecoder.decode(toReturn, "UTF-8");
                   return toReturn.replaceAll("!.*$", "");
                 }
        
        Show
        Todd Lipcon added a comment - Here's a potential fix, which is really klugey. If no one objects to this I'll upload it as a real patch in the next couple of days: diff --git a/src/mapred/org/apache/hadoop/mapred/JobConf.java b/src/mapred/org/apache/hadoop/mapred/JobConf.java index 11be95a..4794eab 100644 --- a/src/mapred/org/apache/hadoop/mapred/JobConf.java +++ b/src/mapred/org/apache/hadoop/mapred/JobConf.java @@ -1452,6 +1452,13 @@ public class JobConf extends Configuration { if (toReturn.startsWith( "file:" )) { toReturn = toReturn.substring( "file:" .length()); } + // URLDecoder is a misnamed class, since it actually decodes + // x-www-form-urlencoded MIME type rather than actual + // URL encoding (which the file path has). Therefore it would + // decode +s to ' 's which is incorrect (spaces are actually + // either unencoded or encoded as "%20" ). Replace +s first, so + // that they are kept sacred during the decoding process. + toReturn = toReturn.replaceAll( "\\+" , "%2B" ); toReturn = URLDecoder.decode(toReturn, "UTF-8" ); return toReturn.replaceAll( "!.*$" , ""); }
        Hide
        Owen O'Malley added a comment -

        Are there any other characters that need to be fixed? I'm worried that this is a point solution for one character rather than the right solution.

        Show
        Owen O'Malley added a comment - Are there any other characters that need to be fixed? I'm worried that this is a point solution for one character rather than the right solution.
        Hide
        Todd Lipcon added a comment -

        Owen: that's the only difference I know of that's mentioned in the spec: http://www.w3.org/MarkUp/html-spec/html-spec_8.html#SEC8.2.1

        I spent about 4 hours trying to find a portable non-klugey fix. The trouble is that the behavior is different on Windows compared to Linux. On a Windows JVM, it encodes spaces as "%20" and +s as "%2B" and on Linux it does neither, best I can tell.

        I definitely agree that this fix is not optimal, but I think '+' is the most common case for a "weird" character in a JAR name. In Debian and RPM packages, the non-alphanumeric characters allowed in version numbers are [+-~.:] and I think all of those should be fine after this patch.

        Show
        Todd Lipcon added a comment - Owen: that's the only difference I know of that's mentioned in the spec: http://www.w3.org/MarkUp/html-spec/html-spec_8.html#SEC8.2.1 I spent about 4 hours trying to find a portable non-klugey fix. The trouble is that the behavior is different on Windows compared to Linux. On a Windows JVM, it encodes spaces as "%20" and +s as "%2B" and on Linux it does neither, best I can tell. I definitely agree that this fix is not optimal, but I think '+' is the most common case for a "weird" character in a JAR name. In Debian and RPM packages, the non-alphanumeric characters allowed in version numbers are [+-~.:] and I think all of those should be fine after this patch.
        Hide
        Owen O'Malley added a comment -

        Of course, you could detect the operating system like Path does.

        Look at the definition of Path.WINDOWS. Of course, we probably should make a method that checks that boolean...

        That said, I'm ok with the quoting as long as it works on both operating systems. (I agree, it is kludgy, but...)

        Show
        Owen O'Malley added a comment - Of course, you could detect the operating system like Path does. Look at the definition of Path.WINDOWS. Of course, we probably should make a method that checks that boolean... That said, I'm ok with the quoting as long as it works on both operating systems. (I agree, it is kludgy, but...)
        Hide
        Todd Lipcon added a comment -

        Here's a patch including proper unit test.

        Show
        Todd Lipcon added a comment - Here's a patch including proper unit test.
        Hide
        Todd Lipcon added a comment -

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] -1 findbugs. The patch appears to introduce 13 new Findbugs warnings.
        [exec]
        [exec] -1 release audit. The applied patch generated 2 release audit warnings (more than the trunk's current 1 warnings).
        [exec]
        [exec] +1 system test framework. The patch passed system test framework compile.

        The findbugs warnings are bogus (none related to this patch). The release audit warnings are also unrelated ("smoke-tests" file and "robots.txt" file). See MAPREDUCE-2172.

        Show
        Todd Lipcon added a comment - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 13 new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 2 release audit warnings (more than the trunk's current 1 warnings). [exec] [exec] +1 system test framework. The patch passed system test framework compile. The findbugs warnings are bogus (none related to this patch). The release audit warnings are also unrelated ("smoke-tests" file and "robots.txt" file). See MAPREDUCE-2172 .
        Hide
        Eli Collins added a comment -

        +1

        This patch looks good to me. Looks like there may be other potential incorrect uses of URLDecode in the HarFileSystem, minding filling a jira to fix those?

        Show
        Eli Collins added a comment - +1 This patch looks good to me. Looks like there may be other potential incorrect uses of URLDecode in the HarFileSystem, minding filling a jira to fix those?
        Hide
        Todd Lipcon added a comment -

        Looks like there may be other potential incorrect uses of URLDecode in the HarFileSystem, minding filling a jira to fix those?

        Looking at HarFileSystem, it seems like it's actually safe - it's using URLDecoder to decode something that was encoded using URLEncoder in HadoopArchives.java. Using these as a pair is OK:

        groovy:000> URLDecoder.decode(URLEncoder.encode("foo+bar baz 100%"))
        ===> foo+bar baz 100%
        
        Show
        Todd Lipcon added a comment - Looks like there may be other potential incorrect uses of URLDecode in the HarFileSystem, minding filling a jira to fix those? Looking at HarFileSystem, it seems like it's actually safe - it's using URLDecoder to decode something that was encoded using URLEncoder in HadoopArchives.java. Using these as a pair is OK: groovy:000> URLDecoder.decode(URLEncoder.encode( "foo+bar baz 100%" )) ===> foo+bar baz 100%
        Hide
        Todd Lipcon added a comment -

        Committed to branch and trunk, thanks Eli.

        Show
        Todd Lipcon added a comment - Committed to branch and trunk, thanks Eli.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #565 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/565/)
        MAPREDUCE-714. JobConf.findContainingJar unescapes unnecessarily on linux. Contributed by Todd Lipcon

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #565 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/565/ ) MAPREDUCE-714 . JobConf.findContainingJar unescapes unnecessarily on linux. Contributed by Todd Lipcon
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-22-branch #33 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-22-branch/33/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-22-branch #33 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-22-branch/33/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #643 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #643 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/ )

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development