Hadoop Common
  1. Hadoop Common
  2. HADOOP-6591

HarFileSystem cannot handle paths with the space character

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: fs
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Since HarFileSystem is using " " as a separator in the index files, it won't work if there are " " in the path.

      1. HADOOP-6591.patch
        3 kB
        Rodrigo Schmidt

        Issue Links

          Activity

          Hide
          Rodrigo Schmidt added a comment -

          I have a simple patch for this JIRA using URL encoding and decoding of filenames, but it also involves modifying HadoopArchives.java, which is in MAPREDUCE.

          The modifications to HarFileSystem and TestHarFileSystem will fail the unit tests if not used with the new version of HadoopArchives.

          How can I submit such patch?

          Show
          Rodrigo Schmidt added a comment - I have a simple patch for this JIRA using URL encoding and decoding of filenames, but it also involves modifying HadoopArchives.java, which is in MAPREDUCE. The modifications to HarFileSystem and TestHarFileSystem will fail the unit tests if not used with the new version of HadoopArchives. How can I submit such patch?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... using URL encoding and decoding of filenames ...
          Then, is it an incompatible change? Does it work with the previously created har files?

          Show
          Tsz Wo Nicholas Sze added a comment - > ... using URL encoding and decoding of filenames ... Then, is it an incompatible change? Does it work with the previously created har files?
          Hide
          Rodrigo Schmidt added a comment -

          Yes, it's incompatible. I changed the protocol version. as Mahadev suggested on his comment at MAPREDUCE-1548.

          Show
          Rodrigo Schmidt added a comment - Yes, it's incompatible. I changed the protocol version. as Mahadev suggested on his comment at MAPREDUCE-1548 .
          Hide
          Mahadev konar added a comment -

          rodrigo,
          I think what nicholas meant was that how would keep it backwards compatible? Would you be supporting both the versions?

          Show
          Mahadev konar added a comment - rodrigo, I think what nicholas meant was that how would keep it backwards compatible? Would you be supporting both the versions?
          Hide
          Rodrigo Schmidt added a comment -

          I see...

          Since we are just starting to use hadoop archives, I didn't care to make my patch backwards compatible. Does it have to be to make it to the hadoop trunk?

          I could easily adapt my patch so that HarFileSystem could read the version 1 protocol, but that doesn't solve the unit test problem. The new unit tests have files/directories with spaces in their names and HadoopArchives.java currently can't handle these cases.

          Show
          Rodrigo Schmidt added a comment - I see... Since we are just starting to use hadoop archives, I didn't care to make my patch backwards compatible. Does it have to be to make it to the hadoop trunk? I could easily adapt my patch so that HarFileSystem could read the version 1 protocol, but that doesn't solve the unit test problem. The new unit tests have files/directories with spaces in their names and HadoopArchives.java currently can't handle these cases.
          Hide
          Rodrigo Schmidt added a comment -

          The other problem I see is that HadoopArchives uses HarFileSystem.VERSION when it creates files. So, if we commit HarFileSystem.java first, HadoopArchives will generate old-style files with VERSION = 2. And if we commit HadoopArchives.java first, new-style files will be generated with VERSION 1 (although the old HadoopArchives won't be able to read them properly). It's sort of a chicken-and-egg problem created because two java files that actually depend on each other ended up in two different hadoop projects (common and mapreduce).

          Sincerely, I don't understand why HarFileSystem is part of common.

          For instance, in Raid, we made DistributedRaidFileSystem part of MAPREDUCE as well, partly to avoid this type or problem.

          Show
          Rodrigo Schmidt added a comment - The other problem I see is that HadoopArchives uses HarFileSystem.VERSION when it creates files. So, if we commit HarFileSystem.java first, HadoopArchives will generate old-style files with VERSION = 2. And if we commit HadoopArchives.java first, new-style files will be generated with VERSION 1 (although the old HadoopArchives won't be able to read them properly). It's sort of a chicken-and-egg problem created because two java files that actually depend on each other ended up in two different hadoop projects (common and mapreduce). Sincerely, I don't understand why HarFileSystem is part of common. For instance, in Raid, we made DistributedRaidFileSystem part of MAPREDUCE as well, partly to avoid this type or problem.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Tested current implementation:

          • Suppose we have a file "k20d3f3/aaaa bbb". The archive job failed with FileNotFoundException.
          • If another file k20d3f3/aaaa exist, then the archive job succeeded but the result .har is corrupted and "fs -ls" failed.
            -bash-3.1$ hadoop fs -ls  har://hdfs-namenode:8020/user/tsz/k20d3f3.2.har/k20d3f3
            ls: could not get get listing for 'har://hdfs-namenode:8020/user/tsz/k20d3f3.2.har/k20d3f3' :
             File: har://hdfs-namenode:8020/user/tsz/k20d3f3.2.har/k20d3f3/bbb does not exist in har://hdfs-namenode:8020/user/tsz/k20d3f3.2.har
            
          • If file k20d3f3/bbb also exist, the archive job successed. "fs -ls" also succeeded with duplicated entries.
            -bash-3.1$ hadoop fs -ls  har://hdfs-namenode:8020/user/tsz/k20d3f3.3.har/k20d3f3
            Found 69 items
            -rw-------   5 tsz users          0 2010-03-06 00:29 /user/tsz/k20d3f3.3.har/k20d3f3/aaaa
            -rw-------   5 tsz users          0 2010-03-06 00:29 /user/tsz/k20d3f3.3.har/k20d3f3/aaaa
            -rw-------   5 tsz users          0 2010-03-06 00:29 /user/tsz/k20d3f3.3.har/k20d3f3/bbb
            -rw-------   5 tsz users          0 2010-03-06 00:29 /user/tsz/k20d3f3.3.har/k20d3f3/bbb
            ...
            
          Show
          Tsz Wo Nicholas Sze added a comment - Tested current implementation: Suppose we have a file "k20d3f3/aaaa bbb". The archive job failed with FileNotFoundException. If another file k20d3f3/aaaa exist, then the archive job succeeded but the result .har is corrupted and "fs -ls" failed. -bash-3.1$ hadoop fs -ls har://hdfs-namenode:8020/user/tsz/k20d3f3.2.har/k20d3f3 ls: could not get get listing for 'har://hdfs-namenode:8020/user/tsz/k20d3f3.2.har/k20d3f3' : File: har://hdfs-namenode:8020/user/tsz/k20d3f3.2.har/k20d3f3/bbb does not exist in har://hdfs-namenode:8020/user/tsz/k20d3f3.2.har If file k20d3f3/bbb also exist, the archive job successed. "fs -ls" also succeeded with duplicated entries. -bash-3.1$ hadoop fs -ls har://hdfs-namenode:8020/user/tsz/k20d3f3.3.har/k20d3f3 Found 69 items -rw------- 5 tsz users 0 2010-03-06 00:29 /user/tsz/k20d3f3.3.har/k20d3f3/aaaa -rw------- 5 tsz users 0 2010-03-06 00:29 /user/tsz/k20d3f3.3.har/k20d3f3/aaaa -rw------- 5 tsz users 0 2010-03-06 00:29 /user/tsz/k20d3f3.3.har/k20d3f3/bbb -rw------- 5 tsz users 0 2010-03-06 00:29 /user/tsz/k20d3f3.3.har/k20d3f3/bbb ...
          Hide
          Rodrigo Schmidt added a comment -

          Thanks for the examples, Nicholas!

          My patch actually solves all these issues, but I don't know how to submit it given that part of the files I changed is in hadoop common, and the other part is in mapreduce. Any recommendations?

          Show
          Rodrigo Schmidt added a comment - Thanks for the examples, Nicholas! My patch actually solves all these issues, but I don't know how to submit it given that part of the files I changed is in hadoop common, and the other part is in mapreduce. Any recommendations?
          Hide
          Rodrigo Schmidt added a comment -

          Submitting my patch without unit tests or changes to HadoopArchives.java since these should be on MAPREDUCE project.

          I made it backwards-compatible though.

          I welcome any comments.

          Show
          Rodrigo Schmidt added a comment - Submitting my patch without unit tests or changes to HadoopArchives.java since these should be on MAPREDUCE project. I made it backwards-compatible though. I welcome any comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438075/HADOOP-6591.patch
          against trunk revision 918880.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/33/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/33/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/33/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/33/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438075/HADOOP-6591.patch against trunk revision 918880. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/33/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/33/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/33/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/33/console This message is automatically generated.
          Hide
          Rodrigo Schmidt added a comment -

          There is no way I can add a contrib test to this patch if the HAR creation is performed by a new version of HadoopArchives.java that must be committed to MAPREDUCE. This is an instance of the chicken or the egg dilemma.

          IMO, the best solution is to commit this patch, which is harmless to previous versions of Hadoop Archives, and then submit my new patch to HadopArchives.java, with new unit tests.

          Show
          Rodrigo Schmidt added a comment - There is no way I can add a contrib test to this patch if the HAR creation is performed by a new version of HadoopArchives.java that must be committed to MAPREDUCE. This is an instance of the chicken or the egg dilemma. IMO, the best solution is to commit this patch, which is harmless to previous versions of Hadoop Archives, and then submit my new patch to HadopArchives.java, with new unit tests.
          Hide
          Rodrigo Schmidt added a comment -

          Nicholas, Mahadev,

          What do you think of this patch? Is it "committable" ?

          Thanks,
          Rodrigo

          Show
          Rodrigo Schmidt added a comment - Nicholas, Mahadev, What do you think of this patch? Is it "committable" ? Thanks, Rodrigo
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Rodrigo, the patch is not backward compatible since it always decodes the path stored in the index files. If there are special character sequences in the existing version 1 har, they will be incorrectly decoded.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Rodrigo, the patch is not backward compatible since it always decodes the path stored in the index files. If there are special character sequences in the existing version 1 har, they will be incorrectly decoded.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          BTW, the version 1 archive should handle spaces in a better way but not create corrupted har silently. So I filed MAPREDUCE-1579.

          Show
          Tsz Wo Nicholas Sze added a comment - BTW, the version 1 archive should handle spaces in a better way but not create corrupted har silently. So I filed MAPREDUCE-1579 .
          Hide
          Rodrigo Schmidt added a comment -

          Hi Nicholas,

          Thanks for reviewing the patch!

          I don't understand the incompatibility problem you mentioned. On decodeFileName, I check version (lowercase, which is the version of the archive generated, not the protocol version) and return the original filename if it's not 2 (which means it's 1 since the protocol rejects higher versions). That should make this patch backward compatible.

          I think you might have confused version (lowercase - version of the archive being read) with VERSION (uppercase - version of the protocol). Let me know if that's indeed the case or I missed something else.

          -Rodrigo

          Show
          Rodrigo Schmidt added a comment - Hi Nicholas, Thanks for reviewing the patch! I don't understand the incompatibility problem you mentioned. On decodeFileName, I check version (lowercase, which is the version of the archive generated, not the protocol version) and return the original filename if it's not 2 (which means it's 1 since the protocol rejects higher versions). That should make this patch backward compatible. I think you might have confused version (lowercase - version of the archive being read) with VERSION (uppercase - version of the protocol). Let me know if that's indeed the case or I missed something else. -Rodrigo
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... On decodeFileName, I check version (lowercase, which is the version of the archive generated, not the protocol version) and return the original filename if it's not 2 (which means it's 1 since the protocol rejects higher versions). That should make this patch backward compatible.

          Yes, it should work fine. Thanks, Rodrigo.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... On decodeFileName, I check version (lowercase, which is the version of the archive generated, not the protocol version) and return the original filename if it's not 2 (which means it's 1 since the protocol rejects higher versions). That should make this patch backward compatible. Yes, it should work fine. Thanks, Rodrigo.
          Hide
          Mahadev konar added a comment -
          -  private static class HarStatus {
          +  private class HarStatus {
          

          Rodrigo, is there is a reason to make it non static?

          Show
          Mahadev konar added a comment - - private static class HarStatus { + private class HarStatus { Rodrigo, is there is a reason to make it non static?
          Hide
          Rodrigo Schmidt added a comment -

          Mahadev, the reason is to use the version property of class HarFileSystem. Without that, the code would have looked more cumbersome. Making it non-static creates a link between HarStatus and the HarFileSystem object that uses it. I see no problem in that since it's a private class anyway.

          Show
          Rodrigo Schmidt added a comment - Mahadev, the reason is to use the version property of class HarFileSystem. Without that, the code would have looked more cumbersome. Making it non-static creates a link between HarStatus and the HarFileSystem object that uses it. I see no problem in that since it's a private class anyway.
          Hide
          Rodrigo Schmidt added a comment -

          I've created MAPREDUCE-1585 to cope with the creation of version 2 archives. I'm uploading the rest of my current patch there, but MAPREDUCE-1585 really depends on this jira for it not to crash the unit tests.

          Show
          Rodrigo Schmidt added a comment - I've created MAPREDUCE-1585 to cope with the creation of version 2 archives. I'm uploading the rest of my current patch there, but MAPREDUCE-1585 really depends on this jira for it not to crash the unit tests.
          Hide
          Mahadev konar added a comment -

          +1 the patch looks good rodrigo....

          Show
          Mahadev konar added a comment - +1 the patch looks good rodrigo....
          Hide
          Rodrigo Schmidt added a comment -

          Mahadev, thanks for reviewing it. The tests will be added with the new patch I'll submit to MAPREDUCE-1585.

          Show
          Rodrigo Schmidt added a comment - Mahadev, thanks for reviewing it. The tests will be added with the new patch I'll submit to MAPREDUCE-1585 .
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Rodrigo!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Rodrigo!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #200 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/200/)
          . HarFileSystem can handle paths with the whitespace characters.
          (Rodrigo Schmidt via dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #200 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/200/ ) . HarFileSystem can handle paths with the whitespace characters. (Rodrigo Schmidt via dhruba)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #278 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/278/)
          . HarFileSystem can handle paths with the whitespace characters.
          (Rodrigo Schmidt via dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #278 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/278/ ) . HarFileSystem can handle paths with the whitespace characters. (Rodrigo Schmidt via dhruba)
          Hide
          Rodrigo Schmidt added a comment -

          The unit tests will be added in MAPREDUCE-1585.

          Show
          Rodrigo Schmidt added a comment - The unit tests will be added in MAPREDUCE-1585 .

            People

            • Assignee:
              Rodrigo Schmidt
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development