Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1579

archive: check and possibly replace the space charater in paths

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: harchive
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Since the space character is used as a separator in the index files, it won't work if there are spaces in the path (see also HADOOP-6591). The archive tools should

      1. detect if there are spaces in the paths and
      2. provide an option to replace it with some other characters.
      1. m1579_20100310.patch
        15 kB
        Tsz Wo Nicholas Sze
      2. m1579_20100310b.patch
        13 kB
        Tsz Wo Nicholas Sze
      3. m1579_20100311_y0.20.patch
        23 kB
        Tsz Wo Nicholas Sze
      4. m1579_20100311.patch
        23 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Tom White made changes -
          Fix Version/s 0.21.0 [ 12314045 ]
          Fix Version/s 0.22.0 [ 12314184 ]
          Tsz Wo Nicholas Sze made changes -
          Attachment m1579_20100311_y0.20.patch [ 12438859 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          m1579_20100311_y0.20.patch: for y0.20

          Show
          Tsz Wo Nicholas Sze added a comment - m1579_20100311_y0.20.patch: for y0.20
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #258 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/258/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #258 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/258/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #276 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/276/)
          Move from 0.21 to trunk in CHANGES.txt.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #276 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/276/ ) Move from 0.21 to trunk in CHANGES.txt.
          Tsz Wo Nicholas Sze made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Assignee Mahadev konar [ mahadev ] Tsz Wo (Nicholas), SZE [ szetszwo ]
          Fix Version/s 0.21.0 [ 12314045 ]
          Fix Version/s 0.20.3 [ 12314813 ]
          Resolution Fixed [ 1 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this.

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #275 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/275/)
          . archive: check and possibly replace the space charater in source paths.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #275 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/275/ ) . archive: check and possibly replace the space charater in source paths.
          Tsz Wo Nicholas Sze made changes -
          Fix Version/s 0.20.3 [ 12314813 ]
          Fix Version/s 0.21.0 [ 12314045 ]
          Fix Version/s 0.22.0 [ 12314184 ]
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          Rodrigo Schmidt added a comment -

          +1

          Everything looks good to me.

          Show
          Rodrigo Schmidt added a comment - +1 Everything looks good to me.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438566/m1579_20100311.patch
          against trunk revision 922047.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/524/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/524/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/524/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/524/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438566/m1579_20100311.patch against trunk revision 922047. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/524/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/524/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/524/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/524/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > One minor thing: Do you really want to create a new test source, instead of using the existing TestHarFileSystem.java. Wouldn't it be better to merge all Har-related tests on the same .java file?

          I actually began with add the new tests to TestHarFileSystem. However, there are subtle difference in the new tests compared with the existing tests. So I created a new file.

          Show
          Tsz Wo Nicholas Sze added a comment - > One minor thing: Do you really want to create a new test source, instead of using the existing TestHarFileSystem.java. Wouldn't it be better to merge all Har-related tests on the same .java file? I actually began with add the new tests to TestHarFileSystem. However, there are subtle difference in the new tests compared with the existing tests. So I created a new file.
          Hide
          Rodrigo Schmidt added a comment -

          Your patch looks good, Nicholas.

          One minor thing: Do you really want to create a new test source, instead of using the existing TestHarFileSystem.java. Wouldn't it be better to merge all Har-related tests on the same .java file?

          Show
          Rodrigo Schmidt added a comment - Your patch looks good, Nicholas. One minor thing: Do you really want to create a new test source, instead of using the existing TestHarFileSystem.java. Wouldn't it be better to merge all Har-related tests on the same .java file?
          Tsz Wo Nicholas Sze made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Tsz Wo Nicholas Sze made changes -
          Attachment m1579_20100311.patch [ 12438566 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          m1579_20100311.patch: added unit tests.

          Show
          Tsz Wo Nicholas Sze added a comment - m1579_20100311.patch: added unit tests.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Yes, my previous post is actually copied from the output of a new unit test. I will add some more tests before posting the new patch.

          Show
          Tsz Wo Nicholas Sze added a comment - Yes, my previous post is actually copied from the output of a new unit test. I will add some more tests before posting the new patch.
          Hide
          Rodrigo Schmidt added a comment -

          Tests look good, Nicholas! Do you intend to add unit tests to your patch?

          Show
          Rodrigo Schmidt added a comment - Tests look good, Nicholas! Do you intend to add unit tests to your patch?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Tested the space replacement option:

          • Original paths
            -rw-r--r--   2 tsz supergroup          1 2010-03-11 18:43 /user/tsz/test/a
            -rw-r--r--   2 tsz supergroup          1 2010-03-11 18:43 /user/tsz/test/b
            -rw-r--r--   2 tsz supergroup          1 2010-03-11 18:43 /user/tsz/test/c
            -rw-r--r--   2 tsz supergroup          3 2010-03-11 18:43 /user/tsz/test/c c
            drwxr-xr-x   - tsz supergroup          0 2010-03-11 18:43 /user/tsz/test/sub 1
            -rw-r--r--   2 tsz supergroup          4 2010-03-11 18:43 /user/tsz/test/sub 1/file
            -rw-r--r--   2 tsz supergroup         10 2010-03-11 18:43 /user/tsz/test/sub 1/file x y z
            -rw-r--r--   2 tsz supergroup          1 2010-03-11 18:43 /user/tsz/test/sub 1/x
            -rw-r--r--   2 tsz supergroup          1 2010-03-11 18:43 /user/tsz/test/sub 1/y
            -rw-r--r--   2 tsz supergroup          1 2010-03-11 18:43 /user/tsz/test/sub 1/z
            drwxr-xr-x   - tsz supergroup          0 2010-03-11 18:43 /user/tsz/test/sub 2
            
          • Replaced paths
            drw-r--r--   - tsz supergroup          0 2010-03-11 18:43 /user/tsz/tmp/bar.har/test
            -rw-r--r--   5 tsz supergroup          1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/a
            -rw-r--r--   5 tsz supergroup          1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/b
            -rw-r--r--   5 tsz supergroup          1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/c
            drw-r--r--   - tsz supergroup          0 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1
            -rw-r--r--   5 tsz supergroup         10 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/file_x_y_z
            -rw-r--r--   5 tsz supergroup          1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/x
            -rw-r--r--   5 tsz supergroup          1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/y
            -rw-r--r--   5 tsz supergroup          1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/z
            -rw-r--r--   5 tsz supergroup          4 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/file
            drw-r--r--   - tsz supergroup          0 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_2
            -rw-r--r--   5 tsz supergroup          3 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/c_c
            
          Show
          Tsz Wo Nicholas Sze added a comment - Tested the space replacement option: Original paths -rw-r--r-- 2 tsz supergroup 1 2010-03-11 18:43 /user/tsz/test/a -rw-r--r-- 2 tsz supergroup 1 2010-03-11 18:43 /user/tsz/test/b -rw-r--r-- 2 tsz supergroup 1 2010-03-11 18:43 /user/tsz/test/c -rw-r--r-- 2 tsz supergroup 3 2010-03-11 18:43 /user/tsz/test/c c drwxr-xr-x - tsz supergroup 0 2010-03-11 18:43 /user/tsz/test/sub 1 -rw-r--r-- 2 tsz supergroup 4 2010-03-11 18:43 /user/tsz/test/sub 1/file -rw-r--r-- 2 tsz supergroup 10 2010-03-11 18:43 /user/tsz/test/sub 1/file x y z -rw-r--r-- 2 tsz supergroup 1 2010-03-11 18:43 /user/tsz/test/sub 1/x -rw-r--r-- 2 tsz supergroup 1 2010-03-11 18:43 /user/tsz/test/sub 1/y -rw-r--r-- 2 tsz supergroup 1 2010-03-11 18:43 /user/tsz/test/sub 1/z drwxr-xr-x - tsz supergroup 0 2010-03-11 18:43 /user/tsz/test/sub 2 Replaced paths drw-r--r-- - tsz supergroup 0 2010-03-11 18:43 /user/tsz/tmp/bar.har/test -rw-r--r-- 5 tsz supergroup 1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/a -rw-r--r-- 5 tsz supergroup 1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/b -rw-r--r-- 5 tsz supergroup 1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/c drw-r--r-- - tsz supergroup 0 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1 -rw-r--r-- 5 tsz supergroup 10 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/file_x_y_z -rw-r--r-- 5 tsz supergroup 1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/x -rw-r--r-- 5 tsz supergroup 1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/y -rw-r--r-- 5 tsz supergroup 1 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/z -rw-r--r-- 5 tsz supergroup 4 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_1/file drw-r--r-- - tsz supergroup 0 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/sub_2 -rw-r--r-- 5 tsz supergroup 3 2010-03-11 18:43 /user/tsz/tmp/bar.har/test/c_c
          Hide
          Rodrigo Schmidt added a comment -

          Looks great to me! A couple of unit tests would be good.

          Show
          Rodrigo Schmidt added a comment - Looks great to me! A couple of unit tests would be good.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Need some tests ...

          Show
          Tsz Wo Nicholas Sze added a comment - Need some tests ...
          Tsz Wo Nicholas Sze made changes -
          Attachment m1579_20100310b.patch [ 12438457 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          m1579_20100310b.patch:

          • removed the changes from MAPREDUCE-1578
          • removed this first LOG.info() calls
          • checkSpace() right before writing to the SequenceFile.
          Show
          Tsz Wo Nicholas Sze added a comment - m1579_20100310b.patch: removed the changes from MAPREDUCE-1578 removed this first LOG.info() calls checkSpace() right before writing to the SequenceFile.
          Hide
          Rodrigo Schmidt added a comment -

          Yes, that's exactly what I mean. The code looks much better with HarEntry.

          Another thing I just noticed in your path is that you are calling checkSpace inside relPathToRoot() and outside it on writeTopLevelDirs().

          I think it is better to check for spaces outside relPathToRoot() since checking for spaces is something that we would like to do in the beginning of the execution only (and relPathToRoot() is an auxiliary function we might want to use several times throughout the execution).

          If you agree with that, you can just remove checkSpace from inside relPathToRoot() and place it also after the relPathToRoot() call on method archive().

          If you prefer to leave the checkSpace() call inside relPathToRoot(), then you can probably remove the checkSpace(relPath) call on writeTopLevelDirs().

          Show
          Rodrigo Schmidt added a comment - Yes, that's exactly what I mean. The code looks much better with HarEntry. Another thing I just noticed in your path is that you are calling checkSpace inside relPathToRoot() and outside it on writeTopLevelDirs(). I think it is better to check for spaces outside relPathToRoot() since checking for spaces is something that we would like to do in the beginning of the execution only (and relPathToRoot() is an auxiliary function we might want to use several times throughout the execution). If you agree with that, you can just remove checkSpace from inside relPathToRoot() and place it also after the relPathToRoot() call on method archive(). If you prefer to leave the checkSpace() call inside relPathToRoot(), then you can probably remove the checkSpace(relPath) call on writeTopLevelDirs().
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Nicholas, the patch looks good and you did some great refactoring also. ...
          Thanks, Rodrigo for taking a look. For the refactoring, you probably mean the change of the map input vale class from Text to HarEntry (and removing MapStat). It is necessary because the original codes uses Text to store paths in a SequenceFile, which again has the space problem.

          Show
          Tsz Wo Nicholas Sze added a comment - > Nicholas, the patch looks good and you did some great refactoring also. ... Thanks, Rodrigo for taking a look. For the refactoring, you probably mean the change of the map input vale class from Text to HarEntry (and removing MapStat). It is necessary because the original codes uses Text to store paths in a SequenceFile, which again has the space problem.
          Hide
          Rodrigo Schmidt added a comment -

          Nicholas, the patch looks good and you did some great refactoring also. Just two minor things:

          1) I think your patch file also contains the changes from MAPREDUCE-1578 and it probably won't merge well. You might want to merge it with trunk and submit it again.

          2) The LOG.info() calls might not help much, specially the first one.

          Show
          Rodrigo Schmidt added a comment - Nicholas, the patch looks good and you did some great refactoring also. Just two minor things: 1) I think your patch file also contains the changes from MAPREDUCE-1578 and it probably won't merge well. You might want to merge it with trunk and submit it again. 2) The LOG.info() calls might not help much, specially the first one.
          Tsz Wo Nicholas Sze made changes -
          Attachment m1579_20100310.patch [ 12438441 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          m1579_20100310.patch: first patch for reviewing.

          Show
          Tsz Wo Nicholas Sze added a comment - m1579_20100310.patch: first patch for reviewing.
          Rodrigo Schmidt made changes -
          Link This issue relates to MAPREDUCE-1585 [ MAPREDUCE-1585 ]
          Hide
          Rodrigo Schmidt added a comment -

          +1 to Nicholas' suggestion.

          Show
          Rodrigo Schmidt added a comment - +1 to Nicholas' suggestion.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I suggest to have two config properties for the version 1 HadoopArchives.

          • har.space.replace.enable = <true|false>
          • har.space.replacement = <REPLACEMENT_STRING>
            If space replacement is not enabled, an exception will be thrown if there are spaces in the paths.
          Show
          Tsz Wo Nicholas Sze added a comment - I suggest to have two config properties for the version 1 HadoopArchives. har.space.replace.enable = <true|false> har.space.replacement = <REPLACEMENT_STRING> If space replacement is not enabled, an exception will be thrown if there are spaces in the paths.
          Rodrigo Schmidt made changes -
          Link This issue is related to MAPREDUCE-1578 [ MAPREDUCE-1578 ]
          Hide
          Rodrigo Schmidt added a comment -

          I agree, but I really think MAPREDUCE-1578 should be solved first. The way it is now, if we ever change the protocol on HarFileSystem, HadoopArchives will start generating old files with new version numbers, which is much more dangerous in my opinion.

          Show
          Rodrigo Schmidt added a comment - I agree, but I really think MAPREDUCE-1578 should be solved first. The way it is now, if we ever change the protocol on HarFileSystem, HadoopArchives will start generating old files with new version numbers, which is much more dangerous in my opinion.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The current behavior is undesirable. The archive tool may return successfully with the resulted har being corrupted. See some examples here.

          Show
          Tsz Wo Nicholas Sze added a comment - The current behavior is undesirable. The archive tool may return successfully with the resulted har being corrupted. See some examples here .
          Tsz Wo Nicholas Sze made changes -
          Field Original Value New Value
          Link This issue is related to HADOOP-6591 [ HADOOP-6591 ]
          Tsz Wo Nicholas Sze created issue -

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development