Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1
    • Fix Version/s: 0.20.2
    • Component/s: fs
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Bugs fixed for Hadoop archives: character escaping in paths, LineReader and file system caching.

      Description

      Found and fixed several bugs involving Hadoop archives:

      • In makeQualified(), the sloppy conversion from Path to URI and back mangles the path if it contains an escape-worthy character.
      • It's possible that fileStatusInIndex() may have to read more than one segment of the index. The LineReader and count of bytes read need to be reset for each block.
      • har:// connections cannot be indexed by (scheme, authority, username) – the path is significant as well. Caching them in this way limits a hadoop client to opening one archive per filesystem. It seems to be safe not to cache them, since they wrap another connection that does the actual networking.
      1. HADOOP-6097.patch
        2 kB
        Ben Slusky
      2. HADOOP-6097-0.20.patch
        4 kB
        Mahadev konar
      3. HADOOP-6097-0.20.patch
        4 kB
        Mahadev konar
      4. HADOOP-6097-0.20.patch
        1 kB
        Mahadev konar
      5. HADOOP-6097-v2.patch
        1 kB
        Tom White

        Issue Links

          Activity

          Ben Slusky created issue -
          Ben Slusky made changes -
          Field Original Value New Value
          Attachment HADOOP-6097.patch [ 12411434 ]
          Ben Slusky made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Vladimir Klimontovich added a comment -

          The patch contains:

          + /** Hadoop Archive connections cannot be cached by (scheme, authority,
          + * username) – the path is significant as well. Come to think of it, they
          + * probably don't need to be cached at all, since they wrap another
          + * connection that does the actual networking.
          + */
          + if (scheme.equals("har"))

          { + return createFileSystem(uri, conf); + }

          +
          return CACHE.get(uri, conf);
          }

          I don't think it's a good way.

          I'd like to suggest to introduce property fs.[fs-name].impl.disable.cache. And don't use cache is this property exists and set to true.

          Show
          Vladimir Klimontovich added a comment - The patch contains: + /** Hadoop Archive connections cannot be cached by (scheme, authority, + * username) – the path is significant as well. Come to think of it, they + * probably don't need to be cached at all, since they wrap another + * connection that does the actual networking. + */ + if (scheme.equals("har")) { + return createFileSystem(uri, conf); + } + return CACHE.get(uri, conf); } I don't think it's a good way. I'd like to suggest to introduce property fs. [fs-name] .impl.disable.cache. And don't use cache is this property exists and set to true.
          Hide
          Koji Noguchi added a comment -

          Ben, for the caching part,
          HarFileSystem.java has a comment

          • the uri of Har is
          • har://underlyingfsscheme-host:port/archivepath.
          • or
          • har:///archivepath.

          So it already creates a cache for each harpath?

          Show
          Koji Noguchi added a comment - Ben, for the caching part, HarFileSystem.java has a comment the uri of Har is har://underlyingfsscheme-host:port/archivepath. or har:///archivepath. So it already creates a cache for each harpath?
          Hide
          Ben Slusky added a comment -

          Koji, I'm not entirely sure what you're saying. When using the har filesystem, you create a connection for each harpath. This is not a network connection itself, but it wraps a network connection to the underlying filesystem, which is cached. I think this caching is sufficient, and it is not necessary to cache the wrapping har filesystem "connection".

          Show
          Ben Slusky added a comment - Koji, I'm not entirely sure what you're saying. When using the har filesystem, you create a connection for each harpath. This is not a network connection itself, but it wraps a network connection to the underlying filesystem, which is cached. I think this caching is sufficient, and it is not necessary to cache the wrapping har filesystem "connection".
          Hide
          Ben Slusky added a comment -

          Vladimir, I will implement that and submit a new patch.

          Show
          Ben Slusky added a comment - Vladimir, I will implement that and submit a new patch.
          Hide
          Mahadev konar added a comment -

          ben,
          koji is right. The caching is just a filesystem caching. The filesystem cache has a cache for each scheme cached. So for har filesystem its caching the scheme and harpath to create a cache for a filesystem meaning that a har filesystem is uniquely identified by a har:///archivepath. the connection caching has nothing to do with this filesystem cache. The connection caching is done via the RPC layer and you cwould not be able to cache connections at the har filesystem layer.

          Show
          Mahadev konar added a comment - ben, koji is right. The caching is just a filesystem caching. The filesystem cache has a cache for each scheme cached. So for har filesystem its caching the scheme and harpath to create a cache for a filesystem meaning that a har filesystem is uniquely identified by a har:///archivepath. the connection caching has nothing to do with this filesystem cache. The connection caching is done via the RPC layer and you cwould not be able to cache connections at the har filesystem layer.
          Hide
          Ben Slusky added a comment -

          Mahadev,

          Yes, we are talking about the filesystem cache, not the connection cache. Sorry for the confusion. As you say, har filesystems are identified by har://authority/archivepath, but the filesystem cache does not index filesystems by path – only by scheme, authority, and username. This leads directly to the misbehavior I described above. I worked around it by not caching har fielsystems. Another option is to cache by scheme, authority, username, AND path. I could submit a patch to do that if anyone has a strong preference for it.

          Show
          Ben Slusky added a comment - Mahadev, Yes, we are talking about the filesystem cache, not the connection cache. Sorry for the confusion. As you say, har filesystems are identified by har://authority/archivepath, but the filesystem cache does not index filesystems by path – only by scheme, authority, and username. This leads directly to the misbehavior I described above. I worked around it by not caching har fielsystems. Another option is to cache by scheme, authority, username, AND path. I could submit a patch to do that if anyone has a strong preference for it.
          Hide
          Koji Noguchi added a comment -

          Ben, sorry about that. I was wrong.

          I got confused since HarFileSystem.getUri() returns har://archivepath,
          I mistakenly thought FileSystem.CACHE would use the path as part of the hash key.

          $ hadoop dfs -ls har:///user/knoguchi/test.har har:///user/knoguchi/test2.har                                
          Found 1 items
          drw-r--r--   - knoguchi users          0 2009-08-18 18:52 /user/knoguchi/test.har/user
          ls: Invalid file name: /user/knoguchi/test2.har in har:///user/knoguchi/test.har
          
          
          $ hadoop dfs -ls har:///user/knoguchi/test2.har har:///user/knoguchi/test.har
          Found 1 items
          drw-------   - knoguchi users          0 2009-08-17 19:15 /user/knoguchi/test2.har/user
          ls: Invalid file name: /user/knoguchi/test.har in har:///user/knoguchi/test2.har
          $ 
          
          Show
          Koji Noguchi added a comment - Ben, sorry about that. I was wrong. I got confused since HarFileSystem.getUri() returns har://archivepath, I mistakenly thought FileSystem.CACHE would use the path as part of the hash key. $ hadoop dfs -ls har:///user/knoguchi/test.har har:///user/knoguchi/test2.har Found 1 items drw-r--r-- - knoguchi users 0 2009-08-18 18:52 /user/knoguchi/test.har/user ls: Invalid file name: /user/knoguchi/test2.har in har:///user/knoguchi/test.har $ hadoop dfs -ls har:///user/knoguchi/test2.har har:///user/knoguchi/test.har Found 1 items drw------- - knoguchi users 0 2009-08-17 19:15 /user/knoguchi/test2.har/user ls: Invalid file name: /user/knoguchi/test.har in har:///user/knoguchi/test2.har $
          Hide
          Chris Douglas added a comment -

          I agree with Vladimir; checking the scheme for "har" and dodging the cache is not a sufficiently general solution for FileSystem. As in the patch comment, the har filesystem isn't a great use case to justify a four-level cache, either.

          introduce property fs.[fs-name].impl.disable.cache. And don't use cache is this property exists and set to true

          +1 However, I'd suggest that it be part of a separate issue.

          Show
          Chris Douglas added a comment - I agree with Vladimir; checking the scheme for "har" and dodging the cache is not a sufficiently general solution for FileSystem. As in the patch comment, the har filesystem isn't a great use case to justify a four-level cache, either. introduce property fs. [fs-name] .impl.disable.cache. And don't use cache is this property exists and set to true +1 However, I'd suggest that it be part of a separate issue.
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Assignee Ben Slusky [ sluskyb ]
          Hide
          Tom White added a comment -

          I'd suggest that it be part of a separate issue.

          I've opened HADOOP-6097 to add support for disabling the cache on a per-filesystem basis.

          I've also regenerated the patch for this issue. It could really do with a unit test.

          Show
          Tom White added a comment - I'd suggest that it be part of a separate issue. I've opened HADOOP-6097 to add support for disabling the cache on a per-filesystem basis. I've also regenerated the patch for this issue. It could really do with a unit test.
          Tom White made changes -
          Attachment HADOOP-6097-v2.patch [ 12418348 ]
          Hide
          Tom White added a comment -

          I meant HADOOP-6231 for the cache issue.

          Show
          Tom White added a comment - I meant HADOOP-6231 for the cache issue.
          Vladimir Klimontovich made changes -
          Link This issue is cloned as HADOOP-6231 [ HADOOP-6231 ]
          Owen O'Malley made changes -
          Fix Version/s 0.20.2 [ 12314203 ]
          Fix Version/s 0.20.1 [ 12313866 ]
          Hide
          Mahadev konar added a comment -

          tom, ben,
          this jira has been marked for a fix for 0.20 branch. HADOOP-6231 has been committed only to trunk (i.e 0.21) .. should this jira be moved to 0.21?

          Show
          Mahadev konar added a comment - tom, ben, this jira has been marked for a fix for 0.20 branch. HADOOP-6231 has been committed only to trunk (i.e 0.21) .. should this jira be moved to 0.21?
          Hide
          Tom White added a comment -

          The latest patch that I generated was actually for trunk, so it would need a 0.20 version if folks want a fix in that version.

          Show
          Tom White added a comment - The latest patch that I generated was actually for trunk, so it would need a 0.20 version if folks want a fix in that version.
          Hide
          Mahadev konar added a comment -

          tom, the issue is that we would also need a backport for HADOOP-6231 which I dont think qualifies to get into a 0.20 branch, since its not a bug fix. We can attach a patch on HADOOP-6231 for 0.20 branch (that wouldnt be committed to the apache branch but others can download it if they want to include it). I can do that for now...

          Show
          Mahadev konar added a comment - tom, the issue is that we would also need a backport for HADOOP-6231 which I dont think qualifies to get into a 0.20 branch, since its not a bug fix. We can attach a patch on HADOOP-6231 for 0.20 branch (that wouldnt be committed to the apache branch but others can download it if they want to include it). I can do that for now...
          Hide
          Ben Slusky added a comment -

          Mahadev, HADOOP-6231 is a bug fix – see Koji's last comment above. I attached a patch for the 0.20 branch there.

          Show
          Ben Slusky added a comment - Mahadev, HADOOP-6231 is a bug fix – see Koji's last comment above. I attached a patch for the 0.20 branch there.
          Ben Slusky made changes -
          Affects Version/s 0.20.1 [ 12313866 ]
          Affects Version/s 0.19.2 [ 12313650 ]
          Hide
          Mahadev konar added a comment -

          ben, I did read through the comments... I think you are right... HADOOP-6231 can be treated as a bug fix for 0.20.*. I will try committing it to 0.20 branch after running the tests and others ....

          ill generate a 0.20 patch for this jira as well... it probably needs a test case....

          Show
          Mahadev konar added a comment - ben, I did read through the comments... I think you are right... HADOOP-6231 can be treated as a bug fix for 0.20.*. I will try committing it to 0.20 branch after running the tests and others .... ill generate a 0.20 patch for this jira as well... it probably needs a test case....
          Hide
          Mahadev konar added a comment -

          ant test passes with this patch except for TestHdfsProxy in contrib which fails due to the following unrelated reason:

          Testcase: testHdfsProxyInterface took 6.078 sec
                  Caused an ERROR
          org/apache/commons/cli/ParseException
          java.lang.NoClassDefFoundError: org/apache/commons/cli/ParseException
                  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
                  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
                  at org.apache.hadoop.hdfsproxy.TestHdfsProxy.testHdfsProxyInterface(TestHdfsProxy.java:222)
          
          Show
          Mahadev konar added a comment - ant test passes with this patch except for TestHdfsProxy in contrib which fails due to the following unrelated reason: Testcase: testHdfsProxyInterface took 6.078 sec Caused an ERROR org/apache/commons/cli/ParseException java.lang.NoClassDefFoundError: org/apache/commons/cli/ParseException at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.hdfsproxy.TestHdfsProxy.testHdfsProxyInterface(TestHdfsProxy.java:222)
          Hide
          Mahadev konar added a comment -

          Filed MAPREDUCE-1010 for adding the testcase.

          Show
          Mahadev konar added a comment - Filed MAPREDUCE-1010 for adding the testcase.
          Mahadev konar made changes -
          Link This issue incorporates MAPREDUCE-1010 [ MAPREDUCE-1010 ]
          Mahadev konar made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12418348/HADOOP-6097-v2.patch
          against trunk revision 816794.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/14/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/14/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/14/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/14/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418348/HADOOP-6097-v2.patch against trunk revision 816794. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/14/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/14/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/14/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/14/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          patch for the 0.20 branch..

          Show
          Mahadev konar added a comment - patch for the 0.20 branch..
          Mahadev konar made changes -
          Attachment HADOOP-6097-0.20.patch [ 12420209 ]
          Hide
          Mahadev konar added a comment -

          this patch includes the test in MAPREDUCE-1010.

          Show
          Mahadev konar added a comment - this patch includes the test in MAPREDUCE-1010 .
          Mahadev konar made changes -
          Attachment HADOOP-6097-0.20.patch [ 12420224 ]
          Hide
          Koji Noguchi added a comment -

          Do we want a test case of handling two har files at once?
          Other than that, +1.

          Show
          Koji Noguchi added a comment - Do we want a test case of handling two har files at once? Other than that, +1.
          Hide
          Mahadev konar added a comment -

          since the core problem was fixed in HADOOP-6231 which already adds test to it, I dont think we need specific tests for har...

          Show
          Mahadev konar added a comment - since the core problem was fixed in HADOOP-6231 which already adds test to it, I dont think we need specific tests for har...
          Hide
          Koji Noguchi added a comment -

          since the core problem was fixed in HADOOP-6231 which already adds test to it, I dont think we need specific tests for har...

          I wanted a testcase so that it would catch if someone take out

          <name>fs.har.impl.disable.cache</name>
          <value>true</value>
          

          from core-default.xml

          Show
          Koji Noguchi added a comment - since the core problem was fixed in HADOOP-6231 which already adds test to it, I dont think we need specific tests for har... I wanted a testcase so that it would catch if someone take out <name>fs.har.impl.disable.cache</name> <value>true</value> from core-default.xml
          Hide
          Mahadev konar added a comment -

          not a bad test to add at all... will add it ....

          Show
          Mahadev konar added a comment - not a bad test to add at all... will add it ....
          Hide
          Mahadev konar added a comment -

          patch with changes corresponding to MAPREDUCE-1010 for hadoop-0.20 branch.

          Show
          Mahadev konar added a comment - patch with changes corresponding to MAPREDUCE-1010 for hadoop-0.20 branch.
          Mahadev konar made changes -
          Attachment HADOOP-6097-0.20.patch [ 12422737 ]
          Hide
          Chris Douglas added a comment -

          +1

          I committed this. Thanks, Ben, Tom, and Mahadev!

          As part of the commit, I created a new 0.20.2 section and moved issues committed to 0.20.2 from trunk and into that section.

          Show
          Chris Douglas added a comment - +1 I committed this. Thanks, Ben, Tom, and Mahadev! As part of the commit, I created a new 0.20.2 section and moved issues committed to 0.20.2 from trunk and into that section.
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #64 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/64/)
          . Fix Path conversion in makeQualified and reset LineReader byte
          count at the start of each block in Hadoop archives. Contributed by Ben Slusky,
          Tom White, and Mahadev Konar

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #64 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/64/ ) . Fix Path conversion in makeQualified and reset LineReader byte count at the start of each block in Hadoop archives. Contributed by Ben Slusky, Tom White, and Mahadev Konar
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #88 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/88/)
          . Fix Path conversion in makeQualified and reset LineReader byte
          count at the start of each block in Hadoop archives. Contributed by Ben Slusky,
          Tom White, and Mahadev Konar

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #88 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/88/ ) . Fix Path conversion in makeQualified and reset LineReader byte count at the start of each block in Hadoop archives. Contributed by Ben Slusky, Tom White, and Mahadev Konar
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #134 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/134/)
          . Fix Path conversion in makeQualified and reset LineReader byte
          count at the start of each block in Hadoop archives. Contributed by Ben Slusky,
          Tom White, and Mahadev Konar

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #134 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/134/ ) . Fix Path conversion in makeQualified and reset LineReader byte count at the start of each block in Hadoop archives. Contributed by Ben Slusky, Tom White, and Mahadev Konar
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #120 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/120/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #120 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/120/ )
          Robert Chansler made changes -
          Release Note Bugs fixed for Hadoop archives: character escaping in paths, LineReader and file system caching.

            People

            • Assignee:
              Ben Slusky
              Reporter:
              Ben Slusky
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development