Hadoop Common
  1. Hadoop Common
  2. HADOOP-6678

Remove FileContext#isFile, isDirectory and exists

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: fs
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      1. Add a method Iterator<FileStatus> listStatus(Path), which allows HDFS client not to have the whole listing in the memory, benefit more from the iterative listing added in HDFS-985. Move the current FileStatus[] listStatus(Path) to be a utility method.
      2. Remove methods isFile(Path), isDirectory(Path), and exists.
        All these methods are implemented by calling getFileStatus(Path).But most users are not aware of this. They would write code as below:
          FileContext fc = ..;
          if (fc.exists(path)) {
            if (fc.isFile(path)) {
             ...
            } else {
            ...
            }
          }
        

        The above code adds unnecessary getFileInfo RPC to NameNode. In our production clusters, we often see that the number of getFileStatus calls is multiple times of the open calls. If we remove isFile, isDirectory, and exists from FileContext, users have to explicitly call getFileStatus first, it is more likely that they will write more efficient code as follow:

          FileContext fc = ...;
          FileStatus fstatus = fc.getFileStatus(path);
          if (fstatus.isFile() {
            ...
          } else {
            ...
          }
        
      1. hadoop-6678-2.patch
        46 kB
        Eli Collins
      2. hadoop-6678-1.patch
        67 kB
        Eli Collins

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          I've just committed this. Thanks, Eli!

          Show
          Hairong Kuang added a comment - I've just committed this. Thanks, Eli!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12443114/hadoop-6678-2.patch
          against trunk revision 939026.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 24 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443114/hadoop-6678-2.patch against trunk revision 939026. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 24 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/61/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          +1. The patch looks good to me.

          Show
          Hairong Kuang added a comment - +1. The patch looks good to me.
          Hide
          Eli Collins added a comment -

          Thanks Hairong. Patch attached.

          • Removes FC#isFile and FC#isDirectory, and uses thereof.
          • Refactors FC#exists per previous comment and moves to Util
          • Refactors FC#checkDest. Note that it is now passed overwrite instead of false. It (and it's callers) needs unit tests. Filed HADOOP-6730 for that.
          • Adds Path#getName tests to TestPath to sanity check my understanding of FC#checkDest.
          • Modifies test classes to use the new helper methods

          Still needs to be applied with HDFS-1089 (easy 4 line patch). ant test-core passes.

          Show
          Eli Collins added a comment - Thanks Hairong. Patch attached. Removes FC#isFile and FC#isDirectory, and uses thereof. Refactors FC#exists per previous comment and moves to Util Refactors FC#checkDest. Note that it is now passed overwrite instead of false. It (and it's callers) needs unit tests. Filed HADOOP-6730 for that. Adds Path#getName tests to TestPath to sanity check my understanding of FC#checkDest. Modifies test classes to use the new helper methods Still needs to be applied with HDFS-1089 (easy 4 line patch). ant test-core passes.
          Hide
          Hairong Kuang added a comment -

          > I think we can write FC#exists as..
          +1. Thank you, Eli!

          Show
          Hairong Kuang added a comment - > I think we can write FC#exists as.. +1. Thank you, Eli!
          Hide
          Eli Collins added a comment -

          Hey Hairong, no worries - looking at the implementation of FileContext#exists I think it can be simplified:

            public boolean exists(final Path f) throws AccessControlException,  IOException {
              try {
                return getFileStatus(f) != null;            
              } catch (FileNotFoundException e) {
                return false;
              } catch (UnsupportedFileSystemException e) {
                return false;
              } catch (UnresolvedLinkException e) {
                return false;
              }
            }
          
          1. AFS#getFileStatus should never return null, eg see Hdfs#getFileStatus. Per the spec if should throw a FileNotFoundException if f does not exist.
          2. Seems like the client should have to deal with UnsupportedFileSystemException rather than have FC#exists swallow it
          3. UnresolvedLinkException will never be thrown (see HADOOP-6727)

          Therefore I think we can write FC#exists as

            public boolean exists(final Path f) throws AccessControlException, UnsupportedFileSystemException,  IOException {
              try {
                FileStatus fs =  getFileStatus(f);
                assert fs != null;
                return true;
              } catch (FileNotFoundException e) {
                return false;
              }
            }
          

          Since this is still not a one-liner putting it in FileContext#util seems reasonable to me. I'll send out new patches.

          Show
          Eli Collins added a comment - Hey Hairong, no worries - looking at the implementation of FileContext#exists I think it can be simplified: public boolean exists( final Path f) throws AccessControlException, IOException { try { return getFileStatus(f) != null ; } catch (FileNotFoundException e) { return false ; } catch (UnsupportedFileSystemException e) { return false ; } catch (UnresolvedLinkException e) { return false ; } } AFS#getFileStatus should never return null, eg see Hdfs#getFileStatus. Per the spec if should throw a FileNotFoundException if f does not exist. Seems like the client should have to deal with UnsupportedFileSystemException rather than have FC#exists swallow it UnresolvedLinkException will never be thrown (see HADOOP-6727 ) Therefore I think we can write FC#exists as public boolean exists( final Path f) throws AccessControlException, UnsupportedFileSystemException, IOException { try { FileStatus fs = getFileStatus(f); assert fs != null ; return true ; } catch (FileNotFoundException e) { return false ; } } Since this is still not a one-liner putting it in FileContext#util seems reasonable to me. I'll send out new patches.
          Hide
          Hairong Kuang added a comment -

          Hi Eli, sorry for the delay for getting back to this issue. I was on vacation & worked on a cluster emergency after I came back.

          But I got some time to think more about this proposed FileContext change. I am quite concerned about removing FileContext#exists. This is a useful often-used method that's not easy for a user to write. Do you think if it is a good idea that we still keep this method but put it in FileContext#Util with a comment saying that "be cautious about using this method." and list a few scenarios that could avoid calling this method?

          Show
          Hairong Kuang added a comment - Hi Eli, sorry for the delay for getting back to this issue. I was on vacation & worked on a cluster emergency after I came back. But I got some time to think more about this proposed FileContext change. I am quite concerned about removing FileContext#exists. This is a useful often-used method that's not easy for a user to write. Do you think if it is a good idea that we still keep this method but put it in FileContext#Util with a comment saying that "be cautious about using this method." and list a few scenarios that could avoid calling this method?
          Hide
          Eli Collins added a comment -

          Thanks for the review Hairong. Could you check out HDFS-1089 as well? It's simple, just removes uses of FileContext#isFile, isDirectory and exists from HDFS, which needs to be checked in before this patch.

          Show
          Eli Collins added a comment - Thanks for the review Hairong. Could you check out HDFS-1089 as well? It's simple, just removes uses of FileContext#isFile, isDirectory and exists from HDFS, which needs to be checked in before this patch.
          Hide
          Hairong Kuang added a comment -

          +1 on Eli's patch.

          Show
          Hairong Kuang added a comment - +1 on Eli's patch.
          Hide
          Hairong Kuang added a comment -

          Thanks Eli for working on part 2. I will review it.

          Since Eli submitted the patch to part 2 to this jira, I created HADOOP-6692 for the listStatus change.

          Show
          Hairong Kuang added a comment - Thanks Eli for working on part 2. I will review it. Since Eli submitted the patch to part 2 to this jira, I created HADOOP-6692 for the listStatus change.
          Hide
          Eli Collins added a comment -

          Right, the patch requires HADOOP-6689 and HADOOP-6563, sorry for the noise.

          Show
          Eli Collins added a comment - Right, the patch requires HADOOP-6689 and HADOOP-6563 , sorry for the noise.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12441111/hadoop-6678-1.patch
          against trunk revision 931226.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 21 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/448/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12441111/hadoop-6678-1.patch against trunk revision 931226. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/448/console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Forgot to mention, this patch also adds FileStatus#isFile, currently it just returns !isDir so that it's clear this change is mostly just changing syntax. Further FileStatus changes will be made in HADOOP-6585.

          Show
          Eli Collins added a comment - Forgot to mention, this patch also adds FileStatus#isFile, currently it just returns !isDir so that it's clear this change is mostly just changing syntax. Further FileStatus changes will be made in HADOOP-6585 .
          Hide
          Eli Collins added a comment -

          Here's a patch for part 2, removes FileContext#isFile, isDirectory, and exists.

          Notes:

          • We no longer call fixRelativePart on the path passed to getFileStatus since that method already calls fixRelativePart on the argument (so this redundant call is now gone)
          • deleteOnExit now throws the FileNotFoundException like delete rather than return false
          • Removed UnresolvedLinkException from the throws clause of the public FileContext APIs since they'll never throw this exception (FileContext resolves links).
          • Patch applies to trunk plus HADOOP-6689 and HADOOP-6563
          Show
          Eli Collins added a comment - Here's a patch for part 2, removes FileContext#isFile, isDirectory, and exists. Notes: We no longer call fixRelativePart on the path passed to getFileStatus since that method already calls fixRelativePart on the argument (so this redundant call is now gone) deleteOnExit now throws the FileNotFoundException like delete rather than return false Removed UnresolvedLinkException from the throws clause of the public FileContext APIs since they'll never throw this exception (FileContext resolves links). Patch applies to trunk plus HADOOP-6689 and HADOOP-6563
          Hide
          Hairong Kuang added a comment -

          > I cold do part 2 in lieu of HADOOP-6585 (Add FileStatus#isDirectory and isFile) and HDFS-995, MR-1535.
          Eli, thanks for volunteering to do part 2. Your help would be great!

          Show
          Hairong Kuang added a comment - > I cold do part 2 in lieu of HADOOP-6585 (Add FileStatus#isDirectory and isFile) and HDFS-995 , MR-1535. Eli, thanks for volunteering to do part 2. Your help would be great!
          Hide
          Todd Lipcon added a comment -

          OK, sounds good. Thanks for the info, Hairong.

          Show
          Todd Lipcon added a comment - OK, sounds good. Thanks for the info, Hairong.
          Hide
          Hairong Kuang added a comment -

          We thought about client side stat cache too. Actually right before I filed the jira, we went through the two options one more time. Although I am excited about the cache solution, we feel that the proposed solution for #2 is less controversy, straightforward, and also exposes more information to for users to write more efficient code.

          Show
          Hairong Kuang added a comment - We thought about client side stat cache too. Actually right before I filed the jira, we went through the two options one more time. Although I am excited about the cache solution, we feel that the proposed solution for #2 is less controversy, straightforward, and also exposes more information to for users to write more efficient code.
          Hide
          Todd Lipcon added a comment -

          The other option for #2 is a stat cache, no? Of course we're trading off one set of problems for another, but I think for a lot of use cases a 3 second stat cache (a la linux nfs default) is totally reasonable, so long as it's easily disabled.

          Show
          Todd Lipcon added a comment - The other option for #2 is a stat cache, no? Of course we're trading off one set of problems for another, but I think for a lot of use cases a 3 second stat cache (a la linux nfs default) is totally reasonable, so long as it's easily disabled.
          Hide
          Eli Collins added a comment -

          Accidentally submitted the last comment..

          +1

          I cold do part 2 in lieu of HADOOP-6585 (Add FileStatus#isDirectory and isFile) and HDFS-995, MR-1535 (Replace usage of FileStatus#isDir()) which I just started working on today.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Accidentally submitted the last comment.. +1 I cold do part 2 in lieu of HADOOP-6585 (Add FileStatus#isDirectory and isFile) and HDFS-995 , MR-1535 (Replace usage of FileStatus#isDir()) which I just started working on today. Thanks, Eli
          Hide
          Eli Collins added a comment -

          +1

          HADOOP-6585 (Add FileStatus#isDirectory and isFile)

          (Replace usage of FileStatus#isDir())

          Show
          Eli Collins added a comment - +1 HADOOP-6585 (Add FileStatus#isDirectory and isFile) (Replace usage of FileStatus#isDir())

            People

            • Assignee:
              Eli Collins
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development