Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-946

NameNode should not return full path name when lisitng a diretory or getting the status of a file

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed

      Description

      FSDirectory#getListring(String src) has the following code:
      int i = 0;
      for (INode cur : contents)

      { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; }

      So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation.

      FSDirectory#getFileInfo(String) does not need to return the file name either.

      Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode.

      1. HdfsFileStatus-yahoo20.patch
        51 kB
        Hairong Kuang
      2. HdfsFileStatusProxy-Yahoo20.patch
        1 kB
        Hairong Kuang
      3. HdfsFileStatus4.patch
        59 kB
        Hairong Kuang
      4. HdfsFileStatus3.patch
        59 kB
        Hairong Kuang
      5. HDFSFileStatus1.patch
        58 kB
        Hairong Kuang
      6. HDFSFileStatus.patch
        57 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          This sounds reasonable. But the client would still return fully-qualified paths, no?

          Show
          Doug Cutting added a comment - This sounds reasonable. But the client would still return fully-qualified paths, no?
          Hide
          Hairong Kuang added a comment -

          For this jira, the client will still return fully-qualified paths.

          But I am thinking that even at the FileContext level it is not necessary to return fully-qualified paths. However this is a user-facing incompatible change. I would prefer to discuss it in a different jira.

          Show
          Hairong Kuang added a comment - For this jira, the client will still return fully-qualified paths. But I am thinking that even at the FileContext level it is not necessary to return fully-qualified paths. However this is a user-facing incompatible change. I would prefer to discuss it in a different jira.
          Hide
          dhruba borthakur added a comment -

          Client's should continue to get the full path name in a FileStatus object, isn't it? Otherwise many many existing client applications will break.

          If you are proposing that the object that is sent over-the-wire is different from FileStatus. If so, please consider the requirement of HDFS-878 too.

          Show
          dhruba borthakur added a comment - Client's should continue to get the full path name in a FileStatus object, isn't it? Otherwise many many existing client applications will break. If you are proposing that the object that is sent over-the-wire is different from FileStatus. If so, please consider the requirement of HDFS-878 too.
          Hide
          Hairong Kuang added a comment -

          This patch does the following:
          1. create a new file status in HDFS, called HDFSFileStatus, for over-the-wire transfer, in which the path contains only the local name of a path, not the full path. Also the path is represented as is a byte array in Java UTF8 encoding just the same as the one stored in each inode.
          2. change ClientProtocol getListing to return HDFSFileStatus[] and getFileInfo to return HDFSFileStatus.
          a. The path in the return value of getFileInfo is always an empty byte array.
          b. If listStatus is called on a file, the path in the only HDFSFileStatus returned also is an empty byte array;
          c. If listStatus is called on a directory, the path in each HDFSFileStatus in the returned array contains the local name of the directory entry.
          3. FileSystem#getFileStatus and FileSystem#listStatus still see FileStatus which contains the full path name.
          4. Unit tests are added to TestFileStatus to verify the change.

          Show
          Hairong Kuang added a comment - This patch does the following: 1. create a new file status in HDFS, called HDFSFileStatus, for over-the-wire transfer, in which the path contains only the local name of a path, not the full path. Also the path is represented as is a byte array in Java UTF8 encoding just the same as the one stored in each inode. 2. change ClientProtocol getListing to return HDFSFileStatus[] and getFileInfo to return HDFSFileStatus. a. The path in the return value of getFileInfo is always an empty byte array. b. If listStatus is called on a file, the path in the only HDFSFileStatus returned also is an empty byte array; c. If listStatus is called on a directory, the path in each HDFSFileStatus in the returned array contains the local name of the directory entry. 3. FileSystem#getFileStatus and FileSystem#listStatus still see FileStatus which contains the full path name. 4. Unit tests are added to TestFileStatus to verify the change.
          Hide
          Hairong Kuang added a comment -

          This patch does the following:
          1. create a new file status in HDFS, called HDFSFileStatus, for over-the-wire transfer, in which the path contains only the local name of a path, not the full path. Also the path is represented as is a byte array in Java UTF8 encoding just the same as the one stored in each inode.
          2. change ClientProtocol getListing to return HDFSFileStatus[] and getFileInfo to return HDFSFileStatus.
          a. The path in the return value of getFileInfo is always an empty byte array.
          b. If listStatus is called on a file, the path in the only HDFSFileStatus returned also is an empty byte array;
          c. If listStatus is called on a directory, the path in each HDFSFileStatus in the returned array contains the local name of the directory entry.
          3. FileSystem#getFileStatus and FileSystem#listStatus still see FileStatus which contains the full path name.
          4. Unit tests are added to TestFileStatus to verify the change.

          Show
          Hairong Kuang added a comment - This patch does the following: 1. create a new file status in HDFS, called HDFSFileStatus, for over-the-wire transfer, in which the path contains only the local name of a path, not the full path. Also the path is represented as is a byte array in Java UTF8 encoding just the same as the one stored in each inode. 2. change ClientProtocol getListing to return HDFSFileStatus[] and getFileInfo to return HDFSFileStatus. a. The path in the return value of getFileInfo is always an empty byte array. b. If listStatus is called on a file, the path in the only HDFSFileStatus returned also is an empty byte array; c. If listStatus is called on a directory, the path in each HDFSFileStatus in the returned array contains the local name of the directory entry. 3. FileSystem#getFileStatus and FileSystem#listStatus still see FileStatus which contains the full path name. 4. Unit tests are added to TestFileStatus to verify the change.
          Hide
          Hairong Kuang added a comment -

          This fixed a bug in the listing of the root.

          Show
          Hairong Kuang added a comment - This fixed a bug in the listing of the root.
          Hide
          Hairong Kuang added a comment -

          > If you are proposing that the object that is sent over-the-wire is different from FileStatus. If so, please consider the requirement of HDFS-878 too.
          This jira tries to reduce the cost of getFileInfo and listing a directory, where HDFS-878 adds cost to these two operations.. So I will not implement HDFS-878 in this jira. Since we are having so many problems with getFileInfo and list a directory, we should be very cautious about adding anything to FileStatus in hdfs unless it is absolutely necessary.

          I have conducted some experiments with my patch. I write an application that spawns 100 threads, each of which lists a directory of size 1300 for 200 times. I use yourKit to profile the NameNode while the application is running. Without the patch, NameNode's CPU utilization is 20~26% and time spent on GC is 3~5%. With the patch, NameNode's CPU utilization drops to 12~17% and the time spent on GS is mostly 0% but occasionally becomes 1 or 2%.

          Show
          Hairong Kuang added a comment - > If you are proposing that the object that is sent over-the-wire is different from FileStatus. If so, please consider the requirement of HDFS-878 too. This jira tries to reduce the cost of getFileInfo and listing a directory, where HDFS-878 adds cost to these two operations.. So I will not implement HDFS-878 in this jira. Since we are having so many problems with getFileInfo and list a directory, we should be very cautious about adding anything to FileStatus in hdfs unless it is absolutely necessary. I have conducted some experiments with my patch. I write an application that spawns 100 threads, each of which lists a directory of size 1300 for 200 times. I use yourKit to profile the NameNode while the application is running. Without the patch, NameNode's CPU utilization is 20~26% and time spent on GC is 3~5%. With the patch, NameNode's CPU utilization drops to 12~17% and the time spent on GS is mostly 0% but occasionally becomes 1 or 2%.
          Hide
          Suresh Srinivas added a comment -
          1. TestDFSShell.java not sure why the methods are named starting with caps. Also is the change to this file needed?
          2. FSDirectory.createFileStatus - consider moving isDirectory check outside. Also current code extends beyond 80 columns.
          3. HDFSFileStatus
            • consider naming it HdfsFileStatus
            • final static public should public static final
            • since this if for HDFS, comments in the code about different notions in the FS is not required in methods getPermission(), getOwner(), getGroup(),
            • Some of the method parameters and other variables could be declared final
          4. getFulName() - without unnecessary else code is more readable. Same for getFullPath()
          Show
          Suresh Srinivas added a comment - TestDFSShell.java not sure why the methods are named starting with caps. Also is the change to this file needed? FSDirectory.createFileStatus - consider moving isDirectory check outside. Also current code extends beyond 80 columns. HDFSFileStatus consider naming it HdfsFileStatus final static public should public static final since this if for HDFS, comments in the code about different notions in the FS is not required in methods getPermission(), getOwner(), getGroup(), Some of the method parameters and other variables could be declared final getFulName() - without unnecessary else code is more readable. Same for getFullPath()
          Hide
          Hairong Kuang added a comment -

          This patch incorporates all review comments except for "2. consider move isDirectory outside". In addition, it
          1. removes the method getPath(), which returns the byte array, from HdfsFileStatus. This makes the byte array immutable, which makes the change safer.
          2. fixes a few more bugs in Jsp code which misuses getPath().
          3. adds comments to INode#name that reminds that if this encoding is changed, the ClientProtocol is changed and the decoding of HdfsFileStatus#name should change too.

          Show
          Hairong Kuang added a comment - This patch incorporates all review comments except for "2. consider move isDirectory outside". In addition, it 1. removes the method getPath(), which returns the byte array, from HdfsFileStatus. This makes the byte array immutable, which makes the change safer. 2. fixes a few more bugs in Jsp code which misuses getPath(). 3. adds comments to INode#name that reminds that if this encoding is changed, the ClientProtocol is changed and the decoding of HdfsFileStatus#name should change too.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12436085/HdfsFileStatus3.patch
          against trunk revision 910760.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/234/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/234/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/234/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/234/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436085/HdfsFileStatus3.patch against trunk revision 910760. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/234/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/234/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/234/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/234/console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          +1 the patch looks good.

          Show
          Suresh Srinivas added a comment - +1 the patch looks good.
          Hide
          Hairong Kuang added a comment -

          Looks that contribute tests did not get to run. Resubmitting the patch...

          Show
          Hairong Kuang added a comment - Looks that contribute tests did not get to run. Resubmitting the patch...
          Hide
          Hairong Kuang added a comment -

          This patch synced with the trunk and fixed a bug in ListPathsServlet.

          Show
          Hairong Kuang added a comment - This patch synced with the trunk and fixed a bug in ListPathsServlet.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #197 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/197/)
          . NameNode should not return full path name when lisitng a diretory or getting the status of a file. Contributed by Hairong Kuang.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #197 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/197/ ) . NameNode should not return full path name when lisitng a diretory or getting the status of a file. Contributed by Hairong Kuang.
          Hide
          Hairong Kuang added a comment -

          Here is the patch that will be committed to Yahoo! Hadoop 20.

          Show
          Hairong Kuang added a comment - Here is the patch that will be committed to Yahoo! Hadoop 20.
          Hide
          Hairong Kuang added a comment -

          Hudson seems to have problem running the test. I ran "ant test-core" locally and tests were passed.

          I've committed this!

          Show
          Hairong Kuang added a comment - Hudson seems to have problem running the test. I ran "ant test-core" locally and tests were passed. I've committed this!
          Hide
          Hairong Kuang added a comment -

          This patch additionally has the change to hdfsproxy in Yahoo!'s security branch.

          Show
          Hairong Kuang added a comment - This patch additionally has the change to hdfsproxy in Yahoo!'s security branch.
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #146 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/146/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #146 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/146/ )
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #302 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/302/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #302 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/302/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #275 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/275/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #275 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/275/ )

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development