Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-814

Add an api to get the visible length of a DFSDataInputStream.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: hdfs-client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Add an api to get the visible length of a DFSDataInputStream.

      Description

      Hflush guarantees that the bytes written before are visible to the new readers. However, there is no way to get the length of the visible bytes. The visible length is useful in some applications like SequenceFile.

      1. privateInputStream.patch
        0.7 kB
        Hairong Kuang
      2. h814_20091221.patch
        3 kB
        Tsz Wo Nicholas Sze
      3. h814_20091221_0.21.patch
        3 kB
        Tsz Wo Nicholas Sze
      4. getLength-yahoo-0.20.patch
        3 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          The "visible length" of a file should be the same as the "length" of a file, isn't it? If so, isn't it appropriate to return that length in the getFileStatus() call (even for files that are being currently written to)?

          Show
          dhruba borthakur added a comment - The "visible length" of a file should be the same as the "length" of a file, isn't it? If so, isn't it appropriate to return that length in the getFileStatus() call (even for files that are being currently written to)?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > The "visible length" of a file should be the same as the "length" of a file, isn't it?

          For closed files, the visible length and the length of a file are the same.

          For files being written, the length of a file is unknown to the NameNode. The individual replica lengths in each datanode may be different. The visible length is the length that all datanodes in the pipeline contain at least such amount of data. Therefore, these data are visible to the readers.

          The implementation of the new API should be similar to DFSInputStream.available() except that it returns a long. See also HDFS-691.

          Show
          Tsz Wo Nicholas Sze added a comment - > The "visible length" of a file should be the same as the "length" of a file, isn't it? For closed files, the visible length and the length of a file are the same. For files being written, the length of a file is unknown to the NameNode. The individual replica lengths in each datanode may be different. The visible length is the length that all datanodes in the pipeline contain at least such amount of data. Therefore, these data are visible to the readers. The implementation of the new API should be similar to DFSInputStream.available() except that it returns a long. See also HDFS-691 .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          h814_20091221.patch: added DFSDataInputStream.getVisibleLength().

          Show
          Tsz Wo Nicholas Sze added a comment - h814_20091221.patch: added DFSDataInputStream.getVisibleLength().
          Hide
          dhruba borthakur added a comment -

          Can this be added to 0.21 as well?

          Show
          dhruba borthakur added a comment - Can this be added to 0.21 as well?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Can this be added to 0.21 as well?

          Sure. This is a part of hflush.

          How does the patch look to you?

          Show
          Tsz Wo Nicholas Sze added a comment - > Can this be added to 0.21 as well? Sure. This is a part of hflush. How does the patch look to you?
          Hide
          dhruba borthakur added a comment -

          Code looks good. Can we also enhance DFSClient.getFileInfo() to return the current length of a file (for a file that is being written into)... something like this:

          public FileStatus getFileInfo(String src) throws IOException {
          checkOpen();
          try {
          FileStatus stat = namenode.getFileInfo(src);
          if (stat.isUnderConstruction())

          Unknown macro: { stat.length = DFSClient.open(src).getFileLength(); }

          } catch(RemoteException re)

          Unknown macro: { throw re.unwrapRemoteException(AccessControlException.class); }

          }

          The benefit of this approach might reduce confusion to users...especially if DFSClient.getFileInfo() and DfsClient.getFileLength() returns different file sizes for the same file. Also, I am guessing that this will not introduce any new performance impact.

          Show
          dhruba borthakur added a comment - Code looks good. Can we also enhance DFSClient.getFileInfo() to return the current length of a file (for a file that is being written into)... something like this: public FileStatus getFileInfo(String src) throws IOException { checkOpen(); try { FileStatus stat = namenode.getFileInfo(src); if (stat.isUnderConstruction()) Unknown macro: { stat.length = DFSClient.open(src).getFileLength(); } } catch(RemoteException re) Unknown macro: { throw re.unwrapRemoteException(AccessControlException.class); } } The benefit of this approach might reduce confusion to users...especially if DFSClient.getFileInfo() and DfsClient.getFileLength() returns different file sizes for the same file. Also, I am guessing that this will not introduce any new performance impact.
          Hide
          Arun C Murthy added a comment -

          +1, this will be useful for SequenceFile etc.

          Related thought: should we move DFSDataInputStream outside of DFSClient since it will have some significant functionality i.e. getVisibleLength?

          Show
          Arun C Murthy added a comment - +1, this will be useful for SequenceFile etc. Related thought: should we move DFSDataInputStream outside of DFSClient since it will have some significant functionality i.e. getVisibleLength?
          Hide
          Arun C Murthy added a comment -

          Related thought: should we move DFSDataInputStream outside of DFSClient since it will have some significant functionality i.e. getVisibleLength?

          Let me elaborate: My thinking is that we need an HDFS specific input-stream which is the input-stream with features such as getVisibleLength etc. (possibly even getCurrentDatanode, getCurrentBlock ?)

          Show
          Arun C Murthy added a comment - Related thought: should we move DFSDataInputStream outside of DFSClient since it will have some significant functionality i.e. getVisibleLength? Let me elaborate: My thinking is that we need an HDFS specific input-stream which is the input-stream with features such as getVisibleLength etc. (possibly even getCurrentDatanode, getCurrentBlock ?)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > an we also enhance DFSClient.getFileInfo() to return the current length of a file (for a file that is being written into)...

          I think it may be a good idea to change DFSClient.getFileInfo(). But we cannot easily update stat.length since there is no api to update the length in FileStatus.

          > The benefit of this approach might reduce confusion to users...especially if DFSClient.getFileInfo() and DfsClient.getFileLength() returns different file sizes for the same file. Also, I am guessing that this will not introduce any new performance impact.

          There is no such method called DfsClient.getFileLength(). Do you mean DfsClient.DFSInputStream.getFileLength()? This method is not visible since DfsClient.DFSInputStream is package private (and I change it to private in my patch).

          For the performance, it is hard to estimate since there are two additional round trips (one to the NN and one to a DN) for DFSClient.open(..).

          Show
          Tsz Wo Nicholas Sze added a comment - > an we also enhance DFSClient.getFileInfo() to return the current length of a file (for a file that is being written into)... I think it may be a good idea to change DFSClient.getFileInfo(). But we cannot easily update stat.length since there is no api to update the length in FileStatus. > The benefit of this approach might reduce confusion to users...especially if DFSClient.getFileInfo() and DfsClient.getFileLength() returns different file sizes for the same file. Also, I am guessing that this will not introduce any new performance impact. There is no such method called DfsClient.getFileLength(). Do you mean DfsClient.DFSInputStream.getFileLength()? This method is not visible since DfsClient.DFSInputStream is package private (and I change it to private in my patch). For the performance, it is hard to estimate since there are two additional round trips (one to the NN and one to a DN) for DFSClient.open(..).
          Hide
          Sanjay Radia added a comment -

          I like Dhruba's suggestion because of the transparent behaviour,
          but it has the following issue: Applications that don't care about very accurate file lengths will pay the cost for files
          that happen to be open for writing. Cost of ls -r of a dir (say MR output dir) can go up when some of the files in the subtree are open for writing.

          Isn't it acceptable to say that listStatus returns the last known file size. DFSDataInputStream.getVisibleLen() gives a more accurate result?

          Show
          Sanjay Radia added a comment - I like Dhruba's suggestion because of the transparent behaviour, but it has the following issue: Applications that don't care about very accurate file lengths will pay the cost for files that happen to be open for writing. Cost of ls -r of a dir (say MR output dir) can go up when some of the files in the subtree are open for writing. Isn't it acceptable to say that listStatus returns the last known file size. DFSDataInputStream.getVisibleLen() gives a more accurate result?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12428660/h814_20091221.patch
          against trunk revision 893066.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/157/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/157/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/157/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428660/h814_20091221.patch against trunk revision 893066. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/157/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/157/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/157/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          h814_20091221_0.21.patch: for 0.21

          Show
          Tsz Wo Nicholas Sze added a comment - h814_20091221_0.21.patch: for 0.21
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this to 0.21 and above.

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this to 0.21 and above.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I forgot to say that the failure of TestFiDataTransferProtocol2 is not related to this. See HDFS-849.

          Show
          Tsz Wo Nicholas Sze added a comment - I forgot to say that the failure of TestFiDataTransferProtocol2 is not related to this. See HDFS-849 .
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #155 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/155/)
          . Add an api to get the visible length of a DFSDataInputStream.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #155 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/155/ ) . Add an api to get the visible length of a DFSDataInputStream.
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #159 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/159/)
          . Add an api to get the visible length of a DFSDataInputStream.

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #159 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/159/ ) . Add an api to get the visible length of a DFSDataInputStream.
          Hide
          dhruba borthakur added a comment -

          > but it has the following issue: Applications that don't care about very accurate file lengths will pay the cost for files

          This will happen only if the file is being written to when somebody else does a getFileStatus on the file. This should never happen for the most typical app that runs on HDFS... a map-reduce job.

          >Cost of ls -r of a dir (say MR output dir) can go up when some of the files in the subtree are open for writing.

          I suspect that this is not a typical use-case. The MR-job output directory will typically be empty until the job is committed and all files get renamed into the out directory (from the tmp directory).

          I am good for this patch because this does not introduce a FileSystem/FileContext API.

          Show
          dhruba borthakur added a comment - > but it has the following issue: Applications that don't care about very accurate file lengths will pay the cost for files This will happen only if the file is being written to when somebody else does a getFileStatus on the file. This should never happen for the most typical app that runs on HDFS... a map-reduce job. >Cost of ls -r of a dir (say MR output dir) can go up when some of the files in the subtree are open for writing. I suspect that this is not a typical use-case. The MR-job output directory will typically be empty until the job is committed and all files get renamed into the out directory (from the tmp directory). I am good for this patch because this does not introduce a FileSystem/FileContext API.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Dhruba, Arun, Hairong and Sanjay, thank you all for the helpful comments on this issue!

          Show
          Tsz Wo Nicholas Sze added a comment - Dhruba, Arun, Hairong and Sanjay, thank you all for the helpful comments on this issue!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #182 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/182/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #182 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/182/ )
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #94 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/94/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #94 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/94/ )
          Hide
          Hairong Kuang added a comment -

          This patch allows a user to get a file's length from DFSOutputStream in 0.20. If a user needs to open a file and fetch its length, it could do it in one RPC with this patch so reducing one getFileStatus call. Different from the same API in 0.20, The return value of getVisibleLength() always the same as the meta info stored in NameNode's memory while the file is opened.

          Show
          Hairong Kuang added a comment - This patch allows a user to get a file's length from DFSOutputStream in 0.20. If a user needs to open a file and fetch its length, it could do it in one RPC with this patch so reducing one getFileStatus call. Different from the same API in 0.20, The return value of getVisibleLength() always the same as the meta info stored in NameNode's memory while the file is opened.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          getLength-yahoo-0.20.patch looks good. Thanks Hairong.

          Show
          Tsz Wo Nicholas Sze added a comment - getLength-yahoo-0.20.patch looks good. Thanks Hairong.
          Hide
          Hairong Kuang added a comment -

          This patch reverts the change that makes DFSInputStream private in the previous patch.

          Show
          Hairong Kuang added a comment - This patch reverts the change that makes DFSInputStream private in the previous patch.
          Hide
          Qi Liu added a comment -

          Please take a look at HDFS-246. Is there any reason why HDFS-246 is not applied? It seems to me that this patch is only a subset of HDFS-246.

          Show
          Qi Liu added a comment - Please take a look at HDFS-246 . Is there any reason why HDFS-246 is not applied? It seems to me that this patch is only a subset of HDFS-246 .

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development