Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-246

Add a method to get file length for Seekable, FSDataInputStream and libhdfs

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When open any seekable file, it should be able to get the length of the file via Seekable interface, since the seek method should be able to detect seeking beyond the end of file. Such interface can benefit distributed file systems by saving a network round-trip of FileSystem.getFileStatus(Path).getLen() for any open file.
      In libhdfs, such interface should also be exposed to make native program taking advantage of this change.
      I have the changes locally for all FSInputStream concrete classes. The change can be considered trivial, since some of the FSInputStream classes already have a method named getFileLength(), or a member field named size/length/end.

      1. HADOOP-5143-2.patch
        10 kB
        Qi Liu
      2. HADOOP-5143.patch
        10 kB
        Qi Liu
      3. hadoop.patch
        11 kB
        Qi Liu

        Issue Links

          Activity

          Hide
          Qi Liu added a comment -

          Attaching a patch to make the getFileLength interface public in both Java and libhdfs.

          Show
          Qi Liu added a comment - Attaching a patch to make the getFileLength interface public in both Java and libhdfs.
          Hide
          dhruba borthakur added a comment -

          This is a good change!

          Can you pl merge this patch with trunk and attach a new diff file? Also, it would be nice if you can generate the patch from the base of the workspace (as described in http://wiki.apache.org/hadoop/HowToContribute). If you can add a unit test to (possibly add to TestFileCreation.java) that would be great.

          Show
          dhruba borthakur added a comment - This is a good change! Can you pl merge this patch with trunk and attach a new diff file? Also, it would be nice if you can generate the patch from the base of the workspace (as described in http://wiki.apache.org/hadoop/HowToContribute ). If you can add a unit test to (possibly add to TestFileCreation.java) that would be great.
          Hide
          Qi Liu added a comment -

          The patch against Hadoop 0.21-dev trunk

          Show
          Qi Liu added a comment - The patch against Hadoop 0.21-dev trunk
          Hide
          Raghu Angadi added a comment -

          I don't see any need to add more not-so-related methods in interfaces. getLength() is already available through various other calls. Seekable just implies users can call seek(). Adding other utility stuff here does not seem very useful.

          What is the use case?

          Show
          Raghu Angadi added a comment - I don't see any need to add more not-so-related methods in interfaces. getLength() is already available through various other calls. Seekable just implies users can call seek(). Adding other utility stuff here does not seem very useful. What is the use case?
          Hide
          Qi Liu added a comment -

          Simple. What if I want to seek relative to the end of a file? Also, it is reasonable to have a method which gives the boundaries where calling seek will not cause exceptions.
          If the file size is less than 2G, available() would do the job. However, in many FSInputStream, available() is not working properly, and even possible to give me negative values if the file size exceeds 2G.
          What I indeed want, is an available() which can give me a value larger than 2G (long). If such interface exists, the file length can be obtained by seek(0); availableLong();

          Show
          Qi Liu added a comment - Simple. What if I want to seek relative to the end of a file? Also, it is reasonable to have a method which gives the boundaries where calling seek will not cause exceptions. If the file size is less than 2G, available() would do the job. However, in many FSInputStream, available() is not working properly, and even possible to give me negative values if the file size exceeds 2G. What I indeed want, is an available() which can give me a value larger than 2G (long). If such interface exists, the file length can be obtained by seek(0); availableLong();
          Hide
          dhruba borthakur added a comment -

          Another use case: if one opens a file for reading (via FSDataInputStream) and then wants to find the length of the file (without making a separate FileStatus RPC to the namanode)

          Show
          dhruba borthakur added a comment - Another use case: if one opens a file for reading (via FSDataInputStream) and then wants to find the length of the file (without making a separate FileStatus RPC to the namanode)
          Hide
          Raghu Angadi added a comment -

          I am not saying getLength() is not useful. It is just that it does not need to be part of Seekable.

          Show
          Raghu Angadi added a comment - I am not saying getLength() is not useful. It is just that it does not need to be part of Seekable.
          Hide
          dhruba borthakur added a comment -

          Hi Qi, is it possible for you to add the new method getFileLength() only to FSInputStream and libhdfs (and not to the Seekable) interface. As Raghu points out, the getFileLength() API does not seem to match with the goals of the Seekable interface.

          You mentioned about the shortcomings about the available() interface for files greater than 2GB. is this something that is fixed in later releases of the JDK?

          Show
          dhruba borthakur added a comment - Hi Qi, is it possible for you to add the new method getFileLength() only to FSInputStream and libhdfs (and not to the Seekable) interface. As Raghu points out, the getFileLength() API does not seem to match with the goals of the Seekable interface. You mentioned about the shortcomings about the available() interface for files greater than 2GB. is this something that is fixed in later releases of the JDK?
          Hide
          Qi Liu added a comment -

          Moved getFileLength() out of Seekable, into FSInputStream and FSDataInputStream.

          Show
          Qi Liu added a comment - Moved getFileLength() out of Seekable, into FSInputStream and FSDataInputStream.
          Hide
          Qi Liu added a comment -

          available() in Hadoop 0.18.3 will report negative numbers if the file size is over 2GB, which obviously is a bug. available() should always return a number greater or equal to 0, agree?

          Show
          Qi Liu added a comment - available() in Hadoop 0.18.3 will report negative numbers if the file size is over 2GB, which obviously is a bug. available() should always return a number greater or equal to 0, agree?
          Hide
          Hong Tang added a comment -

          Agreed. Let's return min(Integer.MAX_VALUE, length).

          Show
          Hong Tang added a comment - Agreed. Let's return min(Integer.MAX_VALUE, length).
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Qi, HDFS-814 added an api to get the visible length of a DFSDataInputStream. The visible length is the same as the file length for closed files. Do you think that we still need this?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Qi, HDFS-814 added an api to get the visible length of a DFSDataInputStream. The visible length is the same as the file length for closed files. Do you think that we still need this?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... Let's return min(Integer.MAX_VALUE, length).

          This was already done by HDFS-691.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... Let's return min(Integer.MAX_VALUE, length). This was already done by HDFS-691 .
          Hide
          Qi Liu added a comment -

          I believe HDFS-814 is not good for general application use. DFSDataInputStream is a concrete sub-class while FSDataInputStream is an abstract parent class. FileSystem.open returns an instance of FSDataInputStream instead of DFSDataInputStream. Explicitly cast FSDataInputStream to DFSDataInputStream is simply not safe. What if in future or in some cases, FileSystem.open returns an instance other than DFSDataInputStream? I still believe the proper way to implement this would be adding a public method to FSDataInputStream abstract class, and add such method to all implementations.

          Show
          Qi Liu added a comment - I believe HDFS-814 is not good for general application use. DFSDataInputStream is a concrete sub-class while FSDataInputStream is an abstract parent class. FileSystem.open returns an instance of FSDataInputStream instead of DFSDataInputStream. Explicitly cast FSDataInputStream to DFSDataInputStream is simply not safe. What if in future or in some cases, FileSystem.open returns an instance other than DFSDataInputStream? I still believe the proper way to implement this would be adding a public method to FSDataInputStream abstract class, and add such method to all implementations.
          Hide
          Cyril Briquet added a comment -

          Here's another use case:
          reading a file that contains an unknown number of UTF-encoded Strings

          FSDataInputStream file = openFile();
          long len = file.length();
          while (file.getPos() < len)

          { String s = file.readUTF(); processString(s); }

          I'd be glad to learn how to implement this pattern with the current HDFS API,
          besides reading the file length through FileSystem.getFileStatus().getLen()

          Show
          Cyril Briquet added a comment - Here's another use case: reading a file that contains an unknown number of UTF-encoded Strings FSDataInputStream file = openFile(); long len = file.length(); while (file.getPos() < len) { String s = file.readUTF(); processString(s); } I'd be glad to learn how to implement this pattern with the current HDFS API, besides reading the file length through FileSystem.getFileStatus().getLen()
          Hide
          Cyril Briquet added a comment -

          Another related use case:

          Let's assume the intent to implement two (somewhat complex) record reading routines,
          one for an HDFS filesystem (through the HDFS API),
          the other for a local filesystem (i.e. through the java.io API).
          (In practice, this use case is extended to support more than two filesystems,
          but let's assume two for the sake of simplicity.)

          To promote code reuse, the core of the two implementations
          can be abstracted into a base class.
          This base class is inherited by two subclasses
          that provide the concrete implementation of low-level I/O.

          The low-level I/O rely on:

          org.apache.hadoop.fs.FSDataInputStream:

          public void seek(long pos) throws IOException; // org.apache.hadoop.fs.Seekable interface
          public long getPos() throws IOException; // org.apache.hadoop.fs.Seekable interface
          public long length() throws IOException; // TODO
          public String readUTF() throws IOException; // java.io.DataInput interface
          public int readInt() throws IOException; // java.io.DataInput interface
          public void close() throws IOException; // java.io.Closeable

          java.io.RandomAccessFile:

          public void seek(long pos) throws IOException; // no interface
          public long getFilePointer() throws IOException; // no interface
          public long length() throws IOException; // no interface
          public String readUTF() throws IOException; // java.io.DataInput interface
          public int readInt() throws IOException; // java.io.DataInput interface
          public void close() throws IOException; // java.io.Closeable

          When considering this use case, the patch proposed by Qi makes a lot of sense (to me, at least
          This would bring to FSDataInputStream the same semantics
          that are available from RandomAccessFile.

          Show
          Cyril Briquet added a comment - Another related use case: Let's assume the intent to implement two (somewhat complex) record reading routines, one for an HDFS filesystem (through the HDFS API), the other for a local filesystem (i.e. through the java.io API). (In practice, this use case is extended to support more than two filesystems, but let's assume two for the sake of simplicity.) To promote code reuse, the core of the two implementations can be abstracted into a base class. This base class is inherited by two subclasses that provide the concrete implementation of low-level I/O. The low-level I/O rely on: org.apache.hadoop.fs.FSDataInputStream: public void seek(long pos) throws IOException; // org.apache.hadoop.fs.Seekable interface public long getPos() throws IOException; // org.apache.hadoop.fs.Seekable interface public long length() throws IOException; // TODO public String readUTF() throws IOException; // java.io.DataInput interface public int readInt() throws IOException; // java.io.DataInput interface public void close() throws IOException; // java.io.Closeable java.io.RandomAccessFile: public void seek(long pos) throws IOException; // no interface public long getFilePointer() throws IOException; // no interface public long length() throws IOException; // no interface public String readUTF() throws IOException; // java.io.DataInput interface public int readInt() throws IOException; // java.io.DataInput interface public void close() throws IOException; // java.io.Closeable When considering this use case, the patch proposed by Qi makes a lot of sense (to me, at least This would bring to FSDataInputStream the same semantics that are available from RandomAccessFile.

            People

            • Assignee:
              Qi Liu
              Reporter:
              Qi Liu
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development