Hadoop Common
  1. Hadoop Common
  2. HADOOP-6307

Support reading on un-closed SequenceFile

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen(). However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length. DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

      1. c6307_20091201.patch
        7 kB
        Tsz Wo Nicholas Sze
      2. c6307_20091130.patch
        6 kB
        Tsz Wo Nicholas Sze
      3. c6307_20091124.patch
        0.8 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          For unclosed file, the hflushed length can be obtained by calling InputStream.available(), i.e.

          fs.open(file).available()
          
          Show
          Tsz Wo Nicholas Sze added a comment - For unclosed file, the hflushed length can be obtained by calling InputStream.available(), i.e. fs.open(file).available()
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Oops, there is a limitation on java.io.InputStream.available(). See HDFS-691.

          Show
          Tsz Wo Nicholas Sze added a comment - Oops, there is a limitation on java.io.InputStream.available(). See HDFS-691 .
          Hide
          Hong Tang added a comment -

          Not sure why this issue only hits SequenceFile. The problem applies equally to TFile (although this was pushed to the caller).

          I have previously asked around why we cannot do "seek(Long.MAX_VALUE); return getPos();" to determine the length of the file, and I was told that you could actually seek beyond the end of the file. TRUE or FALSE?

          Another question is why we cannot support seek like POSIX does, where we can seek in relation to the

          {begin,current,end}

          position of the file?

          Show
          Hong Tang added a comment - Not sure why this issue only hits SequenceFile. The problem applies equally to TFile (although this was pushed to the caller). I have previously asked around why we cannot do "seek(Long.MAX_VALUE); return getPos();" to determine the length of the file, and I was told that you could actually seek beyond the end of the file. TRUE or FALSE? Another question is why we cannot support seek like POSIX does, where we can seek in relation to the {begin,current,end} position of the file?
          Hide
          dhruba borthakur added a comment -

          Isn't it true that fs.getFileStatus(file).getLen(). requires read access on the parent directory whereas fs.open(file).available() required read access on the file itself?

          Many map-reduce programs use SequenceFiles to store data. And they do not need the facility to process files that are currently being written to. In this case, isn't the additional overhead of fetching block locations via fs.open(file) kinda wasteful?

          Show
          dhruba borthakur added a comment - Isn't it true that fs.getFileStatus(file).getLen(). requires read access on the parent directory whereas fs.open(file).available() required read access on the file itself? Many map-reduce programs use SequenceFiles to store data. And they do not need the facility to process files that are currently being written to. In this case, isn't the additional overhead of fetching block locations via fs.open(file) kinda wasteful?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Isn't it true that fs.getFileStatus(file).getLen(). requires read access on the parent directory whereas fs.open(file).available() required read access on the file itself?

          Actually, fs.getFileStatus(file).getLen() requires only "x" access on the parent directory but not "r".

          SequenceFile.Reader opens the file for read. So we must have (and already have) "InputStream in = fs.open(file);" in the codes. My previous suggestion was to call "in.available()" to get the number of available bytes. If we replace "fs.open(file).available()" by "in.available()", it indeed reduces an RPC to the NameNode and does not introduce any additional overhead. (However, it currently does not work because of HDFS-691.)

          FYR, copied the related code segment which includes all SequenceFile.Reader constructors below.

          //line 1438, SequenceFile.java
              /** Open the named file. */
              public Reader(FileSystem fs, Path file, Configuration conf)
                throws IOException {
                this(fs, file, conf.getInt("io.file.buffer.size", 4096), conf, false);
              }
          
              private Reader(FileSystem fs, Path file, int bufferSize,
                             Configuration conf, boolean tempReader) throws IOException {
                this(fs, file, bufferSize, 0, fs.getFileStatus(file).getLen(), conf, tempReader);
              }
              
              private Reader(FileSystem fs, Path file, int bufferSize, long start,
                             long length, Configuration conf, boolean tempReader) 
              throws IOException {
                this.file = file;
                this.in = openFile(fs, file, bufferSize, length);
                this.conf = conf;
                boolean succeeded = false;
                try {
                  seek(start);
                  this.end = in.getPos() + length;
                  init(tempReader);
                  succeeded = true;
                } finally {
                  if (!succeeded) {
                    IOUtils.cleanup(LOG, in);
                  }
                }
              }
          
              /**
               * Override this method to specialize the type of
               * {@link FSDataInputStream} returned.
               */
              protected FSDataInputStream openFile(FileSystem fs, Path file,
                  int bufferSize, long length) throws IOException {
                return fs.open(file, bufferSize);
              }
          
          Show
          Tsz Wo Nicholas Sze added a comment - > Isn't it true that fs.getFileStatus(file).getLen(). requires read access on the parent directory whereas fs.open(file).available() required read access on the file itself? Actually, fs.getFileStatus(file).getLen() requires only "x" access on the parent directory but not "r". SequenceFile.Reader opens the file for read. So we must have (and already have) "InputStream in = fs.open(file);" in the codes. My previous suggestion was to call "in.available()" to get the number of available bytes. If we replace "fs.open(file).available()" by "in.available()", it indeed reduces an RPC to the NameNode and does not introduce any additional overhead. (However, it currently does not work because of HDFS-691 .) FYR, copied the related code segment which includes all SequenceFile.Reader constructors below. //line 1438, SequenceFile.java /** Open the named file. */ public Reader(FileSystem fs, Path file, Configuration conf) throws IOException { this (fs, file, conf.getInt( "io.file.buffer.size" , 4096), conf, false ); } private Reader(FileSystem fs, Path file, int bufferSize, Configuration conf, boolean tempReader) throws IOException { this (fs, file, bufferSize, 0, fs.getFileStatus(file).getLen(), conf, tempReader); } private Reader(FileSystem fs, Path file, int bufferSize, long start, long length, Configuration conf, boolean tempReader) throws IOException { this .file = file; this .in = openFile(fs, file, bufferSize, length); this .conf = conf; boolean succeeded = false ; try { seek(start); this .end = in.getPos() + length; init(tempReader); succeeded = true ; } finally { if (!succeeded) { IOUtils.cleanup(LOG, in); } } } /** * Override this method to specialize the type of * {@link FSDataInputStream} returned. */ protected FSDataInputStream openFile(FileSystem fs, Path file, int bufferSize, long length) throws IOException { return fs.open(file, bufferSize); }
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Not sure why this issue only hits SequenceFile. The problem applies equally to TFile (although this was pushed to the caller).

          This problem applies to any implementation which gets the un-closed file length by calling fs.getFileStatus(file).getLen(). (By "problem", I mean that the reader may not see all hflushed bytes. It sees some part of the file. This is the same behavior before append.) I did not check TFile before. TFile does not have this problem if the caller manage to get the correct length and pass it to the TFile.Reader constructor.

          Show
          Tsz Wo Nicholas Sze added a comment - > Not sure why this issue only hits SequenceFile. The problem applies equally to TFile (although this was pushed to the caller). This problem applies to any implementation which gets the un-closed file length by calling fs.getFileStatus(file).getLen(). (By "problem", I mean that the reader may not see all hflushed bytes. It sees some part of the file. This is the same behavior before append.) I did not check TFile before. TFile does not have this problem if the caller manage to get the correct length and pass it to the TFile.Reader constructor.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          SequenceFile.Reader actually do not need the file length. It compares the current position with the file length to check eof. Instead, we may try-catch EOFException.

          Show
          Tsz Wo Nicholas Sze added a comment - SequenceFile.Reader actually do not need the file length. It compares the current position with the file length to check eof. Instead, we may try-catch EOFException.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > SequenceFile.Reader actually do not need the file length. ...

          Thanks Chris and Arun pointing out that the file length (i.e. SequenceFile.Reader.end) cannot be removed. Otherwise, SequenceFile.Sorter won't work.

          I guess we have to introduce a new public constructor, which takes length as a parameter. So, that user applications could possibly pass the correct length when creating a new Reader.

          Show
          Tsz Wo Nicholas Sze added a comment - > SequenceFile.Reader actually do not need the file length. ... Thanks Chris and Arun pointing out that the file length (i.e. SequenceFile.Reader.end) cannot be removed. Otherwise, SequenceFile.Sorter won't work. I guess we have to introduce a new public constructor, which takes length as a parameter. So, that user applications could possibly pass the correct length when creating a new Reader.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          c6307_20091124.patch: introduces a new public constructor.

          Show
          Tsz Wo Nicholas Sze added a comment - c6307_20091124.patch: introduces a new public constructor.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          c6307_20091130.patch: adds a new constructor with FSDataInputStream and length so that if the file is already opened, it won't be opened again inside the SequenceFile.Reader constructor.

          Show
          Tsz Wo Nicholas Sze added a comment - c6307_20091130.patch: adds a new constructor with FSDataInputStream and length so that if the file is already opened, it won't be opened again inside the SequenceFile.Reader constructor.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12426467/c6307_20091130.patch
          against trunk revision 885534.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426467/c6307_20091130.patch against trunk revision 885534. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Only nit:

          + public Reader(Path file, FSDataInputStream in, int buffersize,
          + long start, long length, Configuration conf) throws IOException

          should be

          + public Reader(FSDataInputStream in, int buffersize,
          + long start, long length, Configuration conf) throws IOException

          Having the 'file' there is useless since we do not use it in the constructor, people might get confused about usage, or worse assume that we will open the file again. The proposed alternative will force people to think in the right direction i.e. they open the file and hand us the input-stream and the start/length.

          Show
          Arun C Murthy added a comment - Only nit: + public Reader(Path file, FSDataInputStream in, int buffersize, + long start, long length, Configuration conf) throws IOException should be + public Reader(FSDataInputStream in, int buffersize, + long start, long length, Configuration conf) throws IOException Having the 'file' there is useless since we do not use it in the constructor, people might get confused about usage, or worse assume that we will open the file again. The proposed alternative will force people to think in the right direction i.e. they open the file and hand us the input-stream and the start/length.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Thanks, Arun!

          c6307_20091201.patch: updated with the review comment.

          Show
          Tsz Wo Nicholas Sze added a comment - Thanks, Arun! c6307_20091201.patch: updated with the review comment.
          Hide
          Arun C Murthy added a comment -

          +1

          Show
          Arun C Murthy added a comment - +1
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12426592/c6307_20091201.patch
          against trunk revision 885888.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426592/c6307_20091201.patch against trunk revision 885888. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this to 0.21 and above.

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this to 0.21 and above.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #95 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/95/)
          . Add a new SequenceFile.Reader constructor in order to support reading on un-closed file.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #95 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/95/ ) . Add a new SequenceFile.Reader constructor in order to support reading on un-closed file.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #175 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/175/)
          . Add a new SequenceFile.Reader constructor in order to support reading on un-closed file.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #175 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/175/ ) . Add a new SequenceFile.Reader constructor in order to support reading on un-closed file.

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development