Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-200

In HDFS, sync() not yet guarantees data available to the new readers

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20-append
    • Fix Version/s: 0.20-append, 0.20.205.0
    • Component/s: None
    • Labels:
      None
    • Tags:
      hbase

      Description

      In the append design doc (https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc), it says

      • A reader is guaranteed to be able to read data that was 'flushed' before the reader opened the file

      However, this feature is not yet implemented. Note that the operation 'flushed' is now called "sync".

      1. 4379_20081010TC3.java
        4 kB
        Tsz Wo Nicholas Sze
      2. checkLeases-fix-1.txt
        0.7 kB
        sam rash
      3. checkLeases-fix-unit-test-1.txt
        4 kB
        sam rash
      4. fsyncConcurrentReaders.txt
        14 kB
        dhruba borthakur
      5. fsyncConcurrentReaders11_20.txt
        14 kB
        dhruba borthakur
      6. fsyncConcurrentReaders12_20.txt
        14 kB
        dhruba borthakur
      7. fsyncConcurrentReaders13_20.txt
        15 kB
        dhruba borthakur
      8. fsyncConcurrentReaders14_20.txt
        17 kB
        dhruba borthakur
      9. fsyncConcurrentReaders15_20.txt
        18 kB
        dhruba borthakur
      10. fsyncConcurrentReaders16_20.txt
        18 kB
        dhruba borthakur
      11. fsyncConcurrentReaders3.patch
        14 kB
        dhruba borthakur
      12. fsyncConcurrentReaders4.patch
        15 kB
        dhruba borthakur
      13. fsyncConcurrentReaders5.txt
        15 kB
        dhruba borthakur
      14. fsyncConcurrentReaders6.patch
        15 kB
        dhruba borthakur
      15. fsyncConcurrentReaders9.patch
        19 kB
        dhruba borthakur
      16. hadoop-stack-namenode-aa0-000-12.u.powerset.com.log.gz
        86 kB
        stack
      17. HDFS-200.20-security.1.patch
        18 kB
        Jitendra Nath Pandey
      18. hdfs-200-ryan-existing-file-fail.txt
        13 kB
        ryan rawson
      19. hypertable-namenode.log.gz
        33 kB
        Doug Judd
      20. namenode.log
        2.51 MB
        Jim Kellerman
      21. namenode.log
        95 kB
        Jim Kellerman
      22. Reader.java
        2 kB
        dhruba borthakur
      23. Reader.java
        1 kB
        Jim Kellerman
      24. reopen_test.sh
        0.3 kB
        Doug Judd
      25. ReopenProblem.java
        7 kB
        Doug Judd
      26. Writer.java
        2 kB
        dhruba borthakur
      27. Writer.java
        2 kB
        Jim Kellerman

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          In particular, the following will fail.

          3.	TC3: fsync.
          a.	On Machine M1, Create file. Write half block of data. Invoke (DFSOutputStream).fsync() on the dfs file handle. Do not close file yet.
          b.	On another machine M2, open file and verify that the half-block of data can be read successfully.
          c.	On M1, append another half block of data.  Close file on M1.
          d.	On M2, open file and read 1 block of data from it. Close file.
          

          See also https://issues.apache.org/jira/secure/attachment/12391788/20081008testplan.txt

          Show
          Tsz Wo Nicholas Sze added a comment - In particular, the following will fail. 3. TC3: fsync. a. On Machine M1, Create file. Write half block of data. Invoke (DFSOutputStream).fsync() on the dfs file handle. Do not close file yet. b. On another machine M2, open file and verify that the half-block of data can be read successfully. c. On M1, append another half block of data. Close file on M1. d. On M2, open file and read 1 block of data from it. Close file. See also https://issues.apache.org/jira/secure/attachment/12391788/20081008testplan.txt
          Hide
          Tsz Wo Nicholas Sze added a comment -

          4379_20081010TC3.java: a unit test implementing TC3 described in my previous comment.

          Show
          Tsz Wo Nicholas Sze added a comment - 4379_20081010TC3.java: a unit test implementing TC3 described in my previous comment .
          Hide
          dhruba borthakur added a comment -

          When a fsync occurs, the client informs the namenode the amount of data that it has written to the last block. The NameNode updates its meatadata with tis value.This means that a new getBlockLocation call will automatically retrieve the last good size of the last block.

          Show
          dhruba borthakur added a comment - When a fsync occurs, the client informs the namenode the amount of data that it has written to the last block. The NameNode updates its meatadata with tis value.This means that a new getBlockLocation call will automatically retrieve the last good size of the last block.
          Hide
          Luke Lu added a comment -

          If TC3 is too hard in general. A TC2.5, like be able to open the file on the same machine to see the half block (after the fsync) would be nice. It would go a long way to make the transaction log work and make things restartable.

          Show
          Luke Lu added a comment - If TC3 is too hard in general. A TC2.5, like be able to open the file on the same machine to see the half block (after the fsync) would be nice. It would go a long way to make the transaction log work and make things restartable.
          Hide
          Luke Lu added a comment -

          @dhruba: this is the current (0.19.0) behavior: if a client writes a half block to a log in hdfs and calls fsync and then got killed. When the client restarts (on the same node) to read the log, the half block is not there, until you restart the entire hadoop cluster and wait for a few minutes.

          Show
          Luke Lu added a comment - @dhruba: this is the current (0.19.0) behavior: if a client writes a half block to a log in hdfs and calls fsync and then got killed. When the client restarts (on the same node) to read the log, the half block is not there, until you restart the entire hadoop cluster and wait for a few minutes.
          Hide
          dhruba borthakur added a comment -

          @Luke: the attached patch should solve your problem. If you have the time, would it be possible for you to apply this patch and then retry your test case?

          Show
          dhruba borthakur added a comment - @Luke: the attached patch should solve your problem. If you have the time, would it be possible for you to apply this patch and then retry your test case?
          Hide
          Luke Lu added a comment -

          @dhruba: unfortunately the patch (applied to the stock 0.19.0) didn't work. The files did show non-zero size after the sync but they're all 7 bytes. Restarting HDFS make the correct content show up.

          Show
          Luke Lu added a comment - @dhruba: unfortunately the patch (applied to the stock 0.19.0) didn't work. The files did show non-zero size after the sync but they're all 7 bytes. Restarting HDFS make the correct content show up.
          Hide
          Doug Judd added a comment -

          Hi Dhruba,

          I've been working with Luke a little on this. Here are more details. The log that gets written in the test is very small. The first thing the software does when it creates the log is it writes a 7-byte header. Then later as the test proceeds, the system will append a small entry and the do a sync. We use the FSDataOutputStream class. The sequence of operations looks something like this:

          out_stream.write(data, 0, amount);
          out_stream.sync()
          [...]

          When the test completes, all of the logs are exactly 7 bytes long. It remains this way even if I wait 10 minutes or kill the Hypertable java process and wait several minutes as well. Here is the listing:

          [doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn
          rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log

          After shutting down HDFS and restarting it again, the listing looks like this:

          [doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn
          rw-rr- 3 doug supergroup 564 2009-01-17 19:52 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 84 2009-01-17 19:52 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 1063 2009-01-17 19:52 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 634 2009-01-17 19:52 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 217 2009-01-17 19:52 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 1943 2009-01-17 19:52 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 1072 2009-01-17 19:52 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 525 2009-01-17 19:52 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log

          The last time I ran this test I encountered a problem where it appeared that some of our commits were lost. Here's what I did:

          1. ran tests (which create a table with 75274825 cells)
          2. kill Hypertable
          3. shutdown HDFS
          4. restart HDFS
          5. restart Hypertable (which re-plays the commit logs)
          6. dumped the table

          The table dump in #6 came up short (e.g. 72M entries). It appears that some of the commit logs (different log than the range_txn log) came back up incomplete.

          Let us know if you want us to run an instrumented version or anything. We can send you the Hadoop log files if that helps. Thanks!

          • Doug
          Show
          Doug Judd added a comment - Hi Dhruba, I've been working with Luke a little on this. Here are more details. The log that gets written in the test is very small. The first thing the software does when it creates the log is it writes a 7-byte header. Then later as the test proceeds, the system will append a small entry and the do a sync. We use the FSDataOutputStream class. The sequence of operations looks something like this: out_stream.write(data, 0, amount); out_stream.sync() [...] When the test completes, all of the logs are exactly 7 bytes long. It remains this way even if I wait 10 minutes or kill the Hypertable java process and wait several minutes as well. Here is the listing: [doug@motherlode000 aol-basic] $ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn rw-r r - 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log After shutting down HDFS and restarting it again, the listing looks like this: [doug@motherlode000 aol-basic] $ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn rw-r r - 3 doug supergroup 564 2009-01-17 19:52 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 84 2009-01-17 19:52 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 1063 2009-01-17 19:52 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 634 2009-01-17 19:52 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 217 2009-01-17 19:52 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 1943 2009-01-17 19:52 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 1072 2009-01-17 19:52 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 525 2009-01-17 19:52 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log The last time I ran this test I encountered a problem where it appeared that some of our commits were lost. Here's what I did: 1. ran tests (which create a table with 75274825 cells) 2. kill Hypertable 3. shutdown HDFS 4. restart HDFS 5. restart Hypertable (which re-plays the commit logs) 6. dumped the table The table dump in #6 came up short (e.g. 72M entries). It appears that some of the commit logs (different log than the range_txn log) came back up incomplete. Let us know if you want us to run an instrumented version or anything. We can send you the Hadoop log files if that helps. Thanks! Doug
          Hide
          Luke Lu added a comment -

          On the design of the TC3: I think that client informing namenode on each sync is too expensive and not scalable. Hypertable can easily generate 1M transactions/s from a small cluster, which the namenode cannot possibly handle. Since restarting HDFS incurs no data loss, I think the basic logic is probably already fine. For most cases, I think that making new readers a bit more expensive to create is probably the right trade-off:

          1. We could append and sync a file and only inform the namenode on the first sync, hinting that we need to do something special for the new readers of the file.
          2. Upon creating a new reader for the file, namenode would request a block report (an incremental one, if we try to be clever) from the involved data nodes and return the right info the reader.

          I think that this would cover most of the use cases for append and sync reasonably.

          Show
          Luke Lu added a comment - On the design of the TC3: I think that client informing namenode on each sync is too expensive and not scalable. Hypertable can easily generate 1M transactions/s from a small cluster, which the namenode cannot possibly handle. Since restarting HDFS incurs no data loss, I think the basic logic is probably already fine. For most cases, I think that making new readers a bit more expensive to create is probably the right trade-off: 1. We could append and sync a file and only inform the namenode on the first sync, hinting that we need to do something special for the new readers of the file. 2. Upon creating a new reader for the file, namenode would request a block report (an incremental one, if we try to be clever) from the involved data nodes and return the right info the reader. I think that this would cover most of the use cases for append and sync reasonably.
          Hide
          Jim Kellerman added a comment -

          HBase really needs visibility to partial blocks and restarting HDFS is not an option for a 100+ node cluster.

          We don't sync after every record, but sync after a configurable number of writes or after a configurable amount of time has passed and
          there is unsync'd data.

          This really needs to be addressed in 0.19.1, 0.20.1 and trunk.

          Show
          Jim Kellerman added a comment - HBase really needs visibility to partial blocks and restarting HDFS is not an option for a 100+ node cluster. We don't sync after every record, but sync after a configurable number of writes or after a configurable amount of time has passed and there is unsync'd data. This really needs to be addressed in 0.19.1, 0.20.1 and trunk.
          Hide
          Jim Kellerman added a comment -

          First of two test files demonstrating the problem

          Show
          Jim Kellerman added a comment - First of two test files demonstrating the problem
          Hide
          Jim Kellerman added a comment -

          Second of two test programs that demonstrate the problem.

          Show
          Jim Kellerman added a comment - Second of two test programs that demonstrate the problem.
          Hide
          Jim Kellerman added a comment -

          I have attached two test programs that demonstrate the problem.

          Writer.java writes 100 records of approx 1KB each per sync. Its command line argument is the number of sync's to perform before aborting.
          Reader.java is run after Writer and tries to read records from the file created by Writer.

          If you run Writer with 4096 syncs:

          bin/hadoop Writer 4096

          and then run Reader, it gets an EOFException without having read any records (note that exception occurs in constructor:

          Exception in thread "main" java.io.EOFException
                  at java.io.DataInputStream.readFully(Unknown Source)
                  at java.io.DataInputStream.readFully(Unknown Source)
                  at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
                  at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
                  at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
                  at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
                  at Reader.main(Reader.java:34)
          

          If you run Writer with 8192 syncs:

          bin/hadoop Writer 8192

          and then run Reader, it reads 684783 records before hitting an EOFException (in next())

          Read 684783 lines
          java.io.EOFException
                  at java.io.DataInputStream.readFully(Unknown Source)
                  at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
                  at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
                  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
                  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
                  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
                  at Reader.main(Reader.java:41)
          

          684783 records that are approximately 1KB in length is approximately 64MB (a hadoop block).

          In this test the partial block containing 134417 records is lost (even though a considerable
          number of syncs took place between filling the first block and the end of Writer).

          So you can read from an improperly closed file but only up to the last full block, confirming
          Luke's observation.

          I will look into the patch shortly.

          Show
          Jim Kellerman added a comment - I have attached two test programs that demonstrate the problem. Writer.java writes 100 records of approx 1KB each per sync. Its command line argument is the number of sync's to perform before aborting. Reader.java is run after Writer and tries to read records from the file created by Writer. If you run Writer with 4096 syncs: bin/hadoop Writer 4096 and then run Reader, it gets an EOFException without having read any records (note that exception occurs in constructor: Exception in thread "main" java.io.EOFException at java.io.DataInputStream.readFully(Unknown Source) at java.io.DataInputStream.readFully(Unknown Source) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) at Reader.main(Reader.java:34) If you run Writer with 8192 syncs: bin/hadoop Writer 8192 and then run Reader, it reads 684783 records before hitting an EOFException (in next()) Read 684783 lines java.io.EOFException at java.io.DataInputStream.readFully(Unknown Source) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) at Reader.main(Reader.java:41) 684783 records that are approximately 1KB in length is approximately 64MB (a hadoop block). In this test the partial block containing 134417 records is lost (even though a considerable number of syncs took place between filling the first block and the end of Writer). So you can read from an improperly closed file but only up to the last full block, confirming Luke's observation. I will look into the patch shortly.
          Hide
          dhruba borthakur added a comment -

          Thanks Doug and Luke for the comments and Jim for providing a good test case. I will incorporate your comments and post a new patch. (Jim, you need not test the existing patch associated with this JIRA)

          Show
          dhruba borthakur added a comment - Thanks Doug and Luke for the comments and Jim for providing a good test case. I will incorporate your comments and post a new patch. (Jim, you need not test the existing patch associated with this JIRA)
          Hide
          dhruba borthakur added a comment -

          This patch implements the following (as specified in the design doc for Appends, HADOOP-1700):
          A reader checks to see if the file is being written to by another writer. if so, it fetches the size of the last block from the primary datanode. This does not have any performance impact for non-concurrent readers not does it have any impact on sync().

          Show
          dhruba borthakur added a comment - This patch implements the following (as specified in the design doc for Appends, HADOOP-1700 ): A reader checks to see if the file is being written to by another writer. if so, it fetches the size of the last block from the primary datanode. This does not have any performance impact for non-concurrent readers not does it have any impact on sync().
          Hide
          Jim Kellerman added a comment -

          @Dhruba

          I tried applying fsyncConcurrentReaders3.patch to 0.19.0, 0.19 branch (latest), 0.20 branch (latest) and trunk. It did not
          apply cleanly to any, but seemed to have less problems with trunk. Building it, and running the Writer and Reader
          programs I attached above, they behaved the same way, i.e., the Reader could not see any sync'd data until the first
          complete block had been written, and then it could only see the data in the first complete block and none of the data
          in the second partial block.

          Show
          Jim Kellerman added a comment - @Dhruba I tried applying fsyncConcurrentReaders3.patch to 0.19.0, 0.19 branch (latest), 0.20 branch (latest) and trunk. It did not apply cleanly to any, but seemed to have less problems with trunk. Building it, and running the Writer and Reader programs I attached above, they behaved the same way, i.e., the Reader could not see any sync'd data until the first complete block had been written, and then it could only see the data in the first complete block and none of the data in the second partial block.
          Hide
          dhruba borthakur added a comment -

          @Jim: The SequenceFile.Writer.sync() does not actually invoke the FSDataOutputStream.sync.

          The FSDataOutputStream.sync() is a heavy weight operation and the SequenceFile.Writer.sync() is typically invoked with small chunks of data, that's the reason why the SequenceFile class does not automatically invoke the fs sync call.

          I changed your test case to invoke the FSDaOutputStream.sync call but still see some problems. Will post an update as soon as I get to the bottom of this.

          Show
          dhruba borthakur added a comment - @Jim: The SequenceFile.Writer.sync() does not actually invoke the FSDataOutputStream.sync. The FSDataOutputStream.sync() is a heavy weight operation and the SequenceFile.Writer.sync() is typically invoked with small chunks of data, that's the reason why the SequenceFile class does not automatically invoke the fs sync call. I changed your test case to invoke the FSDaOutputStream.sync call but still see some problems. Will post an update as soon as I get to the bottom of this.
          Hide
          Jim Kellerman added a comment -

          @Dhruba

          If I have a SequenceFile.Writer, how do I get the FSDataOutputStream? It is a protected member of SequenceFile.Writer.
          Are you planning to add an accessor?

          Thanks.

          Show
          Jim Kellerman added a comment - @Dhruba If I have a SequenceFile.Writer, how do I get the FSDataOutputStream? It is a protected member of SequenceFile.Writer. Are you planning to add an accessor? Thanks.
          Hide
          Doug Cutting added a comment -

          > how do I get the FSDataOutputStream?

          I think it would be better to add a SequenceFile.Writer method that flushes and calls FSDataOutputStream.sync. The problem is that SequenceFile.Writer#sync() should probably be called flush(), not sync(), but that'd be hard to change. So perhaps we should add a SequenceFile.Writer#syncFs() method?

          Show
          Doug Cutting added a comment - > how do I get the FSDataOutputStream? I think it would be better to add a SequenceFile.Writer method that flushes and calls FSDataOutputStream.sync. The problem is that SequenceFile.Writer#sync() should probably be called flush(), not sync(), but that'd be hard to change. So perhaps we should add a SequenceFile.Writer#syncFs() method?
          Hide
          dhruba borthakur added a comment -

          I will add a new method SequenceFile.Writer.syncFS(), test it, and upload a new patch. Thanks Doug and Jim.

          Show
          dhruba borthakur added a comment - I will add a new method SequenceFile.Writer.syncFS(), test it, and upload a new patch. Thanks Doug and Jim.
          Hide
          Doug Judd added a comment -

          I tried the test again and still no luck. To recap, here's how the log file is created:

          out_stream.write(header, 0, 7);
          out_stream.sync()
          out_stream.write(data, 0, amount);
          out_stream.sync()
          [...]

          After the test finished, I shut down the Hypertable servers. This time the listing shows the files to be 0 bytes in length (as opposed to 7 bytes with the previous patch):

          [doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn
          rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log
          rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log

          When the RangeServer starts up again, it discovers that the log file (range_txn/0.log) does exist, so it starts the recovery process. However, it only sees the 7 byte header. All of the subsequent log appends do not appear in the log file. So the system starts up without recovering any of the data.

          BTW, in this particular circumstance, there no other writer writing to the file when the range server comes up and reads it. Here's the high-level of what's going on:

          RangeServer opens an FSDataOutputStream to the log and starts appending to it
          RangeServer is killed with 'kill -9"
          RangeServer comes up again and reads the log

          In your above note you said, "A reader checks to see if the file is being written to by another writer. if so, it fetches the size of the last block from the primary datanode." This is not the case with our test, there is no writer writing to the log when we try to read it.

          • Doug
          Show
          Doug Judd added a comment - I tried the test again and still no luck. To recap, here's how the log file is created: out_stream.write(header, 0, 7); out_stream.sync() out_stream.write(data, 0, amount); out_stream.sync() [...] After the test finished, I shut down the Hypertable servers. This time the listing shows the files to be 0 bytes in length (as opposed to 7 bytes with the previous patch): [doug@motherlode000 aol-basic] $ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn rw-r r - 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log rw-r r - 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log When the RangeServer starts up again, it discovers that the log file (range_txn/0.log) does exist, so it starts the recovery process. However, it only sees the 7 byte header. All of the subsequent log appends do not appear in the log file. So the system starts up without recovering any of the data. BTW, in this particular circumstance, there no other writer writing to the file when the range server comes up and reads it. Here's the high-level of what's going on: RangeServer opens an FSDataOutputStream to the log and starts appending to it RangeServer is killed with 'kill -9" RangeServer comes up again and reads the log In your above note you said, "A reader checks to see if the file is being written to by another writer. if so, it fetches the size of the last block from the primary datanode." This is not the case with our test, there is no writer writing to the log when we try to read it. Doug
          Hide
          Doug Judd added a comment -

          This may be a problem on our end. When I shut down hypertable and then copy each range_txn log file to the local filesystem, the files seem to contain more than just the 7-byte header.

          [doug@motherlode000 aol-basic]$ ls -l rsml/
          total 32
          rw-rw-r- 1 doug doug 564 Jan 26 15:35 102.log
          rw-rw-r- 1 doug doug 84 Jan 26 15:35 104.log
          rw-rw-r- 1 doug doug 1206 Jan 26 15:35 106.log
          rw-rw-r- 1 doug doug 634 Jan 26 15:35 108.log
          rw-rw-r- 1 doug doug 84 Jan 26 15:35 110.log
          rw-rw-r- 1 doug doug 1942 Jan 26 15:36 112.log
          rw-rw-r- 1 doug doug 1074 Jan 26 15:36 114.log
          rw-rw-r- 1 doug doug 525 Jan 26 15:36 116.log

          Show
          Doug Judd added a comment - This may be a problem on our end. When I shut down hypertable and then copy each range_txn log file to the local filesystem, the files seem to contain more than just the 7-byte header. [doug@motherlode000 aol-basic] $ ls -l rsml/ total 32 rw-rw-r - 1 doug doug 564 Jan 26 15:35 102.log rw-rw-r - 1 doug doug 84 Jan 26 15:35 104.log rw-rw-r - 1 doug doug 1206 Jan 26 15:35 106.log rw-rw-r - 1 doug doug 634 Jan 26 15:35 108.log rw-rw-r - 1 doug doug 84 Jan 26 15:35 110.log rw-rw-r - 1 doug doug 1942 Jan 26 15:36 112.log rw-rw-r - 1 doug doug 1074 Jan 26 15:36 114.log rw-rw-r - 1 doug doug 525 Jan 26 15:36 116.log
          Hide
          Doug Judd added a comment -

          It looks like our code depends on the length, as returned by getFileStatus:

          length = mFilesystem.getFileStatus(new Path(fileName)).getLen();

          It appears that the length returned by this API call is not correct. Can you take a look at it and maybe patch this one as well?

          Show
          Doug Judd added a comment - It looks like our code depends on the length, as returned by getFileStatus: length = mFilesystem.getFileStatus(new Path(fileName)).getLen(); It appears that the length returned by this API call is not correct. Can you take a look at it and maybe patch this one as well?
          Hide
          dhruba borthakur added a comment -

          Hi Doug,

          The file length (as returned by getFileStatus) will not change at every write from the client to the datanode. Similarly, every fsync call from the client does not reach the namenode (only the first one per block reaches the namenode). That means the namenode has no good way to know the size of a block when the block is being written to by a writer.

          In your case, the writer has died. The namenode has a timeout of 1 hour before it starts lease recovery for this file. The lease recovery process will set the correct file size on the namenode metadata. If you do not want to wait for one hour, then you can manually trigger lease recovery from your application by trying to reopen the file for append(please use FileSystem.append(pathname)). Lease recovery will update the true length of the file in the namenode metadata.

          Show
          dhruba borthakur added a comment - Hi Doug, The file length (as returned by getFileStatus) will not change at every write from the client to the datanode. Similarly, every fsync call from the client does not reach the namenode (only the first one per block reaches the namenode). That means the namenode has no good way to know the size of a block when the block is being written to by a writer. In your case, the writer has died. The namenode has a timeout of 1 hour before it starts lease recovery for this file. The lease recovery process will set the correct file size on the namenode metadata. If you do not want to wait for one hour, then you can manually trigger lease recovery from your application by trying to reopen the file for append(please use FileSystem.append(pathname)). Lease recovery will update the true length of the file in the namenode metadata.
          Hide
          dhruba borthakur added a comment -

          An updated version of the writer than invokes SequenceFile.Writer.syncFs().

          Show
          dhruba borthakur added a comment - An updated version of the writer than invokes SequenceFile.Writer.syncFs().
          Hide
          dhruba borthakur added a comment -

          This read waits for lease recovery to complete before reading data.

          Show
          dhruba borthakur added a comment - This read waits for lease recovery to complete before reading data.
          Hide
          dhruba borthakur added a comment -

          This patch should merge smoothly with trunk.

          Show
          dhruba borthakur added a comment - This patch should merge smoothly with trunk.
          Hide
          dhruba borthakur added a comment -

          @Jim: I updated your test cases Reader.java and Writer.java and have successfully tested them at my site. Please let me know if you still see problems.

          @Doug: Please use the latest patch that I uploaded. This has the SequenceFile.Writer.syncFS() method.

          (Also, I suggest that you should pull in the patch available at HADOOP-5027)

          Show
          dhruba borthakur added a comment - @Jim: I updated your test cases Reader.java and Writer.java and have successfully tested them at my site. Please let me know if you still see problems. @Doug: Please use the latest patch that I uploaded. This has the SequenceFile.Writer.syncFS() method. (Also, I suggest that you should pull in the patch available at HADOOP-5027 )
          Hide
          Doug Judd added a comment -

          Hi Dhruba,

          I tried your suggestion, but got the following exception when trying to open the file with the 'append' method:

          SEVERE: I/O exception while getting length of file '/hypertable/servers/10.0.30.102_38060/log/range_txn/0.log' - org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log for DFSClient_2003773208 on client 10.0.30.102, because this file is already being created by DFSClient_423127459 on 10.0.30.102
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1088)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1177)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:321)
          at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

          I could re-write the whole thing to not be dependent on knowing log length, however, it seems like it ought to be possible to obtain the actual file length in this situation. The semantics of getFileStatus() seem a little odd. Sometimes it returns the actual length of the file and sometimes it returns a stale version of the length. I suppose this is ok as long as it is well documented. But it should be possible to obtain the actual length of a file. Would it be possible to add a FileSystem::length(Path path) method that returns the accurate file length by fetching the size of the last block from the primary datanode?

          • Doug
          Show
          Doug Judd added a comment - Hi Dhruba, I tried your suggestion, but got the following exception when trying to open the file with the 'append' method: SEVERE: I/O exception while getting length of file '/hypertable/servers/10.0.30.102_38060/log/range_txn/0.log' - org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log for DFSClient_2003773208 on client 10.0.30.102, because this file is already being created by DFSClient_423127459 on 10.0.30.102 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1088) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1177) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:321) at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) I could re-write the whole thing to not be dependent on knowing log length, however, it seems like it ought to be possible to obtain the actual file length in this situation. The semantics of getFileStatus() seem a little odd. Sometimes it returns the actual length of the file and sometimes it returns a stale version of the length. I suppose this is ok as long as it is well documented. But it should be possible to obtain the actual length of a file. Would it be possible to add a FileSystem::length(Path path) method that returns the accurate file length by fetching the size of the last block from the primary datanode? Doug
          Hide
          dhruba borthakur added a comment -

          Hi Doug,

          the exception is expected. Please look at the test file Reader.java that I attached to this JIRA. It shows how the reader waits for the lease recovery to end (to ensure that correct file size is updated on namenode). Please let me know if this approach is suitable for your application.

          We could enhance FileSystem.getFileStatus() to contact the datanode and retrieve and return the most-current file length (only for files that have a concurrent writer). This will not have have performance impact for most map-reduce applications.

          Show
          dhruba borthakur added a comment - Hi Doug, the exception is expected. Please look at the test file Reader.java that I attached to this JIRA. It shows how the reader waits for the lease recovery to end (to ensure that correct file size is updated on namenode). Please let me know if this approach is suitable for your application. We could enhance FileSystem.getFileStatus() to contact the datanode and retrieve and return the most-current file length (only for files that have a concurrent writer). This will not have have performance impact for most map-reduce applications.
          Hide
          Jim Kellerman added a comment -

          Tested this patch in conjunction with HADOOP-5027, Works!!!!!!!!!!!!!!!

          I would like to see this in 0.19.1, 0.20.0 and trunk.

          +++++1

          Show
          Jim Kellerman added a comment - Tested this patch in conjunction with HADOOP-5027 , Works!!!!!!!!!!!!!!! I would like to see this in 0.19.1, 0.20.0 and trunk. +++++1
          Hide
          dhruba borthakur added a comment -

          Thanks Jim. I will wait to hear from Doug Judd on whether it solves his problem.

          Show
          dhruba borthakur added a comment - Thanks Jim. I will wait to hear from Doug Judd on whether it solves his problem.
          Hide
          Doug Judd added a comment -

          Now when I apply your latest patch and the one from 5027 to the 0.19.0 source, the datanodes seem to be going into an infinite loop of NullPointerExceptions. At the top of the hadoop-zvents-datanode-motherlode007.admin.zvents.com.log I'm seeing this:

          [doug@motherlode007 logs]$ more hadoop-zvents-datanode-motherlode007.admin.zvents.com.log
          2009-01-27 15:32:55,828 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
          /************************************************************
          STARTUP_MSG: Starting DataNode
          STARTUP_MSG: host = motherlode007.admin.zvents.com/10.0.30.114
          STARTUP_MSG: args = []
          STARTUP_MSG: version = 0.19.1-dev
          STARTUP_MSG: build = -r ; compiled by 'doug' on Tue Jan 27 15:04:06 PST 2009
          ************************************************************/
          2009-01-27 15:32:57,041 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: motherlode000/10.0.30.100:9000. Already tried 0 time(s).
          2009-01-27 15:33:00,505 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Registered FSDatasetStatusMBean
          2009-01-27 15:33:00,507 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 50010
          2009-01-27 15:33:00,510 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s
          2009-01-27 15:33:00,783 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
          2009-01-27 15:33:00,792 INFO org.mortbay.util.Credential: Checking Resource aliases
          2009-01-27 15:33:01,839 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@319c0bd6
          2009-01-27 15:33:01,878 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static]
          2009-01-27 15:33:02,048 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@5a943dc4
          2009-01-27 15:33:02,049 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs]
          2009-01-27 15:33:02,754 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@6d581e80
          2009-01-27 15:33:02,760 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
          2009-01-27 15:33:02,763 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075
          2009-01-27 15:33:02,764 INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@5c435a3a
          2009-01-27 15:33:02,769 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null
          2009-01-27 15:33:02,825 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020
          2009-01-27 15:33:02,831 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
          2009-01-27 15:33:02,834 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting
          2009-01-27 15:33:02,834 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
          2009-01-27 15:33:02,836 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(motherlode007.admin.zvents.com:50010, storageID
          =DS-745224472-10.0.30.114-50010-1230665635246, infoPort=50075, ipcPort=50020)
          2009-01-27 15:33:02,837 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting
          2009-01-27 15:33:02,837 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting
          2009-01-27 15:33:02,839 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.30.114:50010, storageID=DS-745224472-10.0.30.114-50010-1230
          665635246, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset

          {dirpath='/data1/hadoop/dfs/data/current,/data2/hadoop/dfs/data/current,/data3/hadoop/dfs /data/current'}

          2009-01-27 15:33:02,840 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec
          2009-01-27 15:33:02,932 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.DatanodeDescriptor.reportDiff(DatanodeDescriptor.java:396)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processReport(FSNamesystem.java:2803)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.blockReport(NameNode.java:636)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

          at org.apache.hadoop.ipc.Client.call(Client.java:696)
          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
          at $Proxy4.blockReport(Unknown Source)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:723)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1100)
          at java.lang.Thread.run(Thread.java:619)

          2009-01-27 15:33:02,953 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerExce
          ption
          at org.apache.hadoop.hdfs.server.namenode.DatanodeDescriptor.reportDiff(DatanodeDescriptor.java:396)
          [...]

          And the file is growing rapidly with the following exceptions tacked on to the end:

          2009-01-27 16:10:55,973 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException

          at org.apache.hadoop.ipc.Client.call(Client.java:696)
          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
          at $Proxy4.blockReport(Unknown Source)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:723)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1100)
          at java.lang.Thread.run(Thread.java:619)

          It appears that these exceptions started happening within about 5 seconds after startup, so it doesn't look like it has anything to do with Hypertable. Is it ok to apply these patches to the 0.19.0 source? Or should I be applying them to the trunk?

          • Doug
          Show
          Doug Judd added a comment - Now when I apply your latest patch and the one from 5027 to the 0.19.0 source, the datanodes seem to be going into an infinite loop of NullPointerExceptions. At the top of the hadoop-zvents-datanode-motherlode007.admin.zvents.com.log I'm seeing this: [doug@motherlode007 logs] $ more hadoop-zvents-datanode-motherlode007.admin.zvents.com.log 2009-01-27 15:32:55,828 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = motherlode007.admin.zvents.com/10.0.30.114 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.1-dev STARTUP_MSG: build = -r ; compiled by 'doug' on Tue Jan 27 15:04:06 PST 2009 ************************************************************/ 2009-01-27 15:32:57,041 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: motherlode000/10.0.30.100:9000. Already tried 0 time(s). 2009-01-27 15:33:00,505 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Registered FSDatasetStatusMBean 2009-01-27 15:33:00,507 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 50010 2009-01-27 15:33:00,510 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s 2009-01-27 15:33:00,783 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2009-01-27 15:33:00,792 INFO org.mortbay.util.Credential: Checking Resource aliases 2009-01-27 15:33:01,839 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@319c0bd6 2009-01-27 15:33:01,878 INFO org.mortbay.util.Container: Started WebApplicationContext [/static,/static] 2009-01-27 15:33:02,048 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@5a943dc4 2009-01-27 15:33:02,049 INFO org.mortbay.util.Container: Started WebApplicationContext [/logs,/logs] 2009-01-27 15:33:02,754 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@6d581e80 2009-01-27 15:33:02,760 INFO org.mortbay.util.Container: Started WebApplicationContext [/,/] 2009-01-27 15:33:02,763 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075 2009-01-27 15:33:02,764 INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@5c435a3a 2009-01-27 15:33:02,769 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 2009-01-27 15:33:02,825 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 2009-01-27 15:33:02,831 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2009-01-27 15:33:02,834 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting 2009-01-27 15:33:02,834 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting 2009-01-27 15:33:02,836 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(motherlode007.admin.zvents.com:50010, storageID =DS-745224472-10.0.30.114-50010-1230665635246, infoPort=50075, ipcPort=50020) 2009-01-27 15:33:02,837 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting 2009-01-27 15:33:02,837 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting 2009-01-27 15:33:02,839 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.30.114:50010, storageID=DS-745224472-10.0.30.114-50010-1230 665635246, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset {dirpath='/data1/hadoop/dfs/data/current,/data2/hadoop/dfs/data/current,/data3/hadoop/dfs /data/current'} 2009-01-27 15:33:02,840 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec 2009-01-27 15:33:02,932 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.DatanodeDescriptor.reportDiff(DatanodeDescriptor.java:396) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processReport(FSNamesystem.java:2803) at org.apache.hadoop.hdfs.server.namenode.NameNode.blockReport(NameNode.java:636) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.blockReport(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:723) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1100) at java.lang.Thread.run(Thread.java:619) 2009-01-27 15:33:02,953 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerExce ption at org.apache.hadoop.hdfs.server.namenode.DatanodeDescriptor.reportDiff(DatanodeDescriptor.java:396) [...] And the file is growing rapidly with the following exceptions tacked on to the end: 2009-01-27 16:10:55,973 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.blockReport(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:723) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1100) at java.lang.Thread.run(Thread.java:619) It appears that these exceptions started happening within about 5 seconds after startup, so it doesn't look like it has anything to do with Hypertable. Is it ok to apply these patches to the 0.19.0 source? Or should I be applying them to the trunk? Doug
          Hide
          dhruba borthakur added a comment -

          I think these exceptions are related to the patch in HADOOP-5027. let me check.

          Show
          dhruba borthakur added a comment - I think these exceptions are related to the patch in HADOOP-5027 . let me check.
          Hide
          Doug Judd added a comment -

          Hi Dhruba,

          When I drop the patch for 5027, the NullPointerExceptions go away. However, I'm still having troubles. It appears for some of the files, your trick for obtaining the real length does not work. The waitForLease spins indefinitely. For example, here's the log line for our HdfsBroker process where it is hanging:

          Jan 27, 2009 5:08:53 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Length
          INFO: Getting length of file '/hypertable/servers/10.0.30.106_38060/log/metadata/0

          Then in the namenode log file, I see things like this (I've attached the compressed namenode log):

          2009-01-27 17:08:53,173 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: failed to create file /hypertable/servers/10.0.30.106_38060/log/metadat
          a/0 for DFSClient_-1860021252 on client 10.0.30.106 because current leaseholder is trying to recreate file.
          2009-01-27 17:08:53,174 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9000, call append(/hypertable/servers/10.0.30.106_38060/log/metadata/0, DFSClien
          t_-1860021252) from 10.0.30.106:43507: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.106_3
          8060/log/metadata/0 for DFSClient_-1860021252 on client 10.0.30.106 because current leaseholder is trying to recreate file.
          org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.106_38060/log/metadata/0 for DFSClient_-1860021252
          on client 10.0.30.106 because current leaseholder is trying to recreate file.
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1065)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1177)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:321)
          at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

          Show
          Doug Judd added a comment - Hi Dhruba, When I drop the patch for 5027, the NullPointerExceptions go away. However, I'm still having troubles. It appears for some of the files, your trick for obtaining the real length does not work. The waitForLease spins indefinitely. For example, here's the log line for our HdfsBroker process where it is hanging: Jan 27, 2009 5:08:53 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Length INFO: Getting length of file '/hypertable/servers/10.0.30.106_38060/log/metadata/0 Then in the namenode log file, I see things like this (I've attached the compressed namenode log): 2009-01-27 17:08:53,173 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: failed to create file /hypertable/servers/10.0.30.106_38060/log/metadat a/0 for DFSClient_-1860021252 on client 10.0.30.106 because current leaseholder is trying to recreate file. 2009-01-27 17:08:53,174 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9000, call append(/hypertable/servers/10.0.30.106_38060/log/metadata/0, DFSClien t_-1860021252) from 10.0.30.106:43507: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.106_3 8060/log/metadata/0 for DFSClient_-1860021252 on client 10.0.30.106 because current leaseholder is trying to recreate file. org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.106_38060/log/metadata/0 for DFSClient_-1860021252 on client 10.0.30.106 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1065) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1177) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:321) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I think these exceptions are related to the patch in HADOOP-5027.

          Please do not use my patch in HADOOP-5027. It is not clear that whether the fix is good.

          Show
          Tsz Wo Nicholas Sze added a comment - > I think these exceptions are related to the patch in HADOOP-5027 . Please do not use my patch in HADOOP-5027 . It is not clear that whether the fix is good.
          Hide
          dhruba borthakur added a comment -

          @Doug: the waiting for lease recovery to reclaim the length of the jar file will work only if the original writer has dies and now a new reader wants to read all the data that was written by the writer before it died. This was the use-case for Jim Kellerman.

          In your case, you have a FileSystem object that was used to write to the file. Then your application tried to reopen the same file by invoking fs.apend() on the same FileSystem object. The namenode logs clearly shows that the same dfsclient (i.e. filesystem object) was used to try appending to the file. In this case, the namenode rejects the append call because it knows that the original writer is alive and is still writing to it. If your aim is to have a concurrent reader along with a concurrent writer, then the best that this patch can do is to allow the concurent reader to see the new file length only when the block is full. On the other hand, if you can make your application ot depend on the file length, then you can see all the data in the file. Another alternative would be to implement a new call FileSystem.length() than can retrieve the latest length from the datanode, but that can be done as part of a separate JIRA. Please let us know how difficult is it for you to change your app to not depend on the file length and just read till end-of-file?

          Show
          dhruba borthakur added a comment - @Doug: the waiting for lease recovery to reclaim the length of the jar file will work only if the original writer has dies and now a new reader wants to read all the data that was written by the writer before it died. This was the use-case for Jim Kellerman. In your case, you have a FileSystem object that was used to write to the file. Then your application tried to reopen the same file by invoking fs.apend() on the same FileSystem object. The namenode logs clearly shows that the same dfsclient (i.e. filesystem object) was used to try appending to the file. In this case, the namenode rejects the append call because it knows that the original writer is alive and is still writing to it. If your aim is to have a concurrent reader along with a concurrent writer, then the best that this patch can do is to allow the concurent reader to see the new file length only when the block is full. On the other hand, if you can make your application ot depend on the file length, then you can see all the data in the file. Another alternative would be to implement a new call FileSystem.length() than can retrieve the latest length from the datanode, but that can be done as part of a separate JIRA. Please let us know how difficult is it for you to change your app to not depend on the file length and just read till end-of-file?
          Hide
          Doug Judd added a comment -

          Dhruba,

          The application is not trying to reopen a file that it already has open. It appears that HDFS is getting confused and thinks that this is the case. One thing that is slightly different in this case is that nothing gets written to the file by the original writer. Here is the sequence of relevant operations:

          File gets created
          [nothing gets appended to it]
          process gets killed
          process starts up again
          File is reopened with append to obtain length

          Can you verify that this particular usage pattern is handled properly? I'll try to come up with a stripped down test case tomorrow.

          • Doug
          Show
          Doug Judd added a comment - Dhruba, The application is not trying to reopen a file that it already has open. It appears that HDFS is getting confused and thinks that this is the case. One thing that is slightly different in this case is that nothing gets written to the file by the original writer. Here is the sequence of relevant operations: File gets created [nothing gets appended to it] process gets killed process starts up again File is reopened with append to obtain length Can you verify that this particular usage pattern is handled properly? I'll try to come up with a stripped down test case tomorrow. Doug
          Hide
          Konstantin Shvachko added a comment -
          Show
          Konstantin Shvachko added a comment - See related comment here
          Hide
          Doug Judd added a comment -

          Ok, I've uploaded a reproducible test case (see attachments ReopenProblem.java and reopen_test.sh). This test consistently hangs on both my local mac HDFS installation as well as our 10 node Linux HDFS cluster. It's not as simple as I suggested in my previous comment as you can see by the code. Here's the output I see when I run it:

          [doug@motherlode000 0.9.2.1]$ ./reopen_test.sh
          Deleted hdfs://motherlode000:9000/hypertable/servers
          rmr: cannot remove /hypertable/tables: No such file or directory.
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          length("/hypertable/servers/10.0.30.114_38060/log/root/0") = 163, obtained in 65258 milliseconds
          length("/hypertable/servers/10.0.30.114_38060/log/root/0") = 163, obtained in 113 milliseconds
          Read 163 bytes from root fragment 0
          Read 0 bytes from root fragment 0
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          Waiting for lease recovery ...
          [...]

          Show
          Doug Judd added a comment - Ok, I've uploaded a reproducible test case (see attachments ReopenProblem.java and reopen_test.sh). This test consistently hangs on both my local mac HDFS installation as well as our 10 node Linux HDFS cluster. It's not as simple as I suggested in my previous comment as you can see by the code. Here's the output I see when I run it: [doug@motherlode000 0.9.2.1] $ ./reopen_test.sh Deleted hdfs://motherlode000:9000/hypertable/servers rmr: cannot remove /hypertable/tables: No such file or directory. Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... length("/hypertable/servers/10.0.30.114_38060/log/root/0") = 163, obtained in 65258 milliseconds length("/hypertable/servers/10.0.30.114_38060/log/root/0") = 163, obtained in 113 milliseconds Read 163 bytes from root fragment 0 Read 0 bytes from root fragment 0 Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... Waiting for lease recovery ... [...]
          Hide
          dhruba borthakur added a comment -

          The Hbase folks have requested that this be fixed in the next release form the 0.19 branch.

          Show
          dhruba borthakur added a comment - The Hbase folks have requested that this be fixed in the next release form the 0.19 branch.
          Hide
          dhruba borthakur added a comment -

          Hi Doug,

          ReopenProblem.java is guaranteed to hang. At line 189, you create an file and get an handle on this file. Then , at line 197, you use the same file system object and invoke waitForLeaseRecovery(). waitForLeaseRecovery tries to open the same file again for appending to it. It fails as expected. The namenode log correctly shows that the "same leaseholder is trying to reopen the file for appends"!

          does it make sense? Thanks, dhruba

          Show
          dhruba borthakur added a comment - Hi Doug, ReopenProblem.java is guaranteed to hang. At line 189, you create an file and get an handle on this file. Then , at line 197, you use the same file system object and invoke waitForLeaseRecovery(). waitForLeaseRecovery tries to open the same file again for appending to it. It fails as expected. The namenode log correctly shows that the "same leaseholder is trying to reopen the file for appends"! does it make sense? Thanks, dhruba
          Hide
          Doug Judd added a comment -

          Hi Dhruba,

          The file that is opened on line 189 ( /hypertable/servers/10.0.30.114_38060/log/root/1 ) is different than the one passed into waitForLeaseRecovery on line 197. On line 194, the file object gets reset to a different path ( /hypertable/servers/10.0.30.114_38060/log/metadata/0 ). So, this program should not hang.

          • Doug
          Show
          Doug Judd added a comment - Hi Dhruba, The file that is opened on line 189 ( /hypertable/servers/10.0.30.114_38060/log/root/1 ) is different than the one passed into waitForLeaseRecovery on line 197. On line 194, the file object gets reset to a different path ( /hypertable/servers/10.0.30.114_38060/log/metadata/0 ). So, this program should not hang. Doug
          Hide
          Nigel Daley added a comment -

          As discussed on core-dev@ (http://www.nabble.com/Hadoop-0.19.1-td21739202.html) we will disable append in 0.19.1. Moving these append related issues to 0.19.2.

          Show
          Nigel Daley added a comment - As discussed on core-dev@ ( http://www.nabble.com/Hadoop-0.19.1-td21739202.html ) we will disable append in 0.19.1. Moving these append related issues to 0.19.2.
          Hide
          Jim Kellerman added a comment - - edited

          Applying this patch to hadoop-0.19.0, it does work, but slowly:

          It takes almost an hour to recover the lease of a file when the hdfs cluster is under load:

          2009-02-25 21:39:16,843 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs:/x.y.com:8100/hbase/log_10.76.44.139_1235597506284_8020/hlog.dat.1235597820662
          2009-02-25 21:39:16,847 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Triggering lease recovery.
          ...
          2009-02-25 22:37:12,755 INFO org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for hdfs://x.y.com:8100/hbase/log_10.76.44.139_1235597506284_8020
          
          Show
          Jim Kellerman added a comment - - edited Applying this patch to hadoop-0.19.0, it does work, but slowly: It takes almost an hour to recover the lease of a file when the hdfs cluster is under load: 2009-02-25 21:39:16,843 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs:/x.y.com:8100/hbase/log_10.76.44.139_1235597506284_8020/hlog.dat.1235597820662 2009-02-25 21:39:16,847 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Triggering lease recovery. ... 2009-02-25 22:37:12,755 INFO org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for hdfs: //x.y.com:8100/hbase/log_10.76.44.139_1235597506284_8020
          Hide
          Jim Kellerman added a comment -

          Is there something we can change to make lease recovery faster?
          config parameter?
          block size?
          max file size?

          Thanks.

          Show
          Jim Kellerman added a comment - Is there something we can change to make lease recovery faster? config parameter? block size? max file size? Thanks.
          Hide
          Doug Judd added a comment -

          There should be a way to forcibly take over a lease. In our application, ownership of the various files is controlled at a higher level (i.e. Chubby). A server will not come up and attempt to recover a file (or check its length) unless it is the sole owner of the file (i.e. it will have obtained a Chubby lock).

          Can we add some sort of API that allows a client to forcibly take over a lease immediately?

          Show
          Doug Judd added a comment - There should be a way to forcibly take over a lease. In our application, ownership of the various files is controlled at a higher level (i.e. Chubby). A server will not come up and attempt to recover a file (or check its length) unless it is the sole owner of the file (i.e. it will have obtained a Chubby lock). Can we add some sort of API that allows a client to forcibly take over a lease immediately?
          Hide
          dhruba borthakur added a comment -

          There is ReopenProblem.java attched to this JIRA.

          @Doug: Please use ReopenProblem.java:waitForLeaseRecovery(). This method triggers near-immediate lease recovery and can be used by any application code to take ownership of the file. Does this work for you?

          @Jim: Were u using ReopenProblem.java:waitForLeaseRecovery() and still seeing that it takes upto an hour for the waitForLeaseRecovery() method to return success? This can happen if the original client that was writing to the file was alive and was in communication with the namenode. In this case, the new append-er will not be able to recover the original lease.

          Show
          dhruba borthakur added a comment - There is ReopenProblem.java attched to this JIRA. @Doug: Please use ReopenProblem.java:waitForLeaseRecovery(). This method triggers near-immediate lease recovery and can be used by any application code to take ownership of the file. Does this work for you? @Jim: Were u using ReopenProblem.java:waitForLeaseRecovery() and still seeing that it takes upto an hour for the waitForLeaseRecovery() method to return success? This can happen if the original client that was writing to the file was alive and was in communication with the namenode. In this case, the new append-er will not be able to recover the original lease.
          Hide
          Jim Kellerman added a comment -

          > dhruba borthakur - 26/Feb/09 11:35 AM
          > @Jim: Were u using ReopenProblem.java:waitForLeaseRecovery() and still seeing that it takes upto an hour for the
          > waitForLeaseRecovery() method to return success?

          Well, sort of. my code is similar but inline in the module that uses it:

                  // Recover the files lease if necessary
          boolean recovered = false;
          while (!recovered) {
            try {
              FSDataOutputStream out = fs.append(logfiles[i].getPath());
              out.close();
              recovered = true;
            } catch (IOException e) {
              if (LOG.isDebugEnabled()) {
                LOG.debug("Triggering lease recovery.");
              }
              try {
                Thread.sleep(leaseRecoveryPeriod);
              } catch (InterruptedException ex) {
                // ignore it and try again
              }
            }
          }
          

          > This can happen if the original client that was writing to the file was alive and was in communication with the namenode.
          > In this case, the new append-er will not be able to recover the original lease.

          This was definitely not the case. The original writing process was kill'ed -9

          Like Hypertable, HBase has a mechanism to detect when the original writing process is dead.

          Show
          Jim Kellerman added a comment - > dhruba borthakur - 26/Feb/09 11:35 AM > @Jim: Were u using ReopenProblem.java:waitForLeaseRecovery() and still seeing that it takes upto an hour for the > waitForLeaseRecovery() method to return success? Well, sort of. my code is similar but inline in the module that uses it: // Recover the files lease if necessary boolean recovered = false ; while (!recovered) { try { FSDataOutputStream out = fs.append(logfiles[i].getPath()); out.close(); recovered = true ; } catch (IOException e) { if (LOG.isDebugEnabled()) { LOG.debug( "Triggering lease recovery." ); } try { Thread .sleep(leaseRecoveryPeriod); } catch (InterruptedException ex) { // ignore it and try again } } } > This can happen if the original client that was writing to the file was alive and was in communication with the namenode. > In this case, the new append-er will not be able to recover the original lease. This was definitely not the case. The original writing process was kill'ed -9 Like Hypertable, HBase has a mechanism to detect when the original writing process is dead.
          Hide
          dhruba borthakur added a comment -

          What is the value of leaseRecoveryPeriod in the above piece of code?

          The fact that the lease recovery took 1 hour is very interesting. The namenode has two limits: the soft limit of 1 minute and the hard limit of 1 hour. When there is no ping from a writer for the soft limit period, then a new writer can recover the lease and append to the file. When there is no ping from a writer for the hard limit period, the namenode itself itself initiates lease recovery.

          Is it possible for you to give me access to the namenode log along with the relevant filename in question?

          Show
          dhruba borthakur added a comment - What is the value of leaseRecoveryPeriod in the above piece of code? The fact that the lease recovery took 1 hour is very interesting. The namenode has two limits: the soft limit of 1 minute and the hard limit of 1 hour. When there is no ping from a writer for the soft limit period, then a new writer can recover the lease and append to the file. When there is no ping from a writer for the hard limit period, the namenode itself itself initiates lease recovery. Is it possible for you to give me access to the namenode log along with the relevant filename in question?
          Hide
          Doug Judd added a comment -

          Hi Dhruba,

          When you say "near-immediate" what does that mean? Is it the 1 minute soft limit? If so, that's way too long. If a node bounces in our cluster, we can't have the database locked up for an entire minute waiting for the lease to expire when we know for a 100% fact that the original leaseholder is dead. A one minute database lock-up would make Hypertable unsuitable for serving live data.

          • Doug
          Show
          Doug Judd added a comment - Hi Dhruba, When you say "near-immediate" what does that mean? Is it the 1 minute soft limit? If so, that's way too long. If a node bounces in our cluster, we can't have the database locked up for an entire minute waiting for the lease to expire when we know for a 100% fact that the original leaseholder is dead. A one minute database lock-up would make Hypertable unsuitable for serving live data. Doug
          Hide
          Jim Kellerman added a comment -

          leaseRecoveryPeriod is 10000ms

          file name is hdfs://aa0-000-12.u.powerset.com:8100/hbase/log_208.76.44.139_1235681989771_8020/hlog.dat.1235682444379

          I will attach the namenode log.

          Show
          Jim Kellerman added a comment - leaseRecoveryPeriod is 10000ms file name is hdfs://aa0-000-12.u.powerset.com:8100/hbase/log_208.76.44.139_1235681989771_8020/hlog.dat.1235682444379 I will attach the namenode log.
          Hide
          Jim Kellerman added a comment -

          Namenode log. Note that this run was done using the waitForLeaseRecovery() from ReopenProblem.java

          Show
          Jim Kellerman added a comment - Namenode log. Note that this run was done using the waitForLeaseRecovery() from ReopenProblem.java
          Hide
          Doug Judd added a comment -

          We would like to see the soft lease limit configurable and settable to 0. Our application ensures that there is only one writer to each file. We need to be able to recover immediately.

          Show
          Doug Judd added a comment - We would like to see the soft lease limit configurable and settable to 0. Our application ensures that there is only one writer to each file. We need to be able to recover immediately.
          Hide
          Jim Kellerman added a comment -

          We too would like to be able to recover the lease immediately as we know when we go to read the file that the
          writer is dead.

          Is there anything that we can do to speed this up? We can't contribute to Hadoop, but we are certainly willing to
          test and stress test any patch that is available.

          Show
          Jim Kellerman added a comment - We too would like to be able to recover the lease immediately as we know when we go to read the file that the writer is dead. Is there anything that we can do to speed this up? We can't contribute to Hadoop, but we are certainly willing to test and stress test any patch that is available.
          Hide
          dhruba borthakur added a comment -

          Hi Jim,

          In the namenode log, I see the followign statement:

          NameSystem.startFile: failed to create file /hbase/log_208.76.44.139_1235681989771_8020/hlog.dat.1235682444379 for DFSClient_1908447348 on client 208.76.44.139 because current leaseholder is trying to recreate file.

          This means that the same client that originally created the file is trying to re-open the file. Is this possible? It started at time 21:14 and continued all the way to 22:07. An attempt was made to recreate thias file every 10 seconds (which matches the periodicity that you set)

          Show
          dhruba borthakur added a comment - Hi Jim, In the namenode log, I see the followign statement: NameSystem.startFile: failed to create file /hbase/log_208.76.44.139_1235681989771_8020/hlog.dat.1235682444379 for DFSClient_1908447348 on client 208.76.44.139 because current leaseholder is trying to recreate file. This means that the same client that originally created the file is trying to re-open the file. Is this possible? It started at time 21:14 and continued all the way to 22:07. An attempt was made to recreate thias file every 10 seconds (which matches the periodicity that you set)
          Hide
          Jim Kellerman added a comment -

          No, the original writer was killed at 21:08 and never restarted.

          Lease recovery started at 21:14 by another process on the same machine.

          Show
          Jim Kellerman added a comment - No, the original writer was killed at 21:08 and never restarted. Lease recovery started at 21:14 by another process on the same machine.
          Hide
          dhruba borthakur added a comment -

          Hi Jim, I think the namenode log that you posted is a partial one. It does not have calles to commitBlockSync (that should have occured when the new writer started to append to the file). if so, can you pl post the entire namenode log?

          Show
          dhruba borthakur added a comment - Hi Jim, I think the namenode log that you posted is a partial one. It does not have calles to commitBlockSync (that should have occured when the new writer started to append to the file). if so, can you pl post the entire namenode log?
          Hide
          Jim Kellerman added a comment -

          @Dhruba

          You were correct. I had INFO logging turned off. Here is a new name node log file with INFO enabled.

          Show
          Jim Kellerman added a comment - @Dhruba You were correct. I had INFO logging turned off. Here is a new name node log file with INFO enabled.
          Hide
          Jim Kellerman added a comment -

          More info. When I generated the new namenode.log, the writer was killed at 16:54. The first attempt to grab the lease was at 16:56
          and it was finally granted at 17:54.

          Show
          Jim Kellerman added a comment - More info. When I generated the new namenode.log, the writer was killed at 16:54. The first attempt to grab the lease was at 16:56 and it was finally granted at 17:54.
          Hide
          Jim Kellerman added a comment -

          Dhruba,

          What branch/revision did you test this patch against?

          I've tried several in 0.19 branch (0.19.0, and revisions just before 0.19.1 (and before HADOOP-5225 was committed, but none are able to
          recover the lease.

          It appears that 5225 was not committed to 0.20.0 or to trunk. Is that correct? If so should I try a revision of Hadoop off of 0.20 or trunk?

          Show
          Jim Kellerman added a comment - Dhruba, What branch/revision did you test this patch against? I've tried several in 0.19 branch (0.19.0, and revisions just before 0.19.1 (and before HADOOP-5225 was committed, but none are able to recover the lease. It appears that 5225 was not committed to 0.20.0 or to trunk. Is that correct? If so should I try a revision of Hadoop off of 0.20 or trunk?
          Hide
          Jim Kellerman added a comment -

          Also, were you just testing Reader and Writer on an unloaded cluster, or using some other application
          against a cluster that was under load?

          Show
          Jim Kellerman added a comment - Also, were you just testing Reader and Writer on an unloaded cluster, or using some other application against a cluster that was under load?
          Hide
          dhruba borthakur added a comment -

          HADOOP-5225 has been applied only to 0.19.

          I have tested it only on 0.19 branch (not on trunk). Also with very light load. I will be testing it in scale on 0.19 very veru shortly (I am waiting for HADOOP-5332 to get committed.

          Show
          dhruba borthakur added a comment - HADOOP-5225 has been applied only to 0.19. I have tested it only on 0.19 branch (not on trunk). Also with very light load. I will be testing it in scale on 0.19 very veru shortly (I am waiting for HADOOP-5332 to get committed.
          Hide
          stack added a comment -

          @Dhruba Any luck testing at scale? (Looks like HADOOP-5332 was committed). Thanks.

          Show
          stack added a comment - @Dhruba Any luck testing at scale? (Looks like HADOOP-5332 was committed). Thanks.
          Hide
          dhruba borthakur added a comment -

          Hi Stack, I will post a new merged patch for this one. I am hoping that I get some file system folks to review it once I post the patch.

          Show
          dhruba borthakur added a comment - Hi Stack, I will post a new merged patch for this one. I am hoping that I get some file system folks to review it once I post the patch.
          Hide
          dhruba borthakur added a comment -

          Merged patch with latest trunk.

          Show
          dhruba borthakur added a comment - Merged patch with latest trunk.
          Hide
          stack added a comment -

          Tried the patch on loaded cluster. First attempt got file lease after about 30 seconds or so. Second test fell into a loop:

          2009-05-12 00:51:16,031 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbasetrunk2/.logs/aa0-000-14.u.powerset.com_1242088970041_60021/hlog.dat.1242089388713 retrying...
          2009-05-12 00:51:16,441 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbasetrunk2/.logs/aa0-000-14.u.powerset.com_1242088970041_60021/hlog.dat.1242089388713 retrying...
          ...
          

          Sending Dhruba hdfs logs.

          Show
          stack added a comment - Tried the patch on loaded cluster. First attempt got file lease after about 30 seconds or so. Second test fell into a loop: 2009-05-12 00:51:16,031 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbasetrunk2/.logs/aa0-000-14.u.powerset.com_1242088970041_60021/hlog.dat.1242089388713 retrying... 2009-05-12 00:51:16,441 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbasetrunk2/.logs/aa0-000-14.u.powerset.com_1242088970041_60021/hlog.dat.1242089388713 retrying... ... Sending Dhruba hdfs logs.
          Hide
          dhruba borthakur added a comment -

          Fixed block reports and blockReceived to handle change in generation stamp of a block.

          Show
          dhruba borthakur added a comment - Fixed block reports and blockReceived to handle change in generation stamp of a block.
          Hide
          stack added a comment -

          v9 works well in basic testing. If I crash the writer application, a subsequent reader can pick up edits from the crashed writer if the reader process does append, close and then reopen.

          I've been trying to break the patch but it seems robust (v7 and v8 had issues that Dhruba fixed).

          The append/close process can take from 10-60 seconds in my basic testing. Usually its about 20 seconds. It would be better if there were no pause. Can this be addressed?

          Show
          stack added a comment - v9 works well in basic testing. If I crash the writer application, a subsequent reader can pick up edits from the crashed writer if the reader process does append, close and then reopen. I've been trying to break the patch but it seems robust (v7 and v8 had issues that Dhruba fixed). The append/close process can take from 10-60 seconds in my basic testing. Usually its about 20 seconds. It would be better if there were no pause. Can this be addressed?
          Hide
          stack added a comment -

          My testing was done with hadoop 0.20.0 (I amended patch to work with 0.20 branch).

          Patch itself looks good. Minor comments are

          + Is there anything that can be done about duplicated test condition in BlockManager.java and DatanodeDescriptor.java. Can the test be grouped up into a method that is used in both places?
          + The if part of the if/else in DatanodeDescriptor.java is empty. Put a comment here? As is, its hard to read.

          I haven't played much with killing writer application and the local datanode. Will try that now. Will also try confirm that we're reading up to the last sync before the kill. Will report back.

          Show
          stack added a comment - My testing was done with hadoop 0.20.0 (I amended patch to work with 0.20 branch). Patch itself looks good. Minor comments are + Is there anything that can be done about duplicated test condition in BlockManager.java and DatanodeDescriptor.java. Can the test be grouped up into a method that is used in both places? + The if part of the if/else in DatanodeDescriptor.java is empty. Put a comment here? As is, its hard to read. I haven't played much with killing writer application and the local datanode. Will try that now. Will also try confirm that we're reading up to the last sync before the kill. Will report back.
          Hide
          Hairong Kuang added a comment -

          If one of the datanodes in the pipeline has a very very slow disk, it is possible that when sync returns, not all the syned data get flushed to the disk yet. If this datanode dies and restarts before the lease recovery starts, I think the lease recovery will remove the synced data.

          Show
          Hairong Kuang added a comment - If one of the datanodes in the pipeline has a very very slow disk, it is possible that when sync returns, not all the syned data get flushed to the disk yet. If this datanode dies and restarts before the lease recovery starts, I think the lease recovery will remove the synced data.
          Hide
          stack added a comment -

          After a chat with Hairong and Dhruba, it was thought that the wait on a successful append, close, then open to read was not necessary in this patch; that a new reader should be able to read up to the dead writers last sync. I tried it and all I get are EOFs when the new reader tries to read the dead writers file.

          Show
          stack added a comment - After a chat with Hairong and Dhruba, it was thought that the wait on a successful append, close, then open to read was not necessary in this patch; that a new reader should be able to read up to the dead writers last sync. I tried it and all I get are EOFs when the new reader tries to read the dead writers file.
          Hide
          stack added a comment -

          ... and the read finds a zero-length file (nothing read from the file).

          Show
          stack added a comment - ... and the read finds a zero-length file (nothing read from the file).
          Hide
          Hairong Kuang added a comment -

          It seems to me that the current implementation of sync does not guarantee that synced data become visible to any new reader.

          Show
          Hairong Kuang added a comment - It seems to me that the current implementation of sync does not guarantee that synced data become visible to any new reader.
          Hide
          stack added a comment -

          I tested some more and new reader sees 0 data and gets an EOF each time.

          I went back to testing the wait-on-successful-append, close, and new open for read and ran into the following issue again (I'd seen this in earlier patch) where we're spinnning waiting on the append to succeed:

          2009-05-27 22:16:53,186 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://Y/hbase/.logs/X,1243460581044/hlog.dat.1243462406950
          org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 for DFSClient_-843497834 on client X.X.X.X, because this file is already being created by DFSClient_-181950545 on Y.Y.Y.Y
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1058)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1146)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:392)
              at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:396)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
          
              at org.apache.hadoop.ipc.Client.call(Client.java:739)
              at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
              at $Proxy0.append(Unknown Source)
              at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
              at $Proxy0.append(Unknown Source)
              at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:487)
              at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
              at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
              at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:956)
              at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:800)
              at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:753)
              at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:248)
              at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:459)
              at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:395)
          

          ... and then this started up and won't stop:

          2009-05-27 22:16:59,753 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,60021,1243460581044/hlog.dat.1243462406950 retrying...
          2009-05-27 22:17:00,163 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 retrying...
          2009-05-27 22:17:00,573 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 retrying...
          .....
          

          Namenode log is here: http://www.duboce.net/~stack/could_not_complete_file_nn.log.gz

          Show
          stack added a comment - I tested some more and new reader sees 0 data and gets an EOF each time. I went back to testing the wait-on-successful-append, close, and new open for read and ran into the following issue again (I'd seen this in earlier patch) where we're spinnning waiting on the append to succeed: 2009-05-27 22:16:53,186 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs: //Y/hbase/.logs/X,1243460581044/hlog.dat.1243462406950 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 for DFSClient_-843497834 on client X.X.X.X, because this file is already being created by DFSClient_-181950545 on Y.Y.Y.Y at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1058) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1146) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:392) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:739) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.append(Unknown Source) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.append(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:487) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:956) at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:800) at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:753) at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:248) at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:459) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:395) ... and then this started up and won't stop: 2009-05-27 22:16:59,753 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,60021,1243460581044/hlog.dat.1243462406950 retrying... 2009-05-27 22:17:00,163 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 retrying... 2009-05-27 22:17:00,573 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 retrying... ..... Namenode log is here: http://www.duboce.net/~stack/could_not_complete_file_nn.log.gz
          Hide
          stack added a comment -

          In this test run, the append never succeeds.... or at least, after 25 minutes it still has not successfully done the append open. We try the append, fail with an AlreadyBeingCreatedException, sleep a second, and then cycle. Usually it takes well under a minute to successfully open-to-append. Namenode log is here: www.duboce.net:~stack/wontstop_namenode.log.gz. In this case, I killed datanode and the hbase regionserver simulating a machine falling off the cluster (Previous, I was mostly just killing the server process and not the datanode).

          Show
          stack added a comment - In this test run, the append never succeeds.... or at least, after 25 minutes it still has not successfully done the append open. We try the append, fail with an AlreadyBeingCreatedException, sleep a second, and then cycle. Usually it takes well under a minute to successfully open-to-append. Namenode log is here: www.duboce.net:~stack/wontstop_namenode.log.gz. In this case, I killed datanode and the hbase regionserver simulating a machine falling off the cluster (Previous, I was mostly just killing the server process and not the datanode).
          Hide
          dhruba borthakur added a comment -

          Patch for 0.20 branch. Fixes a concurrentmodificationexception.

          Show
          dhruba borthakur added a comment - Patch for 0.20 branch. Fixes a concurrentmodificationexception.
          Hide
          stack added a comment -

          I can manufacture a situation where we're stuck trying to open-for-append and we never seem to succeed (It ran all night last night cycling trying to open-for-append).

          4 servers. All running datanodes and hbase regionservers. Running head of the hadoop 0.20.0 branch with fsyncConcurrentReaders11_20.txt applied. In hbase, I have dfs.append set to true.

          Start up an upload. After a few minutes, kill datanode on ..44.142. I leave the hbase regionserver on ..44.142 running.

          After a little while, the regionserver on ..44.139 fails because of the following DFSClient exception:

          2009-06-18 18:24:28,514 [regionserver/0:0:0:0:0:0:0:0:60021.cacheFlusher] FATAL org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog required. Forcing server shutdown
          org.apache.hadoop.hbase.DroppedSnapshotException: region: TestTable,0452187774,1245349374706
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:936)
              at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:829)
              at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:268)
              at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.run(MemcacheFlusher.java:149)
          Caused by: java.io.IOException: Bad connect ack with firstBadLink *.*.44.142:51010
              at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
              at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
              at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
              at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
          

          (Why does this fail? Why does one datanode missing cause this fail?)

          The hbase master, also on ..44.139, then goes to recover the files of the crashed regionserver process. It gets stuck looping trying to open-for-append:

          2009-06-18 18:25:04,587 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://aa0-000-12.u.powerset.com:9002/hbase/.logs/aa0-000-12.u.powerset.com,60021,1245348995684/hlog.dat.1245349434556
          org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/aa0-000-12.u.powerset.com,60021,1245348995684/hlog.dat.1245349434556 for DFSClient_2060803895 on client *.*.44.139, because this file is already being created by DFSClient_-937484673 on *.*.44.139
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1058)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1146)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:392)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:396)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
          
              at org.apache.hadoop.ipc.Client.call(Client.java:739)
              at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
              at $Proxy0.append(Unknown Source)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
              at $Proxy0.append(Unknown Source)
              at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:487)
              at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
              at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
              at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1037)
              at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:816)
              at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:756)
              at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
              at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:467)
              at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:401)
          

          Let me attach the namenode log.

          Show
          stack added a comment - I can manufacture a situation where we're stuck trying to open-for-append and we never seem to succeed (It ran all night last night cycling trying to open-for-append). 4 servers. All running datanodes and hbase regionservers. Running head of the hadoop 0.20.0 branch with fsyncConcurrentReaders11_20.txt applied. In hbase, I have dfs.append set to true. Start up an upload. After a few minutes, kill datanode on . .44.142. I leave the hbase regionserver on . .44.142 running. After a little while, the regionserver on . .44.139 fails because of the following DFSClient exception: 2009-06-18 18:24:28,514 [regionserver/0:0:0:0:0:0:0:0:60021.cacheFlusher] FATAL org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: TestTable,0452187774,1245349374706 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:936) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:829) at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:268) at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.run(MemcacheFlusher.java:149) Caused by: java.io.IOException: Bad connect ack with firstBadLink *.*.44.142:51010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) (Why does this fail? Why does one datanode missing cause this fail?) The hbase master, also on . .44.139, then goes to recover the files of the crashed regionserver process. It gets stuck looping trying to open-for-append: 2009-06-18 18:25:04,587 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs: //aa0-000-12.u.powerset.com:9002/hbase/.logs/aa0-000-12.u.powerset.com,60021,1245348995684/hlog.dat.1245349434556 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/aa0-000-12.u.powerset.com,60021,1245348995684/hlog.dat.1245349434556 for DFSClient_2060803895 on client *.*.44.139, because this file is already being created by DFSClient_-937484673 on *.*.44.139 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1058) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1146) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:392) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:739) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.append(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.append(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:487) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1037) at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:816) at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:756) at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274) at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:467) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:401) Let me attach the namenode log.
          Hide
          stack added a comment -

          Namenode log that covers the forever cycle on open-for-append

          Show
          stack added a comment - Namenode log that covers the forever cycle on open-for-append
          Hide
          Philip Zeyliger added a comment -

          I'm traveling overseas and will return on June 29th. For urgent
          matters, please contact Amr Awadallah.

          Show
          Philip Zeyliger added a comment - I'm traveling overseas and will return on June 29th. For urgent matters, please contact Amr Awadallah.
          Hide
          dhruba borthakur added a comment -

          This is not related to HDFS-4379. let me explain why.

          The problem is actually related to HDFS-xxx. The namenode waits for 10 minutes after losing heartbeats from a datanode to declare it dead. During this 10 minutes, the NN is free to choose the dead datanode as a possible replica for a newly allocated block.

          If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have only 4 datanodes in the cluster, every retry picks the dead-datanode and the above logic bails out.

          One solution is to change the value of dfs.client.block.write.retries to a much much larger value, say 200 or so. Better still, increase the number of nodes in ur cluster.

          Show
          dhruba borthakur added a comment - This is not related to HDFS-4379 . let me explain why. The problem is actually related to HDFS-xxx. The namenode waits for 10 minutes after losing heartbeats from a datanode to declare it dead. During this 10 minutes, the NN is free to choose the dead datanode as a possible replica for a newly allocated block. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have only 4 datanodes in the cluster, every retry picks the dead-datanode and the above logic bails out. One solution is to change the value of dfs.client.block.write.retries to a much much larger value, say 200 or so. Better still, increase the number of nodes in ur cluster.
          Hide
          Konstantin Shvachko added a comment -

          I reviewed the patch. It looks good, nothing obviously wrong.
          Do not recommend it for commit to 0.20 mostly because it introduces changes, which effect behavior of the system when appends are not present.
          Main concerns are:

          1. There are new APIs like new rpc call getBlockInfo() to data-nodes and public methods syncFs().
          2. Regular reads will do additional steps before reading.
          3. Lease recovery changed substantially: touches non-append related paths.
          4. Block report processing plays tricks with generation stamps.
          5. Also, I am not sure this will be compatible with the ongoing design HDFS-265.

          Taking into account all the above and if there is enough interest in the feature it makes sense to create a branch of 0.20 with this patch applied.
          This will require some maintenance, but shouldn't be too hard.

          Show
          Konstantin Shvachko added a comment - I reviewed the patch. It looks good, nothing obviously wrong. Do not recommend it for commit to 0.20 mostly because it introduces changes, which effect behavior of the system when appends are not present. Main concerns are: There are new APIs like new rpc call getBlockInfo() to data-nodes and public methods syncFs(). Regular reads will do additional steps before reading. Lease recovery changed substantially: touches non-append related paths. Block report processing plays tricks with generation stamps. Also, I am not sure this will be compatible with the ongoing design HDFS-265 . Taking into account all the above and if there is enough interest in the feature it makes sense to create a branch of 0.20 with this patch applied. This will require some maintenance, but shouldn't be too hard.
          Hide
          stack added a comment -

          (Thanks for review Konstantin)

          In my last few test runs, NameNode has shut itself down with the below:

          ...
          009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9156287469566772234_2527 from XX.XX.XX.142:51010
          2009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9181830129071396520_2355 from XX.XX.XX.142:51010
          2009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9205119721509648294_2410 from XX.XX.XX.142:51010
          2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-7011715647341740217_1 on XX.XX.XX.142:51010 size 47027149 does not belong to any file.
          2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-7011715647341740217 is added to invalidSet of XX.XX.XX.142:51010
          2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-280166356715716926_1 on XX.XX.XX.142:51010 size 6487 does not belong to any file.
          2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-280166356715716926 is added to invalidSet of XX.XX.XX.142:51010
          2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_1532053033915429278_1 on XX.XX.XX.142:51010 size 3869 does not belong to any file.
          2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1532053033915429278 is added to invalidSet of XX.XX.XX.142:51010
          2009-07-14 00:17:47,303 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: ReplicationMonitor thread received Runtime exception. java.lang.IllegalStateException: generationStamp (=1) == GenerationStamp.WILDCARD_STAMP
          2009-07-14 00:17:47,304 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
          /************************************************************
          SHUTDOWN_MSG: Shutting down NameNode at aa0-000-12.u.powerset.com/XX.XX.XX.139
          ************************************************************/
          

          My guess this is a bug only fellas with dfs.support.append=true set run in to?

          Here is code from ReplicationMonitor:

                  } catch (Throwable t) {
                    LOG.warn("ReplicationMonitor thread received Runtime exception. " + t);
                    Runtime.getRuntime().exit(-1);
                  }
          

          Thats a rough call I'd say?

          There are no more detailed exceptions in NN log.

          Dig in more and stick what I find in another issue?

          Show
          stack added a comment - (Thanks for review Konstantin) In my last few test runs, NameNode has shut itself down with the below: ... 009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9156287469566772234_2527 from XX.XX.XX.142:51010 2009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9181830129071396520_2355 from XX.XX.XX.142:51010 2009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9205119721509648294_2410 from XX.XX.XX.142:51010 2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-7011715647341740217_1 on XX.XX.XX.142:51010 size 47027149 does not belong to any file. 2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-7011715647341740217 is added to invalidSet of XX.XX.XX.142:51010 2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-280166356715716926_1 on XX.XX.XX.142:51010 size 6487 does not belong to any file. 2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-280166356715716926 is added to invalidSet of XX.XX.XX.142:51010 2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_1532053033915429278_1 on XX.XX.XX.142:51010 size 3869 does not belong to any file. 2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1532053033915429278 is added to invalidSet of XX.XX.XX.142:51010 2009-07-14 00:17:47,303 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: ReplicationMonitor thread received Runtime exception. java.lang.IllegalStateException: generationStamp (=1) == GenerationStamp.WILDCARD_STAMP 2009-07-14 00:17:47,304 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at aa0-000-12.u.powerset.com/XX.XX.XX.139 ************************************************************/ My guess this is a bug only fellas with dfs.support.append=true set run in to? Here is code from ReplicationMonitor: } catch (Throwable t) { LOG.warn( "ReplicationMonitor thread received Runtime exception. " + t); Runtime .getRuntime().exit(-1); } Thats a rough call I'd say? There are no more detailed exceptions in NN log. Dig in more and stick what I find in another issue?
          Hide
          dhruba borthakur added a comment -

          Here is another patch that prints out the exception stack trace when the ReplicationMonitor encounters an exception. If you can reproduce this failure case with this patch, that will be great.

          Show
          dhruba borthakur added a comment - Here is another patch that prints out the exception stack trace when the ReplicationMonitor encounters an exception. If you can reproduce this failure case with this patch, that will be great.
          Hide
          Ruyue Ma added a comment -

          to: dhruba borthakur

          > This is not related to HDFS-4379. let me explain why.
          > The problem is actually related to HDFS-xxx. The namenode waits for 10 minutes after losing heartbeats from a datanode to declare it dead. During this 10 minutes, the NN is free to choose the dead datanode as a possible replica for a newly allocated block.

          > If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). > This setting works well when you have a reasonable size cluster; if u have only 4 datanodes in the cluster, every retry picks the dead-datanode and the above logic bails out.

          > One solution is to change the value of dfs.client.block.write.retries to a much much larger value, say 200 or so. Better still, increase the number of nodes in ur cluster.

          Our modification: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation.

          +++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java 2009-07-20 00:19:03.000000000 +0800
          @@ -2734,6 +2734,7 @@
          LocatedBlock lb = null;
          boolean retry = false;
          DatanodeInfo[] nodes;
          + DatanodeInfo[] exludedNodes = null;
          int count = conf.getInt("dfs.client.block.write.retries", 3);
          boolean success;
          do {
          @@ -2745,7 +2746,7 @@
          success = false;

          long startTime = System.currentTimeMillis();

          • lb = locateFollowingBlock(startTime);
            + lb = locateFollowingBlock(startTime, exludedNodes);
            block = lb.getBlock();
            nodes = lb.getLocations();

          @@ -2755,6 +2756,19 @@
          success = createBlockOutputStream(nodes, clientName, false);

          if (!success) {
          +
          + LOG.info("Excluding node: " + nodes[errorIndex]);
          + // Mark datanode as excluded
          + DatanodeInfo errorNode = nodes[errorIndex];
          + if (exludedNodes != null)

          { + DatanodeInfo[] newExcludedNodes = new DatanodeInfo[exludedNodes.length + 1]; + System.arraycopy(exludedNodes, 0, newExcludedNodes, 0, exludedNodes.length); + newExcludedNodes[exludedNodes.length] = errorNode; + exludedNodes = newExcludedNodes; + }

          else {
          + exludedNodes = new DatanodeInfo[]

          { errorNode }

          ;
          + }
          +
          LOG.info("Abandoning block " + block);
          namenode.abandonBlock(block, src, clientName);

          Show
          Ruyue Ma added a comment - to: dhruba borthakur > This is not related to HDFS-4379 . let me explain why. > The problem is actually related to HDFS-xxx. The namenode waits for 10 minutes after losing heartbeats from a datanode to declare it dead. During this 10 minutes, the NN is free to choose the dead datanode as a possible replica for a newly allocated block. > If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). > This setting works well when you have a reasonable size cluster; if u have only 4 datanodes in the cluster, every retry picks the dead-datanode and the above logic bails out. > One solution is to change the value of dfs.client.block.write.retries to a much much larger value, say 200 or so. Better still, increase the number of nodes in ur cluster. Our modification: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. +++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java 2009-07-20 00:19:03.000000000 +0800 @@ -2734,6 +2734,7 @@ LocatedBlock lb = null; boolean retry = false; DatanodeInfo[] nodes; + DatanodeInfo[] exludedNodes = null; int count = conf.getInt("dfs.client.block.write.retries", 3); boolean success; do { @@ -2745,7 +2746,7 @@ success = false; long startTime = System.currentTimeMillis(); lb = locateFollowingBlock(startTime); + lb = locateFollowingBlock(startTime, exludedNodes); block = lb.getBlock(); nodes = lb.getLocations(); @@ -2755,6 +2756,19 @@ success = createBlockOutputStream(nodes, clientName, false); if (!success) { + + LOG.info("Excluding node: " + nodes [errorIndex] ); + // Mark datanode as excluded + DatanodeInfo errorNode = nodes [errorIndex] ; + if (exludedNodes != null) { + DatanodeInfo[] newExcludedNodes = new DatanodeInfo[exludedNodes.length + 1]; + System.arraycopy(exludedNodes, 0, newExcludedNodes, 0, exludedNodes.length); + newExcludedNodes[exludedNodes.length] = errorNode; + exludedNodes = newExcludedNodes; + } else { + exludedNodes = new DatanodeInfo[] { errorNode } ; + } + LOG.info("Abandoning block " + block); namenode.abandonBlock(block, src, clientName);
          Hide
          dhruba borthakur added a comment -

          Hi Ruyue, your option of excluding specific datanodes (specified by the client) sounds reasonable. This might help in the case of network partitioning where a specific client loses access to a set of datanodes while the datanode is alive and well and is able to send heartbeats to the namenode. Can you pl create a separate JIRA for your prosposed fix and attach your patch there? Thanks.

          Show
          dhruba borthakur added a comment - Hi Ruyue, your option of excluding specific datanodes (specified by the client) sounds reasonable. This might help in the case of network partitioning where a specific client loses access to a set of datanodes while the datanode is alive and well and is able to send heartbeats to the namenode. Can you pl create a separate JIRA for your prosposed fix and attach your patch there? Thanks.
          Hide
          dhruba borthakur added a comment -

          This patch fixes the exception "Runtime exception. java.lang.IllegalStateException: generationStamp (=1) == GenerationStamp.WILDCARD_STAMP" reported by stack earlier.

          Show
          dhruba borthakur added a comment - This patch fixes the exception "Runtime exception. java.lang.IllegalStateException: generationStamp (=1) == GenerationStamp.WILDCARD_STAMP" reported by stack earlier.
          Hide
          stack added a comment -

          Just to say that I've been testing last patch over last 3 or 4 days and its holding up for me. I'll keep at it. I also wrote Ruyue Ma for the server-side of his patch above but no answer as yet. Good stuff.

          Show
          stack added a comment - Just to say that I've been testing last patch over last 3 or 4 days and its holding up for me. I'll keep at it. I also wrote Ruyue Ma for the server-side of his patch above but no answer as yet. Good stuff.
          Hide
          dhruba borthakur added a comment -

          This patch ignores deleting blocks during block report processing if the file is underConstruction. Updated test case to tets this corner case.

          Show
          dhruba borthakur added a comment - This patch ignores deleting blocks during block report processing if the file is underConstruction. Updated test case to tets this corner case.
          Hide
          ryan rawson added a comment -

          I've been testing this on my 20 node cluster here. Lease recovery can take a long time which is a bit of an issue. The sync seems to be pretty good overall, we are recovering most of the edits up until the last flush, and it's pretty responsive.

          However, I have discovered a new bug, the scenario is like so:

          • we roll the logs every < 1MB (block size).
          • we now have 18 logs to recover. The first 17 were closed properly, only the last one was in mid-write.
          • during log recovery, the hbase master calls fs.append(f); out.close();
          • But the master gets stuck at the out.close(); It can't seem to progress. Investigating the logs, it looks like the namenode 'forgets' about the other 2 replicas for the block (file is 1 block), and thus we are stuck
            until another replica comes back.

          I've attached logs, hadoop fsck, stack traces from hbase.

          Show
          ryan rawson added a comment - I've been testing this on my 20 node cluster here. Lease recovery can take a long time which is a bit of an issue. The sync seems to be pretty good overall, we are recovering most of the edits up until the last flush, and it's pretty responsive. However, I have discovered a new bug, the scenario is like so: we roll the logs every < 1MB (block size). we now have 18 logs to recover. The first 17 were closed properly, only the last one was in mid-write. during log recovery, the hbase master calls fs.append(f); out.close(); But the master gets stuck at the out.close(); It can't seem to progress. Investigating the logs, it looks like the namenode 'forgets' about the other 2 replicas for the block (file is 1 block), and thus we are stuck until another replica comes back. I've attached logs, hadoop fsck, stack traces from hbase.
          Hide
          stack added a comment -

          @dhruba Did you get a chance to checkout Ryan's postings? Good stuff...

          Show
          stack added a comment - @dhruba Did you get a chance to checkout Ryan's postings? Good stuff...
          Hide
          dhruba borthakur added a comment -

          @ryan, stack: thanks for the info. I will follow it up very soon. will keep you posted.

          Show
          dhruba borthakur added a comment - @ryan, stack: thanks for the info. I will follow it up very soon. will keep you posted.
          Hide
          dhruba borthakur added a comment - - edited

          I think this is the scenario that you are facing:

          1. The file is written for the first time. The writer closed the file but at this time only one of the three replicas have checked in with the namenode.
          2. The new writer invoked append() to write more data into the file. The new writer found the one remaining replica of the block, stamped a new generation stamp on this replica, made it ready to receive new data for this file and file is not open for "append". The stamping of the new generation stamp essentially invaidated the other two replicas of this block....this block now has only one valid replica. The namenode won't start replicating this block till the block is full. If this sole datanode now dies, then the file will be "missing a block". This is what you folks encountered.

          One option is to set dfs.replication.min to 2. This will ensure that closing a file (step 1) will be successful only when at least two replicas of the block have checked in with the namenode. This should reduce the probability of this problem occuring. Along with that you could set the replication factor of the hbase log file(s) to be greater than 3.

          Show
          dhruba borthakur added a comment - - edited I think this is the scenario that you are facing: 1. The file is written for the first time. The writer closed the file but at this time only one of the three replicas have checked in with the namenode. 2. The new writer invoked append() to write more data into the file. The new writer found the one remaining replica of the block, stamped a new generation stamp on this replica, made it ready to receive new data for this file and file is not open for "append". The stamping of the new generation stamp essentially invaidated the other two replicas of this block....this block now has only one valid replica. The namenode won't start replicating this block till the block is full. If this sole datanode now dies, then the file will be "missing a block". This is what you folks encountered. One option is to set dfs.replication.min to 2. This will ensure that closing a file (step 1) will be successful only when at least two replicas of the block have checked in with the namenode. This should reduce the probability of this problem occuring. Along with that you could set the replication factor of the hbase log file(s) to be greater than 3.
          Hide
          ryan rawson added a comment -

          I'm not sure exactly what I'm seeing, here is some flow of events:

          namenode says when the regionserver closes a log:

          2009-09-23 17:21:05,128 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.38:50010 1 blocks.
          2009-09-23 17:21:05,128 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.38:50010
          2009-09-23 17:21:05,128 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.38:50010 is added to blk_859496561950482745
          1_4351 size 573866
          2009-09-23 17:21:05,130 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.45:50010 1 blocks.
          2009-09-23 17:21:05,130 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.45:50010
          2009-09-23 17:21:05,130 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.45:50010 is added to blk_859496561950482745
          1_4351 size 573866
          2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.32:50010 1 blocks.
          2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.32:50010
          2009-09-23 17:21:05,131 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.32:50010 is added to blk_859496561950482745
          1_4351 size 573866
          2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: DIR NameNode.complete: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClien
          t_-2129099062
          2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFS
          Client_-2129099062
          2009-09-23 17:21:05,132 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.closeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 with 1 bl
          ocks is persisted to the file system
          2009-09-23 17:21:05,132 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 blockli
          st persisted

          So at this point we have 3 blocks, they have all checked in, right?

          then during logfile recovery we see:

          2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR NameNode.append: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSCl
          ient_-828773542 at 10.10.21.29
          2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: src=/hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228, holde
          r=DFSClient_-828773542, clientMachine=10.10.21.29, replication=512, overwrite=false, append=true
          2009-09-23 17:21:45,997 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 567 Total time for transactions(ms): 9Number of transactions ba
          tched in Syncs: 54 Number of syncs: 374 SyncTimes(ms): 12023 4148 3690 7663
          2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: UnderReplicationBlocks.update blk_8594965619504827451_4351 curReplicas 0 curExpectedReplicas 3 oldReplicas 0 oldExpectedReplicas 3 curPri 2 oldPri 2
          2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.UnderReplicationBlock.update:blk_8594965619504827451_4351 has only 0 replicas and need 3 replicas so is added to neededReplications at priority level 2
          2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.appendFile: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542 at 10.10.21.29 block blk_8594965619504827451_4351 block size 573866
          2009-09-23 17:21:45,997 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop ip=/10.10.21.29 cmd=append src=/hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 dst=null perm=null
          2009-09-23 17:21:47,265 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.UnderReplicationBlock.remove: Removing block blk_8594965619504827451_4351 from priority queue 2
          2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.addBlock: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542
          2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getAdditionalBlock: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542
          2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.addFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 with blk_-7773763727920913762_4364 block is added to the in-memory file system
          2009-09-23 17:21:56,016 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228. blk_-7773763727920913762_4364

          At this point the client is stuck waiting on a block recovery that wont ever really happen.

          I set this config at your suggestion:
          <property>
          <name>dfs.replication.min</name>
          <value>2</value>
          </property>

          in the hbase-site.xml so it should be in the HBase dfs client config.

          from my reading of the logfile, apparently the namenode seems to "forget" the other 2 locations of the block in question. Am I reading it wrong?

          Show
          ryan rawson added a comment - I'm not sure exactly what I'm seeing, here is some flow of events: namenode says when the regionserver closes a log: 2009-09-23 17:21:05,128 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.38:50010 1 blocks. 2009-09-23 17:21:05,128 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.38:50010 2009-09-23 17:21:05,128 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.38:50010 is added to blk_859496561950482745 1_4351 size 573866 2009-09-23 17:21:05,130 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.45:50010 1 blocks. 2009-09-23 17:21:05,130 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.45:50010 2009-09-23 17:21:05,130 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.45:50010 is added to blk_859496561950482745 1_4351 size 573866 2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.32:50010 1 blocks. 2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.32:50010 2009-09-23 17:21:05,131 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.32:50010 is added to blk_859496561950482745 1_4351 size 573866 2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: DIR NameNode.complete: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClien t_-2129099062 2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFS Client_-2129099062 2009-09-23 17:21:05,132 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.closeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 with 1 bl ocks is persisted to the file system 2009-09-23 17:21:05,132 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 blockli st persisted So at this point we have 3 blocks, they have all checked in, right? then during logfile recovery we see: 2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR NameNode.append: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSCl ient_-828773542 at 10.10.21.29 2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: src=/hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228, holde r=DFSClient_-828773542, clientMachine=10.10.21.29, replication=512, overwrite=false, append=true 2009-09-23 17:21:45,997 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 567 Total time for transactions(ms): 9Number of transactions ba tched in Syncs: 54 Number of syncs: 374 SyncTimes(ms): 12023 4148 3690 7663 2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: UnderReplicationBlocks.update blk_8594965619504827451_4351 curReplicas 0 curExpectedReplicas 3 oldReplicas 0 oldExpectedReplicas 3 curPri 2 oldPri 2 2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.UnderReplicationBlock.update:blk_8594965619504827451_4351 has only 0 replicas and need 3 replicas so is added to neededReplications at priority level 2 2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.appendFile: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542 at 10.10.21.29 block blk_8594965619504827451_4351 block size 573866 2009-09-23 17:21:45,997 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop ip=/10.10.21.29 cmd=append src=/hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 dst=null perm=null 2009-09-23 17:21:47,265 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.UnderReplicationBlock.remove: Removing block blk_8594965619504827451_4351 from priority queue 2 2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.addBlock: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542 2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getAdditionalBlock: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542 2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.addFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 with blk_-7773763727920913762_4364 block is added to the in-memory file system 2009-09-23 17:21:56,016 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228. blk_-7773763727920913762_4364 At this point the client is stuck waiting on a block recovery that wont ever really happen. I set this config at your suggestion: <property> <name>dfs.replication.min</name> <value>2</value> </property> in the hbase-site.xml so it should be in the HBase dfs client config. from my reading of the logfile, apparently the namenode seems to "forget" the other 2 locations of the block in question. Am I reading it wrong?
          Hide
          ryan rawson added a comment -

          scratch the last, i was having some environment/library version problems.

          Show
          ryan rawson added a comment - scratch the last, i was having some environment/library version problems.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          After HDFS-659, new readers are able to read hflushed data in the Append Branch. We have committed a test, TestReadWhileWriting, which is similar to TC3.

          Show
          Tsz Wo Nicholas Sze added a comment - After HDFS-659 , new readers are able to read hflushed data in the Append Branch. We have committed a test, TestReadWhileWriting, which is similar to TC3 .
          Hide
          Alex Newman added a comment -

          Anything going on here? Can we expect this to be fixed by 0.21.0?

          Show
          Alex Newman added a comment - Anything going on here? Can we expect this to be fixed by 0.21.0?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Alex, it was already fixed in 0.21 by the new hflush implementation. We have already done a lot of testing on hflush.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Alex, it was already fixed in 0.21 by the new hflush implementation. We have already done a lot of testing on hflush.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          This duplicates HDFS-659.

          Show
          Tsz Wo Nicholas Sze added a comment - This duplicates HDFS-659 .
          Hide
          dhruba borthakur added a comment -

          Merged with latest trunk

          Show
          dhruba borthakur added a comment - Merged with latest trunk
          Hide
          Todd Lipcon added a comment -

          Hey Dhruba,

          In the client-side length fetching code, you use check the LocatedBlock last = newInfo.get(newInfo.locatedBlockCount()-1) to fetch the size.
          However, this is only the last block in the file for cases where the file has fewer than prefetchSize blocks, right? Has anyone tried a test
          after setting block size to 1MB?

          The trunk append code solves this issue by adding a lastBlock field to LocatedBlocks. I imagine you avoided this to keep wire compatibility
          in this patch.

          -Todd

          Show
          Todd Lipcon added a comment - Hey Dhruba, In the client-side length fetching code, you use check the LocatedBlock last = newInfo.get(newInfo.locatedBlockCount()-1) to fetch the size. However, this is only the last block in the file for cases where the file has fewer than prefetchSize blocks, right? Has anyone tried a test after setting block size to 1MB? The trunk append code solves this issue by adding a lastBlock field to LocatedBlocks. I imagine you avoided this to keep wire compatibility in this patch. -Todd
          Hide
          Todd Lipcon added a comment -

          Regarding above comment - I suppose it doesn't entirely matter, since you only care about getting the true file length if you're reading
          at the end of the file. So, when you get near the end (10 blocks away) you'll get the updated length. However, you'll still do useless
          getBlockInfo calls on earlier blocks in the file. Perhaps some check against the LocatedBlocks structure to see if the last block
          in the block list is actually the same as the last block of the file is in order?

          Show
          Todd Lipcon added a comment - Regarding above comment - I suppose it doesn't entirely matter, since you only care about getting the true file length if you're reading at the end of the file. So, when you get near the end (10 blocks away) you'll get the updated length. However, you'll still do useless getBlockInfo calls on earlier blocks in the file. Perhaps some check against the LocatedBlocks structure to see if the last block in the block list is actually the same as the last block of the file is in order?
          Hide
          dhruba borthakur added a comment -

          Hi Todd, I think your point of concern deems investigation. I will post a new patch pretty soon. Thanks for looking over this one.

          One option is to check if the size of the lastBlock in newInfo is lesser than the size of the previous block. if so, then it means that we have reached the last block. of the file; only then trigger the code to retrieve the last block size from the datanode. (For a HDFS file, all blocks except the last one can be partial.)

          Show
          dhruba borthakur added a comment - Hi Todd, I think your point of concern deems investigation. I will post a new patch pretty soon. Thanks for looking over this one. One option is to check if the size of the lastBlock in newInfo is lesser than the size of the previous block. if so, then it means that we have reached the last block. of the file; only then trigger the code to retrieve the last block size from the datanode. (For a HDFS file, all blocks except the last one can be partial.)
          Hide
          dhruba borthakur added a comment -

          I meant "For a HDFS file, only the last block can be partial in size".

          Show
          dhruba borthakur added a comment - I meant "For a HDFS file, only the last block can be partial in size".
          Hide
          Todd Lipcon added a comment -

          That seems like a bit of a tenuous assumption. I agree that it's currently true, but just seems like "coincidence" What about checking:

          LocatedBlock lastLocated = newInfo.get(newInfo.locatedBlockCount() - 1);
          if (lastLocated.getStartOffset() + lastLocated.getBlockSize() == newInfo.getFileLength()) {
            // the last block we located is the last block in the file
            ...
          }
          
          Show
          Todd Lipcon added a comment - That seems like a bit of a tenuous assumption. I agree that it's currently true, but just seems like "coincidence" What about checking: LocatedBlock lastLocated = newInfo.get(newInfo.locatedBlockCount() - 1); if (lastLocated.getStartOffset() + lastLocated.getBlockSize() == newInfo.getFileLength()) { // the last block we located is the last block in the file ... }
          Hide
          dhruba borthakur added a comment -

          Hi Todd, I think your proposal might not work. The newInfo.getFileLength() that you are using is the length of the file as returned by the namenode. We are interested in a file that is currently being written into, thus its length is constantly changing at the datanode.

          Show
          dhruba borthakur added a comment - Hi Todd, I think your proposal might not work. The newInfo.getFileLength() that you are using is the length of the file as returned by the namenode. We are interested in a file that is currently being written into, thus its length is constantly changing at the datanode.
          Hide
          Todd Lipcon added a comment -

          Hey Dhruba. newInfo.getFileLength is indeed the NN's view of the length, which is why I think it should be
          consistent with lastLocated.getBlockSize - the NN's view of the last block. Would the NN ever return an
          'inconsistent' LocatedBlocks such that the file size has changed but the last block size hasn't?

          Show
          Todd Lipcon added a comment - Hey Dhruba. newInfo.getFileLength is indeed the NN's view of the length, which is why I think it should be consistent with lastLocated.getBlockSize - the NN's view of the last block. Would the NN ever return an 'inconsistent' LocatedBlocks such that the file size has changed but the last block size hasn't?
          Hide
          dhruba borthakur added a comment -

          Incorporated Todd's review comments. Thanks Todd. The change you proposed is an optimization, isn't it? The earlier patch should have done the job too.

          Show
          dhruba borthakur added a comment - Incorporated Todd's review comments. Thanks Todd. The change you proposed is an optimization, isn't it? The earlier patch should have done the job too.
          Hide
          Todd Lipcon added a comment -

          The change you proposed is an optimization, isn't it? The earlier patch should have done the job too.

          Yes, I think so

          I'll continue to look over the patch this week to see if I see anything else. Thanks again for posting this on the jira.

          Show
          Todd Lipcon added a comment - The change you proposed is an optimization, isn't it? The earlier patch should have done the job too. Yes, I think so I'll continue to look over the patch this week to see if I see anything else. Thanks again for posting this on the jira.
          Hide
          dhruba borthakur added a comment -

          This patch is needed to make Hbase work correctly on Hadoop 0.20.

          Show
          dhruba borthakur added a comment - This patch is needed to make Hbase work correctly on Hadoop 0.20.
          Hide
          stack added a comment -

          Its not important but the last patch seems to have some dross on the end, patches for content under src/docs/cn*

          Show
          stack added a comment - Its not important but the last patch seems to have some dross on the end, patches for content under src/docs/cn*
          Hide
          sam rash added a comment -

          1-line fix

          Show
          sam rash added a comment - 1-line fix
          Hide
          Todd Lipcon added a comment -

          Hey Sam, any unit test for that last fix? Or description of what was wrong with it?

          Show
          Todd Lipcon added a comment - Hey Sam, any unit test for that last fix? Or description of what was wrong with it?
          Hide
          sam rash added a comment -

          sorry, should be easy to make a unit test, but i didn't have time. the case is that any lease that expires and has more than one path associated with it will throw exceptions. the leaseManager
          calls internalRelease(lease, src) and expects only the lease for src to be removed by the call (hence it does its own loop on all the paths for a lease). However, with hdfs-200, this function removes
          all paths for the lease, not just the specified one. internalReleaseLeaseOne does just the single path.

          if I get a minute, I can make such a test case quickly and update the patch w/it

          Show
          sam rash added a comment - sorry, should be easy to make a unit test, but i didn't have time. the case is that any lease that expires and has more than one path associated with it will throw exceptions. the leaseManager calls internalRelease(lease, src) and expects only the lease for src to be removed by the call (hence it does its own loop on all the paths for a lease). However, with hdfs-200, this function removes all paths for the lease, not just the specified one. internalReleaseLeaseOne does just the single path. if I get a minute, I can make such a test case quickly and update the patch w/it
          Hide
          Todd Lipcon added a comment -

          cool, I agree with your assessment. Unit test should be good whenever you have a chance, thanks!

          Show
          Todd Lipcon added a comment - cool, I agree with your assessment. Unit test should be good whenever you have a chance, thanks!
          Hide
          sam rash added a comment -

          unit test for previous fix (a little bit whitebox, but otherewise the error is benign except for exceptions that get logged--probably why it wasn't noticed for a while)

          Show
          sam rash added a comment - unit test for previous fix (a little bit whitebox, but otherewise the error is benign except for exceptions that get logged--probably why it wasn't noticed for a while)
          Hide
          dhruba borthakur added a comment -

          This has to be pulled into the branch-0.20-append branch.

          Show
          dhruba borthakur added a comment - This has to be pulled into the branch-0.20-append branch.
          Hide
          dhruba borthakur added a comment -

          I committed this into 0.20-append branch.

          Show
          dhruba borthakur added a comment - I committed this into 0.20-append branch.
          Hide
          Jitendra Nath Pandey added a comment -

          Patch for 20-security branch.

          Show
          Jitendra Nath Pandey added a comment - Patch for 20-security branch.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch.

          Show
          Suresh Srinivas added a comment - +1 for the patch.
          Hide
          Suresh Srinivas added a comment -

          I committed this change to 0.20-security.

          Show
          Suresh Srinivas added a comment - I committed this change to 0.20-security.
          Hide
          Matt Foley added a comment -

          Closed upon release of 0.20.205.0

          Show
          Matt Foley added a comment - Closed upon release of 0.20.205.0
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Dhruba,

          Looks following code creating the problem in one special condition.

           BlockInfo storedBlock = blocksMap.getStoredBlock(block);
          +    if (storedBlock == null) {
          +      // if the block with a WILDCARD generation stamp matches and the
          +      // corresponding file is under construction, then accept this block.
          +      // This block has a diferent generation stamp on the datanode 
          +      // because of a lease-recovery-attempt.
          +      Block nblk = new Block(block.getBlockId());
          +      storedBlock = blocksMap.getStoredBlock(nblk);
          +      if (storedBlock != null && storedBlock.getINode() != null &&
          +          (storedBlock.getGenerationStamp() <= block.getGenerationStamp() ||
          +           storedBlock.getINode().isUnderConstruction())) {
          +        NameNode.stateChangeLog.info("BLOCK* NameSystem.addStoredBlock: "
          +          + "addStoredBlock request received for " + block + " on "
          +          + node.getName() + " size " + block.getNumBytes()
          +          + " and it belongs to a file under construction. ");
          +      } else {
          +        storedBlock = null;
          +      }
          
          

          Events are as follows.

          1) DN1->DN2->DN3 are in pipeline with genstamp 1
          2) Client completed writing and closed the file.
          3) Now DN3 killed.
          4) Now reopened the file in append.
          5) Now Pipeline contains DN1->DN2 with genstamp 2
          6) Client continues writing some more data.
          7) Now DN3 started. Replica presents in current directory as this was already finalized before.
          8) DN3 triggered blockReport.
          9) Since this block with genstamp 1 is not there in BlocksMap, it is trying to get the block with WILDCARD. And able to get the block. Will contains newer genstamp(2).
          10) Since the file is UnderConstruction, it will just accept the block and updates in BlocksMap.

          Problem is, if the client gets DN3 for read, they will fail because, NN may give the block ID with latest genstamp (2), and DN3 does not contain the block with genstamp 2. Of cource data also inconsistent.

          Thanks
          Uma

          Show
          Uma Maheswara Rao G added a comment - Hi Dhruba, Looks following code creating the problem in one special condition. BlockInfo storedBlock = blocksMap.getStoredBlock(block); + if (storedBlock == null ) { + // if the block with a WILDCARD generation stamp matches and the + // corresponding file is under construction, then accept this block. + // This block has a diferent generation stamp on the datanode + // because of a lease-recovery-attempt. + Block nblk = new Block(block.getBlockId()); + storedBlock = blocksMap.getStoredBlock(nblk); + if (storedBlock != null && storedBlock.getINode() != null && + (storedBlock.getGenerationStamp() <= block.getGenerationStamp() || + storedBlock.getINode().isUnderConstruction())) { + NameNode.stateChangeLog.info( "BLOCK* NameSystem.addStoredBlock: " + + "addStoredBlock request received for " + block + " on " + + node.getName() + " size " + block.getNumBytes() + + " and it belongs to a file under construction. " ); + } else { + storedBlock = null ; + } Events are as follows. 1) DN1->DN2->DN3 are in pipeline with genstamp 1 2) Client completed writing and closed the file. 3) Now DN3 killed. 4) Now reopened the file in append. 5) Now Pipeline contains DN1->DN2 with genstamp 2 6) Client continues writing some more data. 7) Now DN3 started. Replica presents in current directory as this was already finalized before. 8) DN3 triggered blockReport. 9) Since this block with genstamp 1 is not there in BlocksMap, it is trying to get the block with WILDCARD. And able to get the block. Will contains newer genstamp(2). 10) Since the file is UnderConstruction, it will just accept the block and updates in BlocksMap. Problem is, if the client gets DN3 for read, they will fail because, NN may give the block ID with latest genstamp (2), and DN3 does not contain the block with genstamp 2. Of cource data also inconsistent. Thanks Uma

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              8 Vote for this issue
              Watchers:
              45 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development