Issue Details (XML | Word | Printable)

Key: HDFS-200
Type: New Feature New Feature
Status: Open Open
Priority: Blocker Blocker
Assignee: dhruba borthakur
Reporter: Tsz Wo (Nicholas), SZE
Votes: 8
Watchers: 34
Operations

If you were logged in you would be able to see more operations.
Hadoop HDFS

In HDFS, sync() not yet guarantees data available to the new readers

Created: 09/Oct/08 01:03 AM   Updated: 29/Sep/09 08:38 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works 4379_20081010TC3.java 2008-10-10 06:26 PM Tsz Wo (Nicholas), SZE 4 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders.txt 2009-01-16 08:26 PM dhruba borthakur 14 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders11_20.txt 2009-06-09 06:48 PM dhruba borthakur 14 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders12_20.txt 2009-07-16 07:34 AM dhruba borthakur 14 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders13_20.txt 2009-07-31 01:24 AM dhruba borthakur 15 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders14_20.txt 2009-08-13 06:50 AM dhruba borthakur 17 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders3.patch 2009-01-23 10:26 AM dhruba borthakur 14 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders4.patch 2009-01-27 02:08 AM dhruba borthakur 15 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders5.txt 2009-05-04 04:59 AM dhruba borthakur 15 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders6.patch 2009-05-10 10:55 AM dhruba borthakur 15 kB
Text File Licensed for inclusion in ASF works fsyncConcurrentReaders9.patch 2009-05-18 08:39 PM dhruba borthakur 19 kB
GZip Archive Licensed for inclusion in ASF works hadoop-stack-namenode-aa0-000-12.u.powerset.com.log.gz 2009-06-18 06:52 PM stack 86 kB
Text File hdfs-200-ryan-existing-file-fail.txt 2009-08-27 02:08 AM ryan rawson 13 kB
GZip Archive Licensed for inclusion in ASF works hypertable-namenode.log.gz 2009-01-28 01:22 AM Doug Judd 33 kB
Text File Licensed for inclusion in ASF works namenode.log 2009-03-04 06:05 PM Jim Kellerman 2.51 MB
Text File Licensed for inclusion in ASF works namenode.log 2009-02-26 10:17 PM Jim Kellerman 95 kB
Java Source File Licensed for inclusion in ASF works Reader.java 2009-01-27 02:08 AM dhruba borthakur 2 kB
Java Source File Licensed for inclusion in ASF works Reader.java 2009-01-20 01:30 AM Jim Kellerman 1 kB
File Licensed for inclusion in ASF works reopen_test.sh 2009-01-28 10:04 PM Doug Judd 0.3 kB
Java Source File Licensed for inclusion in ASF works ReopenProblem.java 2009-01-28 10:03 PM Doug Judd 7 kB
Java Source File Licensed for inclusion in ASF works Writer.java 2009-01-27 02:07 AM dhruba borthakur 2 kB
Java Source File Licensed for inclusion in ASF works Writer.java 2009-01-20 01:29 AM Jim Kellerman 2 kB
Issue Links:
Blocker
 
Reference


 Description  « Hide
In the append design doc (https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc), it says
  • A reader is guaranteed to be able to read data that was 'flushed' before the reader opened the file

However, this feature is not yet implemented. Note that the operation 'flushed' is now called "sync".



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Tsz Wo (Nicholas), SZE made changes - 09/Oct/08 01:03 AM
Field Original Value New Value
Link This issue relates to HADOOP-1700 [ HADOOP-1700 ]
Tsz Wo (Nicholas), SZE made changes - 09/Oct/08 01:06 AM
Link This issue relates to HADOOP-2658 [ HADOOP-2658 ]
Tsz Wo (Nicholas), SZE added a comment - 09/Oct/08 01:06 AM
In particular, the following will fail.
3.	TC3: fsync.
a.	On Machine M1, Create file. Write half block of data. Invoke (DFSOutputStream).fsync() on the dfs file handle. Do not close file yet.
b.	On another machine M2, open file and verify that the half-block of data can be read successfully.
c.	On M1, append another half block of data.  Close file on M1.
d.	On M2, open file and read 1 block of data from it. Close file.

See also https://issues.apache.org/jira/secure/attachment/12391788/20081008testplan.txt


Tsz Wo (Nicholas), SZE made changes - 09/Oct/08 01:10 AM
Description In the append design doc in https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc, it says
* A reader is guaranteed to be able to read data that was 'flushed' before the reader opened the file

However, this feature is not yet implemented.
In the append design doc (https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc), it says
* A reader is guaranteed to be able to read data that was 'flushed' before the reader opened the file

However, this feature is not yet implemented. Note that the operation 'flushed' is now called "sync".
Tsz Wo (Nicholas), SZE added a comment - 10/Oct/08 06:26 PM
4379_20081010TC3.java: a unit test implementing TC3 described in my previous comment.

Tsz Wo (Nicholas), SZE made changes - 10/Oct/08 06:26 PM
Attachment 4379_20081010TC3.java [ 12391895 ]
dhruba borthakur made changes - 16/Jan/09 08:22 PM
Assignee dhruba borthakur [ dhruba ]
dhruba borthakur made changes - 16/Jan/09 08:22 PM
Fix Version/s 0.19.1 [ 12313473 ]
dhruba borthakur added a comment - 16/Jan/09 08:26 PM
When a fsync occurs, the client informs the namenode the amount of data that it has written to the last block. The NameNode updates its meatadata with tis value.This means that a new getBlockLocation call will automatically retrieve the last good size of the last block.

dhruba borthakur made changes - 16/Jan/09 08:26 PM
Attachment fsyncConcurrentReaders.txt [ 12398105 ]
Luke Lu added a comment - 16/Jan/09 11:48 PM
If TC3 is too hard in general. A TC2.5, like be able to open the file on the same machine to see the half block (after the fsync) would be nice. It would go a long way to make the transaction log work and make things restartable.

Luke Lu added a comment - 17/Jan/09 12:33 AM
@dhruba: this is the current (0.19.0) behavior: if a client writes a half block to a log in hdfs and calls fsync and then got killed. When the client restarts (on the same node) to read the log, the half block is not there, until you restart the entire hadoop cluster and wait for a few minutes.

dhruba borthakur added a comment - 17/Jan/09 12:47 AM
@Luke: the attached patch should solve your problem. If you have the time, would it be possible for you to apply this patch and then retry your test case?

Luke Lu added a comment - 18/Jan/09 04:50 PM
@dhruba: unfortunately the patch (applied to the stock 0.19.0) didn't work. The files did show non-zero size after the sync but they're all 7 bytes. Restarting HDFS make the correct content show up.

Doug Judd added a comment - 18/Jan/09 05:31 PM
Hi Dhruba,

I've been working with Luke a little on this. Here are more details. The log that gets written in the test is very small. The first thing the software does when it creates the log is it writes a 7-byte header. Then later as the test proceeds, the system will append a small entry and the do a sync. We use the FSDataOutputStream class. The sequence of operations looks something like this:

out_stream.write(data, 0, amount);
out_stream.sync()
[...]

When the test completes, all of the logs are exactly 7 bytes long. It remains this way even if I wait 10 minutes or kill the Hypertable java process and wait several minutes as well. Here is the listing:

[doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn
rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 7 2009-01-17 19:52 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log

After shutting down HDFS and restarting it again, the listing looks like this:

[doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn
rw-rr- 3 doug supergroup 564 2009-01-17 19:52 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 84 2009-01-17 19:52 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 1063 2009-01-17 19:52 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 634 2009-01-17 19:52 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 217 2009-01-17 19:52 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 1943 2009-01-17 19:52 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 1072 2009-01-17 19:52 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 525 2009-01-17 19:52 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log

The last time I ran this test I encountered a problem where it appeared that some of our commits were lost. Here's what I did:

1. ran tests (which create a table with 75274825 cells)
2. kill Hypertable
3. shutdown HDFS
4. restart HDFS
5. restart Hypertable (which re-plays the commit logs)
6. dumped the table

The table dump in #6 came up short (e.g. 72M entries). It appears that some of the commit logs (different log than the range_txn log) came back up incomplete.

Let us know if you want us to run an instrumented version or anything. We can send you the Hadoop log files if that helps. Thanks!

  • Doug

Luke Lu added a comment - 19/Jan/09 07:15 PM
On the design of the TC3: I think that client informing namenode on each sync is too expensive and not scalable. Hypertable can easily generate 1M transactions/s from a small cluster, which the namenode cannot possibly handle. Since restarting HDFS incurs no data loss, I think the basic logic is probably already fine. For most cases, I think that making new readers a bit more expensive to create is probably the right trade-off:

1. We could append and sync a file and only inform the namenode on the first sync, hinting that we need to do something special for the new readers of the file.
2. Upon creating a new reader for the file, namenode would request a block report (an incremental one, if we try to be clever) from the involved data nodes and return the right info the reader.

I think that this would cover most of the use cases for append and sync reasonably.


Jim Kellerman added a comment - 19/Jan/09 10:31 PM
HBase really needs visibility to partial blocks and restarting HDFS is not an option for a 100+ node cluster.

We don't sync after every record, but sync after a configurable number of writes or after a configurable amount of time has passed and
there is unsync'd data.

This really needs to be addressed in 0.19.1, 0.20.1 and trunk.


Jim Kellerman made changes - 19/Jan/09 10:35 PM
Link This issue blocks HBASE-1138 [ HBASE-1138 ]
Jim Kellerman added a comment - 20/Jan/09 01:29 AM
First of two test files demonstrating the problem

Jim Kellerman made changes - 20/Jan/09 01:29 AM
Attachment Writer.java [ 12398276 ]
Jim Kellerman added a comment - 20/Jan/09 01:30 AM
Second of two test programs that demonstrate the problem.

Jim Kellerman made changes - 20/Jan/09 01:30 AM
Attachment Reader.java [ 12398277 ]
Jim Kellerman added a comment - 20/Jan/09 01:47 AM
I have attached two test programs that demonstrate the problem.

Writer.java writes 100 records of approx 1KB each per sync. Its command line argument is the number of sync's to perform before aborting.
Reader.java is run after Writer and tries to read records from the file created by Writer.

If you run Writer with 4096 syncs:

bin/hadoop Writer 4096

and then run Reader, it gets an EOFException without having read any records (note that exception occurs in constructor:

Exception in thread "main" java.io.EOFException
        at java.io.DataInputStream.readFully(Unknown Source)
        at java.io.DataInputStream.readFully(Unknown Source)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
        at Reader.main(Reader.java:34)

If you run Writer with 8192 syncs:

bin/hadoop Writer 8192

and then run Reader, it reads 684783 records before hitting an EOFException (in next())

Read 684783 lines
java.io.EOFException
        at java.io.DataInputStream.readFully(Unknown Source)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
        at Reader.main(Reader.java:41)

684783 records that are approximately 1KB in length is approximately 64MB (a hadoop block).

In this test the partial block containing 134417 records is lost (even though a considerable
number of syncs took place between filling the first block and the end of Writer).

So you can read from an improperly closed file but only up to the last full block, confirming
Luke's observation.

I will look into the patch shortly.


dhruba borthakur added a comment - 20/Jan/09 08:57 AM
Thanks Doug and Luke for the comments and Jim for providing a good test case. I will incorporate your comments and post a new patch. (Jim, you need not test the existing patch associated with this JIRA)

Repository Revision Date User Message
ASF #736144 Tue Jan 20 22:33:08 UTC 2009 jimk HBASE-1138 Test that readers opened after a sync can see all data up to the sync (temporary until HADOOP-4379 is resolved)
Files Changed
MODIFY /hadoop/hbase/trunk/CHANGES.txt
MODIFY /hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/regionserver/HLog.java

Jim Kellerman made changes - 23/Jan/09 01:50 AM
Link This issue blocks HBASE-1148 [ HBASE-1148 ]
dhruba borthakur added a comment - 23/Jan/09 10:26 AM
This patch implements the following (as specified in the design doc for Appends, HADOOP-1700):
A reader checks to see if the file is being written to by another writer. if so, it fetches the size of the last block from the primary datanode. This does not have any performance impact for non-concurrent readers not does it have any impact on sync().

dhruba borthakur made changes - 23/Jan/09 10:26 AM
Attachment fsyncConcurrentReaders3.patch [ 12398542 ]
Jim Kellerman added a comment - 25/Jan/09 11:49 PM
@Dhruba

I tried applying fsyncConcurrentReaders3.patch to 0.19.0, 0.19 branch (latest), 0.20 branch (latest) and trunk. It did not
apply cleanly to any, but seemed to have less problems with trunk. Building it, and running the Writer and Reader
programs I attached above, they behaved the same way, i.e., the Reader could not see any sync'd data until the first
complete block had been written, and then it could only see the data in the first complete block and none of the data
in the second partial block.


dhruba borthakur added a comment - 26/Jan/09 10:39 AM
@Jim: The SequenceFile.Writer.sync() does not actually invoke the FSDataOutputStream.sync.

The FSDataOutputStream.sync() is a heavy weight operation and the SequenceFile.Writer.sync() is typically invoked with small chunks of data, that's the reason why the SequenceFile class does not automatically invoke the fs sync call.

I changed your test case to invoke the FSDaOutputStream.sync call but still see some problems. Will post an update as soon as I get to the bottom of this.


Jim Kellerman added a comment - 26/Jan/09 06:55 PM
@Dhruba

If I have a SequenceFile.Writer, how do I get the FSDataOutputStream? It is a protected member of SequenceFile.Writer.
Are you planning to add an accessor?

Thanks.


Jim Kellerman made changes - 26/Jan/09 06:55 PM
Link This issue blocks HBASE-1138 [ HBASE-1138 ]
Jim Kellerman made changes - 26/Jan/09 07:03 PM
Link This issue blocks HBASE-1155 [ HBASE-1155 ]
Doug Cutting added a comment - 26/Jan/09 07:20 PM
> how do I get the FSDataOutputStream?

I think it would be better to add a SequenceFile.Writer method that flushes and calls FSDataOutputStream.sync. The problem is that SequenceFile.Writer#sync() should probably be called flush(), not sync(), but that'd be hard to change. So perhaps we should add a SequenceFile.Writer#syncFs() method?


dhruba borthakur added a comment - 26/Jan/09 07:29 PM
I will add a new method SequenceFile.Writer.syncFS(), test it, and upload a new patch. Thanks Doug and Jim.

Doug Judd added a comment - 26/Jan/09 10:12 PM
I tried the test again and still no luck. To recap, here's how the log file is created:

out_stream.write(header, 0, 7);
out_stream.sync()
out_stream.write(data, 0, amount);
out_stream.sync()
[...]

After the test finished, I shut down the Hypertable servers. This time the listing shows the files to be 0 bytes in length (as opposed to 7 bytes with the previous patch):

[doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn
rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.104_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.106_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.108_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.110_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.112_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.114_38060/log/range_txn/0.log
rw-rr- 3 doug supergroup 0 2009-01-26 11:40 /hypertable/servers/10.0.30.116_38060/log/range_txn/0.log

When the RangeServer starts up again, it discovers that the log file (range_txn/0.log) does exist, so it starts the recovery process. However, it only sees the 7 byte header. All of the subsequent log appends do not appear in the log file. So the system starts up without recovering any of the data.

BTW, in this particular circumstance, there no other writer writing to the file when the range server comes up and reads it. Here's the high-level of what's going on:

RangeServer opens an FSDataOutputStream to the log and starts appending to it
RangeServer is killed with 'kill -9"
RangeServer comes up again and reads the log

In your above note you said, "A reader checks to see if the file is being written to by another writer. if so, it fetches the size of the last block from the primary datanode." This is not the case with our test, there is no writer writing to the log when we try to read it.

  • Doug

Doug Judd added a comment - 26/Jan/09 11:47 PM
This may be a problem on our end. When I shut down hypertable and then copy each range_txn log file to the local filesystem, the files seem to contain more than just the 7-byte header.

[doug@motherlode000 aol-basic]$ ls -l rsml/
total 32
rw-rw-r- 1 doug doug 564 Jan 26 15:35 102.log
rw-rw-r- 1 doug doug 84 Jan 26 15:35 104.log
rw-rw-r- 1 doug doug 1206 Jan 26 15:35 106.log
rw-rw-r- 1 doug doug 634 Jan 26 15:35 108.log
rw-rw-r- 1 doug doug 84 Jan 26 15:35 110.log
rw-rw-r- 1 doug doug 1942 Jan 26 15:36 112.log
rw-rw-r- 1 doug doug 1074 Jan 26 15:36 114.log
rw-rw-r- 1 doug doug 525 Jan 26 15:36 116.log


Doug Judd added a comment - 27/Jan/09 12:30 AM
It looks like our code depends on the length, as returned by getFileStatus:

length = mFilesystem.getFileStatus(new Path(fileName)).getLen();

It appears that the length returned by this API call is not correct. Can you take a look at it and maybe patch this one as well?


dhruba borthakur added a comment - 27/Jan/09 01:09 AM
Hi Doug,

The file length (as returned by getFileStatus) will not change at every write from the client to the datanode. Similarly, every fsync call from the client does not reach the namenode (only the first one per block reaches the namenode). That means the namenode has no good way to know the size of a block when the block is being written to by a writer.

In your case, the writer has died. The namenode has a timeout of 1 hour before it starts lease recovery for this file. The lease recovery process will set the correct file size on the namenode metadata. If you do not want to wait for one hour, then you can manually trigger lease recovery from your application by trying to reopen the file for append(please use FileSystem.append(pathname)). Lease recovery will update the true length of the file in the namenode metadata.


dhruba borthakur added a comment - 27/Jan/09 02:07 AM
An updated version of the writer than invokes SequenceFile.Writer.syncFs().

dhruba borthakur made changes - 27/Jan/09 02:07 AM
Attachment Writer.java [ 12398789 ]
dhruba borthakur added a comment - 27/Jan/09 02:08 AM
This read waits for lease recovery to complete before reading data.

dhruba borthakur made changes - 27/Jan/09 02:08 AM
Attachment Reader.java [ 12398790 ]
dhruba borthakur added a comment - 27/Jan/09 02:08 AM
This patch should merge smoothly with trunk.

dhruba borthakur made changes - 27/Jan/09 02:08 AM
Attachment fsyncConcurrentReaders4.patch [ 12398791 ]
dhruba borthakur added a comment - 27/Jan/09 02:12 AM
@Jim: I updated your test cases Reader.java and Writer.java and have successfully tested them at my site. Please let me know if you still see problems.

@Doug: Please use the latest patch that I uploaded. This has the SequenceFile.Writer.syncFS() method.

(Also, I suggest that you should pull in the patch available at HADOOP-5027)


Doug Judd added a comment - 27/Jan/09 07:10 AM
Hi Dhruba,

I tried your suggestion, but got the following exception when trying to open the file with the 'append' method:

SEVERE: I/O exception while getting length of file '/hypertable/servers/10.0.30.102_38060/log/range_txn/0.log' - org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log for DFSClient_2003773208 on client 10.0.30.102, because this file is already being created by DFSClient_423127459 on 10.0.30.102
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1088)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:321)
at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

I could re-write the whole thing to not be dependent on knowing log length, however, it seems like it ought to be possible to obtain the actual file length in this situation. The semantics of getFileStatus() seem a little odd. Sometimes it returns the actual length of the file and sometimes it returns a stale version of the length. I suppose this is ok as long as it is well documented. But it should be possible to obtain the actual length of a file. Would it be possible to add a FileSystem::length(Path path) method that returns the accurate file length by fetching the size of the last block from the primary datanode?

  • Doug

dhruba borthakur added a comment - 27/Jan/09 07:17 PM
Hi Doug,

the exception is expected. Please look at the test file Reader.java that I attached to this JIRA. It shows how the reader waits for the lease recovery to end (to ensure that correct file size is updated on namenode). Please let me know if this approach is suitable for your application.

We could enhance FileSystem.getFileStatus() to contact the datanode and retrieve and return the most-current file length (only for files that have a concurrent writer). This will not have have performance impact for most map-reduce applications.


Jim Kellerman added a comment - 27/Jan/09 11:54 PM
Tested this patch in conjunction with HADOOP-5027, Works!!!!!!!!!!!!!!!

I would like to see this in 0.19.1, 0.20.0 and trunk.

+++++1


dhruba borthakur added a comment - 27/Jan/09 11:58 PM
Thanks Jim. I will wait to hear from Doug Judd on whether it solves his problem.

Doug Judd added a comment - 28/Jan/09 12:13 AM
Now when I apply your latest patch and the one from 5027 to the 0.19.0 source, the datanodes seem to be going into an infinite loop of NullPointerExceptions. At the top of the hadoop-zvents-datanode-motherlode007.admin.zvents.com.log I'm seeing this:

[doug@motherlode007 logs]$ more hadoop-zvents-datanode-motherlode007.admin.zvents.com.log
2009-01-27 15:32:55,828 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG: host = motherlode007.admin.zvents.com/10.0.30.114
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.19.1-dev
STARTUP_MSG: build = -r ; compiled by 'doug' on Tue Jan 27 15:04:06 PST 2009
************************************************************/
2009-01-27 15:32:57,041 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: motherlode000/10.0.30.100:9000. Already tried 0 time(s).
2009-01-27 15:33:00,505 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Registered FSDatasetStatusMBean
2009-01-27 15:33:00,507 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 50010
2009-01-27 15:33:00,510 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s
2009-01-27 15:33:00,783 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
2009-01-27 15:33:00,792 INFO org.mortbay.util.Credential: Checking Resource aliases
2009-01-27 15:33:01,839 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@319c0bd6
2009-01-27 15:33:01,878 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static]
2009-01-27 15:33:02,048 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@5a943dc4
2009-01-27 15:33:02,049 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs]
2009-01-27 15:33:02,754 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@6d581e80
2009-01-27 15:33:02,760 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
2009-01-27 15:33:02,763 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075
2009-01-27 15:33:02,764 INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@5c435a3a
2009-01-27 15:33:02,769 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null
2009-01-27 15:33:02,825 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020
2009-01-27 15:33:02,831 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2009-01-27 15:33:02,834 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting
2009-01-27 15:33:02,834 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
2009-01-27 15:33:02,836 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(motherlode007.admin.zvents.com:50010, storageID
=DS-745224472-10.0.30.114-50010-1230665635246, infoPort=50075, ipcPort=50020)
2009-01-27 15:33:02,837 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting
2009-01-27 15:33:02,837 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting
2009-01-27 15:33:02,839 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.30.114:50010, storageID=DS-745224472-10.0.30.114-50010-1230
665635246, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/data1/hadoop/dfs/data/current,/data2/hadoop/dfs/data/current,/data3/hadoop/dfs /data/current'}
2009-01-27 15:33:02,840 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec
2009-01-27 15:33:02,932 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.DatanodeDescriptor.reportDiff(DatanodeDescriptor.java:396)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processReport(FSNamesystem.java:2803)
at org.apache.hadoop.hdfs.server.namenode.NameNode.blockReport(NameNode.java:636)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy4.blockReport(Unknown Source)
at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:723)
at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1100)
at java.lang.Thread.run(Thread.java:619)

2009-01-27 15:33:02,953 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerExce
ption
at org.apache.hadoop.hdfs.server.namenode.DatanodeDescriptor.reportDiff(DatanodeDescriptor.java:396)
[...]

And the file is growing rapidly with the following exceptions tacked on to the end:

2009-01-27 16:10:55,973 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy4.blockReport(Unknown Source)
at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:723)
at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1100)
at java.lang.Thread.run(Thread.java:619)

It appears that these exceptions started happening within about 5 seconds after startup, so it doesn't look like it has anything to do with Hypertable. Is it ok to apply these patches to the 0.19.0 source? Or should I be applying them to the trunk?

  • Doug

dhruba borthakur added a comment - 28/Jan/09 12:28 AM
I think these exceptions are related to the patch in HADOOP-5027. let me check.

Doug Judd added a comment - 28/Jan/09 01:20 AM
Hi Dhruba,

When I drop the patch for 5027, the NullPointerExceptions go away. However, I'm still having troubles. It appears for some of the files, your trick for obtaining the real length does not work. The waitForLease spins indefinitely. For example, here's the log line for our HdfsBroker process where it is hanging:

Jan 27, 2009 5:08:53 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Length
INFO: Getting length of file '/hypertable/servers/10.0.30.106_38060/log/metadata/0

Then in the namenode log file, I see things like this (I've attached the compressed namenode log):

2009-01-27 17:08:53,173 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: failed to create file /hypertable/servers/10.0.30.106_38060/log/metadat
a/0 for DFSClient_-1860021252 on client 10.0.30.106 because current leaseholder is trying to recreate file.
2009-01-27 17:08:53,174 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9000, call append(/hypertable/servers/10.0.30.106_38060/log/metadata/0, DFSClien
t_-1860021252) from 10.0.30.106:43507: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.106_3
8060/log/metadata/0 for DFSClient_-1860021252 on client 10.0.30.106 because current leaseholder is trying to recreate file.
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.106_38060/log/metadata/0 for DFSClient_-1860021252
on client 10.0.30.106 because current leaseholder is trying to recreate file.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1065)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:321)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)


Doug Judd made changes - 28/Jan/09 01:22 AM
Attachment hypertable-namenode.log.gz [ 12398865 ]
Tsz Wo (Nicholas), SZE added a comment - 28/Jan/09 01:32 AM
> I think these exceptions are related to the patch in HADOOP-5027.

Please do not use my patch in HADOOP-5027. It is not clear that whether the fix is good.


dhruba borthakur added a comment - 28/Jan/09 07:20 AM
@Doug: the waiting for lease recovery to reclaim the length of the jar file will work only if the original writer has dies and now a new reader wants to read all the data that was written by the writer before it died. This was the use-case for Jim Kellerman.

In your case, you have a FileSystem object that was used to write to the file. Then your application tried to reopen the same file by invoking fs.apend() on the same FileSystem object. The namenode logs clearly shows that the same dfsclient (i.e. filesystem object) was used to try appending to the file. In this case, the namenode rejects the append call because it knows that the original writer is alive and is still writing to it. If your aim is to have a concurrent reader along with a concurrent writer, then the best that this patch can do is to allow the concurent reader to see the new file length only when the block is full. On the other hand, if you can make your application ot depend on the file length, then you can see all the data in the file. Another alternative would be to implement a new call FileSystem.length() than can retrieve the latest length from the datanode, but that can be done as part of a separate JIRA. Please let us know how difficult is it for you to change your app to not depend on the file length and just read till end-of-file?


Doug Judd added a comment - 28/Jan/09 09:08 AM
Dhruba,

The application is not trying to reopen a file that it already has open. It appears that HDFS is getting confused and thinks that this is the case. One thing that is slightly different in this case is that nothing gets written to the file by the original writer. Here is the sequence of relevant operations:

File gets created
[nothing gets appended to it]
process gets killed
process starts up again
File is reopened with append to obtain length

Can you verify that this particular usage pattern is handled properly? I'll try to come up with a stripped down test case tomorrow.

  • Doug

Konstantin Shvachko added a comment - 28/Jan/09 08:13 PM

Doug Judd made changes - 28/Jan/09 10:03 PM
Attachment ReopenProblem.java [ 12398932 ]
Doug Judd made changes - 28/Jan/09 10:04 PM
Attachment reopen_test.sh [ 12398933 ]
Doug Judd added a comment - 28/Jan/09 10:10 PM
Ok, I've uploaded a reproducible test case (see attachments ReopenProblem.java and reopen_test.sh). This test consistently hangs on both my local mac HDFS installation as well as our 10 node Linux HDFS cluster. It's not as simple as I suggested in my previous comment as you can see by the code. Here's the output I see when I run it:

[doug@motherlode000 0.9.2.1]$ ./reopen_test.sh
Deleted hdfs://motherlode000:9000/hypertable/servers
rmr: cannot remove /hypertable/tables: No such file or directory.
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
length("/hypertable/servers/10.0.30.114_38060/log/root/0") = 163, obtained in 65258 milliseconds
length("/hypertable/servers/10.0.30.114_38060/log/root/0") = 163, obtained in 113 milliseconds
Read 163 bytes from root fragment 0
Read 0 bytes from root fragment 0
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
Waiting for lease recovery ...
[...]


dhruba borthakur added a comment - 30/Jan/09 06:28 PM
The Hbase folks have requested that this be fixed in the next release form the 0.19 branch.

dhruba borthakur made changes - 30/Jan/09 06:28 PM
Priority Major [ 3 ] Blocker [ 1 ]
dhruba borthakur added a comment - 04/Feb/09 09:48 AM
Hi Doug,

ReopenProblem.java is guaranteed to hang. At line 189, you create an file and get an handle on this file. Then , at line 197, you use the same file system object and invoke waitForLeaseRecovery(). waitForLeaseRecovery tries to open the same file again for appending to it. It fails as expected. The namenode log correctly shows that the "same leaseholder is trying to reopen the file for appends"!

does it make sense? Thanks, dhruba


Doug Judd added a comment - 04/Feb/09 03:25 PM
Hi Dhruba,

The file that is opened on line 189 ( /hypertable/servers/10.0.30.114_38060/log/root/1 ) is different than the one passed into waitForLeaseRecovery on line 197. On line 194, the file object gets reset to a different path ( /hypertable/servers/10.0.30.114_38060/log/metadata/0 ). So, this program should not hang.

  • Doug

Nigel Daley added a comment - 11/Feb/09 08:59 PM
As discussed on core-dev@ (http://www.nabble.com/Hadoop-0.19.1-td21739202.html) we will disable append in 0.19.1. Moving these append related issues to 0.19.2.

Nigel Daley made changes - 11/Feb/09 08:59 PM
Fix Version/s 0.19.1 [ 12313473 ]
Fix Version/s 0.19.2 [ 12313650 ]
Jim Kellerman made changes - 11/Feb/09 09:06 PM
Fix Version/s 0.20.0 [ 12313438 ]
Jim Kellerman added a comment - 25/Feb/09 11:31 PM - edited
Applying this patch to hadoop-0.19.0, it does work, but slowly:

It takes almost an hour to recover the lease of a file when the hdfs cluster is under load:

2009-02-25 21:39:16,843 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs:/x.y.com:8100/hbase/log_10.76.44.139_1235597506284_8020/hlog.dat.1235597820662
2009-02-25 21:39:16,847 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Triggering lease recovery.
...
2009-02-25 22:37:12,755 INFO org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for hdfs://x.y.com:8100/hbase/log_10.76.44.139_1235597506284_8020

Jim Kellerman added a comment - 26/Feb/09 06:28 PM
Is there something we can change to make lease recovery faster?
config parameter?
block size?
max file size?

Thanks.


Doug Judd added a comment - 26/Feb/09 06:47 PM
There should be a way to forcibly take over a lease. In our application, ownership of the various files is controlled at a higher level (i.e. Chubby). A server will not come up and attempt to recover a file (or check its length) unless it is the sole owner of the file (i.e. it will have obtained a Chubby lock).

Can we add some sort of API that allows a client to forcibly take over a lease immediately?


dhruba borthakur added a comment - 26/Feb/09 07:35 PM
There is ReopenProblem.java attched to this JIRA.

@Doug: Please use ReopenProblem.java:waitForLeaseRecovery(). This method triggers near-immediate lease recovery and can be used by any application code to take ownership of the file. Does this work for you?

@Jim: Were u using ReopenProblem.java:waitForLeaseRecovery() and still seeing that it takes upto an hour for the waitForLeaseRecovery() method to return success? This can happen if the original client that was writing to the file was alive and was in communication with the namenode. In this case, the new append-er will not be able to recover the original lease.


Jim Kellerman added a comment - 26/Feb/09 08:33 PM
> dhruba borthakur - 26/Feb/09 11:35 AM
> @Jim: Were u using ReopenProblem.java:waitForLeaseRecovery() and still seeing that it takes upto an hour for the
> waitForLeaseRecovery() method to return success?

Well, sort of. my code is similar but inline in the module that uses it:

// Recover the files lease if necessary
boolean recovered = false;
while (!recovered) {
  try {
    FSDataOutputStream out = fs.append(logfiles[i].getPath());
    out.close();
    recovered = true;
  } catch (IOException e) {
    if (LOG.isDebugEnabled()) {
      LOG.debug("Triggering lease recovery.");
    }
    try {
      Thread.sleep(leaseRecoveryPeriod);
    } catch (InterruptedException ex) {
      // ignore it and try again
    }
  }
}

> This can happen if the original client that was writing to the file was alive and was in communication with the namenode.
> In this case, the new append-er will not be able to recover the original lease.

This was definitely not the case. The original writing process was kill'ed -9

Like Hypertable, HBase has a mechanism to detect when the original writing process is dead.


dhruba borthakur added a comment - 26/Feb/09 09:38 PM
What is the value of leaseRecoveryPeriod in the above piece of code?

The fact that the lease recovery took 1 hour is very interesting. The namenode has two limits: the soft limit of 1 minute and the hard limit of 1 hour. When there is no ping from a writer for the soft limit period, then a new writer can recover the lease and append to the file. When there is no ping from a writer for the hard limit period, the namenode itself itself initiates lease recovery.

Is it possible for you to give me access to the namenode log along with the relevant filename in question?


Doug Judd added a comment - 26/Feb/09 10:10 PM
Hi Dhruba,

When you say "near-immediate" what does that mean? Is it the 1 minute soft limit? If so, that's way too long. If a node bounces in our cluster, we can't have the database locked up for an entire minute waiting for the lease to expire when we know for a 100% fact that the original leaseholder is dead. A one minute database lock-up would make Hypertable unsuitable for serving live data.

  • Doug

Jim Kellerman added a comment - 26/Feb/09 10:12 PM
leaseRecoveryPeriod is 10000ms

file name is hdfs://aa0-000-12.u.powerset.com:8100/hbase/log_208.76.44.139_1235681989771_8020/hlog.dat.1235682444379

I will attach the namenode log.


Jim Kellerman added a comment - 26/Feb/09 10:17 PM
Namenode log. Note that this run was done using the waitForLeaseRecovery() from ReopenProblem.java

Jim Kellerman made changes - 26/Feb/09 10:17 PM
Attachment namenode.log [ 12401070 ]
Doug Judd added a comment - 27/Feb/09 07:41 PM
We would like to see the soft lease limit configurable and settable to 0. Our application ensures that there is only one writer to each file. We need to be able to recover immediately.

Doug Judd made changes - 28/Feb/09 12:22 AM
Link This issue relates to HADOOP-5359 [ HADOOP-5359 ]
Jim Kellerman added a comment - 03/Mar/09 06:59 PM
We too would like to be able to recover the lease immediately as we know when we go to read the file that the
writer is dead.

Is there anything that we can do to speed this up? We can't contribute to Hadoop, but we are certainly willing to
test and stress test any patch that is available.


dhruba borthakur added a comment - 03/Mar/09 09:57 PM
Hi Jim,

In the namenode log, I see the followign statement:

NameSystem.startFile: failed to create file /hbase/log_208.76.44.139_1235681989771_8020/hlog.dat.1235682444379 for DFSClient_1908447348 on client 208.76.44.139 because current leaseholder is trying to recreate file.

This means that the same client that originally created the file is trying to re-open the file. Is this possible? It started at time 21:14 and continued all the way to 22:07. An attempt was made to recreate thias file every 10 seconds (which matches the periodicity that you set)


Jim Kellerman added a comment - 04/Mar/09 01:58 AM
No, the original writer was killed at 21:08 and never restarted.

Lease recovery started at 21:14 by another process on the same machine.


dhruba borthakur added a comment - 04/Mar/09 06:36 AM
Hi Jim, I think the namenode log that you posted is a partial one. It does not have calles to commitBlockSync (that should have occured when the new writer started to append to the file). if so, can you pl post the entire namenode log?

Jim Kellerman added a comment - 04/Mar/09 06:05 PM
@Dhruba

You were correct. I had INFO logging turned off. Here is a new name node log file with INFO enabled.


Jim Kellerman made changes - 04/Mar/09 06:05 PM
Attachment namenode.log [ 12401427 ]
Jim Kellerman added a comment - 04/Mar/09 06:11 PM
More info. When I generated the new namenode.log, the writer was killed at 16:54. The first attempt to grab the lease was at 16:56
and it was finally granted at 17:54.

Jim Kellerman added a comment - 09/Mar/09 07:54 PM
Dhruba,

What branch/revision did you test this patch against?

I've tried several in 0.19 branch (0.19.0, and revisions just before 0.19.1 (and before HADOOP-5225 was committed, but none are able to
recover the lease.

It appears that 5225 was not committed to 0.20.0 or to trunk. Is that correct? If so should I try a revision of Hadoop off of 0.20 or trunk?


Jim Kellerman added a comment - 09/Mar/09 08:56 PM
Also, were you just testing Reader and Writer on an unloaded cluster, or using some other application
against a cluster that was under load?

dhruba borthakur added a comment - 09/Mar/09 09:12 PM
HADOOP-5225 has been applied only to 0.19.

I have tested it only on 0.19 branch (not on trunk). Also with very light load. I will be testing it in scale on 0.19 very veru shortly (I am waiting for HADOOP-5332 to get committed.


dhruba borthakur made changes - 16/Apr/09 05:02 AM
Link This issue is related to HADOOP-4751 [ HADOOP-4751 ]
stack added a comment - 22/Apr/09 05:44 AM
@Dhruba Any luck testing at scale? (Looks like HADOOP-5332 was committed). Thanks.

dhruba borthakur added a comment - 22/Apr/09 02:49 PM
Hi Stack, I will post a new merged patch for this one. I am hoping that I get some file system folks to review it once I post the patch.

Nigel Daley made changes - 23/Apr/09 07:07 PM
Fix Version/s 0.20.0 [ 12313438 ]
dhruba borthakur added a comment - 04/May/09 04:59 AM
Merged patch with latest trunk.

dhruba borthakur made changes - 04/May/09 04:59 AM
Attachment fsyncConcurrentReaders5.txt [ 12407127 ]
dhruba borthakur made changes - 10/May/09 10:55 AM
Attachment fsyncConcurrentReaders6.patch [ 12407718 ]
stack added a comment - 12/May/09 12:54 AM
Tried the patch on loaded cluster. First attempt got file lease after about 30 seconds or so. Second test fell into a loop:
2009-05-12 00:51:16,031 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbasetrunk2/.logs/aa0-000-14.u.powerset.com_1242088970041_60021/hlog.dat.1242089388713 retrying...
2009-05-12 00:51:16,441 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbasetrunk2/.logs/aa0-000-14.u.powerset.com_1242088970041_60021/hlog.dat.1242089388713 retrying...
...

Sending Dhruba hdfs logs.


dhruba borthakur added a comment - 18/May/09 08:39 PM
Fixed block reports and blockReceived to handle change in generation stamp of a block.

dhruba borthakur made changes - 18/May/09 08:39 PM
Attachment fsyncConcurrentReaders9.patch [ 12408411 ]
stack added a comment - 18/May/09 11:56 PM
v9 works well in basic testing. If I crash the writer application, a subsequent reader can pick up edits from the crashed writer if the reader process does append, close and then reopen.

I've been trying to break the patch but it seems robust (v7 and v8 had issues that Dhruba fixed).

The append/close process can take from 10-60 seconds in my basic testing. Usually its about 20 seconds. It would be better if there were no pause. Can this be addressed?


stack added a comment - 19/May/09 12:06 AM
My testing was done with hadoop 0.20.0 (I amended patch to work with 0.20 branch).

Patch itself looks good. Minor comments are

+ Is there anything that can be done about duplicated test condition in BlockManager.java and DatanodeDescriptor.java. Can the test be grouped up into a method that is used in both places?
+ The if part of the if/else in DatanodeDescriptor.java is empty. Put a comment here? As is, its hard to read.

I haven't played much with killing writer application and the local datanode. Will try that now. Will also try confirm that we're reading up to the last sync before the kill. Will report back.


Hairong Kuang added a comment - 19/May/09 12:22 AM
If one of the datanodes in the pipeline has a very very slow disk, it is possible that when sync returns, not all the syned data get flushed to the disk yet. If this datanode dies and restarts before the lease recovery starts, I think the lease recovery will remove the synced data.

stack added a comment - 26/May/09 11:02 PM
After a chat with Hairong and Dhruba, it was thought that the wait on a successful append, close, then open to read was not necessary in this patch; that a new reader should be able to read up to the dead writers last sync. I tried it and all I get are EOFs when the new reader tries to read the dead writers file.

stack added a comment - 26/May/09 11:32 PM
... and the read finds a zero-length file (nothing read from the file).

Hairong Kuang added a comment - 27/May/09 06:10 PM
It seems to me that the current implementation of sync does not guarantee that synced data become visible to any new reader.

stack added a comment - 27/May/09 10:29 PM
I tested some more and new reader sees 0 data and gets an EOF each time.

I went back to testing the wait-on-successful-append, close, and new open for read and ran into the following issue again (I'd seen this in earlier patch) where we're spinnning waiting on the append to succeed:

2009-05-27 22:16:53,186 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://Y/hbase/.logs/X,1243460581044/hlog.dat.1243462406950
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 for DFSClient_-843497834 on client X.X.X.X, because this file is already being created by DFSClient_-181950545 on Y.Y.Y.Y
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1058)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1146)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:392)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

    at org.apache.hadoop.ipc.Client.call(Client.java:739)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.append(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy0.append(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:487)
    at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
    at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
    at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:956)
    at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:800)
    at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:753)
    at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:248)
    at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:459)
    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:395)

... and then this started up and won't stop:

2009-05-27 22:16:59,753 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,60021,1243460581044/hlog.dat.1243462406950 retrying...
2009-05-27 22:17:00,163 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 retrying...
2009-05-27 22:17:00,573 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 retrying...
.....

Namenode log is here: http://www.duboce.net/~stack/could_not_complete_file_nn.log.gz


stack added a comment - 28/May/09 08:52 PM
In this test run, the append never succeeds.... or at least, after 25 minutes it still has not successfully done the append open. We try the append, fail with an AlreadyBeingCreatedException, sleep a second, and then cycle. Usually it takes well under a minute to successfully open-to-append. Namenode log is here: www.duboce.net:~stack/wontstop_namenode.log.gz. In this case, I killed datanode and the hbase regionserver simulating a machine falling off the cluster (Previous, I was mostly just killing the server process and not the datanode).

dhruba borthakur added a comment - 09/Jun/09 06:48 PM
Patch for 0.20 branch. Fixes a concurrentmodificationexception.

dhruba borthakur made changes - 09/Jun/09 06:48 PM
Attachment fsyncConcurrentReaders11_20.txt [ 12410241 ]
stack added a comment - 18/Jun/09 06:48 PM
I can manufacture a situation where we're stuck trying to open-for-append and we never seem to succeed (It ran all night last night cycling trying to open-for-append).

4 servers. All running datanodes and hbase regionservers. Running head of the hadoop 0.20.0 branch with fsyncConcurrentReaders11_20.txt applied. In hbase, I have dfs.append set to true.

Start up an upload. After a few minutes, kill datanode on ..44.142. I leave the hbase regionserver on ..44.142 running.

After a little while, the regionserver on ..44.139 fails because of the following DFSClient exception:

2009-06-18 18:24:28,514 [regionserver/0:0:0:0:0:0:0:0:60021.cacheFlusher] FATAL org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: TestTable,0452187774,1245349374706
    at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:936)
    at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:829)
    at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:268)
    at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.run(MemcacheFlusher.java:149)
Caused by: java.io.IOException: Bad connect ack with firstBadLink *.*.44.142:51010
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

(Why does this fail? Why does one datanode missing cause this fail?)

The hbase master, also on ..44.139, then goes to recover the files of the crashed regionserver process. It gets stuck looping trying to open-for-append:

2009-06-18 18:25:04,587 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://aa0-000-12.u.powerset.com:9002/hbase/.logs/aa0-000-12.u.powerset.com,60021,1245348995684/hlog.dat.1245349434556
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/aa0-000-12.u.powerset.com,60021,1245348995684/hlog.dat.1245349434556 for DFSClient_2060803895 on client *.*.44.139, because this file is already being created by DFSClient_-937484673 on *.*.44.139
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1058)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1146)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:392)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

    at org.apache.hadoop.ipc.Client.call(Client.java:739)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.append(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy0.append(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:487)
    at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
    at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
    at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1037)
    at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:816)
    at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:756)
    at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
    at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:467)
    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:401)

Let me attach the namenode log.


stack added a comment - 18/Jun/09 06:52 PM
Namenode log that covers the forever cycle on open-for-append

stack made changes - 18/Jun/09 06:52 PM
Philip Zeyliger added a comment - 18/Jun/09 07:05 PM
I'm traveling overseas and will return on June 29th. For urgent
matters, please contact Amr Awadallah.

Owen O'Malley made changes - 20/Jun/09 07:42 AM
Fix Version/s 0.19.2 [ 12313650 ]
Component/s dfs [ 12310710 ]
Key HADOOP-4379 HDFS-200
Project Hadoop Common [ 12310240 ] HDFS [ 12310942 ]
dhruba borthakur added a comment - 25/Jun/09 08:22 AM
This is not related to HDFS-4379. let me explain why.

The problem is actually related to HDFS-xxx. The namenode waits for 10 minutes after losing heartbeats from a datanode to declare it dead. During this 10 minutes, the NN is free to choose the dead datanode as a possible replica for a newly allocated block.

If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have only 4 datanodes in the cluster, every retry picks the dead-datanode and the above logic bails out.

One solution is to change the value of dfs.client.block.write.retries to a much much larger value, say 200 or so. Better still, increase the number of nodes in ur cluster.


Konstantin Shvachko added a comment - 26/Jun/09 11:55 PM
I reviewed the patch. It looks good, nothing obviously wrong.
Do not recommend it for commit to 0.20 mostly because it introduces changes, which effect behavior of the system when appends are not present.
Main concerns are:
  1. There are new APIs like new rpc call getBlockInfo() to data-nodes and public methods syncFs().
  2. Regular reads will do additional steps before reading.
  3. Lease recovery changed substantially: touches non-append related paths.
  4. Block report processing plays tricks with generation stamps.
  5. Also, I am not sure this will be compatible with the ongoing design HDFS-265.

Taking into account all the above and if there is enough interest in the feature it makes sense to create a branch of 0.20 with this patch applied.
This will require some maintenance, but shouldn't be too hard.


Repository Revision Date User Message
ASF #793145 Fri Jul 10 23:18:15 UTC 2009 stack HBASE-1470 hbase and HADOOP-4379, dhruba's flush/sync
Files Changed
MODIFY /hadoop/hbase/trunk/CHANGES.txt
MODIFY /hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/regionserver/HLog.java

stack added a comment - 14/Jul/09 03:22 AM
(Thanks for review Konstantin)

In my last few test runs, NameNode has shut itself down with the below:

...
009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9156287469566772234_2527 from XX.XX.XX.142:51010
2009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9181830129071396520_2355 from XX.XX.XX.142:51010
2009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9205119721509648294_2410 from XX.XX.XX.142:51010
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-7011715647341740217_1 on XX.XX.XX.142:51010 size 47027149 does not belong to any file.
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-7011715647341740217 is added to invalidSet of XX.XX.XX.142:51010
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-280166356715716926_1 on XX.XX.XX.142:51010 size 6487 does not belong to any file.
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-280166356715716926 is added to invalidSet of XX.XX.XX.142:51010
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_1532053033915429278_1 on XX.XX.XX.142:51010 size 3869 does not belong to any file.
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1532053033915429278 is added to invalidSet of XX.XX.XX.142:51010
2009-07-14 00:17:47,303 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: ReplicationMonitor thread received Runtime exception. java.lang.IllegalStateException: generationStamp (=1) == GenerationStamp.WILDCARD_STAMP
2009-07-14 00:17:47,304 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at aa0-000-12.u.powerset.com/XX.XX.XX.139
************************************************************/

My guess this is a bug only fellas with dfs.support.append=true set run in to?

Here is code from ReplicationMonitor:

} catch (Throwable t) {
          LOG.warn("ReplicationMonitor thread received Runtime exception. " + t);
          Runtime.getRuntime().exit(-1);
        }

Thats a rough call I'd say?

There are no more detailed exceptions in NN log.

Dig in more and stick what I find in another issue?


dhruba borthakur added a comment - 16/Jul/09 07:34 AM
Here is another patch that prints out the exception stack trace when the ReplicationMonitor encounters an exception. If you can reproduce this failure case with this patch, that will be great.

dhruba borthakur made changes - 16/Jul/09 07:34 AM
Attachment fsyncConcurrentReaders12_20.txt [ 12413654 ]
dhruba borthakur made changes - 20/Jul/09 10:42 PM
Link This issue blocks HDFS-495 [ HDFS-495 ]
Ruyue Ma added a comment - 21/Jul/09 06:32 AM
to: dhruba borthakur

> This is not related to HDFS-4379. let me explain why.
> The problem is actually related to HDFS-xxx. The namenode waits for 10 minutes after losing heartbeats from a datanode to declare it dead. During this 10 minutes, the NN is free to choose the dead datanode as a possible replica for a newly allocated block.

> If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). > This setting works well when you have a reasonable size cluster; if u have only 4 datanodes in the cluster, every retry picks the dead-datanode and the above logic bails out.

> One solution is to change the value of dfs.client.block.write.retries to a much much larger value, say 200 or so. Better still, increase the number of nodes in ur cluster.

Our modification: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation.

+++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java 2009-07-20 00:19:03.000000000 +0800
@@ -2734,6 +2734,7 @@
LocatedBlock lb = null;
boolean retry = false;
DatanodeInfo[] nodes;
+ DatanodeInfo[] exludedNodes = null;
int count = conf.getInt("dfs.client.block.write.retries", 3);
boolean success;
do {
@@ -2745,7 +2746,7 @@
success = false;

long startTime = System.currentTimeMillis();

  • lb = locateFollowingBlock(startTime);
    + lb = locateFollowingBlock(startTime, exludedNodes);
    block = lb.getBlock();
    nodes = lb.getLocations();

@@ -2755,6 +2756,19 @@
success = createBlockOutputStream(nodes, clientName, false);

if (!success) {
+
+ LOG.info("Excluding node: " + nodes[errorIndex]);
+ // Mark datanode as excluded
+ DatanodeInfo errorNode = nodes[errorIndex];
+ if (exludedNodes != null) { + DatanodeInfo[] newExcludedNodes = new DatanodeInfo[exludedNodes.length + 1]; + System.arraycopy(exludedNodes, 0, newExcludedNodes, 0, exludedNodes.length); + newExcludedNodes[exludedNodes.length] = errorNode; + exludedNodes = newExcludedNodes; + } else {
+ exludedNodes = new DatanodeInfo[] { errorNode };
+ }
+
LOG.info("Abandoning block " + block);
namenode.abandonBlock(block, src, clientName);


dhruba borthakur added a comment - 22/Jul/09 02:14 PM
Hi Ruyue, your option of excluding specific datanodes (specified by the client) sounds reasonable. This might help in the case of network partitioning where a specific client loses access to a set of datanodes while the datanode is alive and well and is able to send heartbeats to the namenode. Can you pl create a separate JIRA for your prosposed fix and attach your patch there? Thanks.

dhruba borthakur added a comment - 31/Jul/09 01:24 AM
This patch fixes the exception "Runtime exception. java.lang.IllegalStateException: generationStamp (=1) == GenerationStamp.WILDCARD_STAMP" reported by stack earlier.

dhruba borthakur made changes - 31/Jul/09 01:24 AM
Attachment fsyncConcurrentReaders13_20.txt [ 12415083 ]
stack added a comment - 05/Aug/09 04:57 AM
Just to say that I've been testing last patch over last 3 or 4 days and its holding up for me. I'll keep at it. I also wrote Ruyue Ma for the server-side of his patch above but no answer as yet. Good stuff.

dhruba borthakur added a comment - 13/Aug/09 06:50 AM
This patch ignores deleting blocks during block report processing if the file is underConstruction. Updated test case to tets this corner case.

dhruba borthakur made changes - 13/Aug/09 06:50 AM
Attachment fsyncConcurrentReaders14_20.txt [ 12416406 ]
ryan rawson added a comment - 27/Aug/09 02:07 AM
I've been testing this on my 20 node cluster here. Lease recovery can take a long time which is a bit of an issue. The sync seems to be pretty good overall, we are recovering most of the edits up until the last flush, and it's pretty responsive.

However, I have discovered a new bug, the scenario is like so:

  • we roll the logs every < 1MB (block size).
  • we now have 18 logs to recover. The first 17 were closed properly, only the last one was in mid-write.
  • during log recovery, the hbase master calls fs.append(f); out.close();
  • But the master gets stuck at the out.close(); It can't seem to progress. Investigating the logs, it looks like the namenode 'forgets' about the other 2 replicas for the block (file is 1 block), and thus we are stuck
    until another replica comes back.

I've attached logs, hadoop fsck, stack traces from hbase.


ryan rawson made changes - 27/Aug/09 02:08 AM
Attachment hdfs-200-ryan-existing-file-fail.txt [ 12417840 ]
stack added a comment - 10/Sep/09 06:03 AM
@dhruba Did you get a chance to checkout Ryan's postings? Good stuff...

dhruba borthakur added a comment - 11/Sep/09 08:23 AM
@ryan, stack: thanks for the info. I will follow it up very soon. will keep you posted.

dhruba borthakur added a comment - 11/Sep/09 08:52 AM - edited
I think this is the scenario that you are facing:

1. The file is written for the first time. The writer closed the file but at this time only one of the three replicas have checked in with the namenode.
2. The new writer invoked append() to write more data into the file. The new writer found the one remaining replica of the block, stamped a new generation stamp on this replica, made it ready to receive new data for this file and file is not open for "append". The stamping of the new generation stamp essentially invaidated the other two replicas of this block....this block now has only one valid replica. The namenode won't start replicating this block till the block is full. If this sole datanode now dies, then the file will be "missing a block". This is what you folks encountered.

One option is to set dfs.replication.min to 2. This will ensure that closing a file (step 1) will be successful only when at least two replicas of the block have checked in with the namenode. This should reduce the probability of this problem occuring. Along with that you could set the replication factor of the hbase log file(s) to be greater than 3.


Ruyue Ma made changes - 18/Sep/09 07:07 AM
Link This issue relates to HDFS-630 [ HDFS-630 ]
ryan rawson added a comment - 24/Sep/09 12:34 AM
I'm not sure exactly what I'm seeing, here is some flow of events:

namenode says when the regionserver closes a log:

2009-09-23 17:21:05,128 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.38:50010 1 blocks.
2009-09-23 17:21:05,128 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.38:50010
2009-09-23 17:21:05,128 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.38:50010 is added to blk_859496561950482745
1_4351 size 573866
2009-09-23 17:21:05,130 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.45:50010 1 blocks.
2009-09-23 17:21:05,130 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.45:50010
2009-09-23 17:21:05,130 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.45:50010 is added to blk_859496561950482745
1_4351 size 573866
2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.32:50010 1 blocks.
2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8594965619504827451_4351 is received from 10.10.21.32:50010
2009-09-23 17:21:05,131 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.21.32:50010 is added to blk_859496561950482745
1_4351 size 573866
2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: DIR NameNode.complete: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClien
t_-2129099062
2009-09-23 17:21:05,131 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFS
Client_-2129099062
2009-09-23 17:21:05,132 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.closeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 with 1 bl
ocks is persisted to the file system
2009-09-23 17:21:05,132 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 blockli
st persisted

So at this point we have 3 blocks, they have all checked in, right?

then during logfile recovery we see:

2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR NameNode.append: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSCl
ient_-828773542 at 10.10.21.29
2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: src=/hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228, holde
r=DFSClient_-828773542, clientMachine=10.10.21.29, replication=512, overwrite=false, append=true
2009-09-23 17:21:45,997 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 567 Total time for transactions(ms): 9Number of transactions ba
tched in Syncs: 54 Number of syncs: 374 SyncTimes(ms): 12023 4148 3690 7663
2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: UnderReplicationBlocks.update blk_8594965619504827451_4351 curReplicas 0 curExpectedReplicas 3 oldReplicas 0 oldExpectedReplicas 3 curPri 2 oldPri 2
2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.UnderReplicationBlock.update:blk_8594965619504827451_4351 has only 0 replicas and need 3 replicas so is added to neededReplications at priority level 2
2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.appendFile: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542 at 10.10.21.29 block blk_8594965619504827451_4351 block size 573866
2009-09-23 17:21:45,997 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop ip=/10.10.21.29 cmd=append src=/hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 dst=null perm=null
2009-09-23 17:21:47,265 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.UnderReplicationBlock.remove: Removing block blk_8594965619504827451_4351 from priority queue 2
2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.addBlock: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542
2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getAdditionalBlock: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSClient_-828773542
2009-09-23 17:21:56,016 DEBUG org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.addFile: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 with blk_-7773763727920913762_4364 block is added to the in-memory file system
2009-09-23 17:21:56,016 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228. blk_-7773763727920913762_4364

At this point the client is stuck waiting on a block recovery that wont ever really happen.

I set this config at your suggestion:
<property>
<name>dfs.replication.min</name>
<value>2</value>
</property>

in the hbase-site.xml so it should be in the HBase dfs client config.

from my reading of the logfile, apparently the namenode seems to "forget" the other 2 locations of the block in question. Am I reading it wrong?


ryan rawson added a comment - 24/Sep/09 12:41 AM
scratch the last, i was having some environment/library version problems.

Tsz Wo (Nicholas), SZE made changes - 29/Sep/09 08:38 PM
Link This issue relates to HDFS-659 [ HDFS-659 ]
Tsz Wo (Nicholas), SZE added a comment - 29/Sep/09 08:38 PM
After HDFS-659, new readers are able to read hflushed data in the Append Branch. We have committed a test, TestReadWhileWriting, which is similar to TC3.