|
Tsz Wo (Nicholas), SZE made changes - 09/Oct/08 01:03 AM
Tsz Wo (Nicholas), SZE made changes - 09/Oct/08 01:06 AM
Tsz Wo (Nicholas), SZE made changes - 09/Oct/08 01:10 AM
4379_20081010TC3.java: a unit test implementing TC3 described in my previous comment
Tsz Wo (Nicholas), SZE made changes - 10/Oct/08 06:26 PM
dhruba borthakur made changes - 16/Jan/09 08:22 PM
dhruba borthakur made changes - 16/Jan/09 08:22 PM
When a fsync occurs, the client informs the namenode the amount of data that it has written to the last block. The NameNode updates its meatadata with tis value.This means that a new getBlockLocation call will automatically retrieve the last good size of the last block.
dhruba borthakur made changes - 16/Jan/09 08:26 PM
@dhruba: this is the current (0.19.0) behavior: if a client writes a half block to a log in hdfs and calls fsync and then got killed. When the client restarts (on the same node) to read the log, the half block is not there, until you restart the entire hadoop cluster and wait for a few minutes.
@Luke: the attached patch should solve your problem. If you have the time, would it be possible for you to apply this patch and then retry your test case?
Hi Dhruba,
I've been working with Luke a little on this. Here are more details. The log that gets written in the test is very small. The first thing the software does when it creates the log is it writes a 7-byte header. Then later as the test proceeds, the system will append a small entry and the do a sync. We use the FSDataOutputStream class. The sequence of operations looks something like this: out_stream.write(data, 0, amount); When the test completes, all of the logs are exactly 7 bytes long. It remains this way even if I wait 10 minutes or kill the Hypertable java process and wait several minutes as well. Here is the listing: [doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn After shutting down HDFS and restarting it again, the listing looks like this: [doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn The last time I ran this test I encountered a problem where it appeared that some of our commits were lost. Here's what I did: 1. ran tests (which create a table with 75274825 cells) The table dump in #6 came up short (e.g. 72M entries). It appears that some of the commit logs (different log than the range_txn log) came back up incomplete. Let us know if you want us to run an instrumented version or anything. We can send you the Hadoop log files if that helps. Thanks!
On the design of the TC3: I think that client informing namenode on each sync is too expensive and not scalable. Hypertable can easily generate 1M transactions/s from a small cluster, which the namenode cannot possibly handle. Since restarting HDFS incurs no data loss, I think the basic logic is probably already fine. For most cases, I think that making new readers a bit more expensive to create is probably the right trade-off:
1. We could append and sync a file and only inform the namenode on the first sync, hinting that we need to do something special for the new readers of the file. I think that this would cover most of the use cases for append and sync reasonably. HBase really needs visibility to partial blocks and restarting HDFS is not an option for a 100+ node cluster.
We don't sync after every record, but sync after a configurable number of writes or after a configurable amount of time has passed and This really needs to be addressed in 0.19.1, 0.20.1 and trunk.
Jim Kellerman made changes - 19/Jan/09 10:35 PM
First of two test files demonstrating the problem
Jim Kellerman made changes - 20/Jan/09 01:29 AM
Second of two test programs that demonstrate the problem.
Jim Kellerman made changes - 20/Jan/09 01:30 AM
I have attached two test programs that demonstrate the problem.
Writer.java writes 100 records of approx 1KB each per sync. Its command line argument is the number of sync's to perform before aborting. If you run Writer with 4096 syncs: bin/hadoop Writer 4096 and then run Reader, it gets an EOFException without having read any records (note that exception occurs in constructor: Exception in thread "main" java.io.EOFException
at java.io.DataInputStream.readFully(Unknown Source)
at java.io.DataInputStream.readFully(Unknown Source)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
at Reader.main(Reader.java:34)
If you run Writer with 8192 syncs: bin/hadoop Writer 8192 and then run Reader, it reads 684783 records before hitting an EOFException (in next()) Read 684783 lines
java.io.EOFException
at java.io.DataInputStream.readFully(Unknown Source)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
at Reader.main(Reader.java:41)
684783 records that are approximately 1KB in length is approximately 64MB (a hadoop block). In this test the partial block containing 134417 records is lost (even though a considerable So you can read from an improperly closed file but only up to the last full block, confirming I will look into the patch shortly. Thanks Doug and Luke for the comments and Jim for providing a good test case. I will incorporate your comments and post a new patch. (Jim, you need not test the existing patch associated with this JIRA)
Jim Kellerman made changes - 23/Jan/09 01:50 AM
This patch implements the following (as specified in the design doc for Appends,
A reader checks to see if the file is being written to by another writer. if so, it fetches the size of the last block from the primary datanode. This does not have any performance impact for non-concurrent readers not does it have any impact on sync().
dhruba borthakur made changes - 23/Jan/09 10:26 AM
@Dhruba
I tried applying fsyncConcurrentReaders3.patch to 0.19.0, 0.19 branch (latest), 0.20 branch (latest) and trunk. It did not @Jim: The SequenceFile.Writer.sync() does not actually invoke the FSDataOutputStream.sync.
The FSDataOutputStream.sync() is a heavy weight operation and the SequenceFile.Writer.sync() is typically invoked with small chunks of data, that's the reason why the SequenceFile class does not automatically invoke the fs sync call. I changed your test case to invoke the FSDaOutputStream.sync call but still see some problems. Will post an update as soon as I get to the bottom of this. @Dhruba
If I have a SequenceFile.Writer, how do I get the FSDataOutputStream? It is a protected member of SequenceFile.Writer. Thanks.
Jim Kellerman made changes - 26/Jan/09 06:55 PM
Jim Kellerman made changes - 26/Jan/09 07:03 PM
> how do I get the FSDataOutputStream?
I think it would be better to add a SequenceFile.Writer method that flushes and calls FSDataOutputStream.sync. The problem is that SequenceFile.Writer#sync() should probably be called flush(), not sync(), but that'd be hard to change. So perhaps we should add a SequenceFile.Writer#syncFs() method? I will add a new method SequenceFile.Writer.syncFS(), test it, and upload a new patch. Thanks Doug and Jim.
I tried the test again and still no luck. To recap, here's how the log file is created:
out_stream.write(header, 0, 7); After the test finished, I shut down the Hypertable servers. This time the listing shows the files to be 0 bytes in length (as opposed to 7 bytes with the previous patch): [doug@motherlode000 aol-basic]$ hadoop fs -ls /hypertable/servers/10.0.30.1*_38060/log/range_txn When the RangeServer starts up again, it discovers that the log file (range_txn/0.log) does exist, so it starts the recovery process. However, it only sees the 7 byte header. All of the subsequent log appends do not appear in the log file. So the system starts up without recovering any of the data. BTW, in this particular circumstance, there no other writer writing to the file when the range server comes up and reads it. Here's the high-level of what's going on: RangeServer opens an FSDataOutputStream to the log and starts appending to it In your above note you said, "A reader checks to see if the file is being written to by another writer. if so, it fetches the size of the last block from the primary datanode." This is not the case with our test, there is no writer writing to the log when we try to read it.
This may be a problem on our end. When I shut down hypertable and then copy each range_txn log file to the local filesystem, the files seem to contain more than just the 7-byte header.
[doug@motherlode000 aol-basic]$ ls -l rsml/ It looks like our code depends on the length, as returned by getFileStatus:
length = mFilesystem.getFileStatus(new Path(fileName)).getLen(); It appears that the length returned by this API call is not correct. Can you take a look at it and maybe patch this one as well? Hi Doug,
The file length (as returned by getFileStatus) will not change at every write from the client to the datanode. Similarly, every fsync call from the client does not reach the namenode (only the first one per block reaches the namenode). That means the namenode has no good way to know the size of a block when the block is being written to by a writer. In your case, the writer has died. The namenode has a timeout of 1 hour before it starts lease recovery for this file. The lease recovery process will set the correct file size on the namenode metadata. If you do not want to wait for one hour, then you can manually trigger lease recovery from your application by trying to reopen the file for append(please use FileSystem.append(pathname)). Lease recovery will update the true length of the file in the namenode metadata. An updated version of the writer than invokes SequenceFile.Writer.syncFs().
dhruba borthakur made changes - 27/Jan/09 02:07 AM
This read waits for lease recovery to complete before reading data.
dhruba borthakur made changes - 27/Jan/09 02:08 AM
This patch should merge smoothly with trunk.
dhruba borthakur made changes - 27/Jan/09 02:08 AM
@Jim: I updated your test cases Reader.java and Writer.java and have successfully tested them at my site. Please let me know if you still see problems.
@Doug: Please use the latest patch that I uploaded. This has the SequenceFile.Writer.syncFS() method. (Also, I suggest that you should pull in the patch available at HADOOP-5027) Hi Dhruba,
I tried your suggestion, but got the following exception when trying to open the file with the 'append' method: SEVERE: I/O exception while getting length of file '/hypertable/servers/10.0.30.102_38060/log/range_txn/0.log' - org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hypertable/servers/10.0.30.102_38060/log/range_txn/0.log for DFSClient_2003773208 on client 10.0.30.102, because this file is already being created by DFSClient_423127459 on 10.0.30.102 I could re-write the whole thing to not be dependent on knowing log length, however, it seems like it ought to be possible to obtain the actual file length in this situation. The semantics of getFileStatus() seem a little odd. Sometimes it returns the actual length of the file and sometimes it returns a stale version of the length. I suppose this is ok as long as it is well documented. But it should be possible to obtain the actual length of a file. Would it be possible to add a FileSystem::length(Path path) method that returns the accurate file length by fetching the size of the last block from the primary datanode?
Hi Doug,
the exception is expected. Please look at the test file Reader.java that I attached to this JIRA. It shows how the reader waits for the lease recovery to end (to ensure that correct file size is updated on namenode). Please let me know if this approach is suitable for your application. We could enhance FileSystem.getFileStatus() to contact the datanode and retrieve and return the most-current file length (only for files that have a concurrent writer). This will not have have performance impact for most map-reduce applications. Tested this patch in conjunction with HADOOP-5027, Works!!!!!!!!!!!!!!!
I would like to see this in 0.19.1, 0.20.0 and trunk. +++++1 Thanks Jim. I will wait to hear from Doug Judd on whether it solves his problem.
Now when I apply your latest patch and the one from 5027 to the 0.19.0 source, the datanodes seem to be going into an infinite loop of NullPointerExceptions. At the top of the hadoop-zvents-datanode-motherlode007.admin.zvents.com.log I'm seeing this:
[doug@motherlode007 logs]$ more hadoop-zvents-datanode-motherlode007.admin.zvents.com.log at org.apache.hadoop.ipc.Client.call(Client.java:696) 2009-01-27 15:33:02,953 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerExce And the file is growing rapidly with the following exceptions tacked on to the end: 2009-01-27 16:10:55,973 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.ipc.Client.call(Client.java:696) It appears that these exceptions started happening within about 5 seconds after startup, so it doesn't look like it has anything to do with Hypertable. Is it ok to apply these patches to the 0.19.0 source? Or should I be applying them to the trunk?
I think these exceptions are related to the patch in HADOOP-5027. let me check.
Hi Dhruba,
When I drop the patch for 5027, the NullPointerExceptions go away. However, I'm still having troubles. It appears for some of the files, your trick for obtaining the real length does not work. The waitForLease spins indefinitely. For example, here's the log line for our HdfsBroker process where it is hanging: Jan 27, 2009 5:08:53 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Length Then in the namenode log file, I see things like this (I've attached the compressed namenode log): 2009-01-27 17:08:53,173 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: failed to create file /hypertable/servers/10.0.30.106_38060/log/metadat
Doug Judd made changes - 28/Jan/09 01:22 AM
> I think these exceptions are related to the patch in HADOOP-5027.
Please do not use my patch in HADOOP-5027. It is not clear that whether the fix is good. @Doug: the waiting for lease recovery to reclaim the length of the jar file will work only if the original writer has dies and now a new reader wants to read all the data that was written by the writer before it died. This was the use-case for Jim Kellerman.
In your case, you have a FileSystem object that was used to write to the file. Then your application tried to reopen the same file by invoking fs.apend() on the same FileSystem object. The namenode logs clearly shows that the same dfsclient (i.e. filesystem object) was used to try appending to the file. In this case, the namenode rejects the append call because it knows that the original writer is alive and is still writing to it. If your aim is to have a concurrent reader along with a concurrent writer, then the best that this patch can do is to allow the concurent reader to see the new file length only when the block is full. On the other hand, if you can make your application ot depend on the file length, then you can see all the data in the file. Another alternative would be to implement a new call FileSystem.length() than can retrieve the latest length from the datanode, but that can be done as part of a separate JIRA. Please let us know how difficult is it for you to change your app to not depend on the file length and just read till end-of-file? Dhruba,
The application is not trying to reopen a file that it already has open. It appears that HDFS is getting confused and thinks that this is the case. One thing that is slightly different in this case is that nothing gets written to the file by the original writer. Here is the sequence of relevant operations: File gets created Can you verify that this particular usage pattern is handled properly? I'll try to come up with a stripped down test case tomorrow.
Doug Judd made changes - 28/Jan/09 10:03 PM
Doug Judd made changes - 28/Jan/09 10:04 PM
Ok, I've uploaded a reproducible test case (see attachments ReopenProblem.java and reopen_test.sh). This test consistently hangs on both my local mac HDFS installation as well as our 10 node Linux HDFS cluster. It's not as simple as I suggested in my previous comment as you can see by the code. Here's the output I see when I run it:
[doug@motherlode000 0.9.2.1]$ ./reopen_test.sh The Hbase folks have requested that this be fixed in the next release form the 0.19 branch.
dhruba borthakur made changes - 30/Jan/09 06:28 PM
Hi Doug,
ReopenProblem.java is guaranteed to hang. At line 189, you create an file and get an handle on this file. Then , at line 197, you use the same file system object and invoke waitForLeaseRecovery(). waitForLeaseRecovery tries to open the same file again for appending to it. It fails as expected. The namenode log correctly shows that the "same leaseholder is trying to reopen the file for appends"! does it make sense? Thanks, dhruba Hi Dhruba,
The file that is opened on line 189 ( /hypertable/servers/10.0.30.114_38060/log/root/1 ) is different than the one passed into waitForLeaseRecovery on line 197. On line 194, the file object gets reset to a different path ( /hypertable/servers/10.0.30.114_38060/log/metadata/0 ). So, this program should not hang.
As discussed on core-dev@ (http://www.nabble.com/Hadoop-0.19.1-td21739202.html
Nigel Daley made changes - 11/Feb/09 08:59 PM
Jim Kellerman made changes - 11/Feb/09 09:06 PM
Applying this patch to hadoop-0.19.0, it does work, but slowly:
It takes almost an hour to recover the lease of a file when the hdfs cluster is under load: 2009-02-25 21:39:16,843 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs:/x.y.com:8100/hbase/log_10.76.44.139_1235597506284_8020/hlog.dat.1235597820662 2009-02-25 21:39:16,847 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Triggering lease recovery. ... 2009-02-25 22:37:12,755 INFO org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for hdfs://x.y.com:8100/hbase/log_10.76.44.139_1235597506284_8020 Is there something we can change to make lease recovery faster?
config parameter? block size? max file size? Thanks. There should be a way to forcibly take over a lease. In our application, ownership of the various files is controlled at a higher level (i.e. Chubby). A server will not come up and attempt to recover a file (or check its length) unless it is the sole owner of the file (i.e. it will have obtained a Chubby lock).
Can we add some sort of API that allows a client to forcibly take over a lease immediately? There is ReopenProblem.java attched to this JIRA.
@Doug: Please use ReopenProblem.java:waitForLeaseRecovery(). This method triggers near-immediate lease recovery and can be used by any application code to take ownership of the file. Does this work for you? @Jim: Were u using ReopenProblem.java:waitForLeaseRecovery() and still seeing that it takes upto an hour for the waitForLeaseRecovery() method to return success? This can happen if the original client that was writing to the file was alive and was in communication with the namenode. In this case, the new append-er will not be able to recover the original lease. > dhruba borthakur - 26/Feb/09 11:35 AM
> @Jim: Were u using ReopenProblem.java:waitForLeaseRecovery() and still seeing that it takes upto an hour for the > waitForLeaseRecovery() method to return success? Well, sort of. my code is similar but inline in the module that uses it: // Recover the files lease if necessary boolean recovered = false; while (!recovered) { try { FSDataOutputStream out = fs.append(logfiles[i].getPath()); out.close(); recovered = true; } catch (IOException e) { if (LOG.isDebugEnabled()) { LOG.debug("Triggering lease recovery."); } try { Thread.sleep(leaseRecoveryPeriod); } catch (InterruptedException ex) { // ignore it and try again } } } > This can happen if the original client that was writing to the file was alive and was in communication with the namenode. This was definitely not the case. The original writing process was kill'ed -9 Like Hypertable, HBase has a mechanism to detect when the original writing process is dead. What is the value of leaseRecoveryPeriod in the above piece of code?
The fact that the lease recovery took 1 hour is very interesting. The namenode has two limits: the soft limit of 1 minute and the hard limit of 1 hour. When there is no ping from a writer for the soft limit period, then a new writer can recover the lease and append to the file. When there is no ping from a writer for the hard limit period, the namenode itself itself initiates lease recovery. Is it possible for you to give me access to the namenode log along with the relevant filename in question? Hi Dhruba,
When you say "near-immediate" what does that mean? Is it the 1 minute soft limit? If so, that's way too long. If a node bounces in our cluster, we can't have the database locked up for an entire minute waiting for the lease to expire when we know for a 100% fact that the original leaseholder is dead. A one minute database lock-up would make Hypertable unsuitable for serving live data.
leaseRecoveryPeriod is 10000ms
file name is hdfs://aa0-000-12.u.powerset.com:8100/hbase/log_208.76.44.139_1235681989771_8020/hlog.dat.1235682444379 I will attach the namenode log. Namenode log. Note that this run was done using the waitForLeaseRecovery() from ReopenProblem.java
Jim Kellerman made changes - 26/Feb/09 10:17 PM
Doug Judd made changes - 28/Feb/09 12:22 AM
We too would like to be able to recover the lease immediately as we know when we go to read the file that the
writer is dead. Is there anything that we can do to speed this up? We can't contribute to Hadoop, but we are certainly willing to Hi Jim,
In the namenode log, I see the followign statement: NameSystem.startFile: failed to create file /hbase/log_208.76.44.139_1235681989771_8020/hlog.dat.1235682444379 for DFSClient_1908447348 on client 208.76.44.139 because current leaseholder is trying to recreate file. This means that the same client that originally created the file is trying to re-open the file. Is this possible? It started at time 21:14 and continued all the way to 22:07. An attempt was made to recreate thias file every 10 seconds (which matches the periodicity that you set) No, the original writer was killed at 21:08 and never restarted.
Lease recovery started at 21:14 by another process on the same machine. Hi Jim, I think the namenode log that you posted is a partial one. It does not have calles to commitBlockSync (that should have occured when the new writer started to append to the file). if so, can you pl post the entire namenode log?
@Dhruba
You were correct. I had INFO logging turned off. Here is a new name node log file with INFO enabled.
Jim Kellerman made changes - 04/Mar/09 06:05 PM
More info. When I generated the new namenode.log, the writer was killed at 16:54. The first attempt to grab the lease was at 16:56
and it was finally granted at 17:54. Dhruba,
What branch/revision did you test this patch against? I've tried several in 0.19 branch (0.19.0, and revisions just before 0.19.1 (and before It appears that 5225 was not committed to 0.20.0 or to trunk. Is that correct? If so should I try a revision of Hadoop off of 0.20 or trunk? Also, were you just testing Reader and Writer on an unloaded cluster, or using some other application
against a cluster that was under load? I have tested it only on 0.19 branch (not on trunk). Also with very light load. I will be testing it in scale on 0.19 very veru shortly (I am waiting for
dhruba borthakur made changes - 16/Apr/09 05:02 AM
@Dhruba Any luck testing at scale? (Looks like
Hi Stack, I will post a new merged patch for this one. I am hoping that I get some file system folks to review it once I post the patch.
Nigel Daley made changes - 23/Apr/09 07:07 PM
Merged patch with latest trunk.
dhruba borthakur made changes - 04/May/09 04:59 AM
dhruba borthakur made changes - 10/May/09 10:55 AM
Tried the patch on loaded cluster. First attempt got file lease after about 30 seconds or so. Second test fell into a loop:
2009-05-12 00:51:16,031 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbasetrunk2/.logs/aa0-000-14.u.powerset.com_1242088970041_60021/hlog.dat.1242089388713 retrying... 2009-05-12 00:51:16,441 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbasetrunk2/.logs/aa0-000-14.u.powerset.com_1242088970041_60021/hlog.dat.1242089388713 retrying... ... Sending Dhruba hdfs logs. Fixed block reports and blockReceived to handle change in generation stamp of a block.
dhruba borthakur made changes - 18/May/09 08:39 PM
v9 works well in basic testing. If I crash the writer application, a subsequent reader can pick up edits from the crashed writer if the reader process does append, close and then reopen.
I've been trying to break the patch but it seems robust (v7 and v8 had issues that Dhruba fixed). The append/close process can take from 10-60 seconds in my basic testing. Usually its about 20 seconds. It would be better if there were no pause. Can this be addressed? My testing was done with hadoop 0.20.0 (I amended patch to work with 0.20 branch).
Patch itself looks good. Minor comments are + Is there anything that can be done about duplicated test condition in BlockManager.java and DatanodeDescriptor.java. Can the test be grouped up into a method that is used in both places? I haven't played much with killing writer application and the local datanode. Will try that now. Will also try confirm that we're reading up to the last sync before the kill. Will report back. If one of the datanodes in the pipeline has a very very slow disk, it is possible that when sync returns, not all the syned data get flushed to the disk yet. If this datanode dies and restarts before the lease recovery starts, I think the lease recovery will remove the synced data.
After a chat with Hairong and Dhruba, it was thought that the wait on a successful append, close, then open to read was not necessary in this patch; that a new reader should be able to read up to the dead writers last sync. I tried it and all I get are EOFs when the new reader tries to read the dead writers file.
It seems to me that the current implementation of sync does not guarantee that synced data become visible to any new reader.
I tested some more and new reader sees 0 data and gets an EOF each time.
I went back to testing the wait-on-successful-append, close, and new open for read and ran into the following issue again (I'd seen this in earlier patch) where we're spinnning waiting on the append to succeed: 2009-05-27 22:16:53,186 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://Y/hbase/.logs/X,1243460581044/hlog.dat.1243462406950 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 for DFSClient_-843497834 on client X.X.X.X, because this file is already being created by DFSClient_-181950545 on Y.Y.Y.Y at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1058) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1146) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:392) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:739) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.append(Unknown Source) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.append(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:487) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:956) at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:800) at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:753) at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:248) at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:459) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:395) ... and then this started up and won't stop: 2009-05-27 22:16:59,753 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,60021,1243460581044/hlog.dat.1243462406950 retrying... 2009-05-27 22:17:00,163 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 retrying... 2009-05-27 22:17:00,573 [HMaster] INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/X,1243460581044/hlog.dat.1243462406950 retrying... ..... Namenode log is here: http://www.duboce.net/~stack/could_not_complete_file_nn.log.gz In this test run, the append never succeeds.... or at least, after 25 minutes it still has not successfully done the append open. We try the append, fail with an AlreadyBeingCreatedException, sleep a second, and then cycle. Usually it takes well under a minute to successfully open-to-append. Namenode log is here: www.duboce.net:~stack/wontstop_namenode.log.gz. In this case, I killed datanode and the hbase regionserver simulating a machine falling off the cluster (Previous, I was mostly just killing the server process and not the datanode).
Patch for 0.20 branch. Fixes a concurrentmodificationexception.
dhruba borthakur made changes - 09/Jun/09 06:48 PM
I can manufacture a situation where we're stuck trying to open-for-append and we never seem to succeed (It ran all night last night cycling trying to open-for-append).
4 servers. All running datanodes and hbase regionservers. Running head of the hadoop 0.20.0 branch with fsyncConcurrentReaders11_20.txt applied. In hbase, I have dfs.append set to true. Start up an upload. After a few minutes, kill datanode on ..44.142. I leave the hbase regionserver on ..44.142 running. After a little while, the regionserver on ..44.139 fails because of the following DFSClient exception: 2009-06-18 18:24:28,514 [regionserver/0:0:0:0:0:0:0:0:60021.cacheFlusher] FATAL org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: TestTable,0452187774,1245349374706
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:936)
at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:829)
at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:268)
at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.run(MemcacheFlusher.java:149)
Caused by: java.io.IOException: Bad connect ack with firstBadLink *.*.44.142:51010
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
(Why does this fail? Why does one datanode missing cause this fail?) The hbase master, also on ..44.139, then goes to recover the files of the crashed regionserver process. It gets stuck looping trying to open-for-append: 2009-06-18 18:25:04,587 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://aa0-000-12.u.powerset.com:9002/hbase/.logs/aa0-000-12.u.powerset.com,60021,1245348995684/hlog.dat.1245349434556 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/aa0-000-12.u.powerset.com,60021,1245348995684/hlog.dat.1245349434556 for DFSClient_2060803895 on client *.*.44.139, because this file is already being created by DFSClient_-937484673 on *.*.44.139 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1058) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1146) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:392) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:739) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.append(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.append(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:487) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1037) at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:816) at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:756) at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274) at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:467) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:401) Let me attach the namenode log.
stack made changes - 18/Jun/09 06:52 PM
I'm traveling overseas and will return on June 29th. For urgent
matters, please contact Amr Awadallah.
Owen O'Malley made changes - 20/Jun/09 07:42 AM
This is not related to HDFS-4379. let me explain why.
The problem is actually related to HDFS-xxx. The namenode waits for 10 minutes after losing heartbeats from a datanode to declare it dead. During this 10 minutes, the NN is free to choose the dead datanode as a possible replica for a newly allocated block. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have only 4 datanodes in the cluster, every retry picks the dead-datanode and the above logic bails out. One solution is to change the value of dfs.client.block.write.retries to a much much larger value, say 200 or so. Better still, increase the number of nodes in ur cluster. I reviewed the patch. It looks good, nothing obviously wrong.
Do not recommend it for commit to 0.20 mostly because it introduces changes, which effect behavior of the system when appends are not present. Main concerns are:
Taking into account all the above and if there is enough interest in the feature it makes sense to create a branch of 0.20 with this patch applied.
(Thanks for review Konstantin)
In my last few test runs, NameNode has shut itself down with the below: ...
009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9156287469566772234_2527 from XX.XX.XX.142:51010
2009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9181830129071396520_2355 from XX.XX.XX.142:51010
2009-07-14 00:17:46,586 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.removeStoredBlock: blk_-9205119721509648294_2410 from XX.XX.XX.142:51010
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-7011715647341740217_1 on XX.XX.XX.142:51010 size 47027149 does not belong to any file.
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-7011715647341740217 is added to invalidSet of XX.XX.XX.142:51010
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-280166356715716926_1 on XX.XX.XX.142:51010 size 6487 does not belong to any file.
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-280166356715716926 is added to invalidSet of XX.XX.XX.142:51010
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_1532053033915429278_1 on XX.XX.XX.142:51010 size 3869 does not belong to any file.
2009-07-14 00:17:46,586 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1532053033915429278 is added to invalidSet of XX.XX.XX.142:51010
2009-07-14 00:17:47,303 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: ReplicationMonitor thread received Runtime exception. java.lang.IllegalStateException: generationStamp (=1) == GenerationStamp.WILDCARD_STAMP
2009-07-14 00:17:47,304 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at aa0-000-12.u.powerset.com/XX.XX.XX.139
************************************************************/
My guess this is a bug only fellas with dfs.support.append=true set run in to? Here is code from ReplicationMonitor: } catch (Throwable t) { LOG.warn("ReplicationMonitor thread received Runtime exception. " + t); Runtime.getRuntime().exit(-1); } Thats a rough call I'd say? There are no more detailed exceptions in NN log. Dig in more and stick what I find in another issue? Here is another patch that prints out the exception stack trace when the ReplicationMonitor encounters an exception. If you can reproduce this failure case with this patch, that will be great.
dhruba borthakur made changes - 16/Jul/09 07:34 AM
dhruba borthakur made changes - 20/Jul/09 10:42 PM
to: dhruba borthakur
> This is not related to HDFS-4379. let me explain why. > If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). > This setting works well when you have a reasonable size cluster; if u have only 4 datanodes in the cluster, every retry picks the dead-datanode and the above logic bails out. > One solution is to change the value of dfs.client.block.write.retries to a much much larger value, say 200 or so. Better still, increase the number of nodes in ur cluster. Our modification: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. +++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java 2009-07-20 00:19:03.000000000 +0800 long startTime = System.currentTimeMillis();
@@ -2755,6 +2756,19 @@ if (!success) { Hi Ruyue, your option of excluding specific datanodes (specified by the client) sounds reasonable. This might help in the case of network partitioning where a specific client loses access to a set of datanodes while the datanode is alive and well and is able to send heartbeats to the namenode. Can you pl create a separate JIRA for your prosposed fix and attach your patch there? Thanks.
This patch fixes the exception "Runtime exception. java.lang.IllegalStateException: generationStamp (=1) == GenerationStamp.WILDCARD_STAMP" reported by stack earlier.
dhruba borthakur made changes - 31/Jul/09 01:24 AM
This patch ignores deleting blocks during block report processing if the file is underConstruction. Updated test case to tets this corner case.
dhruba borthakur made changes - 13/Aug/09 06:50 AM
I've been testing this on my 20 node cluster here. Lease recovery can take a long time which is a bit of an issue. The sync seems to be pretty good overall, we are recovering most of the edits up until the last flush, and it's pretty responsive.
However, I have discovered a new bug, the scenario is like so:
I've attached logs, hadoop fsck, stack traces from hbase.
ryan rawson made changes - 27/Aug/09 02:08 AM
@ryan, stack: thanks for the info. I will follow it up very soon. will keep you posted.
I think this is the scenario that you are facing:
1. The file is written for the first time. The writer closed the file but at this time only one of the three replicas have checked in with the namenode. One option is to set dfs.replication.min to 2. This will ensure that closing a file (step 1) will be successful only when at least two replicas of the block have checked in with the namenode. This should reduce the probability of this problem occuring. Along with that you could set the replication factor of the hbase log file(s) to be greater than 3. I'm not sure exactly what I'm seeing, here is some flow of events:
namenode says when the regionserver closes a log: 2009-09-23 17:21:05,128 DEBUG org.apache.hadoop.hdfs.StateChange: BLOCK NameNode.blockReceived: from 10.10.21.38:50010 1 blocks. So at this point we have 3 blocks, they have all checked in, right? then during logfile recovery we see: 2009-09-23 17:21:45,997 DEBUG org.apache.hadoop.hdfs.StateChange: DIR NameNode.append: file /hbase/.logs/sv4borg32,60020,1253751520085/hlog.dat.1253751663228 for DFSCl At this point the client is stuck waiting on a block recovery that wont ever really happen. I set this config at your suggestion: in the hbase-site.xml so it should be in the HBase dfs client config. from my reading of the logfile, apparently the namenode seems to "forget" the other 2 locations of the block in question. Am I reading it wrong? scratch the last, i was having some environment/library version problems.
Tsz Wo (Nicholas), SZE made changes - 29/Sep/09 08:38 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
See also https://issues.apache.org/jira/secure/attachment/12391788/20081008testplan.txt