Details
-
Improvement
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.1.0
-
None
-
None
Description
currently, there's no way to analyze and debug DFS errors where blocks disapear.
name server should log its decisions that affect data, including block creation, removal, replication:
- block <b> created, assigned to datanodes A, B, ...
- datanode A dead, block <b> underreplicated(1), replicating to datanode C
- datanode B dead, block <b> underreplicated(2), replicating to datanode D
- datanode A alive, block <b> overreplicated, removing from datanode D
- block <removed> from datanodes C, D, ...
that will enable me to track down, two weeks later, a block that's missing from a file, and to debug the name server.
extra credit:
- rotate log file, as it might grow large
- make this behaviour optional/configurable
the plan is to add a log line for each change in the name space and each change in block placement or replication. What we get is effectively a trace of program execution for DFS changes.
the log will go to a new log object, to enable switching this (extensive) logging on or off.
name space changes will be logged at level fine, block commit changes at finer, and block pending changes at finest.
In order to facilitate tracing of multiple concurrent operations, each line will include the thread id of the name server's thread. For that we derive a logging class, that places the thread id right after the date/time.
we log in the following methods of class name node, and in methods of class nameSystem called by them:
create (startFile)
abandonFileInProgress (abandonFileInProgress )
AbandonBlock (AbandonBlock )
reportWrittenBlock (blockReceived)
addBlock (getAdditionalBlock)
Complete (completeFile)
rename (renameTo)
delete (delete)
Mkdirs (Mkdirs)
sendHeartbeat (getHeartbeat)
blockReport (processReoprt)
blockReceived (blockReceived)
errorReport
getBlockWork (pendingTransfer, blocksToInvalidate)