[HADOOP-96] name server should log decisions that affect data: block creation, removal, replication - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.1.0
Fix Version/s: 0.2.0
Component/s: None
Labels:
None

Description

currently, there's no way to analyze and debug DFS errors where blocks disapear.
name server should log its decisions that affect data, including block creation, removal, replication:

block <b> created, assigned to datanodes A, B, ...
datanode A dead, block <b> underreplicated(1), replicating to datanode C
datanode B dead, block <b> underreplicated(2), replicating to datanode D
datanode A alive, block <b> overreplicated, removing from datanode D
block <removed> from datanodes C, D, ...

that will enable me to track down, two weeks later, a block that's missing from a file, and to debug the name server.

extra credit:

rotate log file, as it might grow large
make this behaviour optional/configurable

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

namenodeLogging.patch
05/May/06 07:18
56 kB
Hairong Kuang
namenodeLogging.patch
04/May/06 08:48
55 kB
Hairong Kuang
namenodeLogging.patch
28/Apr/06 03:53
55 kB
Hairong Kuang

Activity

Ascending order - Click to sort in descending order

Yoram Arnon added a comment - 31/Mar/06 08:07

the plan is to add a log line for each change in the name space and each change in block placement or replication. What we get is effectively a trace of program execution for DFS changes.
the log will go to a new log object, to enable switching this (extensive) logging on or off.
name space changes will be logged at level fine, block commit changes at finer, and block pending changes at finest.
In order to facilitate tracing of multiple concurrent operations, each line will include the thread id of the name server's thread. For that we derive a logging class, that places the thread id right after the date/time.

we log in the following methods of class name node, and in methods of class nameSystem called by them:
create (startFile)
abandonFileInProgress (abandonFileInProgress )
AbandonBlock (AbandonBlock )
reportWrittenBlock (blockReceived)
addBlock (getAdditionalBlock)
Complete (completeFile)
rename (renameTo)
delete (delete)
Mkdirs (Mkdirs)
sendHeartbeat (getHeartbeat)
blockReport (processReoprt)
blockReceived (blockReceived)
errorReport
getBlockWork (pendingTransfer, blocksToInvalidate)

Yoram Arnon added a comment - 31/Mar/06 08:07 the plan is to add a log line for each change in the name space and each change in block placement or replication. What we get is effectively a trace of program execution for DFS changes. the log will go to a new log object, to enable switching this (extensive) logging on or off. name space changes will be logged at level fine, block commit changes at finer, and block pending changes at finest. In order to facilitate tracing of multiple concurrent operations, each line will include the thread id of the name server's thread. For that we derive a logging class, that places the thread id right after the date/time. we log in the following methods of class name node, and in methods of class nameSystem called by them: create (startFile) abandonFileInProgress (abandonFileInProgress ) AbandonBlock (AbandonBlock ) reportWrittenBlock (blockReceived) addBlock (getAdditionalBlock) Complete (completeFile) rename (renameTo) delete (delete) Mkdirs (Mkdirs) sendHeartbeat (getHeartbeat) blockReport (processReoprt) blockReceived (blockReceived) errorReport getBlockWork (pendingTransfer, blocksToInvalidate)

Doug Cutting added a comment - 31/Mar/06 08:16

This sounds like a great plan!

Hadoop's current log formatter has an option to log thread id's. The namenode code can simply turn this on. Also, one can configure the JVM to rotate logs using:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/logging/FileHandler.html

In any case, we should add a timestamp to the log file names generated by bin/hadoop-daemon.sh, where standard out and error are logged, regardless of JVM log configuration.

Doug Cutting added a comment - 31/Mar/06 08:16 This sounds like a great plan! Hadoop's current log formatter has an option to log thread id's. The namenode code can simply turn this on. Also, one can configure the JVM to rotate logs using: http://java.sun.com/j2se/1.4.2/docs/api/java/util/logging/FileHandler.html In any case, we should add a timestamp to the log file names generated by bin/hadoop-daemon.sh, where standard out and error are logged, regardless of JVM log configuration.

Yoram Arnon added a comment - 31/Mar/06 08:43

OK.
we'll use the current log formatter, and just turn on the output of thread id when we're debugging.

Yoram Arnon added a comment - 31/Mar/06 08:43 OK. we'll use the current log formatter, and just turn on the output of thread id when we're debugging.

Eric Baldeschwieler added a comment - 31/Mar/06 09:25

One thing that really helped us was to be able to specify the duration to keep log files. So you could configure the system to keep up to N seconds of logs (think one month). This way logs don't grow without bound, but you can be confident how much history will be available. Also, logs gzip pretty well. It would be nice to zip closed logs automatically.

Don't know how much of this you can get for free from the existing logs packages. Should investigate this.

Are we just logging on the name node, or are data nodes logging all events too. Seems like that would be desirable as well. Using the same mechanisms of course.

Eric Baldeschwieler added a comment - 31/Mar/06 09:25 One thing that really helped us was to be able to specify the duration to keep log files. So you could configure the system to keep up to N seconds of logs (think one month). This way logs don't grow without bound, but you can be confident how much history will be available. Also, logs gzip pretty well. It would be nice to zip closed logs automatically. Don't know how much of this you can get for free from the existing logs packages. Should investigate this. Are we just logging on the name node, or are data nodes logging all events too. Seems like that would be desirable as well. Using the same mechanisms of course.

Hairong Kuang added a comment - 28/Apr/06 03:53

This patch adds the following features
1. NameNode adds a static field "stateChangeLog" that keeps tracks of all the namenode state changes
2. Various logging statements are added to NameNode, FSNamesystem, FSDirectory. Basically namesapce (dir) changes are logged at the fine level and block changes are logged at the finer level
3. initFileHandler is added to LogFormatter. All logs are directed to a log file instead of stdout. All logs are rolled and capped in size. Log file names end with .log.
4. In hadoop-daemon.sh, stdout is redirected to a .out file.
5. namenode logging levels, log file max size, and number of generations are configurable.
6. A JUnit test program is added to test the correctness of namespace change logging.

Hairong Kuang added a comment - 28/Apr/06 03:53 This patch adds the following features 1. NameNode adds a static field "stateChangeLog" that keeps tracks of all the namenode state changes 2. Various logging statements are added to NameNode, FSNamesystem, FSDirectory. Basically namesapce (dir) changes are logged at the fine level and block changes are logged at the finer level 3. initFileHandler is added to LogFormatter. All logs are directed to a log file instead of stdout. All logs are rolled and capped in size. Log file names end with .log. 4. In hadoop-daemon.sh, stdout is redirected to a .out file. 5. namenode logging levels, log file max size, and number of generations are configurable. 6. A JUnit test program is added to test the correctness of namespace change logging.

Doug Cutting added a comment - 02/May/06 01:14

Overall I like this. A few nits, however:

1. It would be nice if, before it switches logging to a file, the name of that file were logged. That way, when folks upgrade they can figure out where there logs are written. This will also help in debugging configuration issues.

2. When I run bin/start-all.sh on my Linux box, the log files end up in my connected directory, in HADOOP_HOME, not in HADOOP_HOME/logs. When I add a print statement, it shows the correct directory for logDir, but that's not where the files are written. I have not tested this on Windows, but it would be good to check that it with a simple configuration (i.e., a hadoop-site.xml that only specifies localhost for the jobtracker and namenode) that, on Windows and Linux, the files are written where expected. Perhaps we could even add a unit test for this.

Thanks!

Doug Cutting added a comment - 02/May/06 01:14 Overall I like this. A few nits, however: 1. It would be nice if, before it switches logging to a file, the name of that file were logged. That way, when folks upgrade they can figure out where there logs are written. This will also help in debugging configuration issues. 2. When I run bin/start-all.sh on my Linux box, the log files end up in my connected directory, in HADOOP_HOME, not in HADOOP_HOME/logs. When I add a print statement, it shows the correct directory for logDir, but that's not where the files are written. I have not tested this on Windows, but it would be good to check that it with a simple configuration (i.e., a hadoop-site.xml that only specifies localhost for the jobtracker and namenode) that, on Windows and Linux, the files are written where expected. Perhaps we could even add a unit test for this. Thanks!

Doug Cutting added a comment - 02/May/06 01:16

Also, one of Owen's patches has made applying this patch require manual steps. So if you make a new patch, please first update your tree and merge with Owen's changes. Thanks!

Doug Cutting added a comment - 02/May/06 01:16 Also, one of Owen's patches has made applying this patch require manual steps. So if you make a new patch, please first update your tree and merge with Owen's changes. Thanks!

Hairong Kuang added a comment - 02/May/06 01:43

Yes, it makes sense to log the log file name. I will make the change. If HADOOP_LOG_DIR is set, the log directory is set to be HADOOP_LOG_DIR. Otherwise, if HADOOP_HOME is set, the log directory is set to be HADOOP_HOME/logs. Otherwise, it is set to be the user's home directory. I do not know why you saw log files in HADOOP_HOME. I will take a look.

Hairong Kuang added a comment - 02/May/06 01:43 Yes, it makes sense to log the log file name. I will make the change. If HADOOP_LOG_DIR is set, the log directory is set to be HADOOP_LOG_DIR. Otherwise, if HADOOP_HOME is set, the log directory is set to be HADOOP_HOME/logs. Otherwise, it is set to be the user's home directory. I do not know why you saw log files in HADOOP_HOME. I will take a look.

Hairong Kuang added a comment - 04/May/06 08:48

Here is patch that includes the changes that Doug suggested. I tested it on a RH linux machine, it worked fine. But it seems that it has a problem on cygwin in that File.exists() returns true for a nonexistent directory. Doug, could you test it? If I am able to figure out what went wrong on cygwin, I will resubmit the patch tomorrow.

Hairong Kuang added a comment - 04/May/06 08:48 Here is patch that includes the changes that Doug suggested. I tested it on a RH linux machine, it worked fine. But it seems that it has a problem on cygwin in that File.exists() returns true for a nonexistent directory. Doug, could you test it? If I am able to figure out what went wrong on cygwin, I will resubmit the patch tomorrow.

Hairong Kuang added a comment - 05/May/06 07:18

The Cygwin problem is solved.

Hairong Kuang added a comment - 05/May/06 07:18 The Cygwin problem is solved.

Doug Cutting added a comment - 06/May/06 00:08

I just committed this.

I made one additional change. You removed the 'cd $HADOOP_HOME' line in hadoop-daemon.sh. I re-added this to hadoop-daemons.sh, so that, when starting remote daemons, they are always run from HADOOP_HOME rather than the users home directory, which is more likely to be NFS mounted. The CWD of a daemon is used for core dumps, java profiler output, etc. and it is generally best if it is not NFS mounted.

Thanks, Hairong!

Doug Cutting added a comment - 06/May/06 00:08 I just committed this. I made one additional change. You removed the 'cd $HADOOP_HOME' line in hadoop-daemon.sh. I re-added this to hadoop-daemons.sh, so that, when starting remote daemons, they are always run from HADOOP_HOME rather than the users home directory, which is more likely to be NFS mounted. The CWD of a daemon is used for core dumps, java profiler output, etc. and it is generally best if it is not NFS mounted. Thanks, Hairong!

Hairong Kuang added a comment - 06/May/06 01:46

Hi Doug,

Thanks for commiting this patch.

The reason that I removed the line 'cd $HADOOP_HOME' is that I had difficulty starting hadoop from any directory except for "HADOOP_HOME". In my configuration, the log dir & pid dir are relative to "HADOOP_HOME". But "HADOOP_HOME" is relative to the current directory ".". If the script changes the current directory, it is not able to get the log dir and pid dir right.

An alternative fix is to set HADOOP_HOME to be its absolute path.

Hairong Kuang added a comment - 06/May/06 01:46 Hi Doug, Thanks for commiting this patch. The reason that I removed the line 'cd $HADOOP_HOME' is that I had difficulty starting hadoop from any directory except for "HADOOP_HOME". In my configuration, the log dir & pid dir are relative to "HADOOP_HOME". But "HADOOP_HOME" is relative to the current directory ".". If the script changes the current directory, it is not able to get the log dir and pid dir right. An alternative fix is to set HADOOP_HOME to be its absolute path.

Doug Cutting added a comment - 06/May/06 02:07

I'm now having troubles with things under cygwin. 'bin/hadoop-daemon.sh start namenode' works, but 'bin/start-dfs.sh' silently fails to start any daemons.

Doug Cutting added a comment - 06/May/06 02:07 I'm now having troubles with things under cygwin. 'bin/hadoop-daemon.sh start namenode' works, but 'bin/start-dfs.sh' silently fails to start any daemons.

Doug Cutting added a comment - 06/May/06 02:36

I figured out the problem under Cygwin & fixed it. HADOOP_HOME wasn't quoted in hadoop-daemon.sh, so things failed when the path contained a space (as my HADOOP_HOME does under cygwin). When testing under Cygwin one should always install in a path that contains a space to test these cases.

Doug Cutting added a comment - 06/May/06 02:36 I figured out the problem under Cygwin & fixed it. HADOOP_HOME wasn't quoted in hadoop-daemon.sh, so things failed when the path contained a space (as my HADOOP_HOME does under cygwin). When testing under Cygwin one should always install in a path that contains a space to test these cases.

People

Assignee:: Hairong Kuang

Reporter:: Yoram Arnon

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 22/Mar/06 01:09

Updated:: 08/Jul/09 16:41

Resolved:: 06/May/06 00:08

Hadoop Common

Details

Description

Attachments

Attachments

Activity

People

Dates