[HADOOP-1139] All block trasitions should be logged at log level INFO - ASF JIRA

Dhruba Borthakur added a comment - 02/May/07 20:24

org.apache.hadoop.dfs.StateChange is used to record state changes to blocks. All other log messages in the FSNamesystem go to org.apache.hadoop.fs.FSNamesystem. Thus, it is possible to control log message of block state changes without affecting other logging levels.

A production site could set only org.apache.hadoop.dfs.StateChange to debug level. I am proposing that no code change is necessary.

Dhruba Borthakur added a comment - 02/May/07 20:24 org.apache.hadoop.dfs.StateChange is used to record state changes to blocks. All other log messages in the FSNamesystem go to org.apache.hadoop.fs.FSNamesystem. Thus, it is possible to control log message of block state changes without affecting other logging levels. A production site could set only org.apache.hadoop.dfs.StateChange to debug level. I am proposing that no code change is necessary.

Sameer Paranjpye added a comment - 02/May/07 22:21

The intent here was to make the logging of block events the default. Changing the setting in configuration is easy to overlook and if a block error occurs then it becomes impossible to diagnose. I would still vote for changing the log messages to go out at INFO.

Sameer Paranjpye added a comment - 02/May/07 22:21 The intent here was to make the logging of block events the default. Changing the setting in configuration is easy to overlook and if a block error occurs then it becomes impossible to diagnose. I would still vote for changing the log messages to go out at INFO.

Michael Bieniosek added a comment - 02/May/07 23:26

What's the motivation to log these at INFO? Is there any reason I would care about block transition messages if I were not debugging hadoop?

Michael Bieniosek added a comment - 02/May/07 23:26 What's the motivation to log these at INFO? Is there any reason I would care about block transition messages if I were not debugging hadoop?

Dhruba Borthakur added a comment - 03/May/07 00:10

I think the namenode will be filled with block transition messages if we log all of them at INFO level. The following are block transition events:

1. Allocate a block for a file that is being written to. Log this event.
2. When the datanodes confirm receipt of a block, log this event.
3. When the file is closed, these blocks move to a different category. Log this event.
4. When a file gets deleted, the namenode tells the datanode to delete those block. Log these.
5. When a datanode dies, replicate blocks. Log these events.

I am guessing that Items 1, 2 and 3 will flood the namenode log.

Dhruba Borthakur added a comment - 03/May/07 00:10 I think the namenode will be filled with block transition messages if we log all of them at INFO level. The following are block transition events: 1. Allocate a block for a file that is being written to. Log this event. 2. When the datanodes confirm receipt of a block, log this event. 3. When the file is closed, these blocks move to a different category. Log this event. 4. When a file gets deleted, the namenode tells the datanode to delete those block. Log these. 5. When a datanode dies, replicate blocks. Log these events. I am guessing that Items 1, 2 and 3 will flood the namenode log.

Sameer Paranjpye added a comment - 03/May/07 20:46

> What's the motivation to log these at INFO? Is there any reason I would care about block transition messages if I were not debugging hadoop?

Block related errors (blocks getting lost, having too many or too few replicas etc.) are very hard to diagnose without a history of block events. These bugs are frequently exposed by running systems and the circumstances in which they occur can be hard to reproduce in a testbed. If you lost a block on your HDFS installation, you would want to know why, that answer can be obtained with much less pain if there is a trail to follow.

Sameer Paranjpye added a comment - 03/May/07 20:46 > What's the motivation to log these at INFO? Is there any reason I would care about block transition messages if I were not debugging hadoop? Block related errors (blocks getting lost, having too many or too few replicas etc.) are very hard to diagnose without a history of block events. These bugs are frequently exposed by running systems and the circumstances in which they occur can be hard to reproduce in a testbed. If you lost a block on your HDFS installation, you would want to know why, that answer can be obtained with much less pain if there is a trail to follow.

Michael Bieniosek added a comment - 03/May/07 20:55

I don't have a problem losing blocks on my HDFS installation (possibly because I have a much smaller installation). I don't see why I should pay the price of having my namenode log flooded with these messages. If you need to debug block related errors, why don't you just set the namenode logging to DEBUG on your installation?

Michael Bieniosek added a comment - 03/May/07 20:55 I don't have a problem losing blocks on my HDFS installation (possibly because I have a much smaller installation). I don't see why I should pay the price of having my namenode log flooded with these messages. If you need to debug block related errors, why don't you just set the namenode logging to DEBUG on your installation?

Dhruba Borthakur added a comment - 12/May/07 00:20

I ran an experiment on a cluster that has about 15M blocks. I switched on logging when a block is allocated, when the namenode receives a block-confirmation from the datanode for a block. The log size at the end of a 8 hour random-writer run was about 6.5GB. This shows that increasing the debug-levels of block transitions might flood the logs.

An alternative is to print log messages at the following events only:

1. When a file is created and then closed, log all blocks that belong to that file.
2. When a file gets deleted, log all the blocks that belong to that file.
3. When the namenode replication engine triggers re-replication of a block, log it.
4. When a replica is detected to be corrupt, it is deleted. Log it.

Dhruba Borthakur added a comment - 12/May/07 00:20 I ran an experiment on a cluster that has about 15M blocks. I switched on logging when a block is allocated, when the namenode receives a block-confirmation from the datanode for a block. The log size at the end of a 8 hour random-writer run was about 6.5GB. This shows that increasing the debug-levels of block transitions might flood the logs. An alternative is to print log messages at the following events only: 1. When a file is created and then closed, log all blocks that belong to that file. 2. When a file gets deleted, log all the blocks that belong to that file. 3. When the namenode replication engine triggers re-replication of a block, log it. 4. When a replica is detected to be corrupt, it is deleted. Log it.

Dhruba Borthakur added a comment - 30/May/07 00:22

1. When a file is created and then closed, log all blocks that belong to that file.
2. When a file gets deleted, log all the blocks that belong to that file.
3. When the namenode replication engine triggers re-replication of a block, log it.
4. When a replica is detected to be corrupt, it is deleted. Log it.
5. When a datanode confirms receipt of a block, log it.

Dhruba Borthakur added a comment - 30/May/07 00:22 1. When a file is created and then closed, log all blocks that belong to that file. 2. When a file gets deleted, log all the blocks that belong to that file. 3. When the namenode replication engine triggers re-replication of a block, log it. 4. When a replica is detected to be corrupt, it is deleted. Log it. 5. When a datanode confirms receipt of a block, log it.

Dhruba Borthakur added a comment - 31/May/07 22:52

This new patch incorporates a few of Hairong's comments.

1. There is no need to check LOG/isInfoEnabled() because logs are typically at INFO level anyway.
2. Change the comment in addStoredBlock() on why we log when safemode is off.

Dhruba Borthakur added a comment - 31/May/07 22:52 This new patch incorporates a few of Hairong's comments. 1. There is no need to check LOG/isInfoEnabled() because logs are typically at INFO level anyway. 2. Change the comment in addStoredBlock() on why we log when safemode is off.

Dhruba Borthakur added a comment - 01/Jun/07 18:49

This removes all calls to LOG.infoEnabled() except at one place where a big hunk of code is required to log.

Dhruba Borthakur added a comment - 01/Jun/07 18:49 This removes all calls to LOG.infoEnabled() except at one place where a big hunk of code is required to log.

Hairong Kuang added a comment - 02/Jun/07 00:08

+1 Code looks good.

Hairong Kuang added a comment - 02/Jun/07 00:08 +1 Code looks good.

Hadoop QA added a comment - 02/Jun/07 06:07

+1

http://issues.apache.org/jira/secure/attachment/12358728/blockTransitionLog3.patch applied and successfully tested against trunk revision r543622.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/235/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/235/console

Hadoop QA added a comment - 02/Jun/07 06:07 +1 http://issues.apache.org/jira/secure/attachment/12358728/blockTransitionLog3.patch applied and successfully tested against trunk revision r543622. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/235/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/235/console

Doug Cutting added a comment - 14/Jun/07 21:41

I just committed this. Thanks, Dhruba!

Michael: the idea of this change is that, if you ever do lose a block, you'll wish you had these events logged. I think it has been constrained so that the log is not flooded, but important events are still logged to permit some post-mortem analysis.

Doug Cutting added a comment - 14/Jun/07 21:41 I just committed this. Thanks, Dhruba! Michael: the idea of this change is that, if you ever do lose a block, you'll wish you had these events logged. I think it has been constrained so that the log is not flooded, but important events are still logged to permit some post-mortem analysis.

Hadoop QA added a comment - 15/Jun/07 11:47

Integrated in Hadoop-Nightly #122 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/122/)

Hadoop QA added a comment - 15/Jun/07 11:47 Integrated in Hadoop-Nightly #122 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/122/ )

Hadoop Common

All block trasitions should be logged at log level INFO

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates