Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3194

DataNode block scanner is running too frequently

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: 2.0.2-alpha
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Block scanning interval by default should be taken as 21 days(3 weeks) and each block scanning should happen once in 21 days.
      Here the block is being scanned continuosly.

      2012-04-03 10:44:47,056 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-241703115-xx.xx.xx.55-1333086229434:blk_-2666054955039014473_1003
      2012-04-03 10:45:02,064 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-241703115-xx.xx.xx.55-1333086229434:blk_-2666054955039014473_1003
      2012-04-03 10:45:17,071 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-241703115-xx.xx.xx.55-1333086229434:blk_-2666054955039014473_1003
      2012-04-03 10:45:32,079 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-241703115-xx.xx.xx.55-1333086229434:blk_-2666054955039014473

      1. HDFS-3194_1.patch
        9 kB
        amith
      2. HDFS-3194_2.patch
        14 kB
        amith
      3. HDFS-3194_4.patch
        14 kB
        amith
      4. HDFS-3194_6.patch
        14 kB
        amith
      5. HDFS-3194_7.patch
        15 kB
        amith
      6. HDFS-3194.patch
        8 kB
        amith
      7. hdfs-3194-1.txt
        1 kB
        Andy Isaacson

        Issue Links

          Activity

          suja s created issue -
          Hide
          Harsh J added a comment -

          org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner does not exist in branch-1. It is a branch-2 feature (attached to Federation). What version are you really using and reporting for? Please fix the version field to reflect the right version…

          Show
          Harsh J added a comment - org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner does not exist in branch-1. It is a branch-2 feature (attached to Federation). What version are you really using and reporting for? Please fix the version field to reflect the right version…
          Hide
          Daryn Sharp added a comment -

          I've seen this behavior in branch-2 & branch-3. Last year, I think Matt Foley fixed this or a similar issue with excessive block scanning. This may be a regression.

          Show
          Daryn Sharp added a comment - I've seen this behavior in branch-2 & branch-3. Last year, I think Matt Foley fixed this or a similar issue with excessive block scanning. This may be a regression.
          suja s made changes -
          Field Original Value New Value
          Fix Version/s 3.0.0 [ 12320356 ]
          Fix Version/s 1.0.3 [ 12320249 ]
          Affects Version/s 3.0.0 [ 12320356 ]
          Affects Version/s 1.0.3 [ 12320249 ]
          Hide
          suja s added a comment -

          @Harsh: I have corrected the version field.

          Show
          suja s added a comment - @Harsh: I have corrected the version field.
          amith made changes -
          Attachment HDFS-3194.patch [ 12522585 ]
          amith made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          amith added a comment -

          Hi Suja,

          I have done analysis on the reported defect and came up with a patch.

          DataBlockScanner thread will scan for the new blocks for every 5 seconds(new period start), but in the scan method of BlockPoolSliceScanner finally we are rolling the logs(irrespective of entries in logs).

          When a new period starts we are loading the number of entries from the logs and non scanned blocks are scanned.
          Since there is log roll for every 5 sec after 10 seconds the previously scanned entries are lost!!!!

          causing the block scan for the next period. Thus we can see the time difference is 15 sec for every consecutive scan.

          In the provided patch I have introduced log roll only after certain number of entries in the logs are reached(which is similar to 20.x version HDFS behaviour), care is taken not to break log interface which already exist.

          Show
          amith added a comment - Hi Suja, I have done analysis on the reported defect and came up with a patch. DataBlockScanner thread will scan for the new blocks for every 5 seconds(new period start), but in the scan method of BlockPoolSliceScanner finally we are rolling the logs(irrespective of entries in logs). When a new period starts we are loading the number of entries from the logs and non scanned blocks are scanned. Since there is log roll for every 5 sec after 10 seconds the previously scanned entries are lost!!!! causing the block scan for the next period. Thus we can see the time difference is 15 sec for every consecutive scan. In the provided patch I have introduced log roll only after certain number of entries in the logs are reached(which is similar to 20.x version HDFS behaviour), care is taken not to break log interface which already exist.
          Hide
          amith added a comment -

          Hi Suja,

          I have done analysis on the reported defect and came up with a patch.

          DataBlockScanner thread will scan for the new blocks for every 5 seconds(new period start), but in the scan method of BlockPoolSliceScanner finally we are rolling the logs(irrespective of entries in logs).

          When a new period starts we are loading the number of entries from the logs and non scanned blocks are scanned.
          Since there is log roll for every 5 sec after 10 seconds the previously scanned entries are lost!!!!

          causing the block scan for the next period. Thus we can see the time difference is 15 sec for every consecutive scan.

          In the provided patch I have introduced log roll only after certain number of entries in the logs are reached(which is similar to 20.x version HDFS behaviour), care is taken not to break log interface which already exist.

          Show
          amith added a comment - Hi Suja, I have done analysis on the reported defect and came up with a patch. DataBlockScanner thread will scan for the new blocks for every 5 seconds(new period start), but in the scan method of BlockPoolSliceScanner finally we are rolling the logs(irrespective of entries in logs). When a new period starts we are loading the number of entries from the logs and non scanned blocks are scanned. Since there is log roll for every 5 sec after 10 seconds the previously scanned entries are lost!!!! causing the block scan for the next period. Thus we can see the time difference is 15 sec for every consecutive scan. In the provided patch I have introduced log roll only after certain number of entries in the logs are reached(which is similar to 20.x version HDFS behaviour), care is taken not to break log interface which already exist.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522585/HDFS-3194.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hdfs.TestSetrepDecreasing
          org.apache.hadoop.hdfs.server.blockmanagement.TestOverReplicatedBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2272//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2272//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2272//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522585/HDFS-3194.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hdfs.TestSetrepDecreasing org.apache.hadoop.hdfs.server.blockmanagement.TestOverReplicatedBlocks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2272//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2272//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2272//console This message is automatically generated.
          amith made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          amith made changes -
          Attachment HDFS-3194_1.patch [ 12523214 ]
          amith made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Target Version/s 3.0.0 [ 12320356 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12523214/HDFS-3194_1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hdfs.TestSetrepDecreasing

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2298//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2298//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12523214/HDFS-3194_1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hdfs.TestSetrepDecreasing +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2298//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2298//console This message is automatically generated.
          Uma Maheswara Rao G made changes -
          Assignee Uma Maheswara Rao G [ umamaheswararao ]
          Uma Maheswara Rao G made changes -
          Assignee Uma Maheswara Rao G [ umamaheswararao ] amith [ amithdk ]
          Eli Collins made changes -
          Link This issue is duplicated by HDFS-3410 [ HDFS-3410 ]
          Hide
          Eli Collins added a comment -

          Per HDFS-3410 looks like startNewPeriod is called by scanBlockPoolSlice every 5 seconds.

          Show
          Eli Collins added a comment - Per HDFS-3410 looks like startNewPeriod is called by scanBlockPoolSlice every 5 seconds.
          Eli Collins made changes -
          Fix Version/s 3.0.0 [ 12320356 ]
          Affects Version/s 2.0.0 [ 12320353 ]
          Affects Version/s 3.0.0 [ 12320356 ]
          Target Version/s 3.0.0 [ 12320356 ] 2.0.0 [ 12320353 ]
          Priority Minor [ 4 ] Major [ 3 ]
          Component/s data-node [ 12312927 ]
          Hide
          amith added a comment -

          Corrected TestDatanodeBlockScanner test which was failing due to timeout

          Show
          amith added a comment - Corrected TestDatanodeBlockScanner test which was failing due to timeout
          amith made changes -
          Attachment HDFS-3194_2.patch [ 12529924 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12529924/HDFS-3194_2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2526//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2526//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2526//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12529924/HDFS-3194_2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2526//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2526//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2526//console This message is automatically generated.
          Hide
          amith added a comment -

          corrected the findbug warning

          Show
          amith added a comment - corrected the findbug warning
          amith made changes -
          Attachment HDFS-3194_4.patch [ 12529934 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12529934/HDFS-3194_4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2527//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2527//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2527//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12529934/HDFS-3194_4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2527//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2527//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2527//console This message is automatically generated.
          Hide
          amith added a comment -

          Attaching a patch

          Show
          amith added a comment - Attaching a patch
          amith made changes -
          Attachment HDFS-3194_6.patch [ 12530162 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530162/HDFS-3194_6.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2540//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2540//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2540//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530162/HDFS-3194_6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2540//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2540//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2540//console This message is automatically generated.
          Eli Collins made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Eli Collins added a comment -

          Hi Amith,

          I think the current behavior (rolling the log on each period) is intended. I think the bug is that we're starting a new period way too frequently (every 5 seconds instead of every three weeks by default). DataBlockScanner#run calls scanBlockPoolSlice every 5 seconds, and it starts a new period every time. Instead I think we should only start a new period in BlockPoolSliceScanner#scan when the full period is up. Make sense?

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hi Amith, I think the current behavior (rolling the log on each period) is intended. I think the bug is that we're starting a new period way too frequently (every 5 seconds instead of every three weeks by default). DataBlockScanner#run calls scanBlockPoolSlice every 5 seconds, and it starts a new period every time. Instead I think we should only start a new period in BlockPoolSliceScanner#scan when the full period is up. Make sense? Thanks, Eli
          Eli Collins made changes -
          Summary Continuous block scanning at DN side DataNode block scanner is running too frequently
          Hide
          Eli Collins added a comment -

          Here's the DN log from a recent trunk build with just one block, we start a new period and re-scan the block every 5 seconds:

          2012-05-30 14:53:46,380 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%
          2012-05-30 14:53:46,380 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-279320499-127.0.1.1-1338414133947:blk_6750362468537267405_1002
          2012-05-30 14:53:51,381 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%
          2012-05-30 14:53:56,383 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%
          2012-05-30 14:53:56,385 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-279320499-127.0.1.1-1338414133947:blk_6750362468537267405_1002
          2012-05-30 14:54:01,386 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%
          2012-05-30 14:54:06,388 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00%
          2012-05-30 14:54:06,389 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-279320499-127.0.1.1-1338414133947:blk_6750362468537267405_1002

          Show
          Eli Collins added a comment - Here's the DN log from a recent trunk build with just one block, we start a new period and re-scan the block every 5 seconds: 2012-05-30 14:53:46,380 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% 2012-05-30 14:53:46,380 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-279320499-127.0.1.1-1338414133947:blk_6750362468537267405_1002 2012-05-30 14:53: 51 ,381 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% 2012-05-30 14:53:56,383 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% 2012-05-30 14:53:56,385 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-279320499-127.0.1.1-1338414133947:blk_6750362468537267405_1002 2012-05-30 14:54:01,386 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% 2012-05-30 14:54:06,388 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Starting a new period : work left in prev period : 0.00% 2012-05-30 14:54:06,389 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-279320499-127.0.1.1-1338414133947:blk_6750362468537267405_1002
          Hide
          Eli Collins added a comment -

          I see this on branch-2 as well.

          Show
          Eli Collins added a comment - I see this on branch-2 as well.
          Hide
          amith added a comment -

          Hi Eli
          Thanks for looking into this issue

          This issue may be induced due to some federation related changes (I don't know which)

          I think the current behavior (rolling the log on each period) is intended. I think the bug is that we're starting a new period way too frequently (every 5 seconds instead of every three weeks by default). DataBlockScanner#run calls scanBlockPoolSlice every 5 seconds

          I feel that code will be modified some thing like

          //Sleep everytime except in the first interation.
                if (!firstRun) {
                  try {
                    Thread.sleep(21*24*3600);
                  } catch (InterruptedException ex) {
                    // Interrupt itself again to set the interrupt status
                    blockScannerThread.interrupt();
                    continue;
                  }
          

          If I change the code like this then there is a possibly we will not be able to do incremental block scan
          eg: I will write 10 blocks @time 14:53:46
          now block scanner has scanned the blocks
          I wrote another set of blocks in 15:53:46 then this blocks will not be scanned by block scanner for next 3 weeks.

          Now in my change this is same as Hadoop 0.20.X behavior (Even new period is started logs are rolled only after reaching certain limit).
          where logs will be rolled only after certain number of blocks are reached in the logs.

          Please correct me if I am wrong

          Thanks
          Amith

          Show
          amith added a comment - Hi Eli Thanks for looking into this issue This issue may be induced due to some federation related changes (I don't know which) I think the current behavior (rolling the log on each period) is intended. I think the bug is that we're starting a new period way too frequently (every 5 seconds instead of every three weeks by default). DataBlockScanner#run calls scanBlockPoolSlice every 5 seconds I feel that code will be modified some thing like //Sleep everytime except in the first interation. if (!firstRun) { try { Thread .sleep(21*24*3600); } catch (InterruptedException ex) { // Interrupt itself again to set the interrupt status blockScannerThread.interrupt(); continue ; } If I change the code like this then there is a possibly we will not be able to do incremental block scan eg: I will write 10 blocks @time 14:53:46 now block scanner has scanned the blocks I wrote another set of blocks in 15:53:46 then this blocks will not be scanned by block scanner for next 3 weeks. Now in my change this is same as Hadoop 0.20.X behavior (Even new period is started logs are rolled only after reaching certain limit). where logs will be rolled only after certain number of blocks are reached in the logs. Please correct me if I am wrong Thanks Amith
          Hide
          Eli Collins added a comment -

          Would just removing the call to bpScanner.scanBlockPoolSlice() in DataBlockScanner#run do the trick?

          Show
          Eli Collins added a comment - Would just removing the call to bpScanner.scanBlockPoolSlice() in DataBlockScanner#run do the trick?
          Hide
          amith added a comment -

          BlockPoolSliceScanner is not a independent thread which can run itself I am not clear what you are trying to tell, could u please elaborate on it.

          This is my understanding
          Here irrespective of number of blocks scanned in the DN we are rolling the logs for every 5 sec.
          so after some time DN will not know about the blocks scanned due to frequent log rolling, so this problem has occured.

          Thanks
          Amith

          Show
          amith added a comment - BlockPoolSliceScanner is not a independent thread which can run itself I am not clear what you are trying to tell, could u please elaborate on it. This is my understanding Here irrespective of number of blocks scanned in the DN we are rolling the logs for every 5 sec. so after some time DN will not know about the blocks scanned due to frequent log rolling, so this problem has occured. Thanks Amith
          amith made changes -
          Attachment HDFS-3194_7.patch [ 12532892 ]
          amith made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12532892/HDFS-3194_7.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 13 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2678//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2678//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532892/HDFS-3194_7.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 13 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2678//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2678//console This message is automatically generated.
          Hide
          Andy Isaacson added a comment -

          I don't understand why we call startNewPeriod in scanBlockPoolSlice. startNewPeriod should be called exactly once per 3-week period. scanBlockPoolSlice gets called every 5 seconds by DataBlockScanner.run(). So we should not be starting a new period when we scan a pool slice.

          So I think this trivial patch fixes the main issue:

          --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          @@ -560,7 +560,6 @@ class BlockPoolSliceScanner {
             }
             
             void scanBlockPoolSlice() {
          -    startNewPeriod();
               // Create a new processedBlocks structure
               processedBlocks = new HashMap<Long, Integer>();
               if (!assignInitialVerificationTimes()) {
          

          There may be another issue that's being addressed in HDFS-3194_7.patch but that patch does not appear to address the core issue that we're restarting the period far too frequently. I don't understand what problem is being addressed in _7 though, so maybe I'm missing something.

          Amith, does my patch look correct to you?

          Show
          Andy Isaacson added a comment - I don't understand why we call startNewPeriod in scanBlockPoolSlice. startNewPeriod should be called exactly once per 3-week period. scanBlockPoolSlice gets called every 5 seconds by DataBlockScanner.run(). So we should not be starting a new period when we scan a pool slice. So I think this trivial patch fixes the main issue: --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java @@ -560,7 +560,6 @@ class BlockPoolSliceScanner { } void scanBlockPoolSlice() { - startNewPeriod(); // Create a new processedBlocks structure processedBlocks = new HashMap< Long , Integer >(); if (!assignInitialVerificationTimes()) { There may be another issue that's being addressed in HDFS-3194 _7.patch but that patch does not appear to address the core issue that we're restarting the period far too frequently. I don't understand what problem is being addressed in _7 though, so maybe I'm missing something. Amith, does my patch look correct to you?
          Hide
          Uma Maheswara Rao G added a comment -

          scanBlockPoolSlice gets called every 5 seconds by DataBlockScanner.run().

          verification logs will track the information about prvious scan time. So, while scanning, it will load this info from this logs and will ensure that one block scanned only after 3*weeks time.
          HDFS-3194_7.patch addresses one potential problem. For every 5 sec run, logs are simply rolling. for every 3 slices, all verification logs are getting cleaned as max rolled index is 2 I remember and created new ones. So, the same block will get scanned for every 15secs again and again. I remember, we have verification log size limit in 20.x versions. So, it will not be rolled untill it reaches that size. That has been broken here with some other JIRAs.
          Let me check your point why startNewPeriod(); is required for every 5 sec slice.

          Show
          Uma Maheswara Rao G added a comment - scanBlockPoolSlice gets called every 5 seconds by DataBlockScanner.run(). verification logs will track the information about prvious scan time. So, while scanning, it will load this info from this logs and will ensure that one block scanned only after 3*weeks time. HDFS-3194 _7.patch addresses one potential problem. For every 5 sec run, logs are simply rolling. for every 3 slices, all verification logs are getting cleaned as max rolled index is 2 I remember and created new ones. So, the same block will get scanned for every 15secs again and again. I remember, we have verification log size limit in 20.x versions. So, it will not be rolled untill it reaches that size. That has been broken here with some other JIRAs. Let me check your point why startNewPeriod(); is required for every 5 sec slice.
          Hide
          amith added a comment -

          Hi Andy,

          Thanks for looking into the patch and sorry for late reply

          I accept I didnt address the startNewPeriod for every 5 seconds,
          With my current patch BlockSCanner should work fine functionally.

          But yes I accept still I need to fix startNewPeriod currently I am testing the new patch, soon I will provide a new one.

          With my initial test, I got startNewPeriod log will not be logged in first run.
          Let me test the new patch and provide it soon

          Amith

          Show
          amith added a comment - Hi Andy, Thanks for looking into the patch and sorry for late reply I accept I didnt address the startNewPeriod for every 5 seconds, With my current patch BlockSCanner should work fine functionally. But yes I accept still I need to fix startNewPeriod currently I am testing the new patch, soon I will provide a new one. With my initial test, I got startNewPeriod log will not be logged in first run. Let me test the new patch and provide it soon Amith
          Hide
          Uma Maheswara Rao G added a comment -

          Let me clarify: Here two issues exist

          • One is about frquent logging. HDFS-3194(is mainly talking about that and duplicated that JIRA to this.)
          • Other one is frquent rolling the verfication logs and resulting into ferquent block scanning instead of 3*weeks a block.
          Show
          Uma Maheswara Rao G added a comment - Let me clarify: Here two issues exist One is about frquent logging. HDFS-3194 (is mainly talking about that and duplicated that JIRA to this.) Other one is frquent rolling the verfication logs and resulting into ferquent block scanning instead of 3*weeks a block.
          Suresh Srinivas made changes -
          Link This issue is related to HDFS-1655 [ HDFS-1655 ]
          Uma Maheswara Rao G made changes -
          Link This issue is related to HDFS-3803 [ HDFS-3803 ]
          Hide
          Eli Collins added a comment -

          Andy, want to post a patch for trunk? Per my earlier comment I think we can just remove the call as you commented.

          Show
          Eli Collins added a comment - Andy, want to post a patch for trunk? Per my earlier comment I think we can just remove the call as you commented.
          Eli Collins made changes -
          Assignee amith [ amithdk ] Andy Isaacson [ adi2 ]
          Hide
          Andy Isaacson added a comment -

          Attaching patch to revert HDFS-3803 and avoid starting a new period every time we enter scanBlockPoolSlice.

          Show
          Andy Isaacson added a comment - Attaching patch to revert HDFS-3803 and avoid starting a new period every time we enter scanBlockPoolSlice.
          Andy Isaacson made changes -
          Attachment hdfs-3194-1.txt [ 12541132 ]
          Hide
          Eli Collins added a comment -

          +1 pending jenkins

          Show
          Eli Collins added a comment - +1 pending jenkins
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12541132/hdfs-3194-1.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3016//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3016//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12541132/hdfs-3194-1.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3016//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3016//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          I've committed this and merged to branch-2.

          Show
          Eli Collins added a comment - I've committed this and merged to branch-2.
          Eli Collins made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Target Version/s 2.0.0-alpha [ 12320353 ]
          Fix Version/s 2.2.0-alpha [ 12322472 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2653 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2653/)
          HDFS-3194. DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928)

          Result = SUCCESS
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2653 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2653/ ) HDFS-3194 . DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2588 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2588/)
          HDFS-3194. DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928)

          Result = SUCCESS
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2588 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2588/ ) HDFS-3194 . DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Hide
          Uma Maheswara Rao G added a comment -

          Eli, Have seen my previous comment?
          Actually there was another potential issue here and we have raised mainly to address that functional issue.

          DataBlockScanner frequently run the scan by sleeping 5sec. Here we are persisting block scan times in verification log to track the blocks which are really completed 3weeks period or not. The problem what I got is, scan method is cleaning that verification log completely after every run. We have the behavior to take one level backup of verification log.

          so for every 15secs, we will lose the verification log history completely. From then onwards, blocks will be scanned again even though that blocks scanned just before 15sec back. I hope this will clear my scenario.

          I am not sure I am missing some thing here.

          Eli or Andy, have tested this behavior with your patch? your patch solving that issue also?

          Thanks
          Uma

          Show
          Uma Maheswara Rao G added a comment - Eli, Have seen my previous comment? Actually there was another potential issue here and we have raised mainly to address that functional issue. DataBlockScanner frequently run the scan by sleeping 5sec. Here we are persisting block scan times in verification log to track the blocks which are really completed 3weeks period or not. The problem what I got is, scan method is cleaning that verification log completely after every run. We have the behavior to take one level backup of verification log. so for every 15secs, we will lose the verification log history completely. From then onwards, blocks will be scanned again even though that blocks scanned just before 15sec back. I hope this will clear my scenario. I am not sure I am missing some thing here. Eli or Andy, have tested this behavior with your patch? your patch solving that issue also? Thanks Uma
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2618 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2618/)
          HDFS-3194. DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928)

          Result = FAILURE
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2618 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2618/ ) HDFS-3194 . DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928) Result = FAILURE eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Hide
          Eli Collins added a comment -

          Hey Uma,

          Now that we're starting a new period every 5s we shouldn't be exiting the scan method so frequently which is what causes the verification logs to be rolled right?

          Feel free to file a new jira to close out the issue or submit the tests posted in an earlier patch if you think they should go in.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey Uma, Now that we're starting a new period every 5s we shouldn't be exiting the scan method so frequently which is what causes the verification logs to be rolled right? Feel free to file a new jira to close out the issue or submit the tests posted in an earlier patch if you think they should go in. Thanks, Eli
          Hide
          Uma Maheswara Rao G added a comment -

          Just re-opned this JIRA to check the above mentioned scenario.
          If that scenario works fine, I am fine to close this JIRA.

          Andy, could you please confirm me?

          Also you can observe the grepped logs in description. for the same block id, it is scanning for every 15secs exactly. (infact system has only 1 block while testing I guess for more clear verification on the behavior.)

          Show
          Uma Maheswara Rao G added a comment - Just re-opned this JIRA to check the above mentioned scenario. If that scenario works fine, I am fine to close this JIRA. Andy, could you please confirm me? Also you can observe the grepped logs in description. for the same block id, it is scanning for every 15secs exactly. (infact system has only 1 block while testing I guess for more clear verification on the behavior.)
          Uma Maheswara Rao G made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Uma Maheswara Rao G added a comment -

          oh, I did not see your comment Eli, without seeing re-opened it.
          I am not sure whether that is reasonable to close this JIRA without satisfying the explained description.
          If you feel that is fine, I will file another JIRA to handle actual functional issue explained in description.

          Show
          Uma Maheswara Rao G added a comment - oh, I did not see your comment Eli, without seeing re-opened it. I am not sure whether that is reasonable to close this JIRA without satisfying the explained description. If you feel that is fine, I will file another JIRA to handle actual functional issue explained in description.
          Hide
          Eli Collins added a comment -

          I thought Andy's explanation in his above comment ("Now in my change this is same as Hadoop 0.20.X behavior (Even new period is started logs are rolled only after reaching certain limit). where logs will be rolled only after certain number of blocks are reached in the logs.") addressed this, am I missing something?

          Show
          Eli Collins added a comment - I thought Andy's explanation in his above comment ("Now in my change this is same as Hadoop 0.20.X behavior (Even new period is started logs are rolled only after reaching certain limit). where logs will be rolled only after certain number of blocks are reached in the logs.") addressed this, am I missing something?
          Hide
          Andy Isaacson added a comment -

          Just re-opned this JIRA to check the above mentioned scenario.

          There is still an issue here even with the patch from hdfs-3194-1.txt. Running on a cluster with just one block we now see

          2012-08-16 15:59:57,884 INFO  datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(391)) - Verification succeeded for BP-2101131164-172.29.122.91-1337906886255:blk_7919273167187535506_4915
          2012-08-16 16:00:07,904 INFO  datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(391)) - Verification succeeded for BP-2101131164-172.29.122.91-1337906886255:blk_7919273167187535506_4915
          2012-08-16 16:00:17,925 INFO  datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(391)) - Verification succeeded for BP-2101131164-172.29.122.91-1337906886255:blk_7919273167187535506_4915
          

          the BP scanner is re-verifying the same replica every 10 seconds.

          We're not restarting a new period though, fetching blockScannerReport?listblocks shows a reasonable amount of "Time left in cur period" although "Progress in this period" is badly wrong:

          Block report for block pool: BP-2101131164-172.29.122.91-1337906886255
          blk_7919273167187535506_4915 : status : ok     type : local  scan time : 1345158298637   2012-08-16 16:04:58,637
          
          Total Blocks                 :      1
          Verified in last hour        :      1
          Verified in last day         :      1
          Verified in last week        :      1
          Verified in last four weeks  :      1
          Verified in SCAN_PERIOD      :      1
          Not yet verified             :      0
          Verified since restart       :     31
          Scans since restart          :     31
          Scan errors since restart    :      0
          Transient scan errors        :      0
          Current scan rate limit KBps :   1024
          Progress this period         :   3200%
          Time left in cur period      :  99.98%
          
          Show
          Andy Isaacson added a comment - Just re-opned this JIRA to check the above mentioned scenario. There is still an issue here even with the patch from hdfs-3194-1.txt. Running on a cluster with just one block we now see 2012-08-16 15:59:57,884 INFO datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(391)) - Verification succeeded for BP-2101131164-172.29.122.91-1337906886255:blk_7919273167187535506_4915 2012-08-16 16:00:07,904 INFO datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(391)) - Verification succeeded for BP-2101131164-172.29.122.91-1337906886255:blk_7919273167187535506_4915 2012-08-16 16:00:17,925 INFO datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(391)) - Verification succeeded for BP-2101131164-172.29.122.91-1337906886255:blk_7919273167187535506_4915 the BP scanner is re-verifying the same replica every 10 seconds. We're not restarting a new period though, fetching blockScannerReport?listblocks shows a reasonable amount of "Time left in cur period" although "Progress in this period" is badly wrong: Block report for block pool: BP-2101131164-172.29.122.91-1337906886255 blk_7919273167187535506_4915 : status : ok type : local scan time : 1345158298637 2012-08-16 16:04:58,637 Total Blocks : 1 Verified in last hour : 1 Verified in last day : 1 Verified in last week : 1 Verified in last four weeks : 1 Verified in SCAN_PERIOD : 1 Not yet verified : 0 Verified since restart : 31 Scans since restart : 31 Scan errors since restart : 0 Transient scan errors : 0 Current scan rate limit KBps : 1024 Progress this period : 3200% Time left in cur period : 99.98%
          Hide
          Uma Maheswara Rao G added a comment -

          Thanks Andy, for your verification. This is what exactly I was trying to explain in last few comments.
          Latest patch from Amith should address this issue.
          As I have helped him some times, I don't want to put directly +1 or review here and push this in. That is the reason, I just left it to get the review from others. There can be a good suggestions from others.

          I will explain how we were addressing similar stuff in older versions:
          before rolling the logs, we were having size check if verification log/max entries check. Till it reaches that many entries, it will not roll the logs immediately.

          Show
          Uma Maheswara Rao G added a comment - Thanks Andy, for your verification. This is what exactly I was trying to explain in last few comments. Latest patch from Amith should address this issue. As I have helped him some times, I don't want to put directly +1 or review here and push this in. That is the reason, I just left it to get the review from others. There can be a good suggestions from others. I will explain how we were addressing similar stuff in older versions: before rolling the logs, we were having size check if verification log/max entries check. Till it reaches that many entries, it will not roll the logs immediately.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1137 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1137/)
          HDFS-3194. DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928)

          Result = FAILURE
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1137 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1137/ ) HDFS-3194 . DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928) Result = FAILURE eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1169 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1169/)
          HDFS-3194. DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928)

          Result = FAILURE
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1169 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1169/ ) HDFS-3194 . DataNode block scanner is running too frequently. Contributed by Andy Isaacson (Revision 1373928) Result = FAILURE eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373928 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java
          Hide
          Eli Collins added a comment -

          Thanks Andy, please file a new jira to track this issue and link it here.

          Show
          Eli Collins added a comment - Thanks Andy, please file a new jira to track this issue and link it here.
          Hide
          Radim Kolar added a comment -

          can be this merged to 0.23?

          Show
          Radim Kolar added a comment - can be this merged to 0.23?
          Hide
          Andy Isaacson added a comment -

          Latest patch from Amith should address this issue.

          I assume you're referring to HDFS-3194_7.patch .

          I've asked for a description before of how it solves the problem, because that was not obvious from the discussion nor from reading the diff. I'm disappointed that nobody responded to that request, so I've gone and read the patch in detail, and I think I can finally explain the approach 7.patch is using.

          The problem 7.patch is solving is: currently when we finish scanning a BP, we unconditionally rotate the log. We keep two logs, the previous and the current logs. We sometimes re-scan a BP before the scanPeriod completes. If rescan happens twice within a single scanPeriod, the logs will rotate away and we will forget which blocks we previously scanned, so we will scan the first blocks again.

          To fix this, 7.patch delays rotating the logs until the log has reached a predetermined size, rather than rotating when the scan completes.

          +  static final int verficationLogLimit = 5;
          

          What does this constant do? It seems to govern the block verification log size, but I don't understand why we want to keep 5 log entries for every block in blockMap.

          +  private static long BLOCK_SCAN_PERIOD_UNIT = 3600 * 1000;
          ...
          -    this.scanPeriod = hours * 3600 * 1000;
          +    this.scanPeriod = hours * BLOCK_SCAN_PERIOD_UNIT;
          

          I don't think adding a named constant here is an improvement, but if you feel that it helps, please use a more descriptive name for this constant, like MS_PER_HOUR or something similar.

          Uma, Amith – Have you tested 7.patch with multiple block pools and a full cluster restart? I think the changed code will leave a dncp_block_verification_log.prev in multiple BP directories, and I suspect that the BLockPoolSliceScanner might resume from the wrong place if there are multiple verification_logs in the data directories.

          Per Eli's request, I'm going to close this Jira as resolved by my one-line patch which resolves the "Block scanner runs too frequently" bug, and open a new Jira to track the "Block scanner repeatedly rescans blocks" bug which is addressed by HDFS-3194_7.patch.

          Show
          Andy Isaacson added a comment - Latest patch from Amith should address this issue. I assume you're referring to HDFS-3194 _7.patch . I've asked for a description before of how it solves the problem, because that was not obvious from the discussion nor from reading the diff. I'm disappointed that nobody responded to that request, so I've gone and read the patch in detail, and I think I can finally explain the approach 7.patch is using. The problem 7.patch is solving is: currently when we finish scanning a BP, we unconditionally rotate the log. We keep two logs, the previous and the current logs. We sometimes re-scan a BP before the scanPeriod completes. If rescan happens twice within a single scanPeriod, the logs will rotate away and we will forget which blocks we previously scanned, so we will scan the first blocks again. To fix this, 7.patch delays rotating the logs until the log has reached a predetermined size, rather than rotating when the scan completes. + static final int verficationLogLimit = 5; What does this constant do? It seems to govern the block verification log size, but I don't understand why we want to keep 5 log entries for every block in blockMap. + private static long BLOCK_SCAN_PERIOD_UNIT = 3600 * 1000; ... - this .scanPeriod = hours * 3600 * 1000; + this .scanPeriod = hours * BLOCK_SCAN_PERIOD_UNIT; I don't think adding a named constant here is an improvement, but if you feel that it helps, please use a more descriptive name for this constant, like MS_PER_HOUR or something similar. Uma, Amith – Have you tested 7.patch with multiple block pools and a full cluster restart? I think the changed code will leave a dncp_block_verification_log.prev in multiple BP directories, and I suspect that the BLockPoolSliceScanner might resume from the wrong place if there are multiple verification_logs in the data directories. Per Eli's request, I'm going to close this Jira as resolved by my one-line patch which resolves the "Block scanner runs too frequently" bug, and open a new Jira to track the "Block scanner repeatedly rescans blocks" bug which is addressed by HDFS-3194 _7.patch.
          Hide
          Andy Isaacson added a comment -

          Created HDFS-3828 to track the repeated rescans issue.

          Show
          Andy Isaacson added a comment - Created HDFS-3828 to track the repeated rescans issue.
          Andy Isaacson made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Andy Isaacson made changes -
          Link This issue relates to HDFS-3828 [ HDFS-3828 ]
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Andy,
          Thanks for handling this issue.
          I am really sorry If you disappointed. I am here to explain you the actual problem. please refer the comment That comment is explaining the same problem what you got after reading the patch.

          Some how I did not get the meaning from your any of the comment that you are asking for the explanation of HDFS-3194_7.patch description.

          If we plan to address the actual issue, I am fine to close this JIRA.
          Please file a separate JIRA or address here, I am ok for either cases.
          I did not ran any test with multiple BPs.
          How logs are cleaned in single bp case? (Also you can observe the behavior of the single dir case in branch-1), as per the fix, there should not be any difference I think.(Not sure if we miss something)

          Also worth editing the title also 'DataNode block scanner is running too frequently'. Because DataBlockScanner will run for every 5secs. The issue is mainly about frequent logging.
          Also description also different than logging issue. Please consider edditing if you plan to file separate JIRA.

          Thanks a lot guys for understanding the actual bug now.

          Show
          Uma Maheswara Rao G added a comment - Hi Andy, Thanks for handling this issue. I am really sorry If you disappointed. I am here to explain you the actual problem. please refer the comment That comment is explaining the same problem what you got after reading the patch. Some how I did not get the meaning from your any of the comment that you are asking for the explanation of HDFS-3194 _7.patch description. If we plan to address the actual issue, I am fine to close this JIRA. Please file a separate JIRA or address here, I am ok for either cases. I did not ran any test with multiple BPs. How logs are cleaned in single bp case? (Also you can observe the behavior of the single dir case in branch-1), as per the fix, there should not be any difference I think.(Not sure if we miss something) Also worth editing the title also 'DataNode block scanner is running too frequently'. Because DataBlockScanner will run for every 5secs. The issue is mainly about frequent logging. Also description also different than logging issue. Please consider edditing if you plan to file separate JIRA. Thanks a lot guys for understanding the actual bug now.
          Hide
          Andy Isaacson added a comment -

          I did not ran any test with multiple BPs.

          OK, thanks for verifying, we will need to test with multiple BPs. It looks like TestMultipleNNDataBlockScanner might test this case, I will check.

          Also worth editing the title also 'DataNode block scanner is running too frequently'. Because DataBlockScanner will run for every 5secs. The issue is mainly about frequent logging.

          Since the patch has already been submitted with this description, I do not think that changing the title now is a benefit.

          Please consider edditing if you plan to file separate JIRA.

          I've filed HDFS-3838, feel free to edit the description there if you feel it's unclear.

          Show
          Andy Isaacson added a comment - I did not ran any test with multiple BPs. OK, thanks for verifying, we will need to test with multiple BPs. It looks like TestMultipleNNDataBlockScanner might test this case, I will check. Also worth editing the title also 'DataNode block scanner is running too frequently'. Because DataBlockScanner will run for every 5secs. The issue is mainly about frequent logging. Since the patch has already been submitted with this description, I do not think that changing the title now is a benefit. Please consider edditing if you plan to file separate JIRA. I've filed HDFS-3838 , feel free to edit the description there if you feel it's unclear.
          Hide
          Andy Isaacson added a comment -

          I've filed HDFS-3838

          Typo, I actually filed HDFS-3828.

          Show
          Andy Isaacson added a comment - I've filed HDFS-3838 Typo, I actually filed HDFS-3828 .
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Andy Isaacson
              Reporter:
              suja s
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development