Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6110

adding more slow action log in critical write path

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 2.3.0
    • Fix Version/s: 2.5.0
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Log slow i/o. Set log thresholds in dfsclient and datanode via the below new configs:

      dfs.client.slow.io.warning.threshold.ms (Default 30 seconds)
      dfs.datanode.slow.io.warning.threshold.ms (Default 300ms)
      Show
      Log slow i/o. Set log thresholds in dfsclient and datanode via the below new configs: dfs.client.slow.io.warning.threshold.ms (Default 30 seconds) dfs.datanode.slow.io.warning.threshold.ms (Default 300ms)

      Description

      After digging a HBase write spike issue caused by slow buffer io in our cluster, just realize we'd better to add more abnormal latency warning log in write flow, such that if other guys hit HLog sync spike, we could know more detail info from HDFS side at the same time.
      Patch will be uploaded soon.

      1. HDFS-6110v6.txt
        13 kB
        stack
      2. HDFS-6110v5.txt
        11 kB
        Liang Xie
      3. HDFS-6110v4.txt
        11 kB
        Liang Xie
      4. HDFS-6110v3.txt
        11 kB
        stack
      5. HDFS-6110-v2.txt
        10 kB
        Liang Xie
      6. HDFS-6110.txt
        6 kB
        Liang Xie

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Closing tickets that are already part of a release.

          Show
          Vinod Kumar Vavilapalli added a comment - Closing tickets that are already part of a release.
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1787 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1787/)
          Move HDFS-6110 down to 2.5.0 section of CHANGES.txt (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598787)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1787 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1787/ ) Move HDFS-6110 down to 2.5.0 section of CHANGES.txt (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598787 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1760 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1760/)
          Move HDFS-6110 down to 2.5.0 section of CHANGES.txt (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598787)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1760 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1760/ ) Move HDFS-6110 down to 2.5.0 section of CHANGES.txt (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598787 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #569 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/569/)
          Move HDFS-6110 down to 2.5.0 section of CHANGES.txt (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598787)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #569 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/569/ ) Move HDFS-6110 down to 2.5.0 section of CHANGES.txt (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598787 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #5638 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5638/)
          Move HDFS-6110 down to 2.5.0 section of CHANGES.txt (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598787)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5638 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5638/ ) Move HDFS-6110 down to 2.5.0 section of CHANGES.txt (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598787 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          stack added a comment -

          Committed to trunk and branch-2. Thanks for the patch Liang Xie.

          Show
          stack added a comment - Committed to trunk and branch-2. Thanks for the patch Liang Xie.
          Hide
          Liang Xie added a comment -

          ping Stack, seems no objection until now

          Show
          Liang Xie added a comment - ping Stack , seems no objection until now
          Hide
          stack added a comment -

          Testing the patch it seems to work nicely. Below are some samples after tuning down the thresholds:

          Datanode-side

          2014-04-28 21:48:33,988 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:101ms (threshold=10ms)
          2014-04-28 21:49:28,026 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:17ms (threshold=10ms)
          2014-04-28 21:49:39,908 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:235ms (threshold=10ms)
          2014-04-28 21:50:44,382 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:194ms (threshold=10ms)
          2014-04-28 21:51:54,831 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:159ms (threshold=10ms)
          2014-04-28 21:52:34,137 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:209ms (threshold=10ms)
          2014-04-28 21:52:40,486 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:605ms (threshold=10ms)
          2014-04-28 21:53:38,690 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:77ms (threshold=10ms)
          2014-04-28 21:53:43,956 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:226ms (threshold=10ms)
          2014-04-28 21:53:59,021 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:133ms (threshold=10ms)
          

          Client-side

          2014-04-29 16:44:07,572 WARN  [ResponseProcessor for block BP-410607956-10.20.84.26-1391491814882:blk_1074012141_1099511899651] hdfs.DFSClient: Slow ReadProcessor read fields took 3ms (threshold=1ms); ack: seqno: 1114 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 896162, targets: [10.20.84.27:50011, 10.20.84.31:50011, 10.20.84.28:50011]
          2014-04-29 16:44:07,572 WARN  [sync.3] hdfs.DFSClient: Slow waitForAckedSeqno took 2ms (threshold=1ms)
          2014-04-29 16:44:07,575 WARN  [ResponseProcessor for block BP-410607956-10.20.84.26-1391491814882:blk_1074012141_1099511899651] hdfs.DFSClient: Slow ReadProcessor read fields took 2ms (threshold=1ms); ack: seqno: 1115 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 955411, targets: [10.20.84.27:50011, 10.20.84.31:50011, 10.20.84.28:50011]
          2014-04-29 16:44:07,575 WARN  [sync.4] hdfs.DFSClient: Slow waitForAckedSeqno took 2ms (threshold=1ms)
          2014-04-29 16:44:07,578 WARN  [ResponseProcessor for block BP-410607956-10.20.84.26-1391491814882:blk_1074012141_1099511899651] hdfs.DFSClient: Slow ReadProcessor read fields took 3ms (threshold=1ms); ack: seqno: 1116 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 854269, targets: [10.20.84.27:50011, 10.20.84.31:50011, 10.20.84.28:50011]
          2014-04-29 16:44:07,579 WARN  [ResponseProcessor for block BP-410607956-10.20.84.26-1391491814882:blk_1074012141_1099511899651] hdfs.DFSClient: Slow ReadProcessor re
          

          Will commit in next day or so unless objection.

          Show
          stack added a comment - Testing the patch it seems to work nicely. Below are some samples after tuning down the thresholds: Datanode-side 2014-04-28 21:48:33,988 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:101ms (threshold=10ms) 2014-04-28 21:49:28,026 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:17ms (threshold=10ms) 2014-04-28 21:49:39,908 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:235ms (threshold=10ms) 2014-04-28 21:50:44,382 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:194ms (threshold=10ms) 2014-04-28 21:51:54,831 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:159ms (threshold=10ms) 2014-04-28 21:52:34,137 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:209ms (threshold=10ms) 2014-04-28 21:52:40,486 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:605ms (threshold=10ms) 2014-04-28 21:53:38,690 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:77ms (threshold=10ms) 2014-04-28 21:53:43,956 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:226ms (threshold=10ms) 2014-04-28 21:53:59,021 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:133ms (threshold=10ms) Client-side 2014-04-29 16:44:07,572 WARN [ResponseProcessor for block BP-410607956-10.20.84.26-1391491814882:blk_1074012141_1099511899651] hdfs.DFSClient: Slow ReadProcessor read fields took 3ms (threshold=1ms); ack: seqno: 1114 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 896162, targets: [10.20.84.27:50011, 10.20.84.31:50011, 10.20.84.28:50011] 2014-04-29 16:44:07,572 WARN [sync.3] hdfs.DFSClient: Slow waitForAckedSeqno took 2ms (threshold=1ms) 2014-04-29 16:44:07,575 WARN [ResponseProcessor for block BP-410607956-10.20.84.26-1391491814882:blk_1074012141_1099511899651] hdfs.DFSClient: Slow ReadProcessor read fields took 2ms (threshold=1ms); ack: seqno: 1115 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 955411, targets: [10.20.84.27:50011, 10.20.84.31:50011, 10.20.84.28:50011] 2014-04-29 16:44:07,575 WARN [sync.4] hdfs.DFSClient: Slow waitForAckedSeqno took 2ms (threshold=1ms) 2014-04-29 16:44:07,578 WARN [ResponseProcessor for block BP-410607956-10.20.84.26-1391491814882:blk_1074012141_1099511899651] hdfs.DFSClient: Slow ReadProcessor read fields took 3ms (threshold=1ms); ack: seqno: 1116 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 854269, targets: [10.20.84.27:50011, 10.20.84.31:50011, 10.20.84.28:50011] 2014-04-29 16:44:07,579 WARN [ResponseProcessor for block BP-410607956-10.20.84.26-1391491814882:blk_1074012141_1099511899651] hdfs.DFSClient: Slow ReadProcessor re Will commit in next day or so unless objection.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12642004/HDFS-6110v6.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6739//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6739//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642004/HDFS-6110v6.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6739//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6739//console This message is automatically generated.
          Hide
          stack added a comment -

          Liang Xie 's latest patch adding in offline review feedback I got from our Todd (See below): i.e. having one threshold for dfsclient (a higher one so folks MR'ing don't get annoyed by all the WARNings about slow i/o), and then another for datanode side which is much lower so we can see bad i/os.

          16:38 < todd> stack: just looked at 6110. had one more thought after commenting on the JIRA
          16:38 < todd> you think we should add a separate config for client vs server?
          16:38 < todd> I'm afraid that the 300ms default may be a little aggressive for the client - people using hadoop fs -put to upload files may get kind of nervous the next time they upgrade if they start
                        seeing warnings
          16:38 < todd> MR jobs too
          16:39 < todd> may be better to have the client default be 10sec or something really long, and then HBase could tune it down for WAL files
          16:39 < stack> todd: thanks boss
          16:39 < todd> you think i'm crazy?
          16:39 < stack> no
          16:39 < stack> Testing it, it is "illuminating" to see how long stuff takes
          16:39 < todd> k. yea
          16:39 < todd> I had a patch like that once on the server side
          16:39 < stack> Was worried though that it'd freak folks out.
          16:40 < stack> Or, rather, they'd ignore what is being said and just consider it 'noise'.
          16:40 < todd> yea
          16:40 < todd> for a throughput app it is kind of noise
          16:40 < todd> but hbase could definitely tune the default inside the RS down
          16:40 < stack> Let me do as you suggest.
          16:40 < todd> k
          16:40 < stack> Thanks for review.
          16:40 < todd> feel free to paste this convo into the jira so it makes sense :)
          16:40 < todd> didn't want to post yet another comment and pollute everyone's mailboxes
          16:41  * stack nod
          
          Show
          stack added a comment - Liang Xie 's latest patch adding in offline review feedback I got from our Todd (See below): i.e. having one threshold for dfsclient (a higher one so folks MR'ing don't get annoyed by all the WARNings about slow i/o), and then another for datanode side which is much lower so we can see bad i/os. 16:38 < todd> stack: just looked at 6110. had one more thought after commenting on the JIRA 16:38 < todd> you think we should add a separate config for client vs server? 16:38 < todd> I'm afraid that the 300ms default may be a little aggressive for the client - people using hadoop fs -put to upload files may get kind of nervous the next time they upgrade if they start seeing warnings 16:38 < todd> MR jobs too 16:39 < todd> may be better to have the client default be 10sec or something really long , and then HBase could tune it down for WAL files 16:39 < stack> todd: thanks boss 16:39 < todd> you think i'm crazy? 16:39 < stack> no 16:39 < stack> Testing it, it is "illuminating" to see how long stuff takes 16:39 < todd> k. yea 16:39 < todd> I had a patch like that once on the server side 16:39 < stack> Was worried though that it'd freak folks out. 16:40 < stack> Or, rather, they'd ignore what is being said and just consider it 'noise'. 16:40 < todd> yea 16:40 < todd> for a throughput app it is kind of noise 16:40 < todd> but hbase could definitely tune the default inside the RS down 16:40 < stack> Let me do as you suggest. 16:40 < todd> k 16:40 < stack> Thanks for review. 16:40 < todd> feel free to paste this convo into the jira so it makes sense :) 16:40 < todd> didn't want to post yet another comment and pollute everyone's mailboxes 16:41 * stack nod
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12641883/HDFS-6110v5.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6730//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6730//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12641883/HDFS-6110v5.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6730//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6730//console This message is automatically generated.
          Hide
          Liang Xie added a comment -

          Hi Colin Patrick McCabe, attached v5 should address your comments, thanks

          Show
          Liang Xie added a comment - Hi Colin Patrick McCabe , attached v5 should address your comments, thanks
          Hide
          Colin Patrick McCabe added a comment -
          +  public static final int     DFS_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300;
          

          It's odd that this is an int, given that we retrieve the threshold as a long later on. This seems likely to lead to confusion-- can we just make this a long everywhere?

          +1 after that's addressed

          Show
          Colin Patrick McCabe added a comment - + public static final int DFS_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300; It's odd that this is an int, given that we retrieve the threshold as a long later on. This seems likely to lead to confusion-- can we just make this a long everywhere? +1 after that's addressed
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638010/HDFS-6110v4.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6565//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6565//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638010/HDFS-6110v4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6565//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6565//console This message is automatically generated.
          Hide
          Liang Xie added a comment -

          Attached v4 should address the last comment from Todd

          Show
          Liang Xie added a comment - Attached v4 should address the last comment from Todd
          Hide
          Todd Lipcon added a comment -

          Actually one thing I just missed – we should add the new config to hdfs-default.xml so that it's documented.

          Show
          Todd Lipcon added a comment - Actually one thing I just missed – we should add the new config to hdfs-default.xml so that it's documented.
          Hide
          Todd Lipcon added a comment -

          Patch looks good to me. +1. Please fill in the release note field after committing - I imagine a lot of folks are going to be curious about these new warnings and will want to know how to turn it off (eg by setting the warning level to 60sec or something)

          Show
          Todd Lipcon added a comment - Patch looks good to me. +1. Please fill in the release note field after committing - I imagine a lot of folks are going to be curious about these new warnings and will want to know how to turn it off (eg by setting the warning level to 60sec or something)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12637343/HDFS-6110v3.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6542//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6542//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12637343/HDFS-6110v3.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6542//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6542//console This message is automatically generated.
          Hide
          Liang Xie added a comment -

          Cool, with this, we could dig deeper once lots of threads stay at waitForAckedSeqno() or similar function. With its help and other tools, i found one of our HBase cluster having write spike issue, was cuased by Centos 6.3 's stable page write feature(In deed, per my understand, all online HBase clusters which are care about latency, should never use RHEL/Centos 6.3 any more).

          Show
          Liang Xie added a comment - Cool, with this, we could dig deeper once lots of threads stay at waitForAckedSeqno() or similar function. With its help and other tools, i found one of our HBase cluster having write spike issue, was cuased by Centos 6.3 's stable page write feature(In deed, per my understand, all online HBase clusters which are care about latency, should never use RHEL/Centos 6.3 any more).
          Hide
          stack added a comment -

          I tried it out. Looks good. Minor formatting of log changes (They all have a 'Slow' prefix...). Here is an example:

          2014-03-27 22:46:19,975 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 986ms (threshold=300ms)
          

          Was going to commit with the conservative 300ms threshold unless objection.

          Show
          stack added a comment - I tried it out. Looks good. Minor formatting of log changes (They all have a 'Slow' prefix...). Here is an example: 2014-03-27 22:46:19,975 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 986ms (threshold=300ms) Was going to commit with the conservative 300ms threshold unless objection.
          Hide
          Liang Xie added a comment -

          Seems not HBase only, i just saw hdfs-6139, MR application should be benefit if this patch is in.

          Show
          Liang Xie added a comment - Seems not HBase only, i just saw hdfs-6139, MR application should be benefit if this patch is in.
          Hide
          Liang Xie added a comment -

          Yes, we used 100ms internally

          Show
          Liang Xie added a comment - Yes, we used 100ms internally
          Hide
          stack added a comment -

          Patch LGTM Liang Xie Let me try it here. 300ms is eons. Probably good as a default.

          Show
          stack added a comment - Patch LGTM Liang Xie Let me try it here. 300ms is eons. Probably good as a default.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12635463/HDFS-6110-v2.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6433//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6433//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635463/HDFS-6110-v2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6433//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6433//console This message is automatically generated.
          Hide
          Liang Xie added a comment -

          Stack Todd Lipcon, could you guys or other committers which probably are interesting on it help to review ? It maybe just a trivial change for HDFS codebase, but very important/useful for HBase end-user on trouble-shooting write outliers issue.

          Show
          Liang Xie added a comment - Stack Todd Lipcon , could you guys or other committers which probably are interesting on it help to review ? It maybe just a trivial change for HDFS codebase, but very important/useful for HBase end-user on trouble-shooting write outliers issue.
          Hide
          Liang Xie added a comment -

          making the threshold configruable in patch v2.

          Show
          Liang Xie added a comment - making the threshold configruable in patch v2.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12635246/HDFS-6110.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6418//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6418//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635246/HDFS-6110.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6418//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6418//console This message is automatically generated.
          Hide
          Liang Xie added a comment -

          the threshold setting is 300ms in the patch, probably we need to make it configuable, though we set it 100ms internally which is more aggressive

          Show
          Liang Xie added a comment - the threshold setting is 300ms in the patch, probably we need to make it configuable, though we set it 100ms internally which is more aggressive
          Hide
          Liang Xie added a comment -

          Here is the patch, extracted from my code, it's pretty simple, but extremely useful for my investigation on HBase write outlier these days Stack

          DFSOutputStream was modified as well, then a HBase ops could be alerted by warning log easier.

          Show
          Liang Xie added a comment - Here is the patch, extracted from my code, it's pretty simple, but extremely useful for my investigation on HBase write outlier these days Stack DFSOutputStream was modified as well, then a HBase ops could be alerted by warning log easier.
          Hide
          Liang Xie added a comment -

          seems HDFS-3751 is a superset for current jira ? my original intend is just add logging inside BlockReceiver(and DFSOutputStream) class, it's seems enough for HBase write outlier investigation to me at least. And still lots of other write codes besides BlockReceiver have lengthy disk IOs, i did not care about those during my digging HBase write spike issue.

          Show
          Liang Xie added a comment - seems HDFS-3751 is a superset for current jira ? my original intend is just add logging inside BlockReceiver(and DFSOutputStream) class, it's seems enough for HBase write outlier investigation to me at least. And still lots of other write codes besides BlockReceiver have lengthy disk IOs, i did not care about those during my digging HBase write spike issue.
          Hide
          Todd Lipcon added a comment -

          Is this duplicate of HDFS-3751?

          Show
          Todd Lipcon added a comment - Is this duplicate of HDFS-3751 ?

            People

            • Assignee:
              Liang Xie
              Reporter:
              Liang Xie
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development