ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1239

add logging/stats to identify fsync stalls

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3.4, 3.4.0, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Release Note:
      committed to 3.3.4, 3.4, trunk rev 1202360

      Description

      We don't have any logging to identify fsync stalls. It's a somewhat common occurrence (after gc/swap issues) when trying to diagnose pipeline stalls - where outstanding requests start piling up and operational latency increases.

      We should have some sort of logging around this. e.g. if the fsync time exceeds some limit then log a warning, something like that.

      It would also be useful to publish "stat" information related to this. min/avg/max latency for fsync.

      This should also be exposed through JMX.

      1. ZOOKEEPER-1239_br33.patch
        3 kB
        Patrick Hunt
      2. ZOOKEEPER-1239_br34.patch
        3 kB
        Patrick Hunt

        Activity

        Hide
        Patrick Hunt added a comment -

        Todd pointed me at this commit in hbase which does similar (sync time tracking)

        HBase does a log message when any sync() call takes longer than a second or so.

        commit 96621d010f9a924e2af338a10b584b7f07caf17d
        Author: Michael Stack <stack@apache.org>
        Date: Mon May 11 17:32:24 2009 +0000

        HBASE-1404 minor edit of regionserver logging messages

        Show
        Patrick Hunt added a comment - Todd pointed me at this commit in hbase which does similar (sync time tracking) — HBase does a log message when any sync() call takes longer than a second or so. commit 96621d010f9a924e2af338a10b584b7f07caf17d Author: Michael Stack <stack@apache.org> Date: Mon May 11 17:32:24 2009 +0000 HBASE-1404 minor edit of regionserver logging messages
        Hide
        Patrick Hunt added a comment -

        This checks the fsync time, if the configured value (defaults to 1second) is exceeded a warning is printed to the log.

        I copied this concept from HBase, except in this case it's override-able if necessary. (note docs change).

        Nearly all existing tests verify this code path, I ran with a much lower configured value and verified this message is output to the log correctly when threshold exceeded.

        There is a patch for 33 and one for 34/trunk.

        I've marked this for 3.4.0 because I think it's both low risk and very useful to have (keep seeing this issue pop up in user reports, which will allow Filesystem diagnosis.

        Show
        Patrick Hunt added a comment - This checks the fsync time, if the configured value (defaults to 1second) is exceeded a warning is printed to the log. I copied this concept from HBase, except in this case it's override-able if necessary. (note docs change). Nearly all existing tests verify this code path, I ran with a much lower configured value and verified this message is output to the log correctly when threshold exceeded. There is a patch for 33 and one for 34/trunk. I've marked this for 3.4.0 because I think it's both low risk and very useful to have (keep seeing this issue pop up in user reports, which will allow Filesystem diagnosis.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12503499/ZOOKEEPER-1239_br34.patch
        against trunk revision 1201045.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/789//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/789//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/789//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503499/ZOOKEEPER-1239_br34.patch against trunk revision 1201045. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/789//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/789//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/789//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        Are you sure we should be doing this timing using system.nanotime?

        Show
        Camille Fournier added a comment - Are you sure we should be doing this timing using system.nanotime?
        Hide
        Camille Fournier added a comment -

        Eh, I guess the popular consensus has changed on using nanotime for this sort of thing, so disregard my question. I'll put this in shortly.

        Show
        Camille Fournier added a comment - Eh, I guess the popular consensus has changed on using nanotime for this sort of thing, so disregard my question. I'll put this in shortly.
        Hide
        Mahadev konar added a comment -

        Camille,
        Can you please commit this to 3.4 branch as well?

        thanks!

        Show
        Mahadev konar added a comment - Camille, Can you please commit this to 3.4 branch as well? thanks!
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1366 (See https://builds.apache.org/job/ZooKeeper-trunk/1366/)
        ZOOKEEPER-1239. add logging/stats to identify fsync stalls. (phunt via camille)

        camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1202360
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/docs/src/documentation/content/xdocs/zookeeperAdmin.xml
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1366 (See https://builds.apache.org/job/ZooKeeper-trunk/1366/ ) ZOOKEEPER-1239 . add logging/stats to identify fsync stalls. (phunt via camille) camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1202360 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/docs/src/documentation/content/xdocs/zookeeperAdmin.xml /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java

          People

          • Assignee:
            Patrick Hunt
            Reporter:
            Patrick Hunt
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development