Solr
  1. Solr
  2. SOLR-7381

Improve Debuggability of SolrCloud using MDC

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      SOLR-6673 added MDC based logging in a few places but we have a lot of ground to cover.

      1. Threads created via thread pool executors do not inherit MDC values and those are some of the most interesting places to log MDC context.
      2. We must expose node names (in tests) so that we can debug faster
      3. We can expose more information via thread names so that a thread dump has enough context to help debug problems in production

      This is critical to help debug SolrCloud failures.

      1. SOLR-7381.patch
        54 kB
        Shalin Shekhar Mangar
      2. SOLR-7381.patch
        44 kB
        Shalin Shekhar Mangar
      3. SOLR-7381-forbid-threadpoolexecutor.patch
        6 kB
        Shalin Shekhar Mangar
      4. SOLR-7381-submitter-stacktrace.patch
        1 kB
        Shalin Shekhar Mangar
      5. SOLR-7381-thread-names.patch
        17 kB
        Shalin Shekhar Mangar
      6. SOLR-7381-thread-names.patch
        16 kB
        Shalin Shekhar Mangar
      7. SOLR-7381-thread-names.patch
        13 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -
          1. Added a new ExecutorUtils.MDCAwareThreadPoolExecutor class which sets the MDC context of the caller thread and unsets after the runnable completes
          2. Added some helper methods in ExecutorUtils similar to the ones in Executors
          3. Used the above two throughout the code base
          4. Added the node name in the MDC context when running in cloud mode

          Examples:

          2531 T46 N:127.0.0.1:35418_zx_%2Ft oass.IndexSchema.readSchema default search field in schema is text
          
          2843 T50 N:127.0.0.1:35418_zx_%2Ft C:control_collection S:shard1 oascc.SolrZkClient.makePath makePath: /collections/control_collection/leader_elect/shard1/election
          
          7495 T73 N:127.0.0.1:35418_zx_%2Ft oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 6
          
          23926 T62 N:127.0.0.1:60362_ oasc.SolrException.log ERROR null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /aliases.json
          		at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
          		at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
          

          Still a few places to fix.

          Show
          Shalin Shekhar Mangar added a comment - Added a new ExecutorUtils.MDCAwareThreadPoolExecutor class which sets the MDC context of the caller thread and unsets after the runnable completes Added some helper methods in ExecutorUtils similar to the ones in Executors Used the above two throughout the code base Added the node name in the MDC context when running in cloud mode Examples: 2531 T46 N:127.0.0.1:35418_zx_%2Ft oass.IndexSchema.readSchema default search field in schema is text 2843 T50 N:127.0.0.1:35418_zx_%2Ft C:control_collection S:shard1 oascc.SolrZkClient.makePath makePath: /collections/control_collection/leader_elect/shard1/election 7495 T73 N:127.0.0.1:35418_zx_%2Ft oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 6 23926 T62 N:127.0.0.1:60362_ oasc.SolrException.log ERROR null :org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /aliases.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) Still a few places to fix.
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Copying over some comments from SOLR-6673:

          Yonik Seeley's comment:
          https://issues.apache.org/jira/browse/SOLR-6673?focusedCommentId=14381847&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14381847

          There is some other redundant info now though: C17 P62757
          That's the core number and the port number. Both where important in the past (along with the thread number) to figuring out chaos monkey test fails. But they don't seem important for production? The port number should be different for diff log files in production, and the collection name we already have.

          The log snippet that I pasted in SOLR-6673 was actually a test log which is why it had C17 and such things. Those are not logged in production logs. There is still merit in logging the core short id in tests because it can be used to differentiate between reloaded cores. Now that I am logging the node name, we don't need the port in the logs so I'll remove it (it is currently logged only in cloud mode).

          Mark Miller's comment at https://issues.apache.org/jira/browse/SOLR-6673?focusedCommentId=14381918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14381918

          I don't really like using caps for the label distinction. Seems nicer to have like c:collection n:core1 or something. Looks a lot nicer, doesn't overload the c char. I don't really know that it screams "small c, must be core" anyway. We just need a consistent clean format that is easy script parsable. Or c:collection s:core1 for the more proper SolrCore?

          If we use "s:core1" then what should we use for the shard name?

          Show
          Shalin Shekhar Mangar added a comment - - edited Copying over some comments from SOLR-6673 : Yonik Seeley 's comment: https://issues.apache.org/jira/browse/SOLR-6673?focusedCommentId=14381847&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14381847 There is some other redundant info now though: C17 P62757 That's the core number and the port number. Both where important in the past (along with the thread number) to figuring out chaos monkey test fails. But they don't seem important for production? The port number should be different for diff log files in production, and the collection name we already have. The log snippet that I pasted in SOLR-6673 was actually a test log which is why it had C17 and such things. Those are not logged in production logs. There is still merit in logging the core short id in tests because it can be used to differentiate between reloaded cores. Now that I am logging the node name, we don't need the port in the logs so I'll remove it (it is currently logged only in cloud mode). Mark Miller 's comment at https://issues.apache.org/jira/browse/SOLR-6673?focusedCommentId=14381918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14381918 I don't really like using caps for the label distinction. Seems nicer to have like c:collection n:core1 or something. Looks a lot nicer, doesn't overload the c char. I don't really know that it screams "small c, must be core" anyway. We just need a consistent clean format that is easy script parsable. Or c:collection s:core1 for the more proper SolrCore? If we use "s:core1" then what should we use for the shard name?
          Hide
          Shalin Shekhar Mangar added a comment -
          1. JettySolrRunner takes care to not pollute the caller thread with the context set by Solr's init and destroy
          2. I added the following to forbidden-apis for Solr only so that we don't accidentally create non-MDC aware thread pools:
            java.util.concurrent.Executors#newFixedThreadPool(int,java.util.concurrent.ThreadFactory)
            java.util.concurrent.Executors#newSingleThreadExecutor(java.util.concurrent.ThreadFactory)
            java.util.concurrent.Executors#newCachedThreadPool(java.util.concurrent.ThreadFactory)
            

          I'll commit this shortly.

          Show
          Shalin Shekhar Mangar added a comment - JettySolrRunner takes care to not pollute the caller thread with the context set by Solr's init and destroy I added the following to forbidden-apis for Solr only so that we don't accidentally create non-MDC aware thread pools: java.util.concurrent.Executors#newFixedThreadPool( int ,java.util.concurrent.ThreadFactory) java.util.concurrent.Executors#newSingleThreadExecutor(java.util.concurrent.ThreadFactory) java.util.concurrent.Executors#newCachedThreadPool(java.util.concurrent.ThreadFactory) I'll commit this shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673116 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1673116 ]

          SOLR-7381: Improve logging by adding node name in MDC in SolrCloud mode and adding MDC to all thread pools

          Show
          ASF subversion and git services added a comment - Commit 1673116 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1673116 ] SOLR-7381 : Improve logging by adding node name in MDC in SolrCloud mode and adding MDC to all thread pools
          Hide
          ASF subversion and git services added a comment -

          Commit 1673121 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673121 ]

          SOLR-7381: Improve logging by adding node name in MDC in SolrCloud mode and adding MDC to all thread pools

          Show
          ASF subversion and git services added a comment - Commit 1673121 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673121 ] SOLR-7381 : Improve logging by adding node name in MDC in SolrCloud mode and adding MDC to all thread pools
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch takes us a step further in improving debuggability by exposing MDC values in thread names so that a thread dump can give us a better idea of what was happening at the time.

          For example, here is a stack trace showing a CloudSolrClient update thread which has the URL of the remote host in its name:

          "CloudSolrClient ThreadPool-6-thread-1-processing-{CloudSolrClient.url=http:/127.0.0.1:53410/ollection1/" #185 prio=5 os_prio=0 tid=0x00007f7778097000 nid=0x218a runnable [0x00007f77415d7000]
             java.lang.Thread.State: RUNNABLE
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
          

          Here's another stack trace showing the update executor thread running on core=collection1, replica=core_node2, node_name=127.0.0.1:53410_ and making a call to http://127.0.0.1:57515/collection1:

          "updateExecutor-11-thread-1-processing-{core=collection1, replica=core_node2, node_name=127.0.0.1:53410_, ConcurrentUpdateSolrClient.baseUrl=http:/127.0.0.1:57515/ollection1, shard=shard3, collection=collection1}" #177 prio=5 os_prio=0 tid=0x00007f775400a000 nid=0x2182 runnable [0x00007f7741ddf000]
             java.lang.Thread.State: RUNNABLE
          	at java.net.SocketOutputStream.socketWrite0(Native Method)
          	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
          	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
          

          Interestingly, the java thread names seem to use the forward slash character as a special one and ignore the character following it so e.g. a url added to the name has "http:/127.0.0.1:57515/ollection1" (notice 'ollection1'!)

          All you need to do to take advantage of this feature is to set (any) MDC values before you submit a task to the thread pool and everything else is taken care for you.

          I should probably add some upper limit to the thread names.

          Show
          Shalin Shekhar Mangar added a comment - This patch takes us a step further in improving debuggability by exposing MDC values in thread names so that a thread dump can give us a better idea of what was happening at the time. For example, here is a stack trace showing a CloudSolrClient update thread which has the URL of the remote host in its name: "CloudSolrClient ThreadPool-6-thread-1-processing-{CloudSolrClient.url=http:/127.0.0.1:53410/ollection1/" #185 prio=5 os_prio=0 tid=0x00007f7778097000 nid=0x218a runnable [0x00007f77415d7000] java.lang. Thread .State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) Here's another stack trace showing the update executor thread running on core=collection1, replica=core_node2, node_name=127.0.0.1:53410_ and making a call to http://127.0.0.1:57515/collection1: "updateExecutor-11-thread-1-processing-{core=collection1, replica=core_node2, node_name=127.0.0.1:53410_, ConcurrentUpdateSolrClient.baseUrl=http:/127.0.0.1:57515/ollection1, shard=shard3, collection=collection1}" #177 prio=5 os_prio=0 tid=0x00007f775400a000 nid=0x2182 runnable [0x00007f7741ddf000] java.lang. Thread .State: RUNNABLE at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) Interestingly, the java thread names seem to use the forward slash character as a special one and ignore the character following it so e.g. a url added to the name has "http:/127.0.0.1:57515/ollection1" (notice 'ollection1'!) All you need to do to take advantage of this feature is to set (any) MDC values before you submit a task to the thread pool and everything else is taken care for you. I should probably add some upper limit to the thread names.
          Hide
          Shalin Shekhar Mangar added a comment -

          Here's a patch which forbids the ThreadPoolExecutor from being used in Solr. I had changed the bulk of its usages earlier but I had to change BatchWriter and GoLive in the hadoop contrib to get the checks passing.

          I'm going to commit this patch shortly.

          Show
          Shalin Shekhar Mangar added a comment - Here's a patch which forbids the ThreadPoolExecutor from being used in Solr. I had changed the bulk of its usages earlier but I had to change BatchWriter and GoLive in the hadoop contrib to get the checks passing. I'm going to commit this patch shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673372 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1673372 ]

          SOLR-7381: Usages of Executors#newFixedThreadPool, #newSingleThreadExecutor, #newCachedThreadPool as well as ThreadPoolExecutor directly is now forbidden in Solr

          Show
          ASF subversion and git services added a comment - Commit 1673372 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1673372 ] SOLR-7381 : Usages of Executors#newFixedThreadPool, #newSingleThreadExecutor, #newCachedThreadPool as well as ThreadPoolExecutor directly is now forbidden in Solr
          Hide
          ASF subversion and git services added a comment -

          Commit 1673373 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673373 ]

          SOLR-7381: Usages of Executors#newFixedThreadPool, #newSingleThreadExecutor, #newCachedThreadPool as well as ThreadPoolExecutor directly is now forbidden in Solr

          Show
          ASF subversion and git services added a comment - Commit 1673373 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673373 ] SOLR-7381 : Usages of Executors#newFixedThreadPool, #newSingleThreadExecutor, #newCachedThreadPool as well as ThreadPoolExecutor directly is now forbidden in Solr
          Hide
          ASF subversion and git services added a comment -

          Commit 1673411 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1673411 ]

          SOLR-7381: Set node name in MDC inside ZkController as soon as we discover it

          Show
          ASF subversion and git services added a comment - Commit 1673411 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1673411 ] SOLR-7381 : Set node name in MDC inside ZkController as soon as we discover it
          Hide
          ASF subversion and git services added a comment -

          Commit 1673412 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673412 ]

          SOLR-7381: Set node name in MDC inside ZkController as soon as we discover it

          Show
          ASF subversion and git services added a comment - Commit 1673412 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673412 ] SOLR-7381 : Set node name in MDC inside ZkController as soon as we discover it
          Hide
          Shalin Shekhar Mangar added a comment -

          A better patch for adding thread names in stack traces.

          1. Limited the thread name to 512 characters
          2. Escape forward slash characters in thread names so that they show up correctly in thread dumps.
          3. Added urls to MDC in LbHttpSolrServer and HttpShardHandler
          Show
          Shalin Shekhar Mangar added a comment - A better patch for adding thread names in stack traces. Limited the thread name to 512 characters Escape forward slash characters in thread names so that they show up correctly in thread dumps. Added urls to MDC in LbHttpSolrServer and HttpShardHandler
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Renamed some MDC keys to be consistent with each other
          2. Added a null check in CoreAdminHandler for action (which can be null if a custom action is invoked)

          All tests pass. I'll commit this shortly.

          Show
          Shalin Shekhar Mangar added a comment - Renamed some MDC keys to be consistent with each other Added a null check in CoreAdminHandler for action (which can be null if a custom action is invoked) All tests pass. I'll commit this shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1673469 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1673469 ]

          SOLR-7381: MDC keys are now exposed in thread names automatically so that a thread dump can give hints on what the thread was doing

          Show
          ASF subversion and git services added a comment - Commit 1673469 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1673469 ] SOLR-7381 : MDC keys are now exposed in thread names automatically so that a thread dump can give hints on what the thread was doing
          Hide
          ASF subversion and git services added a comment -

          Commit 1673471 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1673471 ]

          SOLR-7381: MDC keys are now exposed in thread names automatically so that a thread dump can give hints on what the thread was doing

          Show
          ASF subversion and git services added a comment - Commit 1673471 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673471 ] SOLR-7381 : MDC keys are now exposed in thread names automatically so that a thread dump can give hints on what the thread was doing
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch logs the submitter's stack trace in case there is an uncaught exception from a task submitted to the thread pool.

          It'd be even better if we can find a way to log this for all exceptions inside a task thread but I can't find a clean/efficient way to do it.

          Show
          Shalin Shekhar Mangar added a comment - This patch logs the submitter's stack trace in case there is an uncaught exception from a task submitted to the thread pool. It'd be even better if we can find a way to log this for all exceptions inside a task thread but I can't find a clean/efficient way to do it.
          Hide
          ASF subversion and git services added a comment -

          Commit 1674593 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1674593 ]

          SOLR-7381: Uncaught exceptions thrown by tasks in the pool are logged along with submitter's stack trace

          Show
          ASF subversion and git services added a comment - Commit 1674593 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1674593 ] SOLR-7381 : Uncaught exceptions thrown by tasks in the pool are logged along with submitter's stack trace
          Hide
          ASF subversion and git services added a comment -

          Commit 1674594 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1674594 ]

          SOLR-7381: Uncaught exceptions thrown by tasks in the pool are logged along with submitter's stack trace

          Show
          ASF subversion and git services added a comment - Commit 1674594 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1674594 ] SOLR-7381 : Uncaught exceptions thrown by tasks in the pool are logged along with submitter's stack trace
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development