Solr
  1. Solr
  2. SOLR-5721

ConnectionManager can become stuck in likeExpired

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.6.1
    • Fix Version/s: 4.7, Trunk
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Here are the sequence of events:

      • we disconnect
      • The disconnect timer beings to run (so no longer scheduled), but doesn't set likelyExpired yet
      • We connect, and set likelyExpired = false
      • The disconnect thread runs and sets likelyExpired to true, and it is never set back to false (note that we cancel the disconnect thread but that only cancels scheduled tasks but not running tasks).

      This is pretty difficult to reproduce without doing more work in the disconnect thread. It's easy to reproduce by adding sleeps in various places – I have a test that I'll attach that does that.

      The most straightforward way to solve this would be to grab the synchronization lock on ConnectionManager in the disconnect thread, ensure we aren't actually connected, and only then setting likelyExpired to true. In code:

      synchronized (ConnectionManager.this) {
        if (!connected) likelyExpired = true;
      }
      

      but this is all pretty subtle and error prone. It's easier to just get rid of the disconnect thread and record the last time we disconnected. Then, when we check likelyExpired, we just do a quick calculation to see if we are likelyExpired.

      1. SOLR-5721.patch
        7 kB
        Gregory Chanan
      2. SOLR-5721test.patch
        4 kB
        Gregory Chanan

        Activity

        Hide
        Gregory Chanan added a comment -

        Here's a rough demonstration of a failure with strategic pauses inserted. I don't think we'd want such a test, but I couldn't reproduce the bug with a nicer test. What I did in the nicer test is:

        • disconnect
        • try to reconnect around the time the disconnect thread would run (with some random error)

        if I inserted something small in the disconnect thread, like a System.err, the nicer test would reproduce the issue. I can attach that if people are interested.

        Show
        Gregory Chanan added a comment - Here's a rough demonstration of a failure with strategic pauses inserted. I don't think we'd want such a test, but I couldn't reproduce the bug with a nicer test. What I did in the nicer test is: disconnect try to reconnect around the time the disconnect thread would run (with some random error) if I inserted something small in the disconnect thread, like a System.err, the nicer test would reproduce the issue. I can attach that if people are interested.
        Hide
        Gregory Chanan added a comment -

        Here's a patch that gets rid of the disconnect thread and a test around it.

        Show
        Gregory Chanan added a comment - Here's a patch that gets rid of the disconnect thread and a test around it.
        Hide
        Mark Miller added a comment -

        Nice! Good call.

        Show
        Mark Miller added a comment - Nice! Good call.
        Hide
        ASF subversion and git services added a comment -

        Commit 1568337 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1568337 ]

        SOLR-5721: ConnectionManager can become stuck in likeExpired.

        Show
        ASF subversion and git services added a comment - Commit 1568337 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1568337 ] SOLR-5721 : ConnectionManager can become stuck in likeExpired.
        Hide
        ASF subversion and git services added a comment -

        Commit 1568338 from Mark Miller in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1568338 ]

        SOLR-5721: ConnectionManager can become stuck in likeExpired.

        Show
        ASF subversion and git services added a comment - Commit 1568338 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1568338 ] SOLR-5721 : ConnectionManager can become stuck in likeExpired.
        Hide
        Mark Miller added a comment -

        Thanks Greg!

        Show
        Mark Miller added a comment - Thanks Greg!
        Hide
        Ramkumar Aiyengar added a comment -

        Wouldn't using System.currentTimeMillis for timr deltas lead to errors due to NTP sync or DST? It's not guaranteed to be monotonic. See

        http://stackoverflow.com/questions/2978598/will-sytem-currenttimemillis-always-return-a-value-previous-calls

        System.nanoTime seems to provide a better alternative, at least when the OS supports a monotonic clock.

        Show
        Ramkumar Aiyengar added a comment - Wouldn't using System.currentTimeMillis for timr deltas lead to errors due to NTP sync or DST? It's not guaranteed to be monotonic. See http://stackoverflow.com/questions/2978598/will-sytem-currenttimemillis-always-return-a-value-previous-calls System.nanoTime seems to provide a better alternative, at least when the OS supports a monotonic clock.
        Hide
        Mark Miller added a comment -

        Yeah, sounds like a good improvement.

        Show
        Mark Miller added a comment - Yeah, sounds like a good improvement.
        Hide
        Mark Miller added a comment -

        I filed SOLR-5734 as this kind of spans the system.

        Show
        Mark Miller added a comment - I filed SOLR-5734 as this kind of spans the system.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Gregory Chanan
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development