ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-907

Spurious "KeeperErrorCode = Session moved" messages

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.3.1
    • Fix Version/s: 3.3.2, 3.4.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The sync request does not set the session owner in Request.

      As a result, the leader keeps printing:
      2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:PrepRequestProcessor@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa90000 type:sync: cxid:0x6 zxid:0xfffffffffffffffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved

      1. ZOOKEEPER-907.patch
        0.6 kB
        Vishal Kher
      2. ZOOKEEPER-907.patch_v2
        1 kB
        Vishal Kher

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment -

          sounds like a blocker to me. can this be easily addressed? I'd still like to get 3.3.2 out asap.

          Show
          Patrick Hunt added a comment - sounds like a blocker to me. can this be easily addressed? I'd still like to get 3.3.2 out asap.
          Hide
          Vishal Kher added a comment -

          attaching patch.

          Show
          Vishal Kher added a comment - attaching patch.
          Hide
          Patrick Hunt added a comment -

          Nice. Thanks! Can you include a test? Something like this really should have had a test already... great if you could add one.
          (also hate to mention but the path on the patch is messed up (long prefix), can you address that as well?)

          Show
          Patrick Hunt added a comment - Nice. Thanks! Can you include a test? Something like this really should have had a test already... great if you could add one. (also hate to mention but the path on the patch is messed up (long prefix), can you address that as well?)
          Hide
          Vishal Kher added a comment -

          sure, I will write a test.

          What do you think is the effect of this bug? In PrepRequestProcessor.pRequest(), the leader will not pass sync request to nextProcessor. Does that mean that sync did not succeed?

          Show
          Vishal Kher added a comment - sure, I will write a test. What do you think is the effect of this bug? In PrepRequestProcessor.pRequest(), the leader will not pass sync request to nextProcessor. Does that mean that sync did not succeed?
          Hide
          Benjamin Reed added a comment -

          yes, this will fail the sync. it will not get passed through the pipeline. it will give you a partial sync though

          Show
          Benjamin Reed added a comment - yes, this will fail the sync. it will not get passed through the pipeline. it will give you a partial sync though
          Hide
          Benjamin Reed added a comment -

          wow, that is a really dumb error! vishal, could i ask you to modify your patch to avoid code duplication?
          we should probably have "Request si;" outside of the if, and then just set si inside the if statement, and then do the setOwner and submit after the if block to avoid code duplication. that way, if we make another change in the future, we don't run into this again.

          let me know if you need help with the test.

          Show
          Benjamin Reed added a comment - wow, that is a really dumb error! vishal, could i ask you to modify your patch to avoid code duplication? we should probably have "Request si;" outside of the if, and then just set si inside the if statement, and then do the setOwner and submit after the if block to avoid code duplication. that way, if we make another change in the future, we don't run into this again. let me know if you need help with the test.
          Hide
          Vishal Kher added a comment -

          Ben,

          What do you mean by partial sync? Will the follower still sync with the leader?

          Can you explain how sync works? I was going throught the code flow. I traced the sync call, but I could find the place that performs the data transfer between the leader and follwer.

          I will cleanup the patch. As per the test goes, I am not sure what is the best way to check if a server has received a request with null owner. I was planning to add a test that will initiate all zk client requests. In SessionTrackerImpl.checkesession(), set a flag if it sees a null owner. At the end of the test, fail the test if the flag is set. Any suggestions?

          -Vishal

          Show
          Vishal Kher added a comment - Ben, What do you mean by partial sync? Will the follower still sync with the leader? Can you explain how sync works? I was going throught the code flow. I traced the sync call, but I could find the place that performs the data transfer between the leader and follwer. I will cleanup the patch. As per the test goes, I am not sure what is the best way to check if a server has received a request with null owner. I was planning to add a test that will initiate all zk client requests. In SessionTrackerImpl.checkesession(), set a flag if it sees a null owner. At the end of the test, fail the test if the flag is set. Any suggestions? -Vishal
          Hide
          Benjamin Reed added a comment -

          sync doesn't cause any additional traffic over the atomic broadcast. it just makes sure that the all of the in-process transactions have be sent to the follower. when that error happens, the error will be sent back to the follower ordered after all of the completed transactions. so rather than being able to see the result of all requests initiated before the sync, the follower will see all requests completed before the sync. that is why i referred to it as a partial sync.

          i'm really having problems trying to reproduce this error. can you describe more how it happened? i would like to have an end-to-end test rather than the test of a particular implementation so that this error doesn't pop up if the implementation changes. looking at the code it seems like it should happen everytime the sync request is sent to a follower, but that doesn't seem to be the case.

          Show
          Benjamin Reed added a comment - sync doesn't cause any additional traffic over the atomic broadcast. it just makes sure that the all of the in-process transactions have be sent to the follower. when that error happens, the error will be sent back to the follower ordered after all of the completed transactions. so rather than being able to see the result of all requests initiated before the sync, the follower will see all requests completed before the sync. that is why i referred to it as a partial sync. i'm really having problems trying to reproduce this error. can you describe more how it happened? i would like to have an end-to-end test rather than the test of a particular implementation so that this error doesn't pop up if the implementation changes. looking at the code it seems like it should happen everytime the sync request is sent to a follower, but that doesn't seem to be the case.
          Hide
          Vishal Kher added a comment -

          Let the client connect to the follower. Make sure that sync is your first request.

          To catch this bug you need a test that for each zk client command connects to the follower, issues the request, and disconnects.

          Show
          Vishal Kher added a comment - Let the client connect to the follower. Make sure that sync is your first request. To catch this bug you need a test that for each zk client command connects to the follower, issues the request, and disconnects.
          Hide
          Vishal Kher added a comment -

          Attaching cleaned-up patch.
          Ben, let me know what you think about the test that I suggested.

          Show
          Vishal Kher added a comment - Attaching cleaned-up patch. Ben, let me know what you think about the test that I suggested.
          Hide
          Patrick Hunt added a comment -

          Cancelling the patch - still needs a test.

          Ben could you get back on Vishal's question? (see latest comment)

          Show
          Patrick Hunt added a comment - Cancelling the patch - still needs a test. Ben could you get back on Vishal's question? (see latest comment)
          Hide
          Benjamin Reed added a comment -

          the test case should connect to a follower and do a sync.

          i cannot reproduce this problem. if i connect to the leader or the follower and issue a sync() the return code is always 0.

          Show
          Benjamin Reed added a comment - the test case should connect to a follower and do a sync. i cannot reproduce this problem. if i connect to the leader or the follower and issue a sync() the return code is always 0.
          Hide
          Vishal Kher added a comment -

          Which return code are you referring to? You will see this message in the log file of the reader. It is not passed on to the caller anywhere.

          Show
          Vishal Kher added a comment - Which return code are you referring to? You will see this message in the log file of the reader. It is not passed on to the caller anywhere.
          Hide
          Benjamin Reed added a comment -

          a sorry, i misunderstood the issue. so your client that issues the sync gets a zero return code, but you see that message in the log. right?

          Show
          Benjamin Reed added a comment - a sorry, i misunderstood the issue. so your client that issues the sync gets a zero return code, but you see that message in the log. right?
          Hide
          Benjamin Reed added a comment -

          ah got it. ok i was able to reproduce it: the client connects to the follower, issues a sync, the error message shows up in the log of the leader. so there is an additional bug here – why is the client not getting the session moved error.

          Show
          Benjamin Reed added a comment - ah got it. ok i was able to reproduce it: the client connects to the follower, issues a sync, the error message shows up in the log of the leader. so there is an additional bug here – why is the client not getting the session moved error.
          Hide
          Vishal Kher added a comment -

          It did occur to me. I thought this was by design for sync? sync() is an async call. So the caller never gets any exceptions unless a callback is specified.
          I might be worng here though, I am still reading the code to understand how sync works.

          Show
          Vishal Kher added a comment - It did occur to me. I thought this was by design for sync? sync() is an async call. So the caller never gets any exceptions unless a callback is specified. I might be worng here though, I am still reading the code to understand how sync works.
          Hide
          Benjamin Reed added a comment -

          Ah, I see the problem. There are actually two problems: 1) when sync() get's an error it is not propagated back to the caller. 2) this problem.

          They problem is that 1) is preventing us from writing a test case. We need to fix 1) and then we can write the test for 2).

          Show
          Benjamin Reed added a comment - Ah, I see the problem. There are actually two problems: 1) when sync() get's an error it is not propagated back to the caller. 2) this problem. They problem is that 1) is preventing us from writing a test case. We need to fix 1) and then we can write the test for 2).
          Hide
          Vishal Kher added a comment -

          Hi Ben,

          Do you plan the change the sync() api? Right now sycn() only thorws IllegalArgumentException.

          Show
          Vishal Kher added a comment - Hi Ben, Do you plan the change the sync() api? Right now sycn() only thorws IllegalArgumentException.
          Hide
          Mahadev konar added a comment -

          ben, can is ZOOKEEPER-915 also marked for 3.3.2 then?

          Show
          Mahadev konar added a comment - ben, can is ZOOKEEPER-915 also marked for 3.3.2 then?
          Hide
          Benjamin Reed added a comment -

          may i propose accepting this patch without a test case? (we can see that it fixes the problem.) that way we can get 3.3.2 out. once ZOOKEEPER-915 goes it the tests should cover this issue.

          Show
          Benjamin Reed added a comment - may i propose accepting this patch without a test case? (we can see that it fixes the problem.) that way we can get 3.3.2 out. once ZOOKEEPER-915 goes it the tests should cover this issue.
          Hide
          Mahadev konar added a comment -

          thats ok with me ben. Can you review and +1 the patch it its all good to go?

          Show
          Mahadev konar added a comment - thats ok with me ben. Can you review and +1 the patch it its all good to go?
          Hide
          Benjamin Reed added a comment -

          +1 looks great vishal thanx for the fix!

          Show
          Benjamin Reed added a comment - +1 looks great vishal thanx for the fix!
          Hide
          Benjamin Reed added a comment -

          Committed revision 1031051.
          Committed revision 1031064.

          Show
          Benjamin Reed added a comment - Committed revision 1031051. Committed revision 1031064.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #991 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/991/)
          ZOOKEEPER-907. Spurious "KeeperErrorCode = Session moved" messages

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #991 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/991/ ) ZOOKEEPER-907 . Spurious "KeeperErrorCode = Session moved" messages

            People

            • Assignee:
              Vishal Kher
              Reporter:
              Vishal Kher
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development