ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-710

permanent ZSESSIONMOVED error after client app reconnects to zookeeper cluster

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.2.2
    • Fix Version/s: 3.2.3, 3.3.0
    • Component/s: server
    • Labels:
      None
    • Environment:

      debian lenny; ia64; xen virtualization

    • Hadoop Flags:
      Reviewed

      Description

      Originally problem was described on Users mailing list starting with this post.
      Below I restate it in more organized form.

      We occasionally (few times a day) observe that our client application disconnects from Zookeeper cluster.
      Application is written in C++ and we are using libzookeeper_mt library. In version 3.2.2.

      The disconnects we are observing are probably related to some problems with our network infrastructure - we are observing periods with great packet loss between machines in our DC.

      Sometimes after client application (i.e. zookeeper library) reconnects to zookeeper cluster we are observing that all subsequent requests return ZSESSIONMOVED error. Restarting client app helps - we always pass 0 as clientid to zookeeper_init function so old session is not reused.

      On 16-03-2010 we observed few occurences of problem. Example ones:

      • 22:08; client IP 10.1.112.60 (app1); sessionID 0x22767e1c9630000
      • 14:21; client IP 10.1.112.61 (app2); sessionID 0x324dcc1ba580085

      I attach logs of cluster and application nodes (only stuff concerining zookeeper):

      I also made some analysis of case at 22:08:

      • Network glitch which resulted in problem occurred at about 22:08.
      • From what I see since 17:48 node2 was the leader and it did not
        change later yesterday.
      • Client was connected to node2 since 17:50
      • At around 22:09 client tried to connect to every node (1,2,3).
        Connections to node1 and node3 were closed
        with exception "Exception causing close of session 0x22767e1c9630000
        due to java.io.IOException: Read error".
        Connection to node2 stood alive.
      • All subsequent operations were refused with ZSESSIONMOVED error.
        Error visible both on client and on server side.
      1. zookeeper-node3.log.2010-03-16.gz
        215 kB
        Lukasz Osipiuk
      2. zookeeper-node2.log.2010-03-16.gz
        602 kB
        Lukasz Osipiuk
      3. zookeeper-node1.log.2010-03-16.gz
        178 kB
        Lukasz Osipiuk
      4. ZOOKEEPER-710_3.3.patch
        8 kB
        Patrick Hunt
      5. ZOOKEEPER-710_3.2.patch
        11 kB
        Patrick Hunt
      6. app2.log.2010-03-16.gz
        12 kB
        Lukasz Osipiuk
      7. app1.log.2010-03-16.gz
        176 kB
        Lukasz Osipiuk

        Activity

        Patrick Hunt made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mahadev konar made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Patrick Hunt made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Benjamin Reed made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Patrick Hunt made changes -
        Attachment ZOOKEEPER-710_3.2.patch [ 12439242 ]
        Benjamin Reed made changes -
        Hadoop Flags [Reviewed]
        Patrick Hunt made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Patrick Hunt made changes -
        Attachment ZOOKEEPER-710_3.3.patch [ 12439239 ]
        Patrick Hunt made changes -
        Affects Version/s 3.3.0 [ 12313976 ]
        Patrick Hunt made changes -
        Assignee Patrick Hunt [ phunt ]
        Fix Version/s 3.2.3 [ 12314847 ]
        Fix Version/s 3.3.0 [ 12313976 ]
        Affects Version/s 3.3.0 [ 12313976 ]
        Priority Major [ 3 ] Blocker [ 1 ]
        Component/s server [ 12312382 ]
        Lukasz Osipiuk made changes -
        Description Originally problem was described on Users mailing list starting with this [post|http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201003.mbox/<3b910d891003160743k38e2e7c9y830b182d88396d55@mail.gmail.com>].
        Below I restate it in more organized form.

        We occasionally (few times a day) observe that our client application disconnects from Zookeeper cluster.
        Application is written in C++ and we are using libzookeeper_mt library. In version 3.2.2.

        The disconnects we are observing are probably related to some problems with our network infrastructure - we are observing periods with great packet loss between machines in our DC.

        Sometimes after client application (i.e. zookeeper library) reconnects to zookeeper cluster we are observing that all subsequent requests return ZSESSIONMOVED error. Restarting client app helps - we always pass 0 as clientid to zookeeper_init function so old session is not reused.

        On 16-03-2010 we observed few occurences of problem. Example ones:
        - 22:08; client IP 10.1.112.60 (app1); sessionID 0x22767e1c9630000
        - 14:21; client IP 10.1.112.61 (app2); sessionID 0x324dcc1ba580085

        I attach logs of cluster and application nodes (only stuff concerining zookeeper):
        - zookeeper-node1.log.2010-03-16.gz
        - zookeeper-node2.log.2010-03-16.gz
        - zookeeper-node3.log.2010-03-16.gz
        - app1.log.2010-03-16.gz
        - app2.log.2010-03-16.gz

        I also made some analysis of case at 22:08:
        - Network glitch which resulted in problem occurred at about 22:08.
        - From what I see since 17:48 node2 was the leader and it did not
        change later yesterday.
        - Client was connected to node2 since 17:50
        - At around 22:09 client tried to connect to every node (1,2,3).
        Connections to node1 and node3 were closed
         with exception "Exception causing close of session 0x22767e1c9630000
        due to java.io.IOException: Read error".
         Connection to node2 stood alive.
        - All subsequent operations were refused with ZSESSIONMOVED error.
        Error visible both on client and on server side.

        Originally problem was described on Users mailing list starting with this [post|http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201003.mbox/<3b910d891003160743k38e2e7c9y830b182d88396d55@mail.gmail.com>].
        Below I restate it in more organized form.

        We occasionally (few times a day) observe that our client application disconnects from Zookeeper cluster.
        Application is written in C++ and we are using libzookeeper_mt library. In version 3.2.2.

        The disconnects we are observing are probably related to some problems with our network infrastructure - we are observing periods with great packet loss between machines in our DC.

        Sometimes after client application (i.e. zookeeper library) reconnects to zookeeper cluster we are observing that all subsequent requests return ZSESSIONMOVED error. Restarting client app helps - we always pass 0 as clientid to zookeeper_init function so old session is not reused.

        On 16-03-2010 we observed few occurences of problem. Example ones:
        - 22:08; client IP 10.1.112.60 (app1); sessionID 0x22767e1c9630000
        - 14:21; client IP 10.1.112.61 (app2); sessionID 0x324dcc1ba580085

        I attach logs of cluster and application nodes (only stuff concerining zookeeper):
        - [^zookeeper-node1.log.2010-03-16.gz] - logs of zookeepr cluster node 1 10.1.112.62
        - [^zookeeper-node2.log.2010-03-16.gz] - logs of zookeepr cluster node 2 10.1.112.63
        - [^zookeeper-node3.log.2010-03-16.gz] - logs of zookeepr cluster node 3 10.1.112.64
        - [^app1.log.2010-03-16.gz] - application logs of app1 10.1.112.60
        - [^app2.log.2010-03-16.gz] - application logs of app2 10.1.112.61

        I also made some analysis of case at 22:08:
        - Network glitch which resulted in problem occurred at about 22:08.
        - From what I see since 17:48 node2 was the leader and it did not
        change later yesterday.
        - Client was connected to node2 since 17:50
        - At around 22:09 client tried to connect to every node (1,2,3).
        Connections to node1 and node3 were closed
         with exception "Exception causing close of session 0x22767e1c9630000
        due to java.io.IOException: Read error".
         Connection to node2 stood alive.
        - All subsequent operations were refused with ZSESSIONMOVED error.
        Error visible both on client and on server side.

        Lukasz Osipiuk made changes -
        Affects Version/s 3.2.2 [ 12314335 ]
        Description Originally problem was described on Users mailing list starting with this [post|http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201003.mbox/<3b910d891003160743k38e2e7c9y830b182d88396d55@mail.gmail.com>].
        Below I restate it in more organized form.

        We occasionally (few times a day) observe that our client application disconnects from Zookeeper cluster.
        Application is written in C++ and we are using libzookeeper_mt library. In version 3.2.2.

        The disconnects we are observing are probably related to some problems with our network infrastructure - we are observing periods with great packet loss between machines in our DC.

        Sometimes after client application (i.e. zookeeper library) reconnects to zookeeper cluster we are observing that all subsequent requests return ZSESSIONMOVED error. Restarting client app helps - we always pass 0 as clientid to zookeeper_init function so old session is not reused.

        On 16-03-2010 we observed few occurences of problem. Example ones:
        - 22:08; client IP 10.1.112.60 (app1)
        - 14:21; client IP 10.1.112.61 (app2)

        I attach logs of cluster and application nodes (only stuff concerining zookeeper):
        - zookeeper-node1.log.2010-03-16.gz
        - zookeeper-node2.log.2010-03-16.gz
        - zookeeper-node3.log.2010-03-16.gz
        - app1.log.2010-03-16.gz
        - app2.log.2010-03-16.gz

        I also made some analysis of case at 22:08:
        - Network glitch which resulted in problem occurred at about 22:08.
        - From what I see since 17:48 node2 was the leader and it did not
        change later yesterday.
        - Client was connected to node2 since 17:50
        - At around 22:09 client tried to connect to every node (1,2,3).
        Connections to node1 and node3 were closed
         with exception "Exception causing close of session 0x22767e1c9630000
        due to java.io.IOException: Read error".
         Connection to node2 stood alive.
        - All subsequent operations were refused with ZSESSIONMOVED error.
        Error visible both on client and on server side.

        Originally problem was described on Users mailing list starting with this [post|http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201003.mbox/<3b910d891003160743k38e2e7c9y830b182d88396d55@mail.gmail.com>].
        Below I restate it in more organized form.

        We occasionally (few times a day) observe that our client application disconnects from Zookeeper cluster.
        Application is written in C++ and we are using libzookeeper_mt library. In version 3.2.2.

        The disconnects we are observing are probably related to some problems with our network infrastructure - we are observing periods with great packet loss between machines in our DC.

        Sometimes after client application (i.e. zookeeper library) reconnects to zookeeper cluster we are observing that all subsequent requests return ZSESSIONMOVED error. Restarting client app helps - we always pass 0 as clientid to zookeeper_init function so old session is not reused.

        On 16-03-2010 we observed few occurences of problem. Example ones:
        - 22:08; client IP 10.1.112.60 (app1); sessionID 0x22767e1c9630000
        - 14:21; client IP 10.1.112.61 (app2); sessionID 0x324dcc1ba580085

        I attach logs of cluster and application nodes (only stuff concerining zookeeper):
        - zookeeper-node1.log.2010-03-16.gz
        - zookeeper-node2.log.2010-03-16.gz
        - zookeeper-node3.log.2010-03-16.gz
        - app1.log.2010-03-16.gz
        - app2.log.2010-03-16.gz

        I also made some analysis of case at 22:08:
        - Network glitch which resulted in problem occurred at about 22:08.
        - From what I see since 17:48 node2 was the leader and it did not
        change later yesterday.
        - Client was connected to node2 since 17:50
        - At around 22:09 client tried to connect to every node (1,2,3).
        Connections to node1 and node3 were closed
         with exception "Exception causing close of session 0x22767e1c9630000
        due to java.io.IOException: Read error".
         Connection to node2 stood alive.
        - All subsequent operations were refused with ZSESSIONMOVED error.
        Error visible both on client and on server side.

        Lukasz Osipiuk made changes -
        Attachment app2.log.2010-03-16.gz [ 12439151 ]
        Lukasz Osipiuk made changes -
        Attachment app1.log.2010-03-16.gz [ 12439150 ]
        Lukasz Osipiuk made changes -
        Attachment zookeeper-node2.log.2010-03-16.gz [ 12439148 ]
        Attachment zookeeper-node3.log.2010-03-16.gz [ 12439149 ]
        Lukasz Osipiuk made changes -
        Field Original Value New Value
        Attachment zookeeper-node1.log.2010-03-16.gz [ 12439147 ]
        Lukasz Osipiuk created issue -

          People

          • Assignee:
            Patrick Hunt
            Reporter:
            Lukasz Osipiuk
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development