Uploaded image for project: 'Apache Trafodion (Retired)'
  1. Apache Trafodion (Retired)
  2. TRAFODION-2664

Instance will be down when the zookeeper on name node has been down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.2.0
    • 2.4
    • foundation
    • Important

    Description

      Description: Instance will be down when the zookeeper on name node has been down
      Test Steps:
      Step 1. Start OE and 4 long queries with trafci on the first node esggy-clu-n010
      Step 2. Wait several minutes and stop zookeeper on name node of node esggy-clu-n010 in Cloudera Manager page.
      Step 3. With trafci, run a basic query and 4 long queries again.

      In the above Step 3, we will see the whole instance as down after a while. For this test scenario, I tried it several times, always found instance as down.

      Timestamp:
      Test Start Time: 20170616132939
      Test End Time: 20170616134350
      Stop zookeeper on name node of node esggy-clu-n010: 20170616133344

      Check logs:
      1) Each node displays the following error:
      2017-06-16 13:33:46,276, ERROR, MON, Node Number: 0,, PIN: 5017 , Process Name: $MONITOR,,, TID: 5429, Message ID: 101371801, [CZClient::IsZNodeExpired], zoo_exists() for /trafodion/instance/cluster/esggy-clu-n010.esgyn.cn failed with error ZCONNECTIONLOSS
      2) Zookeeper displays:
      ls /trafodion/instance/cluster
      []
      So, It seems zclient has been lost on each node.

      Location of logs:
      esggy-clu-n010: /data4/jarek/ha.interactive/trafodion_and_cluster_logs/cluster_logs.20170616134816.tar.gz and trafodion_logs.20170616134816.tar.gz
      By the way, because the size of the logs is out of the limited value, so i cannot upload it as the attachment in this JIRA ID.

      How many zookeeper quorum servers in the cluster? total 3.
      <property>
      <name>dcs.zookeeper.quorum</name>
      <value>esggy-clu-n010.esgyn.cn,esggy-clu-n011.esgyn.cn,esggy-clu-n012.esgyn.cn</value>
      </property>

      How to access the cluster?
      1. Login 10.10.10.8 from US machine: trafodion/traf123
      2. Login 10.10.23.19 from 10.10.10.8: trafodion/traf123

      Attachments

        Issue Links

          Activity

            People

              zcorrea Zalo Correa
              bo.yu@esgyn.cn Jarek
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: