[TRAFODION-2664] Instance will be down when the zookeeper on name node has been down - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 2.2.0
Fix Version/s: 2.4
Component/s: foundation
Labels:
- build
Environment:

Hide
Test Environment:
CDH5.4.8: 10.10.23.19:7180, total 6 nodes.
HDFS-HA and DCS-HA: enabled
OS: Centos6.8, physic machine.
SW Build: R2.2.3 (EsgynDB_Enterprise Release 2.2.3 (Build release [sbroeder], branch 1ce8d39-xdc_nari, date 11Jun17)

Show
Test Environment: CDH5.4.8: 10.10.23.19:7180, total 6 nodes. HDFS-HA and DCS-HA: enabled OS: Centos6.8, physic machine. SW Build: R2.2.3 (EsgynDB_Enterprise Release 2.2.3 (Build release [sbroeder], branch 1ce8d39-xdc_nari, date 11Jun17)

Flags:

Important

Description

Description: Instance will be down when the zookeeper on name node has been down
Test Steps:
Step 1. Start OE and 4 long queries with trafci on the first node esggy-clu-n010
Step 2. Wait several minutes and stop zookeeper on name node of node esggy-clu-n010 in Cloudera Manager page.
Step 3. With trafci, run a basic query and 4 long queries again.

In the above Step 3, we will see the whole instance as down after a while. For this test scenario, I tried it several times, always found instance as down.

Timestamp:
Test Start Time: 20170616132939
Test End Time: 20170616134350
Stop zookeeper on name node of node esggy-clu-n010: 20170616133344

Check logs:
1) Each node displays the following error:
2017-06-16 13:33:46,276, ERROR, MON, Node Number: 0,, PIN: 5017 , Process Name: $MONITOR,,, TID: 5429, Message ID: 101371801, [CZClient::IsZNodeExpired], zoo_exists() for /trafodion/instance/cluster/esggy-clu-n010.esgyn.cn failed with error ZCONNECTIONLOSS
2) Zookeeper displays:
ls /trafodion/instance/cluster
[]
So, It seems zclient has been lost on each node.

Location of logs:
esggy-clu-n010: /data4/jarek/ha.interactive/trafodion_and_cluster_logs/cluster_logs.20170616134816.tar.gz and trafodion_logs.20170616134816.tar.gz
By the way, because the size of the logs is out of the limited value, so i cannot upload it as the attachment in this JIRA ID.

How many zookeeper quorum servers in the cluster? total 3.
<property>
<name>dcs.zookeeper.quorum</name>
<value>esggy-clu-n010.esgyn.cn,esggy-clu-n011.esgyn.cn,esggy-clu-n012.esgyn.cn</value>
</property>

How to access the cluster?
1. Login 10.10.10.8 from US machine: trafodion/traf123
2. Login 10.10.23.19 from 10.10.10.8: trafodion/traf123

Attachments

Issue Links

links to

GitHub Pull Request #1155

Activity

People

Assignee:: Zalo Correa

Reporter:: Jarek

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Jun/17 02:23

Updated:: 16/Apr/20 02:19