Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.2.0
-
None
-
I tried this on an OpenStack cluster, using Hortonworks HDP 5.4. This is the code with the new elasticity feature.
Description
When trying to install Trafodion on a cluster, I ran into various situations where the monitor failed to start, based on how host names were configured and specified. I used three kinds of names:
NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the mistake of just adding the nickname, not the actual name in the /etc/hosts line.
LN - a local, non-qualified name that is also the OpenStack instance name and the host name.
FQDN - the fully qualified domain host name
Case Name specified hostname command sqconfig What happened in HDP returns contains ---- -------------- ---------------- -------- -------------------------- 1 nickname local name nickname sqstart returned an error, saying that sqstart must be executed on one of the nodes of the cluster 2 local name local name FQDN? monitor core dump (1) 3 local name FQDN FQDN monitor abends (2) 4 FQDN FQDN FQDN install succeeds
Notes: (1) The core dump happened because of the following code in file core/sqf/monitor/linux/cluster.cxx:
// Build the monitor's configured view of the cluster if ( IsRealCluster ) { // Map node name to physical node id // (for virtual nodes physical node equals "rank" (previously set)) MyPNID = clusterConfig->GetPNid( Node_name ); } Nodes->AddNodes( ); MyNode = Nodes->GetNode(MyPNID); Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ );
Node_name is a local name. The name of the nodes in the "Nodes" list is the FQDN, so we don't find the node and MyPNID is set to -1. This leads to dereferencing MyNode, which is a NULL pointer.
Note 2: The third case is the same as the second, with two modifications: Use the "hostname" command to set the host name to the FQDN, and edit /etc/hosts to put the FQDN first in the line and the local name second (case 2 had it the other way round). This time, we get past the problem described in case 2, but we get an error from MPI, which is unable to communicate with all the nodes (sorry, didn't record the exact error message).
This is now the lines in /etc/hosts look like (same layout for all nodes of the cluster):
# case 1 1.2.3.4 nickname1 1.2.3.5 nickname2 # case 2 1.2.3.4 mynode1 mynode1.novalocal 1.2.3.5 mynode2 mynode2.novalocal # cases 3 and 4 1.2.3.4 mynode1.novalocal mynode1 1.2.3.5 mynode2.novalocal mynode2
My suggestion would be to identify the places where we read node names that are provided by the user and where such node names are compared, and to provide a comparison method that tolerates equivalent forms of names.
There are related JIRAs: TRAFODION-2480 and TRAFODION-2391.
Attachments
Issue Links
- is related to
-
TRAFODION-2391 monitor failed to start when hostname contains uppercase.
- Closed
-
TRAFODION-2480 monitor should not check hostname strictly
- Closed
- links to