Uploaded image for project: 'Apache Trafodion (Retired)'
  1. Apache Trafodion (Retired)
  2. TRAFODION-2692

Monitor fails to start when node names are not of the right form

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.2.0
    • foundation
    • None
    • I tried this on an OpenStack cluster, using Hortonworks HDP 5.4. This is the code with the new elasticity feature.

    Description

      When trying to install Trafodion on a cluster, I ran into various situations where the monitor failed to start, based on how host names were configured and specified. I used three kinds of names:

      NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the mistake of just adding the nickname, not the actual name in the /etc/hosts line.
      LN - a local, non-qualified name that is also the OpenStack instance name and the host name.
      FQDN - the fully qualified domain host name

      Case  Name specified  hostname command  sqconfig  What happened
            in HDP          returns           contains
      ----  --------------  ----------------  --------  --------------------------
        1   nickname        local name        nickname  sqstart returned an error,
                                                        saying that sqstart must
                                                        be executed on one of the
                                                        nodes of the cluster
        2   local name      local name        FQDN?     monitor core dump (1)
        3   local name      FQDN              FQDN      monitor abends (2)
        4   FQDN            FQDN              FQDN      install succeeds
      

      Notes: (1) The core dump happened because of the following code in file core/sqf/monitor/linux/cluster.cxx:

          // Build the monitor's configured view of the cluster
          if ( IsRealCluster )
          {   // Map node name to physical node id
              // (for virtual nodes physical node equals "rank" (previously set))
              MyPNID = clusterConfig->GetPNid( Node_name );
          }
      
          Nodes->AddNodes( );
          MyNode = Nodes->GetNode(MyPNID);
          Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ );
      

      Node_name is a local name. The name of the nodes in the "Nodes" list is the FQDN, so we don't find the node and MyPNID is set to -1. This leads to dereferencing MyNode, which is a NULL pointer.

      Note 2: The third case is the same as the second, with two modifications: Use the "hostname" command to set the host name to the FQDN, and edit /etc/hosts to put the FQDN first in the line and the local name second (case 2 had it the other way round). This time, we get past the problem described in case 2, but we get an error from MPI, which is unable to communicate with all the nodes (sorry, didn't record the exact error message).

      This is now the lines in /etc/hosts look like (same layout for all nodes of the cluster):

      # case 1
      1.2.3.4 nickname1
      1.2.3.5 nickname2
      
      
      # case 2
      1.2.3.4 mynode1 mynode1.novalocal
      1.2.3.5 mynode2 mynode2.novalocal
      
      # cases 3 and 4
      1.2.3.4 mynode1.novalocal mynode1
      1.2.3.5 mynode2.novalocal mynode2
      

      My suggestion would be to identify the places where we read node names that are provided by the user and where such node names are compared, and to provide a comparison method that tolerates equivalent forms of names.

      There are related JIRAs: TRAFODION-2480 and TRAFODION-2391.

      Attachments

        Issue Links

          Activity

            People

              zcorrea Zalo Correa
              hzeller Hans Zeller
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: