Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-5846

Assigning DEFAULT_RACK in resolveNetworkLocation method can break data resiliency

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0, 3.0.0-alpha1
    • Fix Version/s: 2.4.0
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Medhod CachedDNSToSwitchMapping::resolve() can return NULL which requires careful handling. Null can be returned in two cases:
      • An error occurred with topology script execution (script crashes).
      • Script returns wrong number of values (other than expected)

      Critical handling is in the DN registration code. DN registration code is responsible for assigning proper topology paths to all registered datanodes. Existing code handles this NULL pointer on the following way (resolveNetworkLocation method):

      / /resolve its network location
          List<String> rName = dnsToSwitchMapping.resolve(names);
          String networkLocation;
          if (rName == null) {
            LOG.error("The resolve call returned null! Using " + 
                NetworkTopology.DEFAULT_RACK + " for host " + names);
            networkLocation = NetworkTopology.DEFAULT_RACK;
          } else {
            networkLocation = rName.get(0);
          }
          return networkLocation;
      

      The line of code that is assigning default rack:

       networkLocation = NetworkTopology.DEFAULT_RACK; 

      can cause a serious problem. This means if somehow we got NULL, then the default rack will be assigned as a DN's network location and DN's registration will finish successfully. Under this circumstances, we will be able to load data into cluster which is working with a wrong topology. Wrong topology means that fault domains are not honored.

      For the end user, it means that two data replicas can end up in the same fault domain and a single failure can cause loss of two, or more, replicas. Cluster would be in the inconsistent state but it would not be aware of that and the whole thing would work as if everything was fine. We can notice that something wrong happened almost only by looking in the log for the error:

      LOG.error("The resolve call returned null! Using " + 
      NetworkTopology.DEFAULT_RACK + " for host " + names);
      

        Attachments

        1. hdfs-5846.patch
          16 kB
          Chris Nauroth
        2. hdfs-5846.patch
          16 kB
          Nikola Vujic
        3. hdfs-5846.patch
          10 kB
          Nikola Vujic

          Activity

            People

            • Assignee:
              nikola.vujic Nikola Vujic
              Reporter:
              nikola.vujic Nikola Vujic
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: