Uploaded image for project: 'Bigtop'
  1. Bigtop
  2. BIGTOP-1336

Puppet recipes failed to deploy kerberos enabled hadoop cluster

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.8.0
    • Component/s: deployment
    • Labels:
      None

      Description

      Here are some missing dependency setting in our puppet recipes in order to get kerberos enabled on the hadoop cluster.

      The first one is that kerberos principal for hdfs user hasn't been created before formatting namenode, which cause the namenode formatting process failed.

      The second one is that /etc/default/hadoop-hdfs-datanode doesn't get ready before starting up datanodes and results in datanodes failed to startup.
      The datanode error log:

      2014-06-16 15:10:10,711 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain
      java.lang.RuntimeException: Cannot start secure cluster without privileged resources.
      

      Here's the reproduce steps using vagrant-puppet:
      1.) Enable kerberos on the hadoop cluster.

      $ vim bigtop-deploy/vm/vagrant-puppet/provision.sh
      

      Add kerberos definitions.

      cat > /bigtop-puppet/config/site.csv << EOF
      hadoop_head_node,$1
      hadoop_storage_dirs,/data/1,/data/2
      bigtop_yumrepo_uri,http://bigtop.s3.amazonaws.com/releases/0.7.0/redhat/6/x86_64
      jdk_package_name,java-1.7.0-openjdk-devel.x86_64
      components,hadoop,hbase,yarn,mapred-app
      hadoop_security,kerberos
      hadoop_kerberos_domain,vagrant
      hadoop_kerberos_realm,BIGTOP.ORG
      hadoop_kerberos_kdc_server,bigtop1.vagrant
      EOF
      

      2.) Spin up the cluster.

      $ ./startup.sh --cluster
      

      3-1.) Get an error while formating namenode.

      err: /Stage[main]/Hadoop_head_node/Hadoop::Namenode[namenode]/Exec[namenode format]/returns: change from notrun to 0 failed: /bin/bash -c 'yes Y | hdfs namenode -format >> /var/lib/hadoop-hdfs/nn.format.log 2>&1' returned 1 instead of one of [0] at /tmp/vagrant-puppet-2/modules-0/hadoop/manifests/init.pp:361
      

      3-2.) Get an error while starting up datanodes.

      err: /Stage[main]/Hadoop_worker_node/Hadoop::Datanode[datanode]/Service[hadoop-hdfs-datanode]/ensure: change from stopped to running failed: Could not start Service[hadoop-hdfs-datanode]: Execution of '/sbin/service hadoop-hdfs-datanode start' returned 1:  at /tmp/vagrant-puppet-2/modules-0/hadoop/manifests/init.pp:158
      

        Issue Links

          Activity

          Hide
          evans_ye Evans Ye added a comment -

          OK, you might encounter an error like below when following the reproduce steps to provision a kerberos enabled hadoop cluster on VMs.

          err: /Stage[main]/Kerberos::Kdc/Exec[kdb5_util]/returns: change from notrun to 0 failed: Command exceeded timeout at /tmp/vagrant-puppet-2/modules-0/kerberos/manifests/init.pp:113
          

          Its actually raised by following shell command which was running exceed puppet exec timeout during provisioning:

          [root@bigtop1 ~]#  kdb5_util -P cthulhu -r BIGTOP.ORG create -s
          Loading random data
          (hang for a long time)
          

          If you go ahead look into the entropy of the VM, you'll get a pretty poor number:

          [root@bigtop1 ~]# cat /proc/sys/kernel/random/entropy_avail
          4
          

          So, the root cause of error we got is because the kerberos database initialization step can not gather enough entropy on the VM.
          The poor entropy performance often happened on virtual machines because its hardware are simulated(Virtualbox Ticket #11297).
          A simple solution for this is to use rng-tools.

          Since this issue does not related to our puppet recipes but an environment specific problem. I'll prefer to add rng-tools support in vagrant-puppet's provision.sh.
          Suggestions are welcome

          Show
          evans_ye Evans Ye added a comment - OK, you might encounter an error like below when following the reproduce steps to provision a kerberos enabled hadoop cluster on VMs. err: /Stage[main]/Kerberos::Kdc/Exec[kdb5_util]/returns: change from notrun to 0 failed: Command exceeded timeout at /tmp/vagrant-puppet-2/modules-0/kerberos/manifests/init.pp:113 Its actually raised by following shell command which was running exceed puppet exec timeout during provisioning: [root@bigtop1 ~]# kdb5_util -P cthulhu -r BIGTOP.ORG create -s Loading random data (hang for a long time) If you go ahead look into the entropy of the VM, you'll get a pretty poor number: [root@bigtop1 ~]# cat /proc/sys/kernel/random/entropy_avail 4 So, the root cause of error we got is because the kerberos database initialization step can not gather enough entropy on the VM. The poor entropy performance often happened on virtual machines because its hardware are simulated( Virtualbox Ticket #11297 ). A simple solution for this is to use rng-tools . Since this issue does not related to our puppet recipes but an environment specific problem. I'll prefer to add rng-tools support in vagrant-puppet 's provision.sh. Suggestions are welcome
          Hide
          jayunit100 jay vyas added a comment -

          Thanks evans and thanks for catching this.

          So i guess this patch is making the kerberos keytab step precede namenode formatting / starting the namenode service ?

          If so its clearly an important improvement - because kerberos is CRITICAL for running LinuxContainerExecutors in hadoop 2.3.

          Show
          jayunit100 jay vyas added a comment - Thanks evans and thanks for catching this. So i guess this patch is making the kerberos keytab step precede namenode formatting / starting the namenode service ? If so its clearly an important improvement - because kerberos is CRITICAL for running LinuxContainerExecutors in hadoop 2.3.
          Hide
          evans_ye Evans Ye added a comment - - edited

          jay vyas, Yes you're right for the namenode part.
          And there's another part regarding to datanode in this patch, if we do not setup /etc/default/hadoop-hdfs-datanode before datanode started, following FATAL error will show in datanode's log:

          2014-06-16 15:10:10,711 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain
          java.lang.RuntimeException: Cannot start secure cluster without privileged resources.
          

          Overall, this patch is mainly addressing issues to bring a kerberos hadoop cluster up.

          Show
          evans_ye Evans Ye added a comment - - edited jay vyas , Yes you're right for the namenode part. And there's another part regarding to datanode in this patch, if we do not setup /etc/default/hadoop-hdfs-datanode before datanode started, following FATAL error will show in datanode's log: 2014-06-16 15:10:10,711 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain java.lang.RuntimeException: Cannot start secure cluster without privileged resources. Overall, this patch is mainly addressing issues to bring a kerberos hadoop cluster up.
          Hide
          rvs Roman Shaposhnik added a comment -

          +1 and committed! Thanks for the patch.

          Show
          rvs Roman Shaposhnik added a comment - +1 and committed! Thanks for the patch.

            People

            • Assignee:
              Unassigned
              Reporter:
              evans_ye Evans Ye
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development