Whirr
  1. Whirr
  2. WHIRR-227

CDH and Hadoop integration tests are failing

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None

      Description

      I have tried multiple times (even using different internet connections and cloud providers) to run the integration tests for cdh and hadoop and they always fail with the same error message:

      -------------------------------------------------------------------------------
      Test set: org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest
      -------------------------------------------------------------------------------
      Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 336.63 sec <<< FAILURE!
      test(org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest)  Time elapsed: 336.53 sec  <<< ERROR!
      java.io.IOException: Call to ec2-50-16-169-138.compute-1.amazonaws.com/50.16.169.138:8021 failed on local exception: java.net.SocketException: Malformed reply from SOCKS server
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1089)
        at org.apache.hadoop.ipc.Client.call(Client.java:1057)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at org.apache.hadoop.mapred.$Proxy76.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:369)
        at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:486)
        at org.apache.hadoop.mapred.JobClient.init(JobClient.java:471)
        at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:456)
        at org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest.test(CdhHadoopServiceTest.java:87)
      

      I believe this is somehow related to one of the recently committed patches.

        Activity

        Hide
        Tibor Kiss added a comment - - edited

        I am following this bug and I see that changing the short rolenames can be the problem because on the end of
        http://whirr.s3.amazonaws.com/0.4.0-incubating-SNAPSHOT/apache/hadoop/post-configure
        there are the old short rolenames!

        
        for role in $(echo "$ROLES" | tr "," "\n"); do
          case $role in
          nn)
            setup_web
            start_namenode
            ;;
          snn)
            start_daemon secondarynamenode
            ;;
          jt)
            start_daemon jobtracker
            ;;
          dn)
            start_daemon datanode
            ;;
          tt)
            start_daemon tasktracker
            ;;
          esac
        done
        

        Here is what I see in the node when I am running manually.... does nothing at the end!

        [root@domU-12-31-39-0E-CD-63 computeserv]# ./post-configure hadoop-namenode,hadoop-jobtracker -n ec2-184-73-150-247.compute-1.amazonaws.com -j ec2-184-73-150-247.compute-1.amazonaws.com -c ec2
        + set -e
        + ROLES=hadoop-namenode,hadoop-jobtracker
        + shift
        + NN_HOST=
        + JT_HOST=
        + CLOUD_PROVIDER=
        + getopts n:j:c: OPTION
        + case $OPTION in
        + NN_HOST=ec2-184-73-150-247.compute-1.amazonaws.com
        + getopts n:j:c: OPTION
        + case $OPTION in
        + JT_HOST=ec2-184-73-150-247.compute-1.amazonaws.com
        + getopts n:j:c: OPTION
        + case $OPTION in
        + CLOUD_PROVIDER=ec2
        + getopts n:j:c: OPTION
        + case $CLOUD_PROVIDER in
        ++ wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname
        + SELF_HOST=ec2-184-73-150-247.compute-1.amazonaws.com
        + HADOOP_VERSION=0.20.2
        + HADOOP_HOME=/usr/local/hadoop-0.20.2
        + HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
        + configure_hadoop
        + case $CLOUD_PROVIDER in
        + MOUNT=/mnt
        + FIRST_MOUNT=/mnt
        + DFS_NAME_DIR=/mnt/hadoop/hdfs/name
        + FS_CHECKPOINT_DIR=/mnt/hadoop/hdfs/secondary
        + DFS_DATA_DIR=/mnt/hadoop/hdfs/data
        + MAPRED_LOCAL_DIR=/mnt/hadoop/mapred/local
        + MAX_MAP_TASKS=2
        + MAX_REDUCE_TASKS=1
        + CHILD_OPTS=-Xmx550m
        + CHILD_ULIMIT=1126400
        + mkdir -p /mnt/hadoop
        + chown hadoop:hadoop /mnt/hadoop
        + '[' '!' -e /mnt/tmp ']'
        + mkdir /etc/hadoop
        + ln -s /usr/local/hadoop-0.20.2/conf /etc/hadoop/conf
        + cat
        + sed -i -e 's|# export HADOOP_PID_DIR=.*|export HADOOP_PID_DIR=/var/run/hadoop|' /usr/local/hadoop-0.20.2/conf/hadoop-env.sh
        + mkdir -p /var/run/hadoop
        + chown -R hadoop:hadoop /var/run/hadoop
        + sed -i -e 's|# export HADOOP_SSH_OPTS=.*|export HADOOP_SSH_OPTS="-o StrictHostKeyChecking=no"|' /usr/local/hadoop-0.20.2/conf/hadoop-env.sh
        + sed -i -e 's|# export HADOOP_OPTS=.*|export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"|' /usr/local/hadoop-0.20.2/conf/hadoop-env.sh
        + sed -i -e 's|# export HADOOP_LOG_DIR=.*|export HADOOP_LOG_DIR=/var/log/hadoop/logs|' /usr/local/hadoop-0.20.2/conf/hadoop-env.sh
        + rm -rf /var/log/hadoop
        + mkdir /mnt/hadoop/logs
        + chown hadoop:hadoop /mnt/hadoop/logs
        + ln -s /mnt/hadoop/logs /var/log/hadoop
        + chown -R hadoop:hadoop /var/log/hadoop
        ++ echo hadoop-namenode,hadoop-jobtracker
        ++ tr , '\n'
        + for role in '$(echo "$ROLES" | tr "," "\n")'
        + case $role in
        + for role in '$(echo "$ROLES" | tr "," "\n")'
        + case $role in
        

        I see in the trunk that it is fixed. So the problem is that it was forgotten to be uploaded to the s3. That's all.

        Show
        Tibor Kiss added a comment - - edited I am following this bug and I see that changing the short rolenames can be the problem because on the end of http://whirr.s3.amazonaws.com/0.4.0-incubating-SNAPSHOT/apache/hadoop/post-configure there are the old short rolenames! for role in $(echo "$ROLES" | tr "," "\n" ); do case $role in nn) setup_web start_namenode ;; snn) start_daemon secondarynamenode ;; jt) start_daemon jobtracker ;; dn) start_daemon datanode ;; tt) start_daemon tasktracker ;; esac done Here is what I see in the node when I am running manually.... does nothing at the end! [root@domU-12-31-39-0E-CD-63 computeserv]# ./post-configure hadoop-namenode,hadoop-jobtracker -n ec2-184-73-150-247.compute-1.amazonaws.com -j ec2-184-73-150-247.compute-1.amazonaws.com -c ec2 + set -e + ROLES=hadoop-namenode,hadoop-jobtracker + shift + NN_HOST= + JT_HOST= + CLOUD_PROVIDER= + getopts n:j:c: OPTION + case $OPTION in + NN_HOST=ec2-184-73-150-247.compute-1.amazonaws.com + getopts n:j:c: OPTION + case $OPTION in + JT_HOST=ec2-184-73-150-247.compute-1.amazonaws.com + getopts n:j:c: OPTION + case $OPTION in + CLOUD_PROVIDER=ec2 + getopts n:j:c: OPTION + case $CLOUD_PROVIDER in ++ wget -q -O - http: //169.254.169.254/latest/meta-data/ public -hostname + SELF_HOST=ec2-184-73-150-247.compute-1.amazonaws.com + HADOOP_VERSION=0.20.2 + HADOOP_HOME=/usr/local/hadoop-0.20.2 + HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf + configure_hadoop + case $CLOUD_PROVIDER in + MOUNT=/mnt + FIRST_MOUNT=/mnt + DFS_NAME_DIR=/mnt/hadoop/hdfs/name + FS_CHECKPOINT_DIR=/mnt/hadoop/hdfs/secondary + DFS_DATA_DIR=/mnt/hadoop/hdfs/data + MAPRED_LOCAL_DIR=/mnt/hadoop/mapred/local + MAX_MAP_TASKS=2 + MAX_REDUCE_TASKS=1 + CHILD_OPTS=-Xmx550m + CHILD_ULIMIT=1126400 + mkdir -p /mnt/hadoop + chown hadoop:hadoop /mnt/hadoop + '[' '!' -e /mnt/tmp ']' + mkdir /etc/hadoop + ln -s /usr/local/hadoop-0.20.2/conf /etc/hadoop/conf + cat + sed -i -e 's|# export HADOOP_PID_DIR=.*|export HADOOP_PID_DIR=/ var /run/hadoop|' /usr/local/hadoop-0.20.2/conf/hadoop-env.sh + mkdir -p / var /run/hadoop + chown -R hadoop:hadoop / var /run/hadoop + sed -i -e 's|# export HADOOP_SSH_OPTS=.*|export HADOOP_SSH_OPTS= "-o StrictHostKeyChecking=no" |' /usr/local/hadoop-0.20.2/conf/hadoop-env.sh + sed -i -e 's|# export HADOOP_OPTS=.*|export HADOOP_OPTS= "-Djava.net.preferIPv4Stack= true " |' /usr/local/hadoop-0.20.2/conf/hadoop-env.sh + sed -i -e 's|# export HADOOP_LOG_DIR=.*|export HADOOP_LOG_DIR=/ var /log/hadoop/logs|' /usr/local/hadoop-0.20.2/conf/hadoop-env.sh + rm -rf / var /log/hadoop + mkdir /mnt/hadoop/logs + chown hadoop:hadoop /mnt/hadoop/logs + ln -s /mnt/hadoop/logs / var /log/hadoop + chown -R hadoop:hadoop / var /log/hadoop ++ echo hadoop-namenode,hadoop-jobtracker ++ tr , '\n' + for role in '$(echo "$ROLES" | tr "," "\n" )' + case $role in + for role in '$(echo "$ROLES" | tr "," "\n" )' + case $role in I see in the trunk that it is fixed. So the problem is that it was forgotten to be uploaded to the s3. That's all.
        Hide
        Andrei Savu added a comment -

        I'm glad to see that it's something easy to fix. I was re-checking the patch. Thanks Tibor for tracking this down.

        Show
        Andrei Savu added a comment - I'm glad to see that it's something easy to fix. I was re-checking the patch. Thanks Tibor for tracking this down.
        Hide
        Tibor Kiss added a comment -
        Show
        Tibor Kiss added a comment - When will be updated (corrected) the http://whirr.s3.amazonaws.com/0.4.0-incubating-SNAPSHOT/apache/hadoop/post-configure ?
        Hide
        Andrei Savu added a comment -

        Hopefully as soon as Tom sees this

        Show
        Andrei Savu added a comment - Hopefully as soon as Tom sees this
        Hide
        Tom White added a comment -

        I've just uploaded the files so the 0.4.0-incubating-SNAPSHOT directory on S3 is in sync with trunk. This was my fault, as I forgot to upload the files at the time of committing WHIRR-199 and WHIRR-183 (and then I was away for a few days without internet access, so it took a while to fix!).

        The policy should be that any committer is able to update the S3 bucket, but in practice I've found it difficult to administer the permissions so that others are granted access (especially to new directories). Given all this, I think we should move away from this model. I opened WHIRR-225 to discuss this, so please have a look at the proposal there and add your comments. The sooner we fix this, the better, IMO.

        Again - sorry for the inconvenience!

        Show
        Tom White added a comment - I've just uploaded the files so the 0.4.0-incubating-SNAPSHOT directory on S3 is in sync with trunk. This was my fault, as I forgot to upload the files at the time of committing WHIRR-199 and WHIRR-183 (and then I was away for a few days without internet access, so it took a while to fix!). The policy should be that any committer is able to update the S3 bucket, but in practice I've found it difficult to administer the permissions so that others are granted access (especially to new directories). Given all this, I think we should move away from this model. I opened WHIRR-225 to discuss this, so please have a look at the proposal there and add your comments. The sooner we fix this, the better, IMO. Again - sorry for the inconvenience!
        Hide
        Andrei Savu added a comment -

        I guess that now it's safe to close this issue?

        Show
        Andrei Savu added a comment - I guess that now it's safe to close this issue?
        Hide
        Tom White added a comment -

        Tests are failing due to WHIRR-124 for me now.

        Show
        Tom White added a comment - Tests are failing due to WHIRR-124 for me now.
        Hide
        Andrei Savu added a comment -

        I'm seeing the same issue (I was just going to write a comment). We should reopen the issue and rollback that change.

        Show
        Andrei Savu added a comment - I'm seeing the same issue (I was just going to write a comment). We should reopen the issue and rollback that change.
        Hide
        Tom White added a comment -

        +1

        Show
        Tom White added a comment - +1
        Hide
        Andrei Savu added a comment -

        I have re-run all the tests while re-checking WHIRR-124. I'm seeing only one failure in CDH integration tests:

        -------------------------------------------------------------------------------
        Test set: org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest
        -------------------------------------------------------------------------------
        Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 431.283 sec <<< FAILURE!
        test(org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest)  Time elapsed: 431.11 sec  <<< ERROR!
        java.net.UnknownHostException: unknown host: ip-10-112-221-240.ec2.internal
            at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:241)
            at org.apache.hadoop.ipc.Client.getConnection(Client.java:1184)
            at org.apache.hadoop.ipc.Client.call(Client.java:1025)
            at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
            at $Proxy82.getProtocolVersion(Unknown Source)
            at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:369)
            at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111)
            at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213)
            at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180)
            at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
            at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1489)
            at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
            at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1523)
            at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1505)
            at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
            at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
            at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:97)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:799)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
            at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:793)
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:767)
            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1197)
            at org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest.test(CdhHadoopServiceTest.java:104)
        
        Show
        Andrei Savu added a comment - I have re-run all the tests while re-checking WHIRR-124 . I'm seeing only one failure in CDH integration tests: ------------------------------------------------------------------------------- Test set: org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest ------------------------------------------------------------------------------- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 431.283 sec <<< FAILURE! test(org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest) Time elapsed: 431.11 sec <<< ERROR! java.net.UnknownHostException: unknown host: ip-10-112-221-240.ec2.internal at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:241) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1184) at org.apache.hadoop.ipc.Client.call(Client.java:1025) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy82.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:369) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1489) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1523) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1505) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:97) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:799) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:793) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:767) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1197) at org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest.test(CdhHadoopServiceTest.java:104)
        Hide
        Andrei Savu added a comment -

        This is no longer an issue. All the unit an integration tests are passing.

        Show
        Andrei Savu added a comment - This is no longer an issue. All the unit an integration tests are passing.

          People

          • Assignee:
            Unassigned
            Reporter:
            Andrei Savu
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development