Whirr
  1. Whirr
  2. WHIRR-427

CDH Hadoop integration test fails with malformed reply from SOCKS, may be version issue

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Blocker Blocker
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      CdhHadoopServiceTest is failing on trunk.

      java.io.IOException: Call to ec2-174-129-167-130.compute-1.amazonaws.com/174.129.167.130:8021 failed on local exception: java.net.SocketException: Malformed reply from SOCKS server
              at org.apache.hadoop.ipc.Client.wrapException(Client.java:1139)
              at org.apache.hadoop.ipc.Client.call(Client.java:1107)
              at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
              at org.apache.hadoop.mapred.$Proxy85.getProtocolVersion(Unknown Source)
              at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
              at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
              at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:501)
              at org.apache.hadoop.mapred.JobClient.init(JobClient.java:486)
              at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:469)
              at org.apache.whirr.service.cdh.integration.CdhHadoopServiceTest.test(CdhHadoopServiceTest.java:86)
      

      I am running this with mvn -Pintegration install -Dtest=CdhHadoopServiceTest -DargLine="-Dwhirr.test.provider=aws-ec2 -Dwhirr.test.identity=... -Dwhirr.test.credential=..." in services/cdh. The corresponding HadoopServiceTest is passing.

      Also see WHIRR-227 and WHIRR-415.

        Activity

        Hide
        Alex Heneveld added a comment -

        I have tried changing the hadoop+hb+zk versions used by tests in the pom.xml but no joy. (Running u0 and u1 against an explicit REPO version installed on the servers, and running u2 versions against the default cdh3 aka u2 – all give the above problem.)

        I've also noticed various ssh messages:

        Warning: Permanently added 'ec2-107-22-77-118.compute-1.amazonaws.com,107.22.77.118' (RSA) to the list of known hosts.
        channel 2: open failed: connect failed: Connection refused
        channel 2: open failed: connect failed: Connection refused
        channel 2: open failed: connect failed: Connection refused
        

        These are sandwiched between the following timestamped messages (from whirr.log):

        2011-11-09 13:22:44,329 INFO  [org.apache.whirr.service.hadoop.HadoopDataNodeClusterActionHandler] (main) Completed configuration of hadoopcdhclustertest role hadoop-datanode
        2011-11-09 13:22:44,330 INFO  [org.apache.whirr.service.hadoop.HadoopTaskTrackerClusterActionHandler] (main) Completed configuration of hadoopcdhclustertest role hadoop-tasktracker
        2011-11-09 13:22:44,442 INFO  [org.apache.whirr.service.FileClusterStateStore] (main) Wrote instances file /Users/alex/.whirr/hadoopcdhclustertest/instances
        2011-11-09 13:22:56,527 INFO  [org.apache.whirr.actions.ScriptBasedClusterAction] (main) Starting to run scripts on cluster for phase destroyinstances: us-east-1/i-1b20a578
        

        The jclouds-ssh log has the following content at that time, not sure whether the shutdown is expected or premature:

        2011-11-09 13:22:44,318 DEBUG [jclouds.ssh] (user thread 9) << (alex:rsa[fingerprint(d4:28:19:cd:1c:b0:44:bc:4f:8e:8f:b5:12:8a:27:37),sha1(f8:6a:24:ca:ea:9
        b:30:1d:e0:8b:a8:14:6e:6d:75:5d:6a:b5:3c:e8)]@107.22.77.118:22) acquired [output=11/11/09 13:22:06 INFO namenode.FSNamesystem: isPermissionEnabled=true
        11/11/09 13:22:06 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
        11/11/09 13:22:06 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
        11/11/09 13:22:06 INFO common.Storage: Image file of size 110 saved in 0 seconds.
        11/11/09 13:22:06 INFO common.Storage: Storage directory /data/hadoop/hdfs/name has been successfully formatted.
        11/11/09 13:22:06 INFO namenode.NameNode: SHUTDOWN_MSG: 
        /************************************************************
        SHUTDOWN_MSG: Shutting down NameNode at ip-10-205-2-215.ec2.internal/10.205.2.215
        ************************************************************/
        11/11/09 13:22:13 INFO ipc.Client: Retrying connect to server: ec2-107-22-77-118.compute1.amazonaws.com/10.205.2.215:8020. Already tried 0 time(s).
        , error=, exitCode=0]
        
        Show
        Alex Heneveld added a comment - I have tried changing the hadoop+hb+zk versions used by tests in the pom.xml but no joy. (Running u0 and u1 against an explicit REPO version installed on the servers, and running u2 versions against the default cdh3 aka u2 – all give the above problem.) I've also noticed various ssh messages: Warning: Permanently added 'ec2-107-22-77-118.compute-1.amazonaws.com,107.22.77.118' (RSA) to the list of known hosts. channel 2: open failed: connect failed: Connection refused channel 2: open failed: connect failed: Connection refused channel 2: open failed: connect failed: Connection refused These are sandwiched between the following timestamped messages (from whirr.log): 2011-11-09 13:22:44,329 INFO [org.apache.whirr.service.hadoop.HadoopDataNodeClusterActionHandler] (main) Completed configuration of hadoopcdhclustertest role hadoop-datanode 2011-11-09 13:22:44,330 INFO [org.apache.whirr.service.hadoop.HadoopTaskTrackerClusterActionHandler] (main) Completed configuration of hadoopcdhclustertest role hadoop-tasktracker 2011-11-09 13:22:44,442 INFO [org.apache.whirr.service.FileClusterStateStore] (main) Wrote instances file /Users/alex/.whirr/hadoopcdhclustertest/instances 2011-11-09 13:22:56,527 INFO [org.apache.whirr.actions.ScriptBasedClusterAction] (main) Starting to run scripts on cluster for phase destroyinstances: us-east-1/i-1b20a578 The jclouds-ssh log has the following content at that time, not sure whether the shutdown is expected or premature: 2011-11-09 13:22:44,318 DEBUG [jclouds.ssh] (user thread 9) << (alex:rsa[fingerprint(d4:28:19:cd:1c:b0:44:bc:4f:8e:8f:b5:12:8a:27:37),sha1(f8:6a:24:ca:ea:9 b:30:1d:e0:8b:a8:14:6e:6d:75:5d:6a:b5:3c:e8)]@107.22.77.118:22) acquired [output=11/11/09 13:22:06 INFO namenode.FSNamesystem: isPermissionEnabled= true 11/11/09 13:22:06 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000 11/11/09 13:22:06 INFO namenode.FSNamesystem: isAccessTokenEnabled= false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 11/11/09 13:22:06 INFO common.Storage: Image file of size 110 saved in 0 seconds. 11/11/09 13:22:06 INFO common.Storage: Storage directory /data/hadoop/hdfs/name has been successfully formatted. 11/11/09 13:22:06 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ip-10-205-2-215.ec2.internal/10.205.2.215 ************************************************************/ 11/11/09 13:22:13 INFO ipc.Client: Retrying connect to server: ec2-107-22-77-118.compute1.amazonaws.com/10.205.2.215:8020. Already tried 0 time(s). , error=, exitCode=0]
        Hide
        Alex Heneveld added a comment -

        For good measure the CDH pom should probably now look like:

        
            <dependency>
              <!-- look at https://repository.cloudera.com/artifactory/libs-release-local/org/apache/hadoop/hadoop-core/ 
                   to find version info for other cdh REPO versions, same for hbase and zookeeper -->
              <groupId>org.apache.hadoop</groupId>
              <artifactId>hadoop-core</artifactId>
              <!-- <version>0.20.2-cdh3u0</version> -->
              <!-- <version>0.20.2-cdh3u1</version> -->
              <version>0.20.2-cdh3u2</version>
              <scope>test</scope>
            </dependency>
            <dependency>
              <groupId>org.apache.hbase</groupId>
              <artifactId>hbase</artifactId>
              <!-- <version>0.90.1-cdh3u0</version> -->
              <!-- <version>0.90.3-cdh3u1</version> -->
              <version>0.90.4-cdh3u2</version>
              <scope>test</scope>
            </dependency>
            <dependency>
              <groupId>org.apache.zookeeper</groupId>
              <artifactId>zookeeper</artifactId>
              <!-- <version>3.3.3-cdh3u0</version> -->
              <!-- <version>3.3.3-cdh3u1</version> -->
              <version>3.3.3-cdh3u2</version>
              <scope>test</scope>
            </dependency>
        

        Since it is installing cdh3 (aka u2) by default – wouldn't expect that to make a difference as Hadoop version is unchanged.

        But leaving this bug for someone more familiar with the domain...

        Show
        Alex Heneveld added a comment - For good measure the CDH pom should probably now look like: <dependency> <!-- look at https: //repository.cloudera.com/artifactory/libs-release-local/org/apache/hadoop/hadoop-core/ to find version info for other cdh REPO versions, same for hbase and zookeeper --> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <!-- <version>0.20.2-cdh3u0</version> --> <!-- <version>0.20.2-cdh3u1</version> --> <version>0.20.2-cdh3u2</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase</artifactId> <!-- <version>0.90.1-cdh3u0</version> --> <!-- <version>0.90.3-cdh3u1</version> --> <version>0.90.4-cdh3u2</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.zookeeper</groupId> <artifactId>zookeeper</artifactId> <!-- <version>3.3.3-cdh3u0</version> --> <!-- <version>3.3.3-cdh3u1</version> --> <version>3.3.3-cdh3u2</version> <scope>test</scope> </dependency> Since it is installing cdh3 (aka u2 ) by default – wouldn't expect that to make a difference as Hadoop version is unchanged. But leaving this bug for someone more familiar with the domain...
        Hide
        Andrei Savu added a comment -

        Thanks Alex for looking into this. I think Tom can help us move things forward. The namenode shutdown message is expected.

        Show
        Andrei Savu added a comment - Thanks Alex for looking into this. I think Tom can help us move things forward. The namenode shutdown message is expected.
        Hide
        Andrei Savu added a comment -

        Marking this as a blocker for 0.7.0.

        Show
        Andrei Savu added a comment - Marking this as a blocker for 0.7.0.
        Hide
        Andrei Savu added a comment -

        Looks like an ordering issue in statement execution: configure_cdh_hadoop is called before creating /tmp/(core|hdfs|mapred)-site.xml and /tmp/hadoop-env.sh files. It should be easy to fix. Working on a patch now.

        Show
        Andrei Savu added a comment - Looks like an ordering issue in statement execution: configure_cdh_hadoop is called before creating /tmp/(core|hdfs|mapred)-site.xml and /tmp/hadoop-env.sh files. It should be easy to fix. Working on a patch now.
        Hide
        Andrei Savu added a comment -

        Nope, the statements are in the right order but we are creating the files twice - before and after configure_cdh_hadoop.

        Show
        Andrei Savu added a comment - Nope, the statements are in the right order but we are creating the files twice - before and after configure_cdh_hadoop.
        Hide
        Andrei Savu added a comment -

        I have just tried to start a CDH cluster outside the integration tests using a recipe like this:

        whirr.cluster-name=cdh-asavu
        
        # Change the number of machines in the cluster here
        whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker, 1 hadoop-datanode+hadoop-tasktracker
        
        # Uncomment out these lines to run CDH
        whirr.hadoop.install-function=install_cdh_hadoop
        whirr.hadoop.configure-function=configure_cdh_hadoop
        
        # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
        whirr.provider=aws-ec2
        whirr.identity=${env:AWS_ACCESS_KEY_ID}
        whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
        

        It seems to be working fine. The cluster is running Hadoop 0.20.2-cdh3u2.

        I'm now trying to tweak the tests as needed.

        Show
        Andrei Savu added a comment - I have just tried to start a CDH cluster outside the integration tests using a recipe like this: whirr.cluster-name=cdh-asavu # Change the number of machines in the cluster here whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker, 1 hadoop-datanode+hadoop-tasktracker # Uncomment out these lines to run CDH whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. whirr.provider=aws-ec2 whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} It seems to be working fine. The cluster is running Hadoop 0.20.2-cdh3u2. I'm now trying to tweak the tests as needed.
        Hide
        Andrei Savu added a comment -

        Tried the version pom.xml version upgrade as Alex suggested - test is still failing.

        Show
        Andrei Savu added a comment - Tried the version pom.xml version upgrade as Alex suggested - test is still failing.
        Hide
        Andrei Savu added a comment -

        Tried the version pom.xml version upgrade as Alex suggested - test is still failing.

        Show
        Andrei Savu added a comment - Tried the version pom.xml version upgrade as Alex suggested - test is still failing.
        Hide
        Andrei Savu added a comment -

        The CDH HBase integration test is passing as expected and this one also covers Hadoop and ZooKeeper. Could this be generated by the fact that services/cdh depends on services/hadoop? Classpath issue? Broken local configs?

        Show
        Andrei Savu added a comment - The CDH HBase integration test is passing as expected and this one also covers Hadoop and ZooKeeper. Could this be generated by the fact that services/cdh depends on services/hadoop? Classpath issue? Broken local configs?
        Hide
        Andrei Savu added a comment -

        Karel any update? What have you tried so far? (I want to avoid an overlap)

        Show
        Andrei Savu added a comment - Karel any update? What have you tried so far? (I want to avoid an overlap)
        Hide
        Andrei Savu added a comment -

        Possible root cause: the job tracker is not starting as expected - HDFS seems to be working fine.

        Show
        Andrei Savu added a comment - Possible root cause: the job tracker is not starting as expected - HDFS seems to be working fine.
        Hide
        Andrei Savu added a comment -

        Here are some relevant bits from the job tracker log file:

        2011-11-18 12:36:01,364 INFO org.apache.hadoop.mapred.JobTracker: Creating the system directory
        2011-11-18 12:36:01,381 WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (hdfs://ec2-107-20-78-76.compute-1.amazonaws.com/hadoop/system/mapred) 
        because of permissions.
        2011-11-18 12:36:01,381 WARN org.apache.hadoop.mapred.JobTracker: This directory should be owned by the user 'mapred'
        2011-11-18 12:36:01,382 WARN org.apache.hadoop.mapred.JobTracker: Bailing out ...
        org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=WRITE, inode="":hdfs:supergroup:rwx
        r-xr-x
                at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
                at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
                at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
                at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
                at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
                at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1004)
                at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:342)
                at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1226)
                at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:364)
        

        What is strange that I have been able to start the job tracker as expected over SSH:
        /etc/init.d/hadoop-0.20-jobtracker start

        I'm now reviewing the configuration script on the remote machine.

        Show
        Andrei Savu added a comment - Here are some relevant bits from the job tracker log file: 2011-11-18 12:36:01,364 INFO org.apache.hadoop.mapred.JobTracker: Creating the system directory 2011-11-18 12:36:01,381 WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (hdfs://ec2-107-20-78-76.compute-1.amazonaws.com/hadoop/system/mapred) because of permissions. 2011-11-18 12:36:01,381 WARN org.apache.hadoop.mapred.JobTracker: This directory should be owned by the user 'mapred' 2011-11-18 12:36:01,382 WARN org.apache.hadoop.mapred.JobTracker: Bailing out ... org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=WRITE, inode="":hdfs:supergroup:rwx r-xr-x at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1004) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:342) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1226) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:364) What is strange that I have been able to start the job tracker as expected over SSH: /etc/init.d/hadoop-0.20-jobtracker start I'm now reviewing the configuration script on the remote machine.
        Hide
        Andrei Savu added a comment -

        Finally I found the root cause. It seems like in this case the order of the roles is critical so hadoop-jobtracker+hadoop-namenode != hadoop-namenode+hadoop-jobtracker and this is happening because the configure_cdh_hadoop script is also installing the services before starting them. I will update the patch in WHIRR-415 and plan for a more permanent fix in 0.8.0.

        Show
        Andrei Savu added a comment - Finally I found the root cause. It seems like in this case the order of the roles is critical so hadoop-jobtracker+hadoop-namenode != hadoop-namenode+hadoop-jobtracker and this is happening because the configure_cdh_hadoop script is also installing the services before starting them. I will update the patch in WHIRR-415 and plan for a more permanent fix in 0.8.0.
        Hide
        Andrei Savu added a comment -

        Closing this as no longer a problem. The patch attached to WHIRR-415 solves it.

        Show
        Andrei Savu added a comment - Closing this as no longer a problem. The patch attached to WHIRR-415 solves it.
        Hide
        Alex Heneveld added a comment -

        Excellent. Well found!

        Show
        Alex Heneveld added a comment - Excellent. Well found!
        Hide
        Karel Vervaeke added a comment -

        Great! Sorry about being slow on my responses. I was looking in the wrong direction anyway...

        Show
        Karel Vervaeke added a comment - Great! Sorry about being slow on my responses. I was looking in the wrong direction anyway...
        Hide
        Andrew Bayer added a comment -

        FYI, this is still happening with 0.8.0 (a.k.a. trunk)

        Show
        Andrew Bayer added a comment - FYI, this is still happening with 0.8.0 (a.k.a. trunk)

          People

          • Assignee:
            Andrei Savu
            Reporter:
            Alex Heneveld
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development