Whirr
  1. Whirr
  2. WHIRR-441

Precondition failure: IndexOutOfBoundsException on cluster setup

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.6.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      64 bit Amazon linux AMI w/ Cloudera CDH3U2 hadoop/hive stack

      Description

      I was spinning up a 16 node cluster this morning, and, after a series of errors (not uncommon) there was a precondition assertion failure that left the whirr JVM running, but dormant for about 20 minutes. I haven't seen this before using the same cluster config and whirr version, and I'm trying again to see if it's reproducible.

      Here's the error:

      Starting 15 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
      Starting 1 node(s) with roles [hadoop-jobtracker, hadoop-namenode]
      << problem applying options to node(us-east-1/sir-b61d7212):
      org.jclouds.aws.AWSResponseException: request POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 failed with code 400, error: AWSError{requestId='9530b126-fae6-43c8-86a4-b7e2a865c8a1', requestToken='null', code='InternalError', message='An internal error has occurred', context='

      {Response=, Errors=}'}
      at org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.handleError(ParseAWSErrorFromXmlContent.java:74)
      at org.jclouds.http.handlers.DelegatingErrorHandler.handleError(DelegatingErrorHandler.java:69)
      at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.shouldContinue(BaseHttpCommandExecutorService.java:200)
      at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:165)
      at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:134)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      at java.lang.Thread.run(Thread.java:636)
      << problem applying options to node(us-east-1/sir-d4907012):
      org.jclouds.aws.AWSResponseException: request POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 failed with code 400, error: AWSError{requestId='69215a6b-5455-402f-ae0c-aaaca6245cb6', requestToken='null', code='InternalError', message='An internal error has occurred', context='{Response=, Errors=}

      '}
      at org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.handleError(ParseAWSErrorFromXmlContent.java:74)
      at org.jclouds.http.handlers.DelegatingErrorHandler.handleError(DelegatingErrorHandler.java:69)
      at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.shouldContinue(BaseHttpCommandExecutorService.java:200)
      at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:165)
      at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:134)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      at java.lang.Thread.run(Thread.java:636)
      Nodes started: [[id=us-east-1/i-6dcec30e, providerId=i-6dcec30e, group=logs-cluster, name=null, location=[id=us-east-1c, scope=ZONE, description=us-east-1c, parent=us-east-1, iso3166Codes=[US-VA], metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null, family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true, description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml], state=RUNNING, loginPort=22, hostname=ip-10-196-130-159, privateAddresses=[10.196.130.159], publicAddresses=[204.236.240.255], hardware=[id=m1.xlarge, providerId=m1.xlarge, name=null, processors=[[cores=4.0, speed=2.0]], ram=15360, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0, device=/dev/sdb, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdc, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdd, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sde, durable=false, isBootDevice=false]], supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit()), tags=[]], loginUser=ubuntu, userMetadata={}, tags=[]]]
      Starting 2 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
      Dying because - net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF
      Dying because - net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF
      <<kex done>> woke to: net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF
      << (ubuntu@50.16.99.93:22) error acquiring SSHClient(ubuntu@50.16.99.93:22): Broken transport; encountered EOF
      net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF
      at net.schmizz.sshj.transport.Reader.run(Reader.java:70)
      Nodes started: [[id=us-east-1/i-cbc6cba8, providerId=i-cbc6cba8, group=logs-cluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1, iso3166Codes=[US-VA], metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null, family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true, description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml], state=RUNNING, loginPort=22, hostname=ip-10-120-239-36, privateAddresses=[10.120.239.36], publicAddresses=[50.16.99.93], hardware=[id=m1.xlarge, providerId=m1.xlarge, name=null, processors=[[cores=4.0, speed=2.0]], ram=15360, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0, device=/dev/sdb, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdc, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdd, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sde, durable=false, isBootDevice=false]], supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit()), tags=[]], loginUser=ubuntu, userMetadata={}, tags=[]], [id=us-east-1/i-cdc6cbae, providerId=i-cdc6cbae, group=logs-cluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1, iso3166Codes=[US-VA], metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null, family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true, description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml], state=RUNNING, loginPort=22, hostname=ip-10-123-69-36, privateAddresses=[10.123.69.36], publicAddresses=[107.22.69.237], hardware=[id=m1.xlarge, providerId=m1.xlarge, name=null, processors=[[cores=4.0, speed=2.0]], ram=15360, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0, device=/dev/sdb, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdc, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdd, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sde, durable=false, isBootDevice=false]], supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit()), tags=[]], loginUser=ubuntu, userMetadata={}, tags=[]]]
      Deleting failed node node us-east-1/sir-d4907012
      Deleting failed node node us-east-1/sir-b61d7212
      Deleting failed node node us-east-1/i-23cec340
      Node deleted: us-east-1/sir-b61d7212
      Node deleted: us-east-1/sir-d4907012
      Node deleted: us-east-1/i-23cec340
      Exception in thread "main" java.lang.IndexOutOfBoundsException: index (0) must be less than size (0)
      at com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:301)
      at com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:280)
      at com.google.common.collect.Iterables.get(Iterables.java:649)
      at org.apache.whirr.actions.BootstrapClusterAction$1.apply(BootstrapClusterAction.java:226)
      at org.apache.whirr.actions.BootstrapClusterAction$1.apply(BootstrapClusterAction.java:223)
      at com.google.common.collect.Iterators$8.next(Iterators.java:765)
      at java.util.AbstractCollection.addAll(AbstractCollection.java:322)
      at java.util.LinkedHashSet.<init>(LinkedHashSet.java:169)
      at com.google.common.collect.Sets.newLinkedHashSet(Sets.java:264)
      at org.apache.whirr.actions.BootstrapClusterAction.getInstances(BootstrapClusterAction.java:222)
      at org.apache.whirr.actions.BootstrapClusterAction.doAction(BootstrapClusterAction.java:141)
      at org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:80)
      at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:106)
      at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:62)
      at org.apache.whirr.cli.Main.run(Main.java:64)
      at org.apache.whirr.cli.Main.main(Main.java:97)

      Here's the config file:
      =======================

      whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,15 hadoop-datanode+hadoop-tasktracker
      whirr.hadoop.install-function=install_cdh_hadoop
      whirr.hadoop.configure-function=configure_cdh_hadoop
      whirr.provider=aws-ec2

      whirr.cluster-name=logs-cluster
      whirr.identity=ACCESS KEY ID GOES HERE
      whirr.credential=SECRET ACCESS KEY GOES HERE
      whirr.private-key-file=$

      {sys:user.home}

      /.ssh/id_rsa
      whirr.public-key-file=$

      {sys:user.home}

      /.ssh/id_rsa.pub
      whirr.hardware-id=m1.xlarge

      1. Using 64 bit Ubuntu 10.04 to avoid defect https://issues.apache.org/jira/browse/WHIRR-148
        whirr.image-id=us-east-1/ami-da0cf8b3
        whirr.aws-ec2-spot-price=1.00

        Activity

        Hide
        Andrei Savu added a comment -

        Thanks Evan for reporting this. I think you should use options like whirr.instance-templates-minimum-number-of-instances, whirr.max-startup-retries to improve your odds of starting larger size clusters.

        See the configuration guide:
        http://whirr.apache.org/docs/0.6.0/configuration-guide.html

        Show
        Andrei Savu added a comment - Thanks Evan for reporting this. I think you should use options like whirr.instance-templates-minimum-number-of-instances, whirr.max-startup-retries to improve your odds of starting larger size clusters. See the configuration guide: http://whirr.apache.org/docs/0.6.0/configuration-guide.html
        Hide
        Evan Pollan added a comment -

        Just had this happen again, this time with max-retries set to 2, no spot pricing, and a minimum count of datanodes set to 12 (requested 15).

        It spit out the same precondition assertion failure and continued running. So, I sent a QUIT signal to the VM, and here's the dump:

        "DestroyJavaVM" prio=10 tid=0x00007f016400b800 nid=0xdaa waiting on condition [0x0000000000000000]
           java.lang.Thread.State: RUNNABLE
        
        "i/o thread 12" prio=10 tid=0x00007f0130002000 nid=0xdf9 waiting on condition [0x00007f011e2e1000]
           java.lang.Thread.State: WAITING (parking)
                at sun.misc.Unsafe.park(Native Method)
                - parking to wait for  <0x000000008c132a58> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionOb
        ject)
                at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2
        043)
                at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
                at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
                at java.lang.Thread.run(Thread.java:636)
        
        "com.google.inject.internal.util.$Finalizer" daemon prio=10 tid=0x00007f0164623800 nid=0xdb5 in Object.wait() [0x00007f0
        1605a1000]
           java.lang.Thread.State: WAITING (on object monitor)
                at java.lang.Object.wait(Native Method)
                - waiting on <0x000000008b2d1d08> (a java.lang.ref.ReferenceQueue$Lock)
                at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
                - locked <0x000000008b2d1d08> (a java.lang.ref.ReferenceQueue$Lock)
                at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
                at com.google.inject.internal.util.$Finalizer.run(Finalizer.java:114)
        
        "Low Memory Detector" daemon prio=10 tid=0x00007f016409d000 nid=0xdb3 runnable [0x0000000000000000]
           java.lang.Thread.State: RUNNABLE
        
        "CompilerThread1" daemon prio=10 tid=0x00007f016409a800 nid=0xdb2 waiting on condition [0x0000000000000000]
           java.lang.Thread.State: RUNNABLE
        
        "CompilerThread0" daemon prio=10 tid=0x00007f0164097800 nid=0xdb1 waiting on condition [0x0000000000000000]
           java.lang.Thread.State: RUNNABLE
        
        "Signal Dispatcher" daemon prio=10 tid=0x00007f0164089800 nid=0xdb0 waiting on condition [0x0000000000000000]
           java.lang.Thread.State: RUNNABLE
        
        "Finalizer" daemon prio=10 tid=0x00007f0164077800 nid=0xdaf in Object.wait() [0x00007f0160efd000]
           java.lang.Thread.State: WAITING (on object monitor)
                at java.lang.Object.wait(Native Method)
                - waiting on <0x000000008b55de10> (a java.lang.ref.ReferenceQueue$Lock)
                at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
                - locked <0x000000008b55de10> (a java.lang.ref.ReferenceQueue$Lock)
                at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
                at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
        
        
        "Reference Handler" daemon prio=10 tid=0x00007f0164075800 nid=0xdae in Object.wait() [0x00007f0160ffe000]
           java.lang.Thread.State: WAITING (on object monitor)
                at java.lang.Object.wait(Native Method)
                - waiting on <0x000000008b55dda8> (a java.lang.ref.Reference$Lock)
                at java.lang.Object.wait(Object.java:502)
                at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
                - locked <0x000000008b55dda8> (a java.lang.ref.Reference$Lock)
        
        "VM Thread" prio=10 tid=0x00007f016406e800 nid=0xdad runnable 
        
        "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f0164016800 nid=0xdab runnable 
        
        "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f0164018000 nid=0xdac runnable 
        
        "VM Periodic Task Thread" prio=10 tid=0x00007f016409f800 nid=0xdb4 waiting on condition 
        
        JNI global references: 1591
        
        Heap
         PSYoungGen      total 222784K, used 11327K [0x00000000d90b0000, 0x00000000e8760000, 0x0000000100000000)
          eden space 219200K, 3% used [0x00000000d90b0000,0x00000000d984d610,0x00000000e66c0000)
          from space 3584K, 98% used [0x00000000e83e0000,0x00000000e8752610,0x00000000e8760000)
          to   space 12608K, 0% used [0x00000000e6ec0000,0x00000000e6ec0000,0x00000000e7b10000)
         PSOldGen        total 79744K, used 16462K [0x000000008b200000, 0x000000008ffe0000, 0x00000000d90b0000)
          object space 79744K, 20% used [0x000000008b200000,0x000000008c213a98,0x000000008ffe0000)
         PSPermGen       total 27776K, used 27765K [0x0000000080c00000, 0x0000000082720000, 0x000000008b200000)
          object space 27776K, 99% used [0x0000000080c00000,0x000000008271d4d8,0x0000000082720000)
        
        Show
        Evan Pollan added a comment - Just had this happen again, this time with max-retries set to 2, no spot pricing, and a minimum count of datanodes set to 12 (requested 15). It spit out the same precondition assertion failure and continued running. So, I sent a QUIT signal to the VM, and here's the dump: "DestroyJavaVM" prio=10 tid=0x00007f016400b800 nid=0xdaa waiting on condition [0x0000000000000000] java.lang. Thread .State: RUNNABLE "i/o thread 12" prio=10 tid=0x00007f0130002000 nid=0xdf9 waiting on condition [0x00007f011e2e1000] java.lang. Thread .State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000008c132a58> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionOb ject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2 043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang. Thread .run( Thread .java:636) "com.google.inject.internal.util.$Finalizer" daemon prio=10 tid=0x00007f0164623800 nid=0xdb5 in Object .wait() [0x00007f0 1605a1000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <0x000000008b2d1d08> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133) - locked <0x000000008b2d1d08> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149) at com.google.inject.internal.util.$Finalizer.run(Finalizer.java:114) "Low Memory Detector" daemon prio=10 tid=0x00007f016409d000 nid=0xdb3 runnable [0x0000000000000000] java.lang. Thread .State: RUNNABLE "CompilerThread1" daemon prio=10 tid=0x00007f016409a800 nid=0xdb2 waiting on condition [0x0000000000000000] java.lang. Thread .State: RUNNABLE "CompilerThread0" daemon prio=10 tid=0x00007f0164097800 nid=0xdb1 waiting on condition [0x0000000000000000] java.lang. Thread .State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x00007f0164089800 nid=0xdb0 waiting on condition [0x0000000000000000] java.lang. Thread .State: RUNNABLE "Finalizer" daemon prio=10 tid=0x00007f0164077800 nid=0xdaf in Object .wait() [0x00007f0160efd000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <0x000000008b55de10> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133) - locked <0x000000008b55de10> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177) "Reference Handler" daemon prio=10 tid=0x00007f0164075800 nid=0xdae in Object .wait() [0x00007f0160ffe000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <0x000000008b55dda8> (a java.lang.ref.Reference$Lock) at java.lang. Object .wait( Object .java:502) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133) - locked <0x000000008b55dda8> (a java.lang.ref.Reference$Lock) "VM Thread " prio=10 tid=0x00007f016406e800 nid=0xdad runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f0164016800 nid=0xdab runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f0164018000 nid=0xdac runnable "VM Periodic Task Thread " prio=10 tid=0x00007f016409f800 nid=0xdb4 waiting on condition JNI global references: 1591 Heap PSYoungGen total 222784K, used 11327K [0x00000000d90b0000, 0x00000000e8760000, 0x0000000100000000) eden space 219200K, 3% used [0x00000000d90b0000,0x00000000d984d610,0x00000000e66c0000) from space 3584K, 98% used [0x00000000e83e0000,0x00000000e8752610,0x00000000e8760000) to space 12608K, 0% used [0x00000000e6ec0000,0x00000000e6ec0000,0x00000000e7b10000) PSOldGen total 79744K, used 16462K [0x000000008b200000, 0x000000008ffe0000, 0x00000000d90b0000) object space 79744K, 20% used [0x000000008b200000,0x000000008c213a98,0x000000008ffe0000) PSPermGen total 27776K, used 27765K [0x0000000080c00000, 0x0000000082720000, 0x000000008b200000) object space 27776K, 99% used [0x0000000080c00000,0x000000008271d4d8,0x0000000082720000)

          People

          • Assignee:
            Unassigned
            Reporter:
            Evan Pollan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development