Whirr
  1. Whirr
  2. WHIRR-488

whirr hangs in certain cases when creating a spot-priced EC2 cluster

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.6.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      EC2, creating a cdh hadoop cluster

      Description

      In about 1 out of every 5-7 attempts, whirr will hang while creating a spot-priced cluster in EC2. The process just sits there, consuming no system resources and writing nothing to stderr or stdout. In each case, some number of cluster nodes are up and running.

      This happened again to me today, and whirr was hung for about 4 hours. As usual, there were a bunch of errors logged while it was trying to create the instances. About 10 minutes in, though, whirr just went radio silent and stayed that way until I killed it.

      I'll attached the output – it looks similar to the other instances where whirr has had this problem.

      1. whirr.startup.hang.log
        7 kB
        Evan Pollan
      2. whirr.jclouds.spotPriceHang.log
        50 kB
        Evan Pollan

        Activity

        Hide
        Andrei Savu added a comment -

        As far as I can see you are getting an InternalError from Amazon when starting / requesting a bunch of spot instances:

        org.jclouds.aws.AWSResponseException: request POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 failed with code 400, error: AWSError{requestId='19b7d163-0c51-4c1d-8447-947beff61dbc', requestToken='null', code='InternalError', message='An internal error has occurred', context='{Response=, Errors=}'}
        

        We can't do that much as an workaround this except for failing faster. Thanks for reporting!

        Show
        Andrei Savu added a comment - As far as I can see you are getting an InternalError from Amazon when starting / requesting a bunch of spot instances: org.jclouds.aws.AWSResponseException: request POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 failed with code 400, error: AWSError{requestId='19b7d163-0c51-4c1d-8447-947beff61dbc', requestToken='null', code='InternalError', message='An internal error has occurred', context='{Response=, Errors=}'} We can't do that much as an workaround this except for failing faster. Thanks for reporting!
        Hide
        Andrei Savu added a comment -

        Adrian have you seen this before? Is there a property in jclouds to decrease the wait time on spot requests?

        Show
        Andrei Savu added a comment - Adrian have you seen this before? Is there a property in jclouds to decrease the wait time on spot requests?
        Hide
        Evan Pollan added a comment -

        Here's how this behavior manifests itself in 0.7.0 and the trunk (see whirr.jclouds.spotPriceHang.log).

        It's worth noting that I have a cron job that checks to see if whirr is still running launch-cluster 30 minutes after it was started. If so, it kills it, waits 5 minutes, and tries to start the cluster again with the same properties. ~75% of the time, this works. If it hangs again, I edit the cluster properties to use on-demand pricing, and it works 100% of the time.

        The incidences of hangs are now in more than 50% of the time, BTW. Seems like the problem is worse in 0.7.0+...

        Show
        Evan Pollan added a comment - Here's how this behavior manifests itself in 0.7.0 and the trunk (see whirr.jclouds.spotPriceHang.log). It's worth noting that I have a cron job that checks to see if whirr is still running launch-cluster 30 minutes after it was started. If so, it kills it, waits 5 minutes, and tries to start the cluster again with the same properties. ~75% of the time, this works. If it hangs again, I edit the cluster properties to use on-demand pricing, and it works 100% of the time. The incidences of hangs are now in more than 50% of the time, BTW. Seems like the problem is worse in 0.7.0+...

          People

          • Assignee:
            Unassigned
            Reporter:
            Evan Pollan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development