Uploaded image for project: 'Apache Whirr (retired)'
  1. Apache Whirr (retired)
  2. WHIRR-167

Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Amazon EC2

      Description

      Actually it is very unstable the cluster startup process on Amazon EC2 instances. How the number of nodes to be started up is increasing the startup process it fails more often. But sometimes even 2-3 nodes startup process fails. We don't know how many number of instance startup is going on at the same time at Amazon side when it fails or when it successfully starting up. The only think I see is that when I am starting around 10 nodes, the statistics of failing nodes are higher then with smaller number of nodes and is not direct proportional with the number of nodes, looks like it is exponentialy higher probability to fail some nodes.

      Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for RunNodesException and don't bail out if only a few " which indicated the current unreliable startup process. So we should improve it.

      We could add a "max percent failure" property (per instance template), so that if the number failures exceeded this value the whole cluster fails to launch and is shutdown. For the master node the value would be 100%, but for datanodes it would be more like 75%. (Tom White also mentioned in an email).

      Let's discuss if there are any other requirements to this improvement.

        Attachments

        1. whirr.log
          5 kB
          Tibor Kiss
        2. whirr-167-1.patch
          16 kB
          Tibor Kiss
        3. whirr-167-2.patch
          50 kB
          Tibor Kiss
        4. whirr-167-3.patch
          53 kB
          Tibor Kiss
        5. whirr-integrationtest.tar.gz
          5 kB
          Tibor Kiss
        6. whirr-167-4.patch
          58 kB
          Tibor Kiss
        7. whirr-167-5.patch
          60 kB
          Tibor Kiss
        8. whirr-167-7.patch
          64 kB
          Tibor Kiss
        9. whirr.log
          35 kB
          Tibor Kiss
        10. WHIRR-167-cleanup.patch
          3 kB
          Adrian Cole

          Activity

            People

            • Assignee:
              tibor.kiss Tibor Kiss
              Reporter:
              tibor.kiss Tibor Kiss
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: