Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1099

Consensus should be more resilient to transient DNS failures

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Private Beta
    • None
    • consensus
    • None

    Description

      When starting up the Kudu cluster on bolt80 I often see a couple nodes crash with errors like:

      F0902 10:12:03.816699 37993 leader_election.cc:157] Check failed: _s.ok() Bad status: Network error: Unable to resolve address 'e1313.halxg.cloudera.com': Name or service not known
      

      I'm guessing that we end up producing a DNS storm of some kind, and this somehow causes us to get some incorrect "host not found" errors. We shouldn't crash the whole process.

      Attachments

        Issue Links

          Activity

            People

              tlipcon Todd Lipcon
              tlipcon Todd Lipcon
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: