Whirr
  1. Whirr
  2. WHIRR-552

Fix sporadic HBase 0.92 failures

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: service/hbase
    • Labels:
      None

      Description

      Occasionally the HBase 0.92 service fails to start. See WHIRR-525 for a description.

      1. zk.log
        13 kB
        Tom White
      2. WHIRR-552.patch
        2 kB
        Andrew Bayer
      3. rs.log
        33 kB
        Tom White
      4. master.log
        37 kB
        Tom White

        Activity

        Hide
        Andrew Bayer added a comment -

        Switching to HBase 0.92.2 seems to do the trick.

        Show
        Andrew Bayer added a comment - Switching to HBase 0.92.2 seems to do the trick.
        Hide
        Amandeep Khurana added a comment -

        I ran into issues where the parent znode was not created because the master had not yet initialized. RS came up and died because of that.
        Related issue - https://issues.apache.org/jira/browse/HBASE-5666.

        Show
        Amandeep Khurana added a comment - I ran into issues where the parent znode was not created because the master had not yet initialized. RS came up and died because of that. Related issue - https://issues.apache.org/jira/browse/HBASE-5666 .
        Hide
        Karel Vervaeke added a comment -

        Here's what I got:

        5:17:23 RS starting up, failure to connect to zk (logical since zk not started yet)
        5:17:42 Master starting up, same kind of failures
        5:17:47 ZK starting
        5:17:47 RS connects to zk
        5:17:47 Master connects to zk
        5:17:53 Master: found 1 replicas but expecting no less than 3
        5:17:54 - 5:18:02 Master 'waiting for rs to check in'
        5:17:58 RS: first signs of stopping (aborting, initialization of fs failed, bla bla)

        The highlights to me are
        1) 5:17:53 Master:found 1 replicas but expecting no less than 3
        Wrong value for dfs.replication? Which value does it have in the datanodes?

        2) Why did the rs not check in between 5:17:48 and :58? No activity in the rs logs...

        3) The master logs appear truncated. Is it just part of the log or does it really end suddenly?
        Memory issues?

        Show
        Karel Vervaeke added a comment - Here's what I got: 5:17:23 RS starting up, failure to connect to zk (logical since zk not started yet) 5:17:42 Master starting up, same kind of failures 5:17:47 ZK starting 5:17:47 RS connects to zk 5:17:47 Master connects to zk 5:17:53 Master: found 1 replicas but expecting no less than 3 5:17:54 - 5:18:02 Master 'waiting for rs to check in' 5:17:58 RS: first signs of stopping (aborting, initialization of fs failed, bla bla) The highlights to me are 1) 5:17:53 Master:found 1 replicas but expecting no less than 3 Wrong value for dfs.replication? Which value does it have in the datanodes? 2) Why did the rs not check in between 5:17:48 and :58? No activity in the rs logs... 3) The master logs appear truncated. Is it just part of the log or does it really end suddenly? Memory issues?
        Hide
        Tom White added a comment -

        Patrick Hunt told me offline that the "Connected to an old server; r-o mode will be unavailable" message is not a problem. Also, worth trying telnet not nc, since nc sometimes has issues.

        I wonder if this is a timing issue, and simply making the RS wait a bit would help.

        Show
        Tom White added a comment - Patrick Hunt told me offline that the "Connected to an old server; r-o mode will be unavailable" message is not a problem. Also, worth trying telnet not nc, since nc sometimes has issues. I wonder if this is a timing issue, and simply making the RS wait a bit would help.
        Hide
        Tom White added a comment -

        Here are the log files from a time that the service failed.

        Show
        Tom White added a comment - Here are the log files from a time that the service failed.

          People

          • Assignee:
            Andrew Bayer
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development