HBase
  1. HBase
  2. HBASE-3800

If HMaster is started after NN without starting DN in Hbase 090.2 then HMaster is not able to start due to AlreadyCreatedException for /hbase/hbase.version

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.2
    • Fix Version/s: 0.90.3, 0.92.0
    • Component/s: master
    • Labels:
      None

      Description

      It reproduces when HMaster is started for the first time and NN is started without starting DN

      Hmaster logs:
      2011-04-19 16:49:09,208 DEBUG org.apache.hadoop.hbase.master.ActiveMasterManager: A master is now available
      2011-04-19 16:49:09,400 WARN org.apache.hadoop.hbase.util.FSUtils: Version file was empty, odd, will try to set it.
      2011-04-19 16:51:09,674 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /hbase/hbase.version could only be replicated to 0 nodes, instead of 1
      ...........

      2011-04-19 16:51:09,674 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
      2011-04-19 16:51:09,674 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/hbase.version" - Aborting...
      2011-04-19 16:51:09,674 WARN org.apache.hadoop.hbase.util.FSUtils: Unable to create version file at hdfs://C4C1:9000/hbase, retrying: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /hbase/hbase.version could only be replicated to 0 nodes, instead of 1
      ...........

      2011-04-19 16:56:19,695 WARN org.apache.hadoop.hbase.util.FSUtils: Unable to create version file at hdfs://C4C1:9000/hbase, retrying: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/hbase.version for DFSClient_hb_m_C4C1.site:60000_1303202948768 on client 157.5.100.1 because current leaseholder is trying to recreate file.
      org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/hbase.version for DFSClient_hb_m_C4C1.site:60000_1303202948768 on client 157.5.100.1 because current leaseholder is trying to recreate file.
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
      ....

      1. HBASE-3800-trunk.patch
        2 kB
        Andrew Purtell
      2. HBASE-3800-0.90.patch
        2 kB
        Andrew Purtell

        Activity

        Hide
        Andrew Purtell added a comment -

        The fix is to delete the file in the IOException handler before retry.

        Show
        Andrew Purtell added a comment - The fix is to delete the file in the IOException handler before retry.
        Hide
        Andrew Purtell added a comment -

        We figured this out after looking at how the JobTracker handles this case. Apologies the change did not make it upstream.

        Show
        Andrew Purtell added a comment - We figured this out after looking at how the JobTracker handles this case. Apologies the change did not make it upstream.
        Hide
        Todd Lipcon added a comment -

        Duplicates HBASE-3270 maybe?

        Show
        Todd Lipcon added a comment - Duplicates HBASE-3270 maybe?
        Hide
        gaojinchao added a comment -

        why delete this code? can you tell me your case ?

        +
        + // Are there any data nodes up yet?
        + // Currently the safe mode check falls through if the namenode is up but no
        + // datanodes have reported in yet.
        + try {
        + while (dfs.getDataNodeStats().length == 0) {
        + LOG.info("Waiting for dfs to come up...");
        + try

        { + Thread.sleep(wait); + }

        catch (InterruptedException e)

        { + //continue + }

        + }
        + } catch (IOException e)

        { + + // getDataNodeStats can fail if superuser privilege is required to run + // the datanode report, just ignore it + }

        +
        // Make sure dfs is not in safe mode
        while (dfs.setSafeMode(FSConstants.SafeModeAction.SAFEMODE_GET)) {
        LOG.info("Waiting for dfs to exit safe mode...");

        Show
        gaojinchao added a comment - why delete this code? can you tell me your case ? + + // Are there any data nodes up yet? + // Currently the safe mode check falls through if the namenode is up but no + // datanodes have reported in yet. + try { + while (dfs.getDataNodeStats().length == 0) { + LOG.info("Waiting for dfs to come up..."); + try { + Thread.sleep(wait); + } catch (InterruptedException e) { + //continue + } + } + } catch (IOException e) { + + // getDataNodeStats can fail if superuser privilege is required to run + // the datanode report, just ignore it + } + // Make sure dfs is not in safe mode while (dfs.setSafeMode(FSConstants.SafeModeAction.SAFEMODE_GET)) { LOG.info("Waiting for dfs to exit safe mode...");
        Hide
        Andrew Purtell added a comment -

        The code in question doesn't work with security. getDataNodeStats is a privileged operation.

        Show
        Andrew Purtell added a comment - The code in question doesn't work with security. getDataNodeStats is a privileged operation.
        Hide
        gaojinchao added a comment -

        OK, I will make a mistake if I make a patch.
        thanks.

        Show
        gaojinchao added a comment - OK, I will make a mistake if I make a patch. thanks.

          People

          • Assignee:
            Andrew Purtell
            Reporter:
            gaojinchao
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development