Details

    • Hadoop Flags:
      Reviewed

      Description

      Sometimes the clone operation from the hbase shell can hang. The table has been created (it shows up in the web ui), but does not have any entries in META.

      There don't seem to be any clone, snapshot, enable or disable found in the master's jstack.

      Here's a trace from the HBaseAdmin:

      "main" prio=10 tid=0x00007f782800d000 nid=0x25c waiting on condition [0x00007f782f9bf000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.hbase.client.HBaseAdmin.cloneSnapshot(HBaseAdmin.java:2413)
              at org.apache.hadoop.hbase.client.HBaseAdmin.cloneSnapshot(HBaseAdmin.java:2393)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:465)
              at org.jruby.javasupport.JavaMethod.invokeDirect(JavaMethod.java:323)
              at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:69)
              at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:201)
              at org.jruby.ast.CallTwoArgNode.interpret(CallTwoArgNode.java:59)
              at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:104)
      ... (more jruby stack) ... 
      
      1. hbase-7352.v3.patch
        2 kB
        Jonathan Hsieh
      2. HBASE-7352-v0.patch
        0.8 kB
        Matteo Bertozzi
      3. HBASE-7352-v1.patch
        2 kB
        Matteo Bertozzi
      4. HBASE-7352-v2.patch
        2 kB
        Matteo Bertozzi

        Activity

        Hide
        jmhsieh Jonathan Hsieh added a comment - - edited

        The snapshot can be cloned from another shell instance but must write to a different name.

        If you attempt to clone the snapshot (pe-11) to the same target table (pe-11-table), you get:

        hbase(main):006:0> clone_snapshot 'pe-11', 'pe-11-table'
        
        ERROR: org.apache.hadoop.hbase.snapshot.exception.RestoreSnapshotException: org.apache.hadoop.hbase.snapshot.exception.RestoreSnapshotExcept
        ion: Couldn't clone the snapshot=name: "pe-11"
        table: "TestTable" 
        creationTime: 1355441918484
        type: FLUSH
        version: 0
         on table=pe-11-table
                at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.cloneSnapshot(SnapshotManager.java:558)
                at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.restoreSnapshot(SnapshotManager.java:597)
                at org.apache.hadoop.hbase.master.HMaster.restoreSnapshot(HMaster.java:2528)
                at sun.reflect.GeneratedMethodAccessor59.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Server.call(ProtobufRpcEngine.java:356)
                at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1816)
        Caused by: org.apache.hadoop.hbase.TableExistsException: pe-11-table
                at org.apache.hadoop.hbase.master.handler.CreateTableHandler.<init>(CreateTableHandler.java:96)
                at org.apache.hadoop.hbase.master.snapshot.CloneSnapshotHandler.<init>(CloneSnapshotHandler.java:65)
                at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.cloneSnapshot(SnapshotManager.java:551)
        
                at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.cloneSnapshot(SnapshotManager.java:558)
                at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.restoreSnapshot(SnapshotManager.java:597)
                at org.apache.hadoop.hbase.master.HMaster.restoreSnapshot(HMaster.java:2528)
                at sun.reflect.GeneratedMethodAccessor59.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Server.call(ProtobufRpcEngine.java:356)
                at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1816)
        Caused by: org.apache.hadoop.hbase.TableExistsException: pe-11-table
                at org.apache.hadoop.hbase.master.handler.CreateTableHandler.<init>(CreateTableHandler.java:96)
                at org.apache.hadoop.hbase.master.snapshot.CloneSnapshotHandler.<init>(CloneSnapshotHandler.java:65)
                at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.cloneSnapshot(SnapshotManager.java:551)
                ... 7 more
        

        The dir for seems to be present for table pe-11, but there seems to be a large number of missing files.

        In this particular case, the snapshot has 16 regions, while the failed attempt to restore has 12 regions moved into real table position. This suggests that something failed internally but was allowed to continue to do the dir rename at some point.

        If one removes the data bad data from the hdfs we still cannot clone to the same pe-11-table name because there is some in memory state that blocks this.

        (grammar fixes)

        Show
        jmhsieh Jonathan Hsieh added a comment - - edited The snapshot can be cloned from another shell instance but must write to a different name. If you attempt to clone the snapshot (pe-11) to the same target table (pe-11-table), you get: hbase(main):006:0> clone_snapshot 'pe-11', 'pe-11-table' ERROR: org.apache.hadoop.hbase.snapshot.exception.RestoreSnapshotException: org.apache.hadoop.hbase.snapshot.exception.RestoreSnapshotExcept ion: Couldn't clone the snapshot=name: "pe-11" table: "TestTable" creationTime: 1355441918484 type: FLUSH version: 0 on table=pe-11-table at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.cloneSnapshot(SnapshotManager.java:558) at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.restoreSnapshot(SnapshotManager.java:597) at org.apache.hadoop.hbase.master.HMaster.restoreSnapshot(HMaster.java:2528) at sun.reflect.GeneratedMethodAccessor59.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Server.call(ProtobufRpcEngine.java:356) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1816) Caused by: org.apache.hadoop.hbase.TableExistsException: pe-11-table at org.apache.hadoop.hbase.master.handler.CreateTableHandler.<init>(CreateTableHandler.java:96) at org.apache.hadoop.hbase.master.snapshot.CloneSnapshotHandler.<init>(CloneSnapshotHandler.java:65) at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.cloneSnapshot(SnapshotManager.java:551) at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.cloneSnapshot(SnapshotManager.java:558) at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.restoreSnapshot(SnapshotManager.java:597) at org.apache.hadoop.hbase.master.HMaster.restoreSnapshot(HMaster.java:2528) at sun.reflect.GeneratedMethodAccessor59.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Server.call(ProtobufRpcEngine.java:356) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1816) Caused by: org.apache.hadoop.hbase.TableExistsException: pe-11-table at org.apache.hadoop.hbase.master.handler.CreateTableHandler.<init>(CreateTableHandler.java:96) at org.apache.hadoop.hbase.master.snapshot.CloneSnapshotHandler.<init>(CloneSnapshotHandler.java:65) at org.apache.hadoop.hbase.master.snapshot.manage.SnapshotManager.cloneSnapshot(SnapshotManager.java:551) ... 7 more The dir for seems to be present for table pe-11, but there seems to be a large number of missing files. In this particular case, the snapshot has 16 regions, while the failed attempt to restore has 12 regions moved into real table position. This suggests that something failed internally but was allowed to continue to do the dir rename at some point. If one removes the data bad data from the hdfs we still cannot clone to the same pe-11-table name because there is some in memory state that blocks this. (grammar fixes)
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        something failed internally but was allowed to continue to do the dir rename at some point.

        This is not desirable. Renaming directory should only take place after all regions are copied.

        Show
        yuzhihong@gmail.com Ted Yu added a comment - something failed internally but was allowed to continue to do the dir rename at some point. This is not desirable. Renaming directory should only take place after all regions are copied.
        Hide
        jmhsieh Jonathan Hsieh added a comment -

        I agree. Hence I have filed this bug.

        Show
        jmhsieh Jonathan Hsieh added a comment - I agree. Hence I have filed this bug.
        Hide
        mbertozzi Matteo Bertozzi added a comment -

        I've missed the exit condition on retries, so if the table fail during the enable you'll keep spinning in the loop.

        The .tmp stuff is here: HBASE-7389, HBASE-7365

        Show
        mbertozzi Matteo Bertozzi added a comment - I've missed the exit condition on retries, so if the table fail during the enable you'll keep spinning in the loop. The .tmp stuff is here: HBASE-7389 , HBASE-7365
        Hide
        jmhsieh Jonathan Hsieh added a comment - - edited

        Thanks for the links.

        Shouldn't we throw an exception of some sort if we fail after the max number of tries?

        Show
        jmhsieh Jonathan Hsieh added a comment - - edited Thanks for the links. Shouldn't we throw an exception of some sort if we fail after the max number of tries?
        Hide
        mbertozzi Matteo Bertozzi added a comment -

        What about extracting the wait loop from enableTable() and use that also for the clone?

        throw new IOException("Unable to enable table " +
          Bytes.toString(tableName));
        

        The enable table code, after N retries says: "Unable to enable table"... but this looks like the wrong message... we don't know if the enable is failed or is just slow...

        change the message in "table not yet enabled", change the exception type in RetriesExhaustedException()?

        Show
        mbertozzi Matteo Bertozzi added a comment - What about extracting the wait loop from enableTable() and use that also for the clone? throw new IOException( "Unable to enable table " + Bytes.toString(tableName)); The enable table code, after N retries says: "Unable to enable table"... but this looks like the wrong message... we don't know if the enable is failed or is just slow... change the message in "table not yet enabled", change the exception type in RetriesExhaustedException()?
        Hide
        jmhsieh Jonathan Hsieh added a comment -

        I like it.

        Looking at v1, yeah I agree with the message in the exception being inaccurate. Maybe something closer to what is in the comment to make it more accurate?

        Nit: missing "for"

        +  /**
        +   * Wait the table to be enabled.
        +   
        
        Show
        jmhsieh Jonathan Hsieh added a comment - I like it. Looking at v1, yeah I agree with the message in the exception being inaccurate. Maybe something closer to what is in the comment to make it more accurate? Nit: missing "for" + /** + * Wait the table to be enabled. +
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12562676/HBASE-7352-v2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3757//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562676/HBASE-7352-v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3757//console This message is automatically generated.
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        Another nit:

        +   * If the table exceeded the retry period, an exception is thrown.
        

        'table' is not an action. You can say 'enabling table exceeds'

        Looks like you should rebase your patch.

        Show
        yuzhihong@gmail.com Ted Yu added a comment - Another nit: + * If the table exceeded the retry period, an exception is thrown. 'table' is not an action. You can say 'enabling table exceeds' Looks like you should rebase your patch.
        Hide
        jmhsieh Jonathan Hsieh added a comment -

        I've made ted's suggested fix. v3 is what I will commit if after tests run.

        Show
        jmhsieh Jonathan Hsieh added a comment - I've made ted's suggested fix. v3 is what I will commit if after tests run.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12562683/hbase-7352.v3.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3759//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562683/hbase-7352.v3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3759//console This message is automatically generated.
        Hide
        jmhsieh Jonathan Hsieh added a comment - - edited

        passed with failures on known flakys. Thanks matteo and ted. Committing to online and offline snapshots branches.

        Show
        jmhsieh Jonathan Hsieh added a comment - - edited passed with failures on known flakys. Thanks matteo and ted. Committing to online and offline snapshots branches.
        Hide
        stack stack added a comment -

        Marking closed.

        Show
        stack stack added a comment - Marking closed.

          People

          • Assignee:
            mbertozzi Matteo Bertozzi
            Reporter:
            jmhsieh Jonathan Hsieh
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development