HBase
  1. HBase
  2. HBASE-9041

TestFlushSnapshotFromClient.testConcurrentSnapshottingAttempts fails

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.95.2
    • Component/s: snapshots, test
    • Labels:
      None

      Description

      Assigning Matteo to take a look (give back to me if you don't have time boss).

      Failed here: https://builds.apache.org/job/HBase-TRUNK/4293/testReport/org.apache.hadoop.hbase.snapshot/TestFlushSnapshotFromClient/testConcurrentSnapshottingAttempts/

      Yesterday, it failed in a different place and for different reason: https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/352/testReport/junit/org.apache.hadoop.hbase.snapshot/TestFlushSnapshotFromClient/testFlushTableSnapshot/

      The latter test fail was noted on tail of HBASE-8984. There I speculate that its the 'load' of 400. I don't think the load reporting is correct. Will dig in on that.

      1. HBASE-9041-v0.patch
        3 kB
        Matteo Bertozzi
      2. uppingrows.txt
        0.7 kB
        stack
      3. lessrows.txt
        3 kB
        stack

        Activity

        Hide
        Matteo Bertozzi added a comment -

        my guess is there's too much data insert/too many regions for a busy machine... and you get this timeout/regions not yet online/ready exceptions

        Show
        Matteo Bertozzi added a comment - my guess is there's too much data insert/too many regions for a busy machine... and you get this timeout/regions not yet online/ready exceptions
        Hide
        stack added a comment -

        Any suggestions on making it carry less load? Else we should just disable the test Matteo Bertozzi?

        Show
        stack added a comment - Any suggestions on making it carry less load? Else we should just disable the test Matteo Bertozzi ?
        Hide
        Matteo Bertozzi added a comment -

        change the rowCount to something low is the first idea: loadData(table, 10000, TEST_FAM);
        but I need some time to look at it, unless you want to try reducing it now

        Show
        Matteo Bertozzi added a comment - change the rowCount to something low is the first idea: loadData(table, 10000, TEST_FAM); but I need some time to look at it, unless you want to try reducing it now
        Hide
        stack added a comment -

        Is it supposed to do this in the test Matteo Bertozzi?

        2013-07-25 06:54:13,314 WARN  [Thread-3129] snapshot.TestFlushSnapshotFromClient(504): Got an exception when checking for snapshot ss3
        org.apache.hadoop.hbase.exceptions.UnknownSnapshotException: org.apache.hadoop.hbase.exceptions.UnknownSnapshotException: Snapshot { ss=ss3 table=test2 type=FLUSH } is not currently running or one of the known completed snapshots.
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
        	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
        	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:232)
        	at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:2701)
        	at org.apache.hadoop.hbase.client.HBaseAdmin.execute(HBaseAdmin.java:2670)
        	at org.apache.hadoop.hbase.client.HBaseAdmin.isSnapshotFinished(HBaseAdmin.java:2364)
        	at org.apache.hadoop.hbase.snapshot.TestFlushSnapshotFromClient.testConcurrentSnapshottingAttempts(TestFlushSnapshotFromClient.java:500)
        
        

        And on yesterday's fail, you think it for same reason – overloaded?

        This test came in w/ HBASE-7321. I could bother Jonathan Hsieh?

        Show
        stack added a comment - Is it supposed to do this in the test Matteo Bertozzi ? 2013-07-25 06:54:13,314 WARN [ Thread -3129] snapshot.TestFlushSnapshotFromClient(504): Got an exception when checking for snapshot ss3 org.apache.hadoop.hbase.exceptions.UnknownSnapshotException: org.apache.hadoop.hbase.exceptions.UnknownSnapshotException: Snapshot { ss=ss3 table=test2 type=FLUSH } is not currently running or one of the known completed snapshots. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:232) at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:2701) at org.apache.hadoop.hbase.client.HBaseAdmin.execute(HBaseAdmin.java:2670) at org.apache.hadoop.hbase.client.HBaseAdmin.isSnapshotFinished(HBaseAdmin.java:2364) at org.apache.hadoop.hbase.snapshot.TestFlushSnapshotFromClient.testConcurrentSnapshottingAttempts(TestFlushSnapshotFromClient.java:500) And on yesterday's fail, you think it for same reason – overloaded? This test came in w/ HBASE-7321 . I could bother Jonathan Hsieh ?
        Hide
        stack added a comment -

        Matteo Bertozzi Let me try reducing it. Thanks.

        Show
        stack added a comment - Matteo Bertozzi Let me try reducing it. Thanks.
        Hide
        Matteo Bertozzi added a comment -

        the snapshot related failures are related to the RS not ready...

        Show
        Matteo Bertozzi added a comment - the snapshot related failures are related to the RS not ready...
        Hide
        stack added a comment -

        Write way less rows. Passes locally. Let me commit and see how it does.

        Show
        stack added a comment - Write way less rows. Passes locally. Let me commit and see how it does.
        Hide
        stack added a comment -

        Applied to trunk and 0.95. Lets see how it does Matteo Bertozzi Thanks boss.

        Show
        stack added a comment - Applied to trunk and 0.95. Lets see how it does Matteo Bertozzi Thanks boss.
        Hide
        stack added a comment -

        A bunch failed here: https://builds.apache.org/job/hbase-0.95-on-hadoop2/197/ Different type of fail.

        Show
        stack added a comment - A bunch failed here: https://builds.apache.org/job/hbase-0.95-on-hadoop2/197/ Different type of fail.
        Hide
        Matteo Bertozzi added a comment -

        There's only one test failing here which is testFlushCreateListDestroy() the other are consequence of snapshots not cleaned up, see HBASE-9058.

        The problem in this case, is the previous applied patch which reduces the data to the regions, leaving one region without data.

        Show
        Matteo Bertozzi added a comment - There's only one test failing here which is testFlushCreateListDestroy() the other are consequence of snapshots not cleaned up, see HBASE-9058 . The problem in this case, is the previous applied patch which reduces the data to the regions, leaving one region without data.
        Hide
        stack added a comment -

        Matteo Bertozzi So I need to do more than 10 rows? It was 10k rows.

        Show
        stack added a comment - Matteo Bertozzi So I need to do more than 10 rows? It was 10k rows.
        Hide
        stack added a comment -

        Writing 1000 rows instead of 100.

        Show
        stack added a comment - Writing 1000 rows instead of 100.
        Hide
        stack added a comment -

        Committed to 0.95 and trunk. Lets see how this does.

        Show
        stack added a comment - Committed to 0.95 and trunk. Lets see how this does.
        Show
        stack added a comment - Here is another fail: http://54.241.6.143/job/HBase-0.95-Hadoop-2/org.apache.hbase$hbase-server/703/testReport/junit/org.apache.hadoop.hbase.client/TestSnapshotFromClient/testSnapshotDeletionWithRegex/
        Hide
        Matteo Bertozzi added a comment -

        it's not just a add random number game
        there's a bug in the first rows added, let me fix it.

        Show
        Matteo Bertozzi added a comment - it's not just a add random number game there's a bug in the first rows added, let me fix it.
        Hide
        stack added a comment -

        Matteo Bertozzi Thanks. I am being lazy. I want work on something else (smile). Just commit your attempted fix. We'll soon know if it works or not.

        Show
        stack added a comment - Matteo Bertozzi Thanks. I am being lazy. I want work on something else (smile). Just commit your attempted fix. We'll soon know if it works or not.
        Hide
        Matteo Bertozzi added a comment -

        going to commit v0, that fixes the "ensure 1 row per region" code

        Show
        Matteo Bertozzi added a comment - going to commit v0, that fixes the "ensure 1 row per region" code
        Show
        stack added a comment - Is this failure related Matteo Bertozzi https://builds.apache.org/job/HBase-TRUNK/4306/testReport/junit/org.apache.hadoop.hbase.snapshot/TestFlushSnapshotFromClient/testConcurrentSnapshottingAttempts/
        Hide
        Matteo Bertozzi added a comment -

        no that looks like is just slow, the testConcurrentSnapshottingAttempts() has a fixed stop at 60sec.
        A couple of snapshots are done, the others are not. but they are not failed, so the test is waiting for completion until the 60sec timeout mark the test as failed

        Show
        Matteo Bertozzi added a comment - no that looks like is just slow, the testConcurrentSnapshottingAttempts() has a fixed stop at 60sec. A couple of snapshots are done, the others are not. but they are not failed, so the test is waiting for completion until the 60sec timeout mark the test as failed
        Hide
        stack added a comment -

        Matteo Bertozzi Sounds like I should up 60s timeouts to 5minutes all over the code base?

        Show
        stack added a comment - Matteo Bertozzi Sounds like I should up 60s timeouts to 5minutes all over the code base?
        Show
        stack added a comment - Matteo Bertozzi Opinion on http://54.241.6.143/job/HBase-0.95-Hadoop-2/org.apache.hbase$hbase-server/720/testReport/org.apache.hadoop.hbase.client/TestSnapshotFromClient/testSnapshotDeletionWithRegex/ ?
        Hide
        stack added a comment -

        Matteo Bertozzi nvm. let me up the retry from 1 (Pardon my asking you about these tests; they are about snapshotting so thought you'd be interested)

        Show
        stack added a comment - Matteo Bertozzi nvm. let me up the retry from 1 (Pardon my asking you about these tests; they are about snapshotting so thought you'd be interested)
        Hide
        stack added a comment -

        Doesn't fail anymore after matteo jujitsu. Will open new issue if this comes up again.

        Show
        stack added a comment - Doesn't fail anymore after matteo jujitsu. Will open new issue if this comes up again.

          People

          • Assignee:
            Matteo Bertozzi
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development