HBase
  1. HBase
  2. HBASE-4841

If I call split fast enough, while inserting, rows disappear.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I'll attach a unit test for this. Basically if you call split, while inserting data you can get to the point to where the cluster becomes unstable, or rows will disappear. The unit test gives you some flexibility of:

      • How many rows
      • How wide the rows are
      • The frequency of the split.

      The default settings crash unit tests or cause the unit tests to fail on my laptop. On my macbook air, i could actually turn down the number of total rows, and the frequency of the splits which is surprising. I think this is because the macbook air has much better IO than my backup acer.

      1. 1
        4 kB
        Alex Newman
      2. log2
        510 kB
        Alex Newman
      3. log
        639 kB
        Alex Newman

        Activity

        Hide
        Alex Newman added a comment -

        Agreed this test passes, although we might want have a test like this somewhere. On the other hand it is pretty high level.

        Show
        Alex Newman added a comment - Agreed this test passes, although we might want have a test like this somewhere. On the other hand it is pretty high level.
        Hide
        stack added a comment -

        So we can close this as fixed by 4853?

        Show
        stack added a comment - So we can close this as fixed by 4853?
        Hide
        Jean-Daniel Cryans added a comment -

        Well the first log is the region offline issue I described, the second one has data loss indeed but as it was Nov. 21, and providing that it was a recent checkout, then it would be 4853 yeah.

        Show
        Jean-Daniel Cryans added a comment - Well the first log is the region offline issue I described, the second one has data loss indeed but as it was Nov. 21, and providing that it was a recent checkout, then it would be 4853 yeah.
        Hide
        Alex Newman added a comment -

        @JD if you look at the above posted logs you will see that it was failing with

        junit.framework.AssertionFailedError: We are missing some rows

        I'm guessing it was HBASE-4853.

        Show
        Alex Newman added a comment - @JD if you look at the above posted logs you will see that it was failing with junit.framework.AssertionFailedError: We are missing some rows I'm guessing it was HBASE-4853 .
        Hide
        Jean-Daniel Cryans added a comment -

        Sorry if I'm not being clear.

        So the claim is that this test shows we can lose rows while splitting too fast. I tried the test myself on both 0.92 and trunk multiple times.

        The only times I got errors (not failures, which would mean it's failing an assertion), is when we call admin.split every so often in the test:

            for (int i = 0; i != NUMBER_OF_ROWS; i++) {
              byte[] rowName = Bytes.toBytes(i);
              Put put = new Put(rowName);
              for (int j = 0; j != NUMBER_OF_COLS; j++) {
                put.add(CF, String.valueOf(j).getBytes(), Bytes.toBytes(j * i));
              }
        
              if (i % NUM_ROWS_BEFORE_SPLIT == 0 && i != 0) {
                admin.split(TABLE_NAME);
                LOG.info("Splitting");
              }
        
              htable.put(put);
        
              if (i % NUM_ROWS_BEFORE_OUTPUT == 0 && i!= 0 ) {
                LOG.info("Inserted Row:" + i);
        }
        

        The problem is that if you call split on a table that has a region offline, the exception bubbles all the way up and, in this case, kills the test. That's why I needed to catch and move forward.

        After this change, the test passes 100% of the time for both 0.92 and trunk.

        Now what I'm wondering is if the test in Alex's case was failing or "erroring". If the former, it's either an unknown bug or HBASE-4853. If the latter, then it's the issue I saw and there's no data loss.

        Show
        Jean-Daniel Cryans added a comment - Sorry if I'm not being clear. So the claim is that this test shows we can lose rows while splitting too fast. I tried the test myself on both 0.92 and trunk multiple times. The only times I got errors (not failures, which would mean it's failing an assertion), is when we call admin.split every so often in the test: for ( int i = 0; i != NUMBER_OF_ROWS; i++) { byte [] rowName = Bytes.toBytes(i); Put put = new Put(rowName); for ( int j = 0; j != NUMBER_OF_COLS; j++) { put.add(CF, String .valueOf(j).getBytes(), Bytes.toBytes(j * i)); } if (i % NUM_ROWS_BEFORE_SPLIT == 0 && i != 0) { admin.split(TABLE_NAME); LOG.info( "Splitting" ); } htable.put(put); if (i % NUM_ROWS_BEFORE_OUTPUT == 0 && i!= 0 ) { LOG.info( "Inserted Row:" + i); } The problem is that if you call split on a table that has a region offline, the exception bubbles all the way up and, in this case, kills the test. That's why I needed to catch and move forward. After this change, the test passes 100% of the time for both 0.92 and trunk. Now what I'm wondering is if the test in Alex's case was failing or "erroring". If the former, it's either an unknown bug or HBASE-4853 . If the latter, then it's the issue I saw and there's no data loss.
        Hide
        stack added a comment -

        How does that make it so we don't miss rows J-D? (I can see how it would make splits work but not sure how it keeps scanner finding all rows)

        Show
        stack added a comment - How does that make it so we don't miss rows J-D? (I can see how it would make splits work but not sure how it keeps scanner finding all rows)
        Hide
        Jean-Daniel Cryans added a comment -

        The test passes for me 100% of the time on both trunk and 0.92 when I wrap the admin.split in order to catch the region offline exception that it gets sometimes.

        Show
        Jean-Daniel Cryans added a comment - The test passes for me 100% of the time on both trunk and 0.92 when I wrap the admin.split in order to catch the region offline exception that it gets sometimes.
        Hide
        stack added a comment -

        Yeah, thats being fixed over in hbase-4853

        Show
        stack added a comment - Yeah, thats being fixed over in hbase-4853
        Hide
        ramkrishna.s.vasudevan added a comment -
        LOG.warn("Why is there a raw encodedRegionName in lastSeqWritten? name=" +
                  Bytes.toString(encodedRegionName) + ", seqid=" + l);
        
        

        is getting repeated in the logs.

        Show
        ramkrishna.s.vasudevan added a comment - LOG.warn( "Why is there a raw encodedRegionName in lastSeqWritten? name=" + Bytes.toString(encodedRegionName) + ", seqid=" + l); is getting repeated in the logs.
        ramkrishna.s.vasudevan made changes -
        Assignee ramkrishna.s.vasudevan [ ram_krish ]
        stack made changes -
        Priority Major [ 3 ] Critical [ 2 ]
        Hide
        stack added a comment -

        Upped priority

        Show
        stack added a comment - Upped priority
        Alex Newman made changes -
        Attachment log [ 12504665 ]
        Hide
        Alex Newman added a comment -

        Here is a log of this script taking the HBase server out.

        Show
        Alex Newman added a comment - Here is a log of this script taking the HBase server out.
        Alex Newman made changes -
        Attachment log2 [ 12504664 ]
        Hide
        Alex Newman added a comment -

        here is a log of the wrong number of rows being returned

        Show
        Alex Newman added a comment - here is a log of the wrong number of rows being returned
        Alex Newman made changes -
        Description I'll attach a unit test for this. Basically if you call split, while inserting data you can get to the point to where the cluster becomes unstable, or rows will disappear. I'll attach a unit test for this. Basically if you call split, while inserting data you can get to the point to where the cluster becomes unstable, or rows will disappear. The unit test gives you some flexibility of:

        - How many rows
        - How wide the rows are
        - The frequency of the split.


        The default settings crash unit tests or cause the unit tests to fail on my laptop. On my macbook air, i could actually turn down the number of total rows, and the frequency of the splits which is surprising. I think this is because the macbook air has much better IO than my backup acer.
        Hide
        Alex Newman added a comment -

        I realized it may be easier If I post the log for the unit test, rather than requiring you to run it. It's on the way.

        Show
        Alex Newman added a comment - I realized it may be easier If I post the log for the unit test, rather than requiring you to run it. It's on the way.
        Hide
        Alex Newman added a comment -

        Since this can cause dataloss it may make sense to increase the priority.

        Show
        Alex Newman added a comment - Since this can cause dataloss it may make sense to increase the priority.
        Alex Newman made changes -
        Field Original Value New Value
        Attachment 1 [ 12504579 ]
        Alex Newman created issue -

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            Alex Newman
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development