HBase
  1. HBase
  2. HBASE-2256

Delete row, followed quickly to put of the same row will sometimes fail.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.3
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Doing a Delete of a whole row, followed immediately by a put to that row will sometimes miss a cell. Attached is a test to provoke the issue.

        Issue Links

          Activity

          Hide
          stack added a comment -

          He Liangliang I like this notion (It is related a bit to HBASE-8927). The call to currentTimeMillis is done frequently. I think we'd combine this change with an attempt at removing as many calls to currentTimeMillis as possible. We might replace the currentTimeMillis calls that are for tiiming with nanotime calls instead and use this new class for Cell version).

          Show
          stack added a comment - He Liangliang I like this notion (It is related a bit to HBASE-8927 ). The call to currentTimeMillis is done frequently. I think we'd combine this change with an attempt at removing as many calls to currentTimeMillis as possible. We might replace the currentTimeMillis calls that are for tiiming with nanotime calls instead and use this new class for Cell version).
          Hide
          He Liangliang added a comment -

          Also encountered similar problem. What about this solution?

          public class IncrementingWallTimeEnvironmentEdge implements EnvironmentEdge {
            private long clock = -1;
          
            public IncrementingWallTimeEnvironmentEdge() {
            }
          
            @Override
            public long currentTimeMillis() {
              long wallTime = System.currentTimeMillis() << 10; // ~us, or any arbitrary scaling factor
          
              synchronized (this) {
                  if (clock < wallTime) {
                    clock = wallTime;
                  }
                  return clock++;
              }
            }
          }
          

          This would solve this problem and guarantee the timestamp aligned with the wall time clock in milliseconds as long as we set the scaling factor to a larger enough number (i.e. make sure the speed of logical clock is slower than System.currentTimeMillis()). Shift factor from 10 ~20 (1M-1G qps) is proper value for current server configuration which also will not introduce wrapping around concern (584M ~ 0.57M year).

          Show
          He Liangliang added a comment - Also encountered similar problem. What about this solution? public class IncrementingWallTimeEnvironmentEdge implements EnvironmentEdge { private long clock = -1; public IncrementingWallTimeEnvironmentEdge() { } @Override public long currentTimeMillis() { long wallTime = System .currentTimeMillis() << 10; // ~us, or any arbitrary scaling factor synchronized ( this ) { if (clock < wallTime) { clock = wallTime; } return clock++; } } } This would solve this problem and guarantee the timestamp aligned with the wall time clock in milliseconds as long as we set the scaling factor to a larger enough number (i.e. make sure the speed of logical clock is slower than System.currentTimeMillis()). Shift factor from 10 ~20 (1M-1G qps) is proper value for current server configuration which also will not introduce wrapping around concern (584M ~ 0.57M year).
          Hide
          Liang Xie added a comment -
          Show
          Liang Xie added a comment - Please refer to https://issues.apache.org/jira/browse/HBASE-8721 for our fix
          Hide
          Liang Xie added a comment -

          We(XiaoMi) fixed this issue with introducing a ScanDeleteTrackerWithMVCC.
          My workmate Honghua Feng will upload a patch soon.

          Show
          Liang Xie added a comment - We(XiaoMi) fixed this issue with introducing a ScanDeleteTrackerWithMVCC. My workmate Honghua Feng will upload a patch soon.
          Hide
          Jonathan Gray added a comment -

          I think this would be a hacky non-solution, regardless of whether it's epoch nanos or not.

          Show
          Jonathan Gray added a comment - I think this would be a hacky non-solution, regardless of whether it's epoch nanos or not.
          Hide
          Ted Yu added a comment -
          	  long l = System.nanoTime();
          	  long l2 = System.currentTimeMillis();
          

          Looking at the values of l (1302209826865074000) and l2 (1302209826865), nanoTime is aligned with time in millis.
          Assuming nano and milli timestamps correlate, we can devise (correction) mechanism in master and region servers such that (corrected) nano timestamp reflects the actual millisecond timestamp.

          Show
          Ted Yu added a comment - long l = System .nanoTime(); long l2 = System .currentTimeMillis(); Looking at the values of l (1302209826865074000) and l2 (1302209826865), nanoTime is aligned with time in millis. Assuming nano and milli timestamps correlate, we can devise (correction) mechanism in master and region servers such that (corrected) nano timestamp reflects the actual millisecond timestamp.
          Hide
          Todd Lipcon added a comment -

          more importantly, nanotime is elapsed time since some arbitrary system-local reference point, and has no meaning on an absolute scale.

          Show
          Todd Lipcon added a comment - more importantly, nanotime is elapsed time since some arbitrary system-local reference point, and has no meaning on an absolute scale.
          Hide
          Ted Yu added a comment -

          From http://download.oracle.com/javase/1.5.0/docs/api/java/lang/System.html#nanoTime%28%29 :
          Differences in successive calls that span greater than approximately 292 years (263 nanoseconds) will not accurately compute elapsed time due to numerical overflow.

          Show
          Ted Yu added a comment - From http://download.oracle.com/javase/1.5.0/docs/api/java/lang/System.html#nanoTime%28%29 : Differences in successive calls that span greater than approximately 292 years (263 nanoseconds) will not accurately compute elapsed time due to numerical overflow.
          Hide
          ryan rawson added a comment -

          Probably can't use nano time, it wraps around too frequently.

          Show
          ryan rawson added a comment - Probably can't use nano time, it wraps around too frequently.
          Hide
          Ted Yu added a comment -

          Currently timestamp for Put and Delete is in milliseconds.
          If we can use System.nanoTime(), the chance of this issue happening would be very low.

          Show
          Ted Yu added a comment - Currently timestamp for Put and Delete is in milliseconds. If we can use System.nanoTime(), the chance of this issue happening would be very low.
          Hide
          stack added a comment -

          @Nathaniel This should be fixed as by-product of hbase-2856.

          Show
          stack added a comment - @Nathaniel This should be fixed as by-product of hbase-2856.
          Hide
          Nathaniel Cook added a comment -

          We ran into this problem recently in our production code. A single hbase client needed to first clean up of several columns by deleting them and then put a subset those columns back in with new values. Frequently the delete and put call would happen in the same millisecond thus masking the put. For now we have implemented a fix on our side but would be nice to see a real fix for this, where the region servers handle this more gracefully.

          Maybe we could log some warnings when this occurs for possible easier debugging? This was an extremely difficult problem to find.

          Or maybe someone has a clever solution?

          Show
          Nathaniel Cook added a comment - We ran into this problem recently in our production code. A single hbase client needed to first clean up of several columns by deleting them and then put a subset those columns back in with new values. Frequently the delete and put call would happen in the same millisecond thus masking the put. For now we have implemented a fix on our side but would be nice to see a real fix for this, where the region servers handle this more gracefully. Maybe we could log some warnings when this occurs for possible easier debugging? This was an extremely difficult problem to find. Or maybe someone has a clever solution?
          Hide
          Kevin Peterson added a comment -

          What if I could do something like this:

          Put put1 = ...
          HTable.put(put1);
          Delete delete = new Delete(...).guaranteeAfter(put1);
          HTable.delete(delete);
          Put put2 = new Put(...).guaranteeAfter(delete);
          HTable.put(put2)

          It seems like the distributed case isn't a problem since it's so unlikely, but the delete then put seems more plausible. We could set the timestamp to 1ms after the delete if needed. The occasional write will get a timestamp a few ms in the future, which doesn't seem that bad. I think this solves Clint's requirement of seeing a correct view without explicitly messing with timestamps.

          Show
          Kevin Peterson added a comment - What if I could do something like this: Put put1 = ... HTable.put(put1); Delete delete = new Delete(...).guaranteeAfter(put1); HTable.delete(delete); Put put2 = new Put(...).guaranteeAfter(delete); HTable.put(put2) It seems like the distributed case isn't a problem since it's so unlikely, but the delete then put seems more plausible. We could set the timestamp to 1ms after the delete if needed. The occasional write will get a timestamp a few ms in the future, which doesn't seem that bad. I think this solves Clint's requirement of seeing a correct view without explicitly messing with timestamps.
          Hide
          ryan rawson added a comment -

          there is a millisecond resolution, and it might be difficult to get better without changing the storage format so we can get nanos in there.

          but still, for most people, doing a put - delete - put all within 1 millisecond is a not common. maybe it might be possible to change something so we dont have to run up against this issue?

          Show
          ryan rawson added a comment - there is a millisecond resolution, and it might be difficult to get better without changing the storage format so we can get nanos in there. but still, for most people, doing a put - delete - put all within 1 millisecond is a not common. maybe it might be possible to change something so we dont have to run up against this issue?
          Hide
          Clint Morgan added a comment -

          I have only noticed this in unit tests and a non-distributed setup. However, this Delete,Put happens in ITHbase's IndexRegion which means that even in a distributed setup the client for the put/delete and the regionserver handling them could be in the same JVM.

          For the put, put case, it seems to me this could be a real issue. Even in distributed setup sequential puts could happen in the same ms no? However, I did a similar test for Put after Put and it seems to always work. If it did not, I'm sure users would have complained loudly by now.

          From my point of view, it would be nice to have this behavior for Put after Delete as well.

          I'm not saying I need finer granularity tham ms. Just that when I'm never explicitly messing with timestamps, I always see a "correct" view that reflects my last operation. I just skimmed over the bigtable paper, and could not find an explication about what they do in this case..

          Show
          Clint Morgan added a comment - I have only noticed this in unit tests and a non-distributed setup. However, this Delete,Put happens in ITHbase's IndexRegion which means that even in a distributed setup the client for the put/delete and the regionserver handling them could be in the same JVM. For the put, put case, it seems to me this could be a real issue. Even in distributed setup sequential puts could happen in the same ms no? However, I did a similar test for Put after Put and it seems to always work. If it did not, I'm sure users would have complained loudly by now. From my point of view, it would be nice to have this behavior for Put after Delete as well. I'm not saying I need finer granularity tham ms. Just that when I'm never explicitly messing with timestamps, I always see a "correct" view that reflects my last operation. I just skimmed over the bigtable paper, and could not find an explication about what they do in this case..
          Hide
          Jean-Daniel Cryans added a comment -

          Yes, I'm sure that the millisecond nature of timestamps comes in to play here. However, I'm not setting any timestamps, and was under the impression that hbase would always reflect the state of the last operation done. Is this not a valid assumption?

          Do you see this delete problem even on a fully distributed setup? In my experience, they only happen in unit tests where all components are in the same JVM whereas when network is involved some milliseconds will separate two consecutive operations.

          A related question. If I do two puts (w/latest timestamp), am I guaranteed to see the last one? I'm sure many users operate under this assumption.

          If they have the same timestamp, there's no guarantee.

          So, for a given row, I'm doing a delete of an entire row, then a put of two cells in different families. Then I do a get.

          See my first comment, it's ok when not done in unit tests. Like the bigtable paper says, if you need a finer granularity than millisecond you may need to redefine the timestamps (using something like microseconds).

          Show
          Jean-Daniel Cryans added a comment - Yes, I'm sure that the millisecond nature of timestamps comes in to play here. However, I'm not setting any timestamps, and was under the impression that hbase would always reflect the state of the last operation done. Is this not a valid assumption? Do you see this delete problem even on a fully distributed setup? In my experience, they only happen in unit tests where all components are in the same JVM whereas when network is involved some milliseconds will separate two consecutive operations. A related question. If I do two puts (w/latest timestamp), am I guaranteed to see the last one? I'm sure many users operate under this assumption. If they have the same timestamp, there's no guarantee. So, for a given row, I'm doing a delete of an entire row, then a put of two cells in different families. Then I do a get. See my first comment, it's ok when not done in unit tests. Like the bigtable paper says, if you need a finer granularity than millisecond you may need to redefine the timestamps (using something like microseconds).
          Hide
          Clint Morgan added a comment -

          Yes, I'm sure that the millisecond nature of timestamps comes in to play here. However, I'm not setting any timestamps, and was under the impression that hbase would always reflect the state of the last operation done. Is this not a valid assumption?

          A related question. If I do two puts (w/latest timestamp), am I guaranteed to see the last one? I'm sure many users operate under this assumption.

          So, for a given row, I'm doing a delete of an entire row, then a put of two cells in different families. Then I do a get.

          Most times I'll see all of the latest put. Sometimes I see nothing for that row. Sometimes I see just one family/cell from the previous put. It seems with 0.20.2 I would either get all the row, or none. However with 0.20.3 I will see all or just one family. Returning just one family seems even more wrong.

          So suppose I wish to wipe existing cells of a row, then write some new cells. Is the only way to reliably do this to put a 1 ms pause between the delete and the put? This would hurt my throughput...

          Show
          Clint Morgan added a comment - Yes, I'm sure that the millisecond nature of timestamps comes in to play here. However, I'm not setting any timestamps, and was under the impression that hbase would always reflect the state of the last operation done. Is this not a valid assumption? A related question. If I do two puts (w/latest timestamp), am I guaranteed to see the last one? I'm sure many users operate under this assumption. So, for a given row, I'm doing a delete of an entire row, then a put of two cells in different families. Then I do a get. Most times I'll see all of the latest put. Sometimes I see nothing for that row. Sometimes I see just one family/cell from the previous put. It seems with 0.20.2 I would either get all the row, or none. However with 0.20.3 I will see all or just one family. Returning just one family seems even more wrong. So suppose I wish to wipe existing cells of a row, then write some new cells. Is the only way to reliably do this to put a 1 ms pause between the delete and the put? This would hurt my throughput...
          Hide
          ryan rawson added a comment -

          isnt this due to the millisecond timestamp nature of our puts and delete markers? If the next put has the same millisecond TS as an existing delete record you will be masked by the previous delete.

          Show
          ryan rawson added a comment - isnt this due to the millisecond timestamp nature of our puts and delete markers? If the next put has the same millisecond TS as an existing delete record you will be masked by the previous delete.

            People

            • Assignee:
              Unassigned
              Reporter:
              Clint Morgan
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development