HBase
  1. HBase
  2. HBASE-2406

Define semantics of cell timestamps/versions

    Details

    • Type: Task Task
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: documentation
    • Labels:
      None

      Description

      There is a lot of general confusion over the semantics of the cell timestamp. In particular, a couple questions that often come up:

      • If multiple writes to a cell have the same timestamp, are all versions maintained or just the last?
      • Is it OK to write cells in a non-increasing timestamp order?

      Let's discuss, figure out what semantics make sense, and then move towards (a) documentation, (b) unit tests that prove we have those semantics.

      1. 2406.txt
        22 kB
        stack
      2. versions.html
        12 kB
        stack

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #235 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/235/)
          Remove a stale section citing issues w/ versions fixed by HBASE-2406 – thanks to Sho Shimauchi for turning this one up (Revision 1401970)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/src/docbkx/book.xml
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #235 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/235/ ) Remove a stale section citing issues w/ versions fixed by HBASE-2406 – thanks to Sho Shimauchi for turning this one up (Revision 1401970) Result = FAILURE stack : Files : /hbase/trunk/src/docbkx/book.xml
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #3484 (See https://builds.apache.org/job/HBase-TRUNK/3484/)
          Remove a stale section citing issues w/ versions fixed by HBASE-2406 – thanks to Sho Shimauchi for turning this one up (Revision 1401970)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/src/docbkx/book.xml
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #3484 (See https://builds.apache.org/job/HBase-TRUNK/3484/ ) Remove a stale section citing issues w/ versions fixed by HBASE-2406 – thanks to Sho Shimauchi for turning this one up (Revision 1401970) Result = FAILURE stack : Files : /hbase/trunk/src/docbkx/book.xml
          Hide
          Jonathan Gray added a comment -

          I read through about half of the doc so far. Great work Stack! Thanks for doing this.

          Show
          Jonathan Gray added a comment - I read through about half of the doc so far. Great work Stack! Thanks for doing this.
          Hide
          stack added a comment -

          I'm resolving this issue. We have unit tests that demo what we have, and now I've commiitted doc. that describes how hbase currently does versions with a listing of limitations. Other issues have been opened to address these limitations (linked off this issue) which I believe cover the concerns raised in here. Lets open new issues to address holes in the doc. I'm adding or to add unit tests if not already coverage of how we think versioning is working in HBase.

          Show
          stack added a comment - I'm resolving this issue. We have unit tests that demo what we have, and now I've commiitted doc. that describes how hbase currently does versions with a listing of limitations. Other issues have been opened to address these limitations (linked off this issue) which I believe cover the concerns raised in here. Lets open new issues to address holes in the doc. I'm adding or to add unit tests if not already coverage of how we think versioning is working in HBase.
          Hide
          stack added a comment -

          Here is a section on hbase versions based on Bruno's article w/ some updates and a bit of preamble on what hbase versions are about.

          As per Brunos' article, it explains how timestamping works for get, scan, put, and delete. It has a section on limitiations. It also explicitly answers the questions that todd raises at head of this issue.

          While I was at it, I added some docbook config. up in the pom and also removed the sample illustrative article I'd added to show how we could do articles only in docbook.

          I also added a 'quick start' section at head of the book as well as other clean up.

          Show
          stack added a comment - Here is a section on hbase versions based on Bruno's article w/ some updates and a bit of preamble on what hbase versions are about. As per Brunos' article, it explains how timestamping works for get, scan, put, and delete. It has a section on limitiations. It also explicitly answers the questions that todd raises at head of this issue. While I was at it, I added some docbook config. up in the pom and also removed the sample illustrative article I'd added to show how we could do articles only in docbook. I also added a 'quick start' section at head of the book as well as other clean up.
          Hide
          stack added a comment -

          Leave it in for now... we should be able to put something basic in the book

          Show
          stack added a comment - Leave it in for now... we should be able to put something basic in the book
          Hide
          Jonathan Gray added a comment -

          Anyone want to work on this in the next week or so? If not, let's punt to 0.92

          Show
          Jonathan Gray added a comment - Anyone want to work on this in the next week or so? If not, let's punt to 0.92
          Hide
          Jonathan Gray added a comment -

          And on the unit test front, there are now several tests in TestFromClientSide that show the version update behavior. I think we have out-of-order insertion tests as well. Not sure if we have tests that show the delete case does not work.

          Show
          Jonathan Gray added a comment - And on the unit test front, there are now several tests in TestFromClientSide that show the version update behavior. I think we have out-of-order insertion tests as well. Not sure if we have tests that show the delete case does not work.
          Hide
          Jonathan Gray added a comment -

          Following commit of HBASE-1485, I think cell version semantics are basically defined. You can update existing versions and you can write versions out of order.

          What is left is dealing with deletes. I think there is some consensus that whether or not we eventually "fix" this issue (inserting behind deletes), it's well beyond the scope of 0.90.

          Since this JIRA deals with documentation, I suppose we should add a section to the book about versions and then we can close this. JIRAs remain to deal with the delete issue specifically.

          Any volunteers?

          Show
          Jonathan Gray added a comment - Following commit of HBASE-1485 , I think cell version semantics are basically defined. You can update existing versions and you can write versions out of order. What is left is dealing with deletes. I think there is some consensus that whether or not we eventually "fix" this issue (inserting behind deletes), it's well beyond the scope of 0.90. Since this JIRA deals with documentation, I suppose we should add a section to the book about versions and then we can close this. JIRAs remain to deal with the delete issue specifically. Any volunteers?
          Hide
          Pranav Khaitan added a comment -

          Patch for this issue posted at https://review.cloudera.org/r/780/

          Show
          Pranav Khaitan added a comment - Patch for this issue posted at https://review.cloudera.org/r/780/
          Hide
          stack added a comment -

          Here is one from this morning:

          08:49 < paul> i put a value on row x, cf y, col z, ts 0 with value 1 then I put row x, cf y, col z, ts 1 with value 2. if i scan or get {VERSIONS=>5} (from shell) 
                        I get value 2, but if I scan {TIMESTAMP=>0) I still get value 1 back even though I should not exist...
          08:49 < paul> *it
          ...
          09:12 < St^Ack> paul: which version of hbase?
          09:13 < paul>  latest stable i think (shell reports 0.20.5 for VERSION)
          09:17 < St^Ack> paul: why would value 1 not exist?
          09:17 < St^Ack> you expect that because  you set versions == 1, that it should suppress return of value 1
          09:17 < St^Ack> ?
          09:17 < paul> i assumed (possibly naievely) that because it only keeps one version and as value 2 has a newer timestamp that yes, i thought it would?
          09:18 < paul> is that entirely wrong?
          09:18 < St^Ack> paul: its not how it works
          09:18 < St^Ack> as to whether you are 'wrong', I'd say you are not -- your expectation makes sense
          09:18 < paul> oh right thanks :), so what happens to older versions? do they go eventually or just waste space?
          09:19 < St^Ack> they go eventually
          09:19 < tlipcon> it will work after a major compaction
          
          Show
          stack added a comment - Here is one from this morning: 08:49 < paul> i put a value on row x, cf y, col z, ts 0 with value 1 then I put row x, cf y, col z, ts 1 with value 2. if i scan or get {VERSIONS=>5} (from shell) I get value 2, but if I scan {TIMESTAMP=>0) I still get value 1 back even though I should not exist... 08:49 < paul> *it ... 09:12 < St^Ack> paul: which version of hbase? 09:13 < paul> latest stable i think (shell reports 0.20.5 for VERSION) 09:17 < St^Ack> paul: why would value 1 not exist? 09:17 < St^Ack> you expect that because you set versions == 1, that it should suppress return of value 1 09:17 < St^Ack> ? 09:17 < paul> i assumed (possibly naievely) that because it only keeps one version and as value 2 has a newer timestamp that yes, i thought it would? 09:18 < paul> is that entirely wrong? 09:18 < St^Ack> paul: its not how it works 09:18 < St^Ack> as to whether you are 'wrong', I'd say you are not -- your expectation makes sense 09:18 < paul> oh right thanks :), so what happens to older versions? do they go eventually or just waste space? 09:19 < St^Ack> they go eventually 09:19 < tlipcon> it will work after a major compaction
          Hide
          stack added a comment -

          @Bruno Whats your opinion on HBASE-2847? Do you really need it? Or could you work around it? (See the issue for comments on the radical change we'd need to fix it).

          Show
          stack added a comment - @Bruno Whats your opinion on HBASE-2847 ? Do you really need it? Or could you work around it? (See the issue for comments on the radical change we'd need to fix it).
          Hide
          Bruno Dumon added a comment -

          I commented on this on the blog post. This is not the case, we do support this by setting max to be the timestamp+1

          My problem there was not the 'less than or equal' but rather getting the most recent version at some past point-in-time. I now (finally) understand this can be achieved by setting the range from 0 to the desired timestamp and max versions to 1. I'll update the blog to reflect this.

          Concerning resurfacing puts: I do not see this as a problem, just as an interesting effect of how things work.

          Show
          Bruno Dumon added a comment - I commented on this on the blog post. This is not the case, we do support this by setting max to be the timestamp+1 My problem there was not the 'less than or equal' but rather getting the most recent version at some past point-in-time. I now (finally) understand this can be achieved by setting the range from 0 to the desired timestamp and max versions to 1. I'll update the blog to reflect this. Concerning resurfacing puts: I do not see this as a problem, just as an interesting effect of how things work.
          Hide
          Jonathan Gray added a comment -

          What are you suggesting? That you do the Get with ts+1? Or you are suggesting that you get more than one version and then the app figures out what to do? Does that work? What if there is an entry at ts+1? App has to check for this?

          In existing TimeRange, the max is exclusive, not inclusive. What Bruno was asking about is getting the most recent version of something according to a max timestamp (but inclusive, not exclusive). Since the timestamp is a long there aren't any precision issues... you just say max=TS+1 (exclusive) which effectively means, max=TS inclusive

          rather than let whatever the current state of implementation shape our behavior

          Yeah, we should figure what we want. I'm not saying we should decide based on how we do it currently, I'm saying it's still an open question what we should do on minor compactions (and was recently changed following removal of gets) and that a determination there could impact this stuff. Or it could not. Just saying the two are somewhat related.

          But yeah, let's figure out what we want and not be overly concerned with current implementations.

          Show
          Jonathan Gray added a comment - What are you suggesting? That you do the Get with ts+1? Or you are suggesting that you get more than one version and then the app figures out what to do? Does that work? What if there is an entry at ts+1? App has to check for this? In existing TimeRange, the max is exclusive, not inclusive. What Bruno was asking about is getting the most recent version of something according to a max timestamp (but inclusive, not exclusive). Since the timestamp is a long there aren't any precision issues... you just say max=TS+1 (exclusive) which effectively means, max=TS inclusive rather than let whatever the current state of implementation shape our behavior Yeah, we should figure what we want. I'm not saying we should decide based on how we do it currently, I'm saying it's still an open question what we should do on minor compactions (and was recently changed following removal of gets) and that a determination there could impact this stuff. Or it could not. Just saying the two are somewhat related. But yeah, let's figure out what we want and not be overly concerned with current implementations.
          Hide
          stack added a comment -

          .bq ...by setting max to be the timestamp+1

          What are you suggesting? That you do the Get with ts+1? Or you are suggesting that you get more than one version and then the app figures out what to do? Does that work? What if there is an entry at ts+1? App has to check for this?

          On, re-surfacing Puts, you say "Seems like there's an argument on both sides of this." I agree. Lets just decide and document what we decided (I suggest current behavior is what we go with).

          Regards not nailing-down behavior until we figure stuff like what to do in minor compactions, I'd think that rather, that we should just say how we want it to be and then implement toward this goal rather than let whatever the current state of implementation shape our behavior.

          Show
          stack added a comment - .bq ...by setting max to be the timestamp+1 What are you suggesting? That you do the Get with ts+1? Or you are suggesting that you get more than one version and then the app figures out what to do? Does that work? What if there is an entry at ts+1? App has to check for this? On, re-surfacing Puts, you say "Seems like there's an argument on both sides of this." I agree. Lets just decide and document what we decided (I suggest current behavior is what we go with). Regards not nailing-down behavior until we figure stuff like what to do in minor compactions, I'd think that rather, that we should just say how we want it to be and then implement toward this goal rather than let whatever the current state of implementation shape our behavior.
          Hide
          Jonathan Gray added a comment -

          Gets lack "...the ability to retrieve the latest version less than or equal to a given timestamp, thus giving the 'latest' state of the record at a certain point in time."

          I commented on this on the blog post. This is not the case, we do support this by setting max to be the timestamp+1

          Major compactions are not invisible to the user

          This is hard to fix and it's not clear what "expected" behavior should be. Do you ever re-surface a Put once it's been hidden? Seems like there's an argument on both sides of this. If I want to keep the latest two versions, I might have accidentally inserted a bad version, so I want to delete it and resurface an older one. But maybe someone else has an argument that they never want something to be able to re-appear after being shadowed?

          I think the most important fix is to handle duplicate versions (ordered by insertion time, using memstoreTS and storefile stamps).

          Other stuff is less clear what the "right" answer should be. I also don't think we can attempt to completely nail-down this stuff until we make a strong determination about what should/should not be processed during minor compactions. I did some preliminary benchmarking work on minor compactions a couple months back, hoping to have an intern pick that work back up so we can make a decision here.

          Show
          Jonathan Gray added a comment - Gets lack "...the ability to retrieve the latest version less than or equal to a given timestamp, thus giving the 'latest' state of the record at a certain point in time." I commented on this on the blog post. This is not the case, we do support this by setting max to be the timestamp+1 Major compactions are not invisible to the user This is hard to fix and it's not clear what "expected" behavior should be. Do you ever re-surface a Put once it's been hidden? Seems like there's an argument on both sides of this. If I want to keep the latest two versions, I might have accidentally inserted a bad version, so I want to delete it and resurface an older one. But maybe someone else has an argument that they never want something to be able to re-appear after being shadowed? I think the most important fix is to handle duplicate versions (ordered by insertion time, using memstoreTS and storefile stamps). Other stuff is less clear what the "right" answer should be. I also don't think we can attempt to completely nail-down this stuff until we make a strong determination about what should/should not be processed during minor compactions. I did some preliminary benchmarking work on minor compactions a couple months back, hoping to have an intern pick that work back up so we can make a decision here.
          Hide
          stack added a comment -

          From Bruno article, issues mentioned are:

          + Gets lack "...the ability to retrieve the latest version less than or equal to a given timestamp, thus giving the 'latest' state of the record at a certain point in time."
          + Major compactions are not invisible to the user: "...create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore..."
          + hbase-1485, where if two cells with same coordinates, we do not return latest added
          + A put after a tombstone was put in will be overshadowed by the tombstone if timestamp is older than the tombstone, though it went in after the tombstone was added (I made an issue for this, HBASE-2847).

          Show
          stack added a comment - From Bruno article, issues mentioned are: + Gets lack "...the ability to retrieve the latest version less than or equal to a given timestamp, thus giving the 'latest' state of the record at a certain point in time." + Major compactions are not invisible to the user: "...create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore..." + hbase-1485, where if two cells with same coordinates, we do not return latest added + A put after a tombstone was put in will be overshadowed by the tombstone if timestamp is older than the tombstone, though it went in after the tombstone was added (I made an issue for this, HBASE-2847 ).
          Hide
          stack added a comment -

          Bruno at outerthought describes how it currently works, warts n'all: http://outerthought.org/blog/blog/417-OTC.html

          Show
          stack added a comment - Bruno at outerthought describes how it currently works, warts n'all: http://outerthought.org/blog/blog/417-OTC.html
          Hide
          Kevin Peterson added a comment -

          I lean towards there being no notion of ordering other than timestamps.

          If multiple writes to a cell have the same timestamp, one of those versions will be maintained, and it is undefined which version will be maintained.

          If the user writes to cells with out of order timestamps, and the writes would make the cell exceed the number of versions the column family stores, the cell will contain those versions with the highest timestamp. More formally:

          A column family retains N versions.
          Given a cell C storing a possibly empty set of versions and timestamps S =

          { (v1, ts1), (v2, ts2), ... (vn, tsn) }

          , n <= N.
          The user makes m writes to C W =

          { (v1', ts1'), (v2', ts2'), ... (vm', tsm') }

          .
          If m + n <= N, C will retain all writes as versions.
          If m + n > N, C will contain those N (v, ts) from S union W with the highest ts.

          If the user writes to cells with a timestamp before the current time minus the column family's TTL, the write will be discarded.

          I wonder if there are any uses of timestamps that we can recommend without forcing people to understand all the details. Here's what I can think of:
          1. If your data model has a preexisting timestamp, and this timestamp never changes manually set timestamps will be more convenient than serializing a timestamp in your own format.
          2. If your data model has a preexisting timestamp, and this timestamp changes in a way compatible with the behavior of HBase, and you understand the details of timestamps and versioning, manually set timestamps will be more convenient than serializing a timestamp in your own format.
          3. If you want a consistent version of some data that spans multiple tables (i.e. secondary index), you may want to use the same timestamp to insert into both tables so that you can use the exact timestamp as part of a get() after reading it out of one table.
          4. If your data includes a meaningful timestamp, especially if that timestamp can change, you may find it more straightforward to store that timestamp in your own format rather than relying on HBase timestamps.

          Likely point of confusion is if the user ignores the details of versioning, and always queries for the most recent timestamp, the user could have a mental model like this:

          • A cell stores a value and a timestamp
          • I can supply the timestamp when I write a value
          • I can read the timestamp
          • *I can update the timestamp (incorrect)

          The user can write the value with a higher timestamp which appears to update the timestamp. When the user tries the same thing with a lower timestamp, this doesn't work.

          Show
          Kevin Peterson added a comment - I lean towards there being no notion of ordering other than timestamps. If multiple writes to a cell have the same timestamp, one of those versions will be maintained, and it is undefined which version will be maintained. If the user writes to cells with out of order timestamps, and the writes would make the cell exceed the number of versions the column family stores, the cell will contain those versions with the highest timestamp. More formally: A column family retains N versions. Given a cell C storing a possibly empty set of versions and timestamps S = { (v1, ts1), (v2, ts2), ... (vn, tsn) } , n <= N. The user makes m writes to C W = { (v1', ts1'), (v2', ts2'), ... (vm', tsm') } . If m + n <= N, C will retain all writes as versions. If m + n > N, C will contain those N (v, ts) from S union W with the highest ts. If the user writes to cells with a timestamp before the current time minus the column family's TTL, the write will be discarded. I wonder if there are any uses of timestamps that we can recommend without forcing people to understand all the details. Here's what I can think of: 1. If your data model has a preexisting timestamp, and this timestamp never changes manually set timestamps will be more convenient than serializing a timestamp in your own format. 2. If your data model has a preexisting timestamp, and this timestamp changes in a way compatible with the behavior of HBase, and you understand the details of timestamps and versioning, manually set timestamps will be more convenient than serializing a timestamp in your own format. 3. If you want a consistent version of some data that spans multiple tables (i.e. secondary index), you may want to use the same timestamp to insert into both tables so that you can use the exact timestamp as part of a get() after reading it out of one table. 4. If your data includes a meaningful timestamp, especially if that timestamp can change, you may find it more straightforward to store that timestamp in your own format rather than relying on HBase timestamps. Likely point of confusion is if the user ignores the details of versioning, and always queries for the most recent timestamp, the user could have a mental model like this: A cell stores a value and a timestamp I can supply the timestamp when I write a value I can read the timestamp *I can update the timestamp (incorrect) The user can write the value with a higher timestamp which appears to update the timestamp. When the user tries the same thing with a lower timestamp, this doesn't work.
          Hide
          stack added a comment -

          Made this critical and we should have it for 0.21 if not for 0.20.5.

          Show
          stack added a comment - Made this critical and we should have it for 0.21 if not for 0.20.5.
          Hide
          stack added a comment -

          We have our work cut out for us: "@kdpeterson HBase manually set timestamp explained by XKCD (from http://xkcd.com/657/) http://twitpic.com/1hd437" http://twitter.com/kdpeterson/status/12598874822 and "@kdpeterson Manually setting the time stamp in HBase is like time travel. It stops making any sense and intuitive is wrong." http://twitter.com/kdpeterson/status/12598514371

          Show
          stack added a comment - We have our work cut out for us: "@kdpeterson HBase manually set timestamp explained by XKCD (from http://xkcd.com/657/ ) http://twitpic.com/1hd437 " http://twitter.com/kdpeterson/status/12598874822 and "@kdpeterson Manually setting the time stamp in HBase is like time travel. It stops making any sense and intuitive is wrong." http://twitter.com/kdpeterson/status/12598514371
          Hide
          stack added a comment -

          @Todd Thanks for filing this one. I like the bit about unit tests to validate they work according to the yet-to-be written doc.

          Show
          stack added a comment - @Todd Thanks for filing this one. I like the bit about unit tests to validate they work according to the yet-to-be written doc.
          Hide
          ryan rawson added a comment -

          Post HBASE-2248 world it will be safe to put and delete in any order.

          Show
          ryan rawson added a comment - Post HBASE-2248 world it will be safe to put and delete in any order.

            People

            • Assignee:
              Pranav Khaitan
              Reporter:
              Todd Lipcon
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development