HBase
  1. HBase
  2. HBASE-8927

Use nano time instead of mili time everywhere

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Less collisions and we are paying the price of a long anyways so might as well fill it.

      1. 8927.txt
        388 kB
        stack

        Issue Links

          Activity

          Hide
          stack added a comment -

          Search and replace.

          It will take more work. A bunch of tests fail because timings are not not what they expected.

          Show
          stack added a comment - Search and replace. It will take more work. A bunch of tests fail because timings are not not what they expected.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12591798/8927.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 402 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6307//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12591798/8927.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 402 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6307//console This message is automatically generated.
          Hide
          Jieshan Bean added a comment -

          I think System.nanoTime() can not be used as TimeStamp. It can't ensure the accuracy.

          Show
          Jieshan Bean added a comment - I think System.nanoTime() can not be used as TimeStamp. It can't ensure the accuracy.
          Hide
          Lars Hofhansl added a comment -

          nanoTime cannot be used as absolute timestamps, it can only be used to compare times over a relatively small interval.

          Show
          Lars Hofhansl added a comment - nanoTime cannot be used as absolute timestamps, it can only be used to compare times over a relatively small interval.
          Hide
          Enis Soztutar added a comment -

          Agreed what Lars said.

          Show
          Enis Soztutar added a comment - Agreed what Lars said.
          Hide
          Lars Hofhansl added a comment -

          Maybe we can use millies and fill the lower bits from nano time.

          Show
          Lars Hofhansl added a comment - Maybe we can use millies and fill the lower bits from nano time.
          Hide
          stack added a comment -

          Yeah on what javadoc says but src is quoted in this thread: http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless

          jlong os::javaTimeMillis() {
            timeval time;
            int status = gettimeofday(&time, NULL);
            assert(status != -1, "linux error");
            return jlong(time.tv_sec) * 1000  +  jlong(time.tv_usec / 1000);
          }
          
          
          jlong os::javaTimeNanos() {
            if (Linux::supports_monotonic_clock()) {
              struct timespec tp;
              int status = Linux::clock_gettime(CLOCK_MONOTONIC, &tp);
              assert(status == 0, "gettime error");
              jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) + jlong(tp.tv_nsec);
              return result;
            } else {
              timeval time;
              int status = gettimeofday(&time, NULL);
              assert(status != -1, "linux error");
              jlong usecs = jlong(time.tv_sec) * (1000 * 1000) + jlong(time.tv_usec);
              return 1000 * usecs;
            }
          }
          

          The above looks pretty good to me.

          Show
          stack added a comment - Yeah on what javadoc says but src is quoted in this thread: http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless jlong os::javaTimeMillis() { timeval time; int status = gettimeofday(&time, NULL); assert (status != -1, "linux error" ); return jlong(time.tv_sec) * 1000 + jlong(time.tv_usec / 1000); } jlong os::javaTimeNanos() { if (Linux::supports_monotonic_clock()) { struct timespec tp; int status = Linux::clock_gettime(CLOCK_MONOTONIC, &tp); assert (status == 0, "gettime error" ); jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) + jlong(tp.tv_nsec); return result; } else { timeval time; int status = gettimeofday(&time, NULL); assert (status != -1, "linux error" ); jlong usecs = jlong(time.tv_sec) * (1000 * 1000) + jlong(time.tv_usec); return 1000 * usecs; } } The above looks pretty good to me.
          Hide
          Enis Soztutar added a comment -

          Maybe we can use millies and fill the lower bits from nano time.

          I did exactly that in HBASE-6833. The problem is that each CPU thread will observe it's own nanosecond increments. Thus, concurrent updates coming at the same time won't have any guarantees for having monotonically increasing ts's because they are handled by different hardware threads. We can add some synchronization there (see my patch there), but it is not clear whether using ns will gain us if we did that.

          Show
          Enis Soztutar added a comment - Maybe we can use millies and fill the lower bits from nano time. I did exactly that in HBASE-6833 . The problem is that each CPU thread will observe it's own nanosecond increments. Thus, concurrent updates coming at the same time won't have any guarantees for having monotonically increasing ts's because they are handled by different hardware threads. We can add some synchronization there (see my patch there), but it is not clear whether using ns will gain us if we did that.
          Hide
          stack added a comment -

          We could left shift millis in the long and then keep incrementing sequence number within a millisecond? (more expensive than call to nano).

          On CLOCK_MONOTONIC from http://linux.die.net/man/3/clock_gettime, it says "Clock that cannot be set and represents monotonic time since some unspecified starting point." which is off-putting but when I print it out it is same as millis.
          Did you see it going backward or out of order on SMP Enis (in spite of CLOCK_MONOTONIC)? Were you on windows?

          Show
          stack added a comment - We could left shift millis in the long and then keep incrementing sequence number within a millisecond? (more expensive than call to nano). On CLOCK_MONOTONIC from http://linux.die.net/man/3/clock_gettime , it says "Clock that cannot be set and represents monotonic time since some unspecified starting point." which is off-putting but when I print it out it is same as millis. Did you see it going backward or out of order on SMP Enis (in spite of CLOCK_MONOTONIC)? Were you on windows?
          Hide
          Sergey Shelukhin added a comment -

          What problem are we trying to solve? The fix with synchronized increments/etc. is not cheap and the only place we /really/ need it as timestamps, it seems. Just checking

          Show
          Sergey Shelukhin added a comment - What problem are we trying to solve? The fix with synchronized increments/etc. is not cheap and the only place we /really/ need it as timestamps, it seems. Just checking
          Hide
          stack added a comment -

          + Less cordinate collisions (where coordinates are row+cf+qualifier+type+ts). As slow as we are we can do a bunch of ops inside a ms. If nanotime, can establish some order regards events that arrive inside the same ms.
          + We have a 64bit ts already but we don't use all the bytes; that seems a little silly.

          Show
          stack added a comment - + Less cordinate collisions (where coordinates are row+cf+qualifier+type+ts). As slow as we are we can do a bunch of ops inside a ms. If nanotime, can establish some order regards events that arrive inside the same ms. + We have a 64bit ts already but we don't use all the bytes; that seems a little silly.
          Hide
          Enis Soztutar added a comment -

          Did you see it going backward or out of order on SMP Enis

          That was on windows, but I believe it should be the same in linux as well. The hardware clocks themselves do not go back, but if two updates come with seq_num X and seq_num X+1, then X+1 might get a smaller ns because the hardware thread might observe a different clock than the other thread.

          Show
          Enis Soztutar added a comment - Did you see it going backward or out of order on SMP Enis That was on windows, but I believe it should be the same in linux as well. The hardware clocks themselves do not go back, but if two updates come with seq_num X and seq_num X+1, then X+1 might get a smaller ns because the hardware thread might observe a different clock than the other thread.
          Hide
          Elliott Clark added a comment -

          The hardware clocks themselves do not go back, but if two updates come with seq_num X and seq_num X+1, then X+1 might get a smaller ns because the hardware thread might observe a different clock than the other thread.

          That can happen now if the mutations are not on the same row:

          • [Thread 1] Mutation A comes in to thread
          • [Thread 1] Wal edit A gets seq id.
          • [Thread 1] Thread Context Switches
          • [Thread 2] Mutation B comes in
          • [Thread 2] Wal edit B gets seq id.
          • [Thread 2] Mutation B gets timestamp.
          • [Thread 1] Wakes up
          • [Thread 1] Mutation A gets time stamp.

          A.id < B.id && A.ts > B.ts

          Locks ensure this isn't an issue on the same row.

          Show
          Elliott Clark added a comment - The hardware clocks themselves do not go back, but if two updates come with seq_num X and seq_num X+1, then X+1 might get a smaller ns because the hardware thread might observe a different clock than the other thread. That can happen now if the mutations are not on the same row: [Thread 1] Mutation A comes in to thread [Thread 1] Wal edit A gets seq id. [Thread 1] Thread Context Switches [Thread 2] Mutation B comes in [Thread 2] Wal edit B gets seq id. [Thread 2] Mutation B gets timestamp. [Thread 1] Wakes up [Thread 1] Mutation A gets time stamp. A.id < B.id && A.ts > B.ts Locks ensure this isn't an issue on the same row.
          Hide
          Enis Soztutar added a comment -

          That can happen now if the mutations are not on the same row:

          That is true, but I've never seen that happen in reality. Versus in ns case, it will happen very frequently.

          Show
          Enis Soztutar added a comment - That can happen now if the mutations are not on the same row: That is true, but I've never seen that happen in reality. Versus in ns case, it will happen very frequently.
          Hide
          Sergey Shelukhin added a comment -

          stack agree on first, but for general use (like e.g. measuring ttl expiration, how long things took for normal ops, and such) having a complex ts generation seems sillier I'd do nano on case by case basis for non-timestamps.

          Enis Soztutar hardware clocks do go back on faulty motherboards... also Windows does adjust the clock backwards from the time server if it drifts forwards. Linux ntpd can also do that unless explicitly disabled.

          Elliott Clark but does it matter across rows?

          Show
          Sergey Shelukhin added a comment - stack agree on first, but for general use (like e.g. measuring ttl expiration, how long things took for normal ops, and such) having a complex ts generation seems sillier I'd do nano on case by case basis for non-timestamps. Enis Soztutar hardware clocks do go back on faulty motherboards... also Windows does adjust the clock backwards from the time server if it drifts forwards. Linux ntpd can also do that unless explicitly disabled. Elliott Clark but does it matter across rows?
          Hide
          stack added a comment -

          ...a complex ts generation seems sillier

          It is no more complex than what we currently have? We'd do System.nanoTime instead of System.currentTimeMillis. That is all?

          Show
          stack added a comment - ...a complex ts generation seems sillier It is no more complex than what we currently have? We'd do System.nanoTime instead of System.currentTimeMillis. That is all?
          Hide
          Elliott Clark added a comment -

          Elliott Clark but does it matter across rows?

          We make acid guarantees around rows. Not around different mutations on different rows. So yes. Showing that the clock always goes forward (minus errors) on a row means that we are only discussing changing behavior or timestamps that we gave no promises for.

          Show
          Elliott Clark added a comment - Elliott Clark but does it matter across rows? We make acid guarantees around rows. Not around different mutations on different rows. So yes. Showing that the clock always goes forward (minus errors) on a row means that we are only discussing changing behavior or timestamps that we gave no promises for.
          Hide
          Sergey Shelukhin added a comment - - edited

          stack nanoTime is not the time... it cannot be used between machines (i.e. for ttl and such) or anywhere where wall clock time is required, only for differences on the same machine


          This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time. The value returned represents nanoseconds since some fixed but arbitrary time (perhaps in the future, so values may be negative). This method provides nanosecond precision, but not necessarily nanosecond accuracy. No guarantees are made about how frequently values change. Differences in successive calls that span greater than approximately 292 years (263 nanoseconds) will not accurately compute elapsed time due to numerical overflow.

          Show
          Sergey Shelukhin added a comment - - edited stack nanoTime is not the time... it cannot be used between machines (i.e. for ttl and such) or anywhere where wall clock time is required, only for differences on the same machine This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time. The value returned represents nanoseconds since some fixed but arbitrary time (perhaps in the future, so values may be negative). This method provides nanosecond precision, but not necessarily nanosecond accuracy. No guarantees are made about how frequently values change. Differences in successive calls that span greater than approximately 292 years (263 nanoseconds) will not accurately compute elapsed time due to numerical overflow.
          Hide
          stack added a comment -

          Yeah. I've read that whiney-pants CYA bit of years-old javadoc. The implementation though looks like it should give us the behavior we want (and x-machines). I'd say next up is trying this on a few SMP machines.

          Show
          stack added a comment - Yeah. I've read that whiney-pants CYA bit of years-old javadoc. The implementation though looks like it should give us the behavior we want (and x-machines). I'd say next up is trying this on a few SMP machines.
          Hide
          Sergey Shelukhin added a comment -

          Depending on implementation against explicit interface guarantee (or lack thereof) is not a good design decision imho

          Show
          Sergey Shelukhin added a comment - Depending on implementation against explicit interface guarantee (or lack thereof) is not a good design decision imho
          Hide
          Jean-Marc Spaggiari added a comment -

          I'm not getting the points against this change.

          We will not be the first one to move from Milli to Nano, no?

          https://issues.apache.org/jira/browse/CASSANDRA-733

          The issues we might have we nano, we already have them with milli?

          Show
          Jean-Marc Spaggiari added a comment - I'm not getting the points against this change. We will not be the first one to move from Milli to Nano, no? https://issues.apache.org/jira/browse/CASSANDRA-733 The issues we might have we nano, we already have them with milli?
          Hide
          Ted Yu added a comment -

          I noticed that the size of patch for CASSANDRA-733 is relatively small.

          It contains changes in the following form:

          -        long startTime = System.currentTimeMillis();
          +        long startTime = System.nanoTime();
          ...
          -            writeStats.add(System.currentTimeMillis() - startTime);
          +            writeStats.addNano(System.nanoTime() - startTime);
          
          Show
          Ted Yu added a comment - I noticed that the size of patch for CASSANDRA-733 is relatively small. It contains changes in the following form: - long startTime = System .currentTimeMillis(); + long startTime = System .nanoTime(); ... - writeStats.add( System .currentTimeMillis() - startTime); + writeStats.addNano( System .nanoTime() - startTime);
          Hide
          Jean-Marc Spaggiari added a comment -

          Exact. They simply changed currentTimeMillis() for nanoTime(). And I think that's the point of St.Ack's patch too.

          Show
          Jean-Marc Spaggiari added a comment - Exact. They simply changed currentTimeMillis() for nanoTime(). And I think that's the point of St.Ack's patch too.
          Hide
          Jean-Marc Spaggiari added a comment -

          Just to add:

          jmspaggiari@t:~/cassandra/cassandra$ grep -R "System.nanoTime()" * | wc
          112 643 13913

          Show
          Jean-Marc Spaggiari added a comment - Just to add: jmspaggiari@t:~/cassandra/cassandra$ grep -R "System.nanoTime()" * | wc 112 643 13913
          Hide
          stack added a comment -

          Jean-Marc Spaggiari That patch is about something else, tracking latencies in nanos, not kv timestamping.

          For this issue to progress, I need to see what happens in linux multithreaded app on smp to see if CLOCK_MONOTONIC means what I think it means.

          Show
          stack added a comment - Jean-Marc Spaggiari That patch is about something else, tracking latencies in nanos, not kv timestamping. For this issue to progress, I need to see what happens in linux multithreaded app on smp to see if CLOCK_MONOTONIC means what I think it means.
          Hide
          Jean-Marc Spaggiari added a comment -

          This was just to provide an example. Goal was to show that other applications are already using nano instead on milli, so we might be able to do that too.

          Regarding Mononotic, documentation says:
          CLOCK_MONOTONIC
          Clock that cannot be set and represents monotonic time since some unspecified starting point.

          "cannot be set"... Seems that even NTPs can't modify that.

          Show
          Jean-Marc Spaggiari added a comment - This was just to provide an example. Goal was to show that other applications are already using nano instead on milli, so we might be able to do that too. Regarding Mononotic, documentation says: CLOCK_MONOTONIC Clock that cannot be set and represents monotonic time since some unspecified starting point. "cannot be set"... Seems that even NTPs can't modify that.
          Hide
          Enis Soztutar added a comment -

          This was just to provide an example. Goal was to show that other applications are already using nano instead on milli,

          Cassandra's issue is about changing to nano for tracking latencies, vs. in this case, this is about using nano as a globally pseudo-consistent wall clock.

          Show
          Enis Soztutar added a comment - This was just to provide an example. Goal was to show that other applications are already using nano instead on milli, Cassandra's issue is about changing to nano for tracking latencies, vs. in this case, this is about using nano as a globally pseudo-consistent wall clock.
          Hide
          Jean-Marc Spaggiari added a comment -

          Hi Enis Soztutar, as I just replied to Stack, you're right. "This was just to provide an example". They have nanoTime() all over the place, not just on this patch. But they also still have some currentMs() calls... Might be interesting to ask them

          Show
          Jean-Marc Spaggiari added a comment - Hi Enis Soztutar , as I just replied to Stack , you're right. "This was just to provide an example". They have nanoTime() all over the place, not just on this patch. But they also still have some currentMs() calls... Might be interesting to ask them
          Hide
          Sergey Shelukhin added a comment -

          As I said can we please look at reasons to do this, regardless of what this clock does... there's not a single good reason to use nano in most places, with the exception of column timestamp precision, for which a more complex and safer scheme can be made as suggested by some comments above, and measuring time taken for short (normally expected to take single seconds or less) operations.
          Now, we could do it everywhere anyway just because why not, if it were free of other consequences, but in this case we want to ignore explicitly missing guarantees on the interface and rely on internals of particular JVM(s) and OS(es). If some JVM, OS, or future changes to JVM code change the implementation (still staying within the interface guarantees), there will be a fail of epic proportions.

          In my mind there isn't even a reason to look at internals and ponder, simply because we don't stand to gain anything from nano in most places.

          Show
          Sergey Shelukhin added a comment - As I said can we please look at reasons to do this, regardless of what this clock does... there's not a single good reason to use nano in most places, with the exception of column timestamp precision, for which a more complex and safer scheme can be made as suggested by some comments above, and measuring time taken for short (normally expected to take single seconds or less) operations. Now, we could do it everywhere anyway just because why not, if it were free of other consequences, but in this case we want to ignore explicitly missing guarantees on the interface and rely on internals of particular JVM(s) and OS(es). If some JVM, OS, or future changes to JVM code change the implementation (still staying within the interface guarantees), there will be a fail of epic proportions. In my mind there isn't even a reason to look at internals and ponder, simply because we don't stand to gain anything from nano in most places.
          Hide
          stack added a comment -

          ...with the exception of column timestamp precision, for which a more complex and safer scheme can be made as suggested by some comments above, and measuring time taken for short (normally expected to take single seconds or less) operations.

          Seems like good enough reasons to me for doing nano time.

          I took a look at jdk8 and the story is no different there in its description of system.nanotime. It adds a java.time package with classes like Instant w/ its nanosecond resolution http://download.java.net/jdk8/docs/api/java/time/Instant.html but it doesn't look too amenable given its long of seconds and then nanos inside a second (apart from it being jdk8)

          Could do something like Lars suggests above w/ left shift filling in bottom few bytes w/ an incrementing number; it'd be a bit of a pain to implement but would be nice avoiding clashes.

          Show
          stack added a comment - ...with the exception of column timestamp precision, for which a more complex and safer scheme can be made as suggested by some comments above, and measuring time taken for short (normally expected to take single seconds or less) operations. Seems like good enough reasons to me for doing nano time. I took a look at jdk8 and the story is no different there in its description of system.nanotime. It adds a java.time package with classes like Instant w/ its nanosecond resolution http://download.java.net/jdk8/docs/api/java/time/Instant.html but it doesn't look too amenable given its long of seconds and then nanos inside a second (apart from it being jdk8) Could do something like Lars suggests above w/ left shift filling in bottom few bytes w/ an incrementing number; it'd be a bit of a pain to implement but would be nice avoiding clashes.
          Hide
          Andrew Purtell added a comment -

          Could do something like Lars suggests above w/ left shift filling in bottom few bytes w/ an incrementing number; it'd be a bit of a pain to implement but would be nice avoiding clashes.

          That is the middle ground as I read the comments above.

          Show
          Andrew Purtell added a comment - Could do something like Lars suggests above w/ left shift filling in bottom few bytes w/ an incrementing number; it'd be a bit of a pain to implement but would be nice avoiding clashes. That is the middle ground as I read the comments above.
          Hide
          stack added a comment -

          At a minimum, I think we should just left shift millis so we open up the lower bytes for external transactions, etc. to use, if we do nothing else on this issue.

          Show
          stack added a comment - At a minimum, I think we should just left shift millis so we open up the lower bytes for external transactions, etc. to use, if we do nothing else on this issue.
          Hide
          Sergey Shelukhin added a comment -

          Seems like good enough reasons to me for doing nano time.

          Yes, for these particular pieces of code, not everywhere...

          Show
          Sergey Shelukhin added a comment - Seems like good enough reasons to me for doing nano time. Yes, for these particular pieces of code, not everywhere...
          Hide
          stack added a comment -

          So, left-shifting the timestamping opens up the two least significant bytes. We could make these two bytes be a counter inside the millisecond. We could give each edit that comes in during this millisecond its own unique counter. We should never ever after have clashes (hard to do 64k ops inside a millisecond).

          No-clashing edits would help the distributed log replay case.

          Here is max long: http://en.wikipedia.org/wiki/9223372036854775807 which in hex is 0x7FFFFFFFFFFFFFFF. If we left shift currentTimeMillis, we can run it up to "Tue Oct 16 19:45:55 PDT 6429" before it'd rollover.

          Show
          stack added a comment - So, left-shifting the timestamping opens up the two least significant bytes. We could make these two bytes be a counter inside the millisecond. We could give each edit that comes in during this millisecond its own unique counter. We should never ever after have clashes (hard to do 64k ops inside a millisecond). No-clashing edits would help the distributed log replay case. Here is max long: http://en.wikipedia.org/wiki/9223372036854775807 which in hex is 0x7FFFFFFFFFFFFFFF. If we left shift currentTimeMillis, we can run it up to "Tue Oct 16 19:45:55 PDT 6429" before it'd rollover.
          Hide
          stack added a comment -

          Just read through this issue again. Sergey Shelukhin stuff is good pushback above.

          We are careful now allocating sequenceid assigning edits an order. Should we at the same time allocate the Cell ts or at least modify the LSBs on the ts to add in some derivative of the sequenceid? (Would only work if we've not put the edit into memstore yet).

          cuijianwei Didn't you suggest a few lines of code assigning a more granular ts (IIRC)? Inside a synchronize you assigned the ts, left-shifted, and filled in lower bits with something? I can't find it just now...

          Show
          stack added a comment - Just read through this issue again. Sergey Shelukhin stuff is good pushback above. We are careful now allocating sequenceid assigning edits an order. Should we at the same time allocate the Cell ts or at least modify the LSBs on the ts to add in some derivative of the sequenceid? (Would only work if we've not put the edit into memstore yet). cuijianwei Didn't you suggest a few lines of code assigning a more granular ts (IIRC)? Inside a synchronize you assigned the ts, left-shifted, and filled in lower bits with something? I can't find it just now...
          Show
          stack added a comment - Sorry cuijianwei Its He Liangliang over in https://issues.apache.org/jira/browse/HBASE-2256?focusedCommentId=13825164&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13825164
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12591798/8927.txt
          against trunk revision .
          ATTACHMENT ID: 12591798

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 402 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10507//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12591798/8927.txt against trunk revision . ATTACHMENT ID: 12591798 +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 402 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10507//console This message is automatically generated.
          Hide
          He Liangliang added a comment -

          stack Yes, another benefit of left shifting + synchronized counter is it does not suffer from system time drift

          Show
          He Liangliang added a comment - stack Yes, another benefit of left shifting + synchronized counter is it does not suffer from system time drift
          Hide
          Nicolas Liochon added a comment -

          So I'm adding there the comment I wrote in a duplicate jira: Am I the only one who would appreciate some extra bits, to have a timestamps + a counter?

          Show
          Nicolas Liochon added a comment - So I'm adding there the comment I wrote in a duplicate jira: Am I the only one who would appreciate some extra bits, to have a timestamps + a counter?
          Hide
          stack added a comment -

          So I'm adding there the comment I wrote in a duplicate jira: Am I the only one who would appreciate some extra bits, to have a timestamps + a counter?

          Say more. How would the counter be used? It'd be part of the ts or a different facility altogether?

          Show
          stack added a comment - So I'm adding there the comment I wrote in a duplicate jira: Am I the only one who would appreciate some extra bits, to have a timestamps + a counter? Say more. How would the counter be used? It'd be part of the ts or a different facility altogether?
          Hide
          Nicolas Liochon added a comment -

          It's part of the ts. Typically used when the user application sets the ts. With a composition of: ts + uniqueClientId + counter all the operations are ordered. In the client application it's not very difficult to have a unique id per client process and then to maintain a counter. I expect that playing with transactions leads to these kind of needs.

          Show
          Nicolas Liochon added a comment - It's part of the ts. Typically used when the user application sets the ts. With a composition of: ts + uniqueClientId + counter all the operations are ordered. In the client application it's not very difficult to have a unique id per client process and then to maintain a counter. I expect that playing with transactions leads to these kind of needs.
          Hide
          Gary Helmling added a comment -

          Regarding precision, FWIW in the case of Tephra, we multiply the current timestamp in milliseconds by 1000000 and add a counter (reset every millisecond). 1 billion operations per seconds is probably more than we need, so if we need to support more that 292 years worth of timestamped values, we could probably adjust to a lower counter range.

          Show
          Gary Helmling added a comment - Regarding precision, FWIW in the case of Tephra, we multiply the current timestamp in milliseconds by 1000000 and add a counter (reset every millisecond). 1 billion operations per seconds is probably more than we need, so if we need to support more that 292 years worth of timestamped values, we could probably adjust to a lower counter range.
          Hide
          stack added a comment -

          Gary Helmling Not a left-shift but a multiply? Above speculate left-shifting two bytes. You'd shift more Gary?

          Show
          stack added a comment - Gary Helmling Not a left-shift but a multiply? Above speculate left-shifting two bytes. You'd shift more Gary?
          Hide
          Lars Hofhansl added a comment -

          Should we pick this up again?

          At a minimum, I think we should just left shift millis so we open up the lower bytes for external transactions, etc. to use, if we do nothing else on this issue.

          Agreed. We'd probably need to add a "timestamp multiplier" or "timestamp shift" option to table or column family. That way we can grandfather in old tables and in all cases scale the TTL value accordingly.
          I think this is the only concerns, the timestamps never actually have to changed in the existing/old KVs (even when stored in the WAL). Care would have to be taken when the multiplier is changed to higher value and there's a TTL on the table already, that would probably not work easily.

          Show
          Lars Hofhansl added a comment - Should we pick this up again? At a minimum, I think we should just left shift millis so we open up the lower bytes for external transactions, etc. to use, if we do nothing else on this issue. Agreed. We'd probably need to add a "timestamp multiplier" or "timestamp shift" option to table or column family. That way we can grandfather in old tables and in all cases scale the TTL value accordingly. I think this is the only concerns, the timestamps never actually have to changed in the existing/old KVs (even when stored in the WAL). Care would have to be taken when the multiplier is changed to higher value and there's a TTL on the table already, that would probably not work easily.
          Hide
          stack added a comment -

          Should we pick this up again?

          I like this issue. Hard part is ensuring we don't break TTLs going between old and new ts types (as you pointed out above). Was thinking we'd have all timestamping go via the environmentedge thingy... would add a compare on it. It would do the ttl math cognizant of the ts typing (my guess is that left shift would be less expensive doing this compare than multiply but would have to measure).

          We'd probably need to add a "timestamp multiplier" or "timestamp shift" option to table or column family. That way we can grandfather in old tables and in all cases scale the TTL value accordingly.

          All or nothing I'd say Just say NO to more options (smile).

          Show
          stack added a comment - Should we pick this up again? I like this issue. Hard part is ensuring we don't break TTLs going between old and new ts types (as you pointed out above). Was thinking we'd have all timestamping go via the environmentedge thingy... would add a compare on it. It would do the ttl math cognizant of the ts typing (my guess is that left shift would be less expensive doing this compare than multiply but would have to measure). We'd probably need to add a "timestamp multiplier" or "timestamp shift" option to table or column family. That way we can grandfather in old tables and in all cases scale the TTL value accordingly. All or nothing I'd say Just say NO to more options (smile).

            People

            • Assignee:
              Unassigned
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:

                Development