Details

    • Type: Brainstorming Brainstorming
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is to start a discussion about timestamp promises declared per table of CF.
      For example if a client promises only monotonically increasing timestamps (or no custom set timestamps) and VERSIONS=1, we can aggressively and easily remove old versions of the same row/fam/col from the memstore before we flush, just by supplying a comparator that ignores the timestamp (i.e. two KV just differing by TS would be considered equal).
      That would increase the performance of counters significantly.

        Issue Links

          Activity

          Hide
          Sergey Shelukhin added a comment -

          I think there was already a JIRA about that but I cannot find it.
          The discussion was also about introducing modes - e.g. "client cannot supply timestamp" (so we can put seqId in timestamp and it will be increasing as expected), or "client must supply all timestamps"

          Show
          Sergey Shelukhin added a comment - I think there was already a JIRA about that but I cannot find it. The discussion was also about introducing modes - e.g. "client cannot supply timestamp" (so we can put seqId in timestamp and it will be increasing as expected), or "client must supply all timestamps"
          Hide
          Jonathan Hsieh added a comment -

          I started looking into this – the part I'm less familiar with the implementation about is how to discern the distributed log replays and replication replays (which legitimately will write timestamps) from fresh writes.

          I'd think the different replay writes would be tagged or marked so that we can make a simple distinction in one place.

          Ideally this would be a table scoped parameter, that can work with alter table, something like MOD_TS_OK or INTRINSIC_TS_ONLY

          Show
          Jonathan Hsieh added a comment - I started looking into this – the part I'm less familiar with the implementation about is how to discern the distributed log replays and replication replays (which legitimately will write timestamps) from fresh writes. I'd think the different replay writes would be tagged or marked so that we can make a simple distinction in one place. Ideally this would be a table scoped parameter, that can work with alter table, something like MOD_TS_OK or INTRINSIC_TS_ONLY
          Hide
          Lars Hofhansl added a comment -

          Yep. Log replay is another issue where this will help. W.r.t. cheap upserts into the memstore, would still need to be careful when to update in place, because or the SLAB storage (that is why upsert is not using the SLAB, for example).

          Maybe we could first agree upon how a user indicates these promises. Would it be per CF? Or per table? Per table makes more sense (IMHO - why would a user want to supply timestamps for some columns for not for others in the same row?), but per CF would fit better with how we handling other config options.

          Something like CLIENT_TIMESTAMPS, SERVER_TIMESTAMPS? These are easy to verify. Might also want CLIENT_MONOTONIC_TIMESTAMPS (thinking of something like Phoenix here, which does its own TS management, or transaction libraries that use the TS to implement SI), which would be hard/expensive to validate, so we'd have to trust the client.

          Show
          Lars Hofhansl added a comment - Yep. Log replay is another issue where this will help. W.r.t. cheap upserts into the memstore, would still need to be careful when to update in place, because or the SLAB storage (that is why upsert is not using the SLAB, for example). Maybe we could first agree upon how a user indicates these promises. Would it be per CF? Or per table? Per table makes more sense (IMHO - why would a user want to supply timestamps for some columns for not for others in the same row?), but per CF would fit better with how we handling other config options. Something like CLIENT_TIMESTAMPS, SERVER_TIMESTAMPS? These are easy to verify. Might also want CLIENT_MONOTONIC_TIMESTAMPS (thinking of something like Phoenix here, which does its own TS management, or transaction libraries that use the TS to implement SI), which would be hard/expensive to validate, so we'd have to trust the client.
          Hide
          Lars Hofhansl added a comment -

          Jonathan Hsieh, missed your earlier comment. So tables scoped... I agree. MOD_TS_OK would continue to be the default I assume. Might want MOD_TS_ONLY, and MOD_TS_MONOTONIC (or something).

          Show
          Lars Hofhansl added a comment - Jonathan Hsieh , missed your earlier comment. So tables scoped... I agree. MOD_TS_OK would continue to be the default I assume. Might want MOD_TS_ONLY, and MOD_TS_MONOTONIC (or something).
          Hide
          Jonathan Hsieh added a comment -

          My hope actually would be to have MOD_TS_OK=false by default on all newly created tables, but on by default for any previously existing tables. This way we don't surprise any existing users unless they create new tables. Systems like phoenix would create with MOD_TS_OK=true.

          If it was on, I think we could legitimately turn on distributed log replay on by default. It also would effectively eliminate a class of problems users can encounter if they get fancy without knowing what really is going one.

          Show
          Jonathan Hsieh added a comment - My hope actually would be to have MOD_TS_OK=false by default on all newly created tables, but on by default for any previously existing tables. This way we don't surprise any existing users unless they create new tables. Systems like phoenix would create with MOD_TS_OK=true. If it was on, I think we could legitimately turn on distributed log replay on by default. It also would effectively eliminate a class of problems users can encounter if they get fancy without knowing what really is going one.
          Hide
          Lars Hofhansl added a comment -

          This would also help with:

          • scanning: If we scan files in creation order (starting with the memstore) we can stop when we've seen a key (we know there cannot be a newer one in a later file)
          • compactions: we can remove delete markers if we compact a tail (again in time order) of the HFiles.
          • probably many more optimization that will pop up
          Show
          Lars Hofhansl added a comment - This would also help with: scanning: If we scan files in creation order (starting with the memstore) we can stop when we've seen a key (we know there cannot be a newer one in a later file) compactions: we can remove delete markers if we compact a tail (again in time order) of the HFiles. probably many more optimization that will pop up
          Hide
          Enis Soztutar added a comment -

          HBASE-9905 contains some discussions related to this. The modes proposed there is similar because they were borrowed from Lars's ideas in the first place.

          Show
          Enis Soztutar added a comment - HBASE-9905 contains some discussions related to this. The modes proposed there is similar because they were borrowed from Lars's ideas in the first place.
          Hide
          Vladimir Rodionov added a comment -

          I think, there is an easy quick patch / work around which will allow to speed up reads by going only to MemStore or block cache only:

          hint on Get/ Append: something like READ_FASTEST;

          Get get = Get(row)
          get.setAttribute(OperationWithAttributes.READ_FASTEST);
          
          Append append = ..
          append.setAttribute(OperationWithAttributes.READ_FASTEST)
          

          Unfortunately, Increment does not implements OperationWithAttributes. Why?

          Show
          Vladimir Rodionov added a comment - I think, there is an easy quick patch / work around which will allow to speed up reads by going only to MemStore or block cache only: hint on Get/ Append: something like READ_FASTEST; Get get = Get(row) get.setAttribute(OperationWithAttributes.READ_FASTEST); Append append = .. append.setAttribute(OperationWithAttributes.READ_FASTEST) Unfortunately, Increment does not implements OperationWithAttributes. Why?
          Hide
          Andrew Purtell added a comment -

          Unfortunately, Increment does not implements OperationWithAttributes

          Maybe in 0.94? In more recent versions of HBase, Increment extends Mutation which extends OperationWithAttributes.

          Show
          Andrew Purtell added a comment - Unfortunately, Increment does not implements OperationWithAttributes Maybe in 0.94? In more recent versions of HBase, Increment extends Mutation which extends OperationWithAttributes.
          Hide
          Vladimir Rodionov added a comment -

          You are right. I checked 0.94 version.

          Show
          Vladimir Rodionov added a comment - You are right. I checked 0.94 version.
          Hide
          Lars Hofhansl added a comment -

          Also note that Increment already does this (and hence does not work when Puts and Increments are mixed).

          It's also bigger than just the memstore. When HFile history corresponds to the history of its contents, we can start scanning HFiles in order of age (not even seeking later HFiles unless we didn't find what we are looking for in newer HFiles).

          Show
          Lars Hofhansl added a comment - Also note that Increment already does this (and hence does not work when Puts and Increments are mixed). It's also bigger than just the memstore. When HFile history corresponds to the history of its contents, we can start scanning HFiles in order of age (not even seeking later HFiles unless we didn't find what we are looking for in newer HFiles).
          Hide
          Vladimir Rodionov added a comment -

          It's also bigger than just the memstore. When HFile history corresponds to the history of its contents, we can start scanning HFiles in order of age (not even seeking later HFiles unless we didn't find what we are looking for in newer HFiles).

          I think Lars Hofhansl is referring to custom compaction into serious of non-overlapping by date range sorted (of course) HFile's. We have https://issues.apache.org/jira/browse/PHOENIX-914 for that already.

          Show
          Vladimir Rodionov added a comment - It's also bigger than just the memstore. When HFile history corresponds to the history of its contents, we can start scanning HFiles in order of age (not even seeking later HFiles unless we didn't find what we are looking for in newer HFiles). I think Lars Hofhansl is referring to custom compaction into serious of non-overlapping by date range sorted (of course) HFile's. We have https://issues.apache.org/jira/browse/PHOENIX-914 for that already.
          Hide
          Lars Hofhansl added a comment -

          That is one use case. I meant an even more general scenario. Currently for any scan we need to seek into all involved HFiles and the memstore to find the next element.
          For many scans (for example when the set of columns is known ahead of time, and we're looking for latest N versions only) we can seek on-demand in order of the newest to latest, as soon as we found what we're looking for we can stop and avoid seeking into the older HFiles.

          Show
          Lars Hofhansl added a comment - That is one use case. I meant an even more general scenario. Currently for any scan we need to seek into all involved HFiles and the memstore to find the next element. For many scans (for example when the set of columns is known ahead of time, and we're looking for latest N versions only) we can seek on-demand in order of the newest to latest, as soon as we found what we're looking for we can stop and avoid seeking into the older HFiles.

            People

            • Assignee:
              Unassigned
              Reporter:
              Lars Hofhansl
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:

                Development