Pig
  1. Pig
  2. PIG-1832

Support timestamp in HBaseStorage when storing

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When storing data into HBase using org.apache.pig.backend.hadoop.hbase.HBaseStorage, HBase timestamp field is stored with insertion time of the mapreduce job. It would be nice to have a way to populate timestamp from user data.

        Issue Links

          Activity

          Hide
          Eric Yang added a comment -

          hi Guido, I think the -tmpestamp=<epoch_utc> make sense for high throughput system. We probably should revisit per cell level timestamp writing later. This is not a high priority item for me to work on. If anyone would like to tackle this issue, feel free to take this issue.

          Show
          Eric Yang added a comment - hi Guido, I think the -tmpestamp=<epoch_utc> make sense for high throughput system. We probably should revisit per cell level timestamp writing later. This is not a high priority item for me to work on. If anyone would like to tackle this issue, feel free to take this issue.
          Hide
          Bill Graham added a comment -

          I don't think there is a ticket to support returning multiple cell versions with timestamps, but we did discuss ideas for an approach here:

          https://issues.apache.org/jira/browse/PIG-1782?focusedCommentId=12988192&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12988192

          Basically the idea is to create a new class to support this, since it would be fundamentally very different than what we currently support with HBaseStorage. That work might be better handled after we tackle PIG-3067 (HBaseStorage should be split up to become more manageable).

          Show
          Bill Graham added a comment - I don't think there is a ticket to support returning multiple cell versions with timestamps, but we did discuss ideas for an approach here: https://issues.apache.org/jira/browse/PIG-1782?focusedCommentId=12988192&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12988192 Basically the idea is to create a new class to support this, since it would be fundamentally very different than what we currently support with HBaseStorage . That work might be better handled after we tackle PIG-3067 (HBaseStorage should be split up to become more manageable).
          Hide
          Guido Serra aka Zeph added a comment -

          s/imaging/imagine

          Show
          Guido Serra aka Zeph added a comment - s/imaging/imagine
          Hide
          Guido Serra aka Zeph added a comment -

          p.s. Bill Graham I can't find a ticket addressing the outputting the timestamp... I mean, imaging I'd like to see multiple versions, given a time range... (k, I guess I need to create a feature ticket for that)

          Show
          Guido Serra aka Zeph added a comment - p.s. Bill Graham I can't find a ticket addressing the outputting the timestamp... I mean, imaging I'd like to see multiple versions, given a time range... (k, I guess I need to create a feature ticket for that)
          Hide
          Bill Graham added a comment -

          Yes, read via time ranges is done. Work on PIG-2114 seems stalled though and there's a lot going on in that patch. I propose this JIRA just add write support for -timestamp=<millis_since_the_epoch_utc> for consistency with the current read API. That's a quick change that would be useful and would give full read/write support for timestamps. That would also help reduce the somewhat broad scope of PIG-2114.

          Show
          Bill Graham added a comment - Yes, read via time ranges is done. Work on PIG-2114 seems stalled though and there's a lot going on in that patch. I propose this JIRA just add write support for -timestamp=<millis_since_the_epoch_utc> for consistency with the current read API. That's a quick change that would be useful and would give full read/write support for timestamps. That would also help reduce the somewhat broad scope of PIG-2114 .
          Hide
          Guido Serra aka Zeph added a comment -

          even... they just updated ( PIG-2341 ) the documentation:

          I'd say, that just having the double usage of "-timestamp=", at LOAD and on STORE, is all we need

          right now (as of version 0.11), this option is being taken into consideration only at LOAD time

          p.s. there is a scenario though, which I'm covering with a python/jython custom script, that puzzles me... what if only a cell (row/column intersection) changes? HBase by design stores a new entry at a given timestamp for all the family:columns provided, even if they are identical ... shall we compute the difference within the HBaseStorage, or shall the user handle it?

          Show
          Guido Serra aka Zeph added a comment - even... they just updated ( PIG-2341 ) the documentation: http://pig.apache.org/docs/r0.11.0/func.html#HBaseStorage I'd say, that just having the double usage of "-timestamp=", at LOAD and on STORE, is all we need right now (as of version 0.11), this option is being taken into consideration only at LOAD time p.s. there is a scenario though, which I'm covering with a python/jython custom script, that puzzles me... what if only a cell (row/column intersection) changes? HBase by design stores a new entry at a given timestamp for all the family:columns provided, even if they are identical ... shall we compute the difference within the HBaseStorage, or shall the user handle it?
          Hide
          Guido Serra aka Zeph added a comment -

          k, PIG-2886 is covering only the reading... this is actually attempting to cover the writing, let's keep it open

          seems to be partially addressed in PIG-2114 though... Eric Yang any progress from ur side?

          Show
          Guido Serra aka Zeph added a comment - k, PIG-2886 is covering only the reading... this is actually attempting to cover the writing, let's keep it open seems to be partially addressed in PIG-2114 though... Eric Yang any progress from ur side?
          Hide
          Guido Serra aka Zeph added a comment -

          Eric Yang up to me it is covered by PIG-2886 , have a look at it

          Show
          Guido Serra aka Zeph added a comment - Eric Yang up to me it is covered by PIG-2886 , have a look at it
          Hide
          Eric Yang added a comment -

          For loading HBase data with timestamp, the API could look like this:

          a = load 'hbase://table1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:*', 
            '-loadKey -gt $START -caster Utf8StorageConverter -timeRange $startTs,$endTs');
          

          For storing, I am inclined to suggest a new callback user defined function in HBaseStorage as parameter, this will enable to extract timestamp from row key, and set the timestamp at cell level. For example:

          STORE table2 INTO 'table2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:s1 cf2:s2', 
            '-cb org.apache.pig.backend.hadoop.hbase.TimestampExtractor("\\w+-\\d+-\\w+")');
          

          It could also be used by setting data with bulk loaded timestamp:

          STORE table2 INTO 'table2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:s1 cf2:s2', 
            '-cb org.apache.pig.backend.hadoop.hbase.TimestampSetter($ts)');
          

          Any thoughts?

          Show
          Eric Yang added a comment - For loading HBase data with timestamp, the API could look like this: a = load 'hbase: //table1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:*', '-loadKey -gt $START -caster Utf8StorageConverter -timeRange $startTs,$endTs'); For storing, I am inclined to suggest a new callback user defined function in HBaseStorage as parameter, this will enable to extract timestamp from row key, and set the timestamp at cell level. For example: STORE table2 INTO 'table2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:s1 cf2:s2', '-cb org.apache.pig.backend.hadoop.hbase.TimestampExtractor( "\\w+-\\d+-\\w+" )'); It could also be used by setting data with bulk loaded timestamp: STORE table2 INTO 'table2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:s1 cf2:s2', '-cb org.apache.pig.backend.hadoop.hbase.TimestampSetter($ts)'); Any thoughts?
          Hide
          Andrew Clegg added a comment -

          This would be really handy e.g. for replaying log files into hbase after a failure. So the cells could be dated with the actual time of the event, for example.

          Show
          Andrew Clegg added a comment - This would be really handy e.g. for replaying log files into hbase after a failure. So the cells could be dated with the actual time of the event, for example.
          Hide
          Bill Graham added a comment -

          @Vincent, timestamp filtering at read time is being implemented as part of PIG-2114 FYI.

          Show
          Bill Graham added a comment - @Vincent, timestamp filtering at read time is being implemented as part of PIG-2114 FYI.
          Hide
          Vincent BARAT added a comment -

          It would be definitively nice if timestamp could be also specified when loading data: in the same way the -lt and -gt options work for row keys, it would be nice to be able to specify a timestamp threshold.

          Show
          Vincent BARAT added a comment - It would be definitively nice if timestamp could be also specified when loading data: in the same way the -lt and -gt options work for row keys, it would be nice to be able to specify a timestamp threshold.

            People

            • Assignee:
              Unassigned
              Reporter:
              Eric Yang
            • Votes:
              6 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development