Pig
  1. Pig
  2. PIG-1782

Add ability to load data by column family in HBaseStorage

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Java 6, Mac OS X 10.6

    • Release Note:
      Hide
      Enhanced HBaseStorage functionality to support loading dynamically named columns by column family or by column name prefixes.

      Javadoc:


      /**
       * A HBase implementation of LoadFunc and StoreFunc.
       * <P>
       * Below is an example showing how to load data from HBase:
       * <pre>{@code
       * raw = LOAD 'hbase://SampleTable&#39;
       * USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
       * 'info:first_name info:last_name friends:* info:*', '-loadKey true -limit 5')
       * AS (id:bytearray, first_name:chararray, last_name:chararray, friends_map:map[], info_map:map[]);
       * }</pre>
       * This example loads data redundantly from the info column family just to
       * illustrate usage. Note that the row key is inserted first in the result schema.
       * To load only column names that start with a given prefix, specify the column
       * name with a trailing '*'. For example passing <code>friends:bob_*</code> to
       * the constructor in the above example would cause only columns that start with
       * <i>bob_</i> to be loaded.
       * <P>
       * Below is an example showing how to store data into HBase:
       * <pre>{@code
       * copy = STORE raw INTO 'hbase://SampleTableCopy&#39;
       * USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
       * 'info:first_name info:last_name friends:* info:*')
       * AS (info:first_name info:last_name buddies:* info:*);
       * }</pre>
       * Note that STORE will expect the first value in the tuple to be the row key.
       * Scalars values need to map to an explicit column descriptor and maps need to
       * map to a column family name. In the above examples, the <code>friends</code>
       * column family data from <code>SampleTable</code> will be written to a
       * <code>buddies</code> column family in the <code>SampleTableCopy</code> table.
       *
       */
      Show
      Enhanced HBaseStorage functionality to support loading dynamically named columns by column family or by column name prefixes. Javadoc: /**  * A HBase implementation of LoadFunc and StoreFunc.  * <P>  * Below is an example showing how to load data from HBase:  * <pre>{@code  * raw = LOAD ' hbase://SampleTable&#39;  * USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(  * 'info:first_name info:last_name friends:* info:*', '-loadKey true -limit 5')  * AS (id:bytearray, first_name:chararray, last_name:chararray, friends_map:map[], info_map:map[]);  * }</pre>  * This example loads data redundantly from the info column family just to  * illustrate usage. Note that the row key is inserted first in the result schema.  * To load only column names that start with a given prefix, specify the column  * name with a trailing '*'. For example passing <code>friends:bob_*</code> to  * the constructor in the above example would cause only columns that start with  * <i>bob_</i> to be loaded.  * <P>  * Below is an example showing how to store data into HBase:  * <pre>{@code  * copy = STORE raw INTO ' hbase://SampleTableCopy&#39;  * USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(  * 'info:first_name info:last_name friends:* info:*')  * AS (info:first_name info:last_name buddies:* info:*);  * }</pre>  * Note that STORE will expect the first value in the tuple to be the row key.  * Scalars values need to map to an explicit column descriptor and maps need to  * map to a column family name. In the above examples, the <code>friends</code>  * column family data from <code>SampleTable</code> will be written to a  * <code>buddies</code> column family in the <code>SampleTableCopy</code> table.  *  */

      Description

      It would be nice to load all columns in the column family by using short hand syntax like:

      CpuMetrics = load 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
      

      Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1, in cpu column family.

      CpuMetrics would contain something like:

      (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
      
      1. PIG-1782_4.patch
        33 kB
        Dmitriy V. Ryaboy
      2. PIG-1782_1.patch
        23 kB
        Bill Graham
      3. PIG_1782_3.patch
        30 kB
        Bill Graham
      4. PIG_1782_2.patch
        35 kB
        Bill Graham
      5. apply-PIG-1782-patch.sh
        2 kB
        Bill Graham

        Issue Links

          Activity

          Hide
          Dmitriy V. Ryaboy added a comment -

          Committed to 0.9 trunk.

          Show
          Dmitriy V. Ryaboy added a comment - Committed to 0.9 trunk.
          Hide
          Bill Graham added a comment -

          Verified that patch applies cleanly to trunk, unit tests pass and a sanity test job against a cluster utilizing a map of CF name/values runs as expected.

          Show
          Bill Graham added a comment - Verified that patch applies cleanly to trunk, unit tests pass and a sanity test job against a cluster utilizing a map of CF name/values runs as expected.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Attached patch should apply cleanly to the current trunk. Please review.

          Show
          Dmitriy V. Ryaboy added a comment - Attached patch should apply cleanly to the current trunk. Please review.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Bill, I will definitely look at this by the end of the weekend.

          Show
          Dmitriy V. Ryaboy added a comment - Bill, I will definitely look at this by the end of the weekend.
          Hide
          Bill Graham added a comment -

          Ping. Can anyone review this patch please, and possibly even commit? I'd like to get this into Pig 0.9.0 if possible. We've been using it for a while without issue.

          Show
          Bill Graham added a comment - Ping. Can anyone review this patch please, and possibly even commit? I'd like to get this into Pig 0.9.0 if possible. We've been using it for a while without issue.
          Hide
          Bill Graham added a comment -

          Here's a new patch #3 with the projection unit tests removed. Dymitriy and I synced up off-line and decided to tackle the issue with projections in a separate JIRA. I'll open one and add the relevant unit tests.

          This patch also requires PIG_1680 and it's built from my pig_1782 git repos fyi:

          https://github.com/billonahill/pig/tree/pig_1782

          Show
          Bill Graham added a comment - Here's a new patch #3 with the projection unit tests removed. Dymitriy and I synced up off-line and decided to tackle the issue with projections in a separate JIRA. I'll open one and add the relevant unit tests. This patch also requires PIG_1680 and it's built from my pig_1782 git repos fyi: https://github.com/billonahill/pig/tree/pig_1782
          Hide
          Bill Graham added a comment -

          Sorry for the back and forth on this one, but I've discovered another bug when doing projections while doing additional testing of patch 3.

          Show
          Bill Graham added a comment - Sorry for the back and forth on this one, but I've discovered another bug when doing projections while doing additional testing of patch 3.
          Hide
          Bill Graham added a comment -

          Attached is a second patch. This one is built to be applied on top of the PIG_1680.3.patch.

          From the Javadocs:

          An HBase implementation of LoadFunc and StoreFunc.

          Below is an example showing how to load data from HBase:

          raw = LOAD 'hbase://SampleTable'
                USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
          'info:first_name info:last_name friends:* info:*', '-loadKey true -limit 5')
                 AS (id:bytearray, first_name:chararray, last_name:chararray, friends_map:map[], info_map:map[]);
          

          This example loads data redundantly from the info column family just to illustrate usage. Note that the row key is inserted first in the result schema. To load only column names that start with a given prefix, specify the column prefix with a trailing *. For example passing friends:bob_* to the constructor in the above example would cause only columns that start with bob_ to be loaded.

          Below is an example showing how to store data into HBase:

           copy = STORE raw INTO 'hbase://SampleTableCopy'
                 USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
                 'info:first_name info:last_name friends:* info:*')
                 AS (info:first_name info:last_name buddies:* info:*);
          

          Note that STORE will expect the first value in the tuple to be the row key. Scalar values need to map to an explicit column descriptor and maps need to map to a column family name. In the above examples, the friends column family data from SampleTable will be written to a buddies column family in the SampleTableCopy table.

          Show
          Bill Graham added a comment - Attached is a second patch. This one is built to be applied on top of the PIG_1680.3.patch. From the Javadocs: An HBase implementation of LoadFunc and StoreFunc. Below is an example showing how to load data from HBase: raw = LOAD 'hbase: //SampleTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name friends:* info:*', '-loadKey true -limit 5') AS (id:bytearray, first_name:chararray, last_name:chararray, friends_map:map[], info_map:map[]); This example loads data redundantly from the info column family just to illustrate usage. Note that the row key is inserted first in the result schema. To load only column names that start with a given prefix, specify the column prefix with a trailing *. For example passing friends:bob_* to the constructor in the above example would cause only columns that start with bob _ to be loaded. Below is an example showing how to store data into HBase: copy = STORE raw INTO 'hbase: //SampleTableCopy' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name friends:* info:*') AS (info:first_name info:last_name buddies:* info:*); Note that STORE will expect the first value in the tuple to be the row key. Scalar values need to map to an explicit column descriptor and maps need to map to a column family name. In the above examples, the friends column family data from SampleTable will be written to a buddies column family in the SampleTableCopy table.
          Hide
          Bill Graham added a comment -

          @Dmitriy, I branched your git clone and incorporated my changes for this patch. There's one bug when using projections that I've added a unit test for that fails. see https://github.com/billonahill/pig/tree/bills_pig_1680

          I'm not yet sure if this is a bug in the base PIG-1680 functionality, or only when using maps per PIG-1782. I'll look into it more tomorrow.

          Show
          Bill Graham added a comment - @Dmitriy, I branched your git clone and incorporated my changes for this patch. There's one bug when using projections that I've added a unit test for that fails. see https://github.com/billonahill/pig/tree/bills_pig_1680 I'm not yet sure if this is a bug in the base PIG-1680 functionality, or only when using maps per PIG-1782 . I'll look into it more tomorrow.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Michael, that will be addressed in PIG-1832 after we get done with this. The scope of this ticket is creeping already, as it's getting mixed in with 1680..

          Show
          Dmitriy V. Ryaboy added a comment - Michael, that will be addressed in PIG-1832 after we get done with this. The scope of this ticket is creeping already, as it's getting mixed in with 1680..
          Hide
          Michael Lugassy added a comment -

          Thanks guys, will any of these allow looking at multiple versions/retrieving timetamps of HBase cells?

          Show
          Michael Lugassy added a comment - Thanks guys, will any of these allow looking at multiple versions/retrieving timetamps of HBase cells?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Bill and I diverged a bit since I posted a different patch to 1680, will post a merged patch soonish.

          Show
          Dmitriy V. Ryaboy added a comment - Bill and I diverged a bit since I posted a different patch to 1680, will post a merged patch soonish.
          Hide
          Bill Graham added a comment -

          Attached are two files, a patch and a script to apply it. A few things to note about this patch:

          • It relies on HBase 0.89.0 or greater and it effectively replaces PIG-1680.
          • I've updated HBaseStorage for now. If we want to deprecate that class and create a new one instead, I can do that.
          • I added support for a columnPrefix option to filter down columns returned. Proper column prefix functionality though requires HBASE-3550.
          • I had to do some hackery in setStoreLocation and getOutputFormat with the conf objects to keep NPEs from being thrown from HBase (see comments in code). A review of what I'm doing with the conf objects in that part of code would be good.
          • There are still no unit tests for this code, since it's a tricky thing to test. I have a few simple hbase and pig scripts that I've been using that I could provide.
          Show
          Bill Graham added a comment - Attached are two files, a patch and a script to apply it. A few things to note about this patch: It relies on HBase 0.89.0 or greater and it effectively replaces PIG-1680 . I've updated HBaseStorage for now. If we want to deprecate that class and create a new one instead, I can do that. I added support for a columnPrefix option to filter down columns returned. Proper column prefix functionality though requires HBASE-3550 . I had to do some hackery in setStoreLocation and getOutputFormat with the conf objects to keep NPEs from being thrown from HBase (see comments in code). A review of what I'm doing with the conf objects in that part of code would be good. There are still no unit tests for this code, since it's a tricky thing to test. I have a few simple hbase and pig scripts that I've been using that I could provide.
          Hide
          Bill Graham added a comment -

          Dymitriy yes of course you're right, we'd still need shims. Let's see what comes back from your question to the list. Maybe we can just move forward requiring >= 0.89.

          I've got a working patch that I should be able to add next week fyi (I'm on vacation this week).

          Show
          Bill Graham added a comment - Dymitriy yes of course you're right, we'd still need shims. Let's see what comes back from your question to the list. Maybe we can just move forward requiring >= 0.89. I've got a working patch that I should be able to add next week fyi (I'm on vacation this week).
          Hide
          Michael Lugassy added a comment -

          Can't we just pass (extended = 'true') for the load function?

          Show
          Michael Lugassy added a comment - Can't we just pass (extended = 'true') for the load function?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Bill,
          I am not sure how we can pull in both versions of Hbase (one for the current HBaseStorage we would deprecate, and one for the new HBaseStorage), and not run into compilation nightmares. Seems like we need shims either way, no?

          Show
          Dmitriy V. Ryaboy added a comment - Bill, I am not sure how we can pull in both versions of Hbase (one for the current HBaseStorage we would deprecate, and one for the new HBaseStorage), and not run into compilation nightmares. Seems like we need shims either way, no?
          Hide
          Bill Graham added a comment -

          @Dmitriy I think the deprecation idea has it's merits. The patch I'm working on is actually against HBase 0.90.0. It basically includes the PIG-1680 patch. What if we deprecated the existing HBase classes and created new ones in a new location that required HBase >= 0.90. That way we can clean up the package structure and put off having to shim a little longer.

          Show
          Bill Graham added a comment - @Dmitriy I think the deprecation idea has it's merits. The patch I'm working on is actually against HBase 0.90.0. It basically includes the PIG-1680 patch. What if we deprecated the existing HBase classes and created new ones in a new location that required HBase >= 0.90. That way we can clean up the package structure and put off having to shim a little longer.
          Hide
          Michael Lugassy added a comment -

          +1 for including version timestamps in the response. This would help both processing multiple versions and easily parsing timestamps which are "free" inside HBase cells.

          Show
          Michael Lugassy added a comment - +1 for including version timestamps in the response. This would help both processing multiple versions and easily parsing timestamps which are "free" inside HBase cells.
          Hide
          Dmitriy V. Ryaboy added a comment -

          That seems reasonable to me.

          The only reason I suggest deprecating the current HBaseStorage is that it's awkwardly placed in backend.hadoop.hbase which is not where anyone really expects to find it. But I guess we can do that in a different ticket.

          Show
          Dmitriy V. Ryaboy added a comment - That seems reasonable to me. The only reason I suggest deprecating the current HBaseStorage is that it's awkwardly placed in backend.hadoop.hbase which is not where anyone really expects to find it. But I guess we can do that in a different ticket.
          Hide
          Bill Graham added a comment -

          I agree. Dmitriy, I like where you're going with new classes and deprecation, but maybe we could do this with just an enhanced (and backward compatible) HBaseStorage and a new AdvancedHBaseStorage.

          • HBaseStorage
          • if you specific discrete columns, you get a tuple of values like the current behavior
          • if you specify one or more CFs (or possibly a CF with a wildcard column expression) you get back a tuple of maps
          • If you specify a mix, you get a tuple with values and maps. For example 'cf2:foo c1: cf2:bar' would produce ( value, { col => value }

            , value }

          • This is backwards compatible and seems easiest to grok from a users perspective.
          • AdvancedHBaseStorage
          • Somehow support mulitiple timestamps with a more complex data structure
          • One possibility is to use the data structure I suggested in my previous comment where everything is a map
          • Another is to return something like the proposed HBaseStorage data structure, where each 'value' is replaced with ( (value, ts), ... )
          • We could hash out the specifics of AdvancedHBaseStorage in another JIRA if we decide to go this route
          Show
          Bill Graham added a comment - I agree. Dmitriy, I like where you're going with new classes and deprecation, but maybe we could do this with just an enhanced (and backward compatible) HBaseStorage and a new AdvancedHBaseStorage. HBaseStorage if you specific discrete columns, you get a tuple of values like the current behavior if you specify one or more CFs (or possibly a CF with a wildcard column expression) you get back a tuple of maps If you specify a mix, you get a tuple with values and maps. For example 'cf2:foo c1: cf2:bar' would produce ( value, { col => value } , value } This is backwards compatible and seems easiest to grok from a users perspective. AdvancedHBaseStorage Somehow support mulitiple timestamps with a more complex data structure One possibility is to use the data structure I suggested in my previous comment where everything is a map Another is to return something like the proposed HBaseStorage data structure, where each 'value' is replaced with ( (value, ts), ... ) We could hash out the specifics of AdvancedHBaseStorage in another JIRA if we decide to go this route
          Hide
          Dmitriy V. Ryaboy added a comment -

          That's certainly possible, I just don't think it's a good design from a usability standpoint

          Show
          Dmitriy V. Ryaboy added a comment - That's certainly possible, I just don't think it's a good design from a usability standpoint
          Hide
          Eric Yang added a comment -

          @Bill, agree. I filed a seperated jira for supporting timestamp.
          @Dmitriy, Would it be possible to add a parameter to switch between the return type?

          Suggested flags:

          • -returnMap (default)
          • -returnTuple

          Example for Map:

          CpuMetrics = load 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
          

          Example for Tuple:

          CpuMetrics = load 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');
          
          Show
          Eric Yang added a comment - @Bill, agree. I filed a seperated jira for supporting timestamp. @Dmitriy, Would it be possible to add a parameter to switch between the return type? Suggested flags: -returnMap (default) -returnTuple Example for Map: CpuMetrics = load 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey'); Example for Tuple: CpuMetrics = load 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');
          Hide
          Dmitriy V. Ryaboy added a comment -

          Bill, I think what you are suggesting is the "correct" way but I'd prefer not to break people's existing scripts which is what would happen if we changed what we return when a schema like 'cf2:foo cf2:bar' is specified in your proposal...

          There are also usability benefits to having the flat return schema you get from HBaseStorage now – it looks exactly like loading from PigStorage, so no surprises. You ask for 2 columns, and get 2 values in a tuple, it's sort of what you'd expect.

          Perhaps we take your suggestion, put that into builtins.AdvancedHBaseStorage, deprecate the current HBaseStorage, and move the current code to builtins.SimpleHBaseStorage ?

          Show
          Dmitriy V. Ryaboy added a comment - Bill, I think what you are suggesting is the "correct" way but I'd prefer not to break people's existing scripts which is what would happen if we changed what we return when a schema like 'cf2:foo cf2:bar' is specified in your proposal... There are also usability benefits to having the flat return schema you get from HBaseStorage now – it looks exactly like loading from PigStorage, so no surprises. You ask for 2 columns, and get 2 values in a tuple, it's sort of what you'd expect. Perhaps we take your suggestion, put that into builtins.AdvancedHBaseStorage, deprecate the current HBaseStorage, and move the current code to builtins.SimpleHBaseStorage ?
          Hide
          Bill Graham added a comment -

          I was also thinking about a map, but I thought we might want to preserve the ordering of the fields specified when explicit fields are requested, as well as CFs, like Dmitriy's example. We'd get the CF fields in the natural ordering that Hbase stores them in too. The more I think about it though, I don't think this is that useful and I think a map approach seems the way to go.

          @Eric: Yes pig doesn't have any ts control upon writes currently (and that should be improved), but that shouldn't rule out the ability to read them. I can see many use cases where some non-Pig process is populating HBase, but Pig is used for queries.

          @Dmitriy: I prototyped that exact use case using tuples of tuples, but ran into the downsides you point out. Also each row read has a variable length of tuples, which would seem really difficult to work with.

          I like this approach when reading all columns in a family:

          ( rowKey, { col1 => ((val1, ts), ..), col2 => ((val2, ts), ..) } ) 
          

          For Dymitriy's use case, having the same schema returned (alwaya a map) regardless of how the column families are specified (i.e., 'cf1: cf2:foo' vs 'cf1:' vs 'cf2:foo cf2:bar') is one option. Another is to return a map for CFs and a ((val1, ts), ..) for explicit columns. I'm not sure which approach would make life easier on the script writer.

          Show
          Bill Graham added a comment - I was also thinking about a map, but I thought we might want to preserve the ordering of the fields specified when explicit fields are requested, as well as CFs, like Dmitriy's example. We'd get the CF fields in the natural ordering that Hbase stores them in too. The more I think about it though, I don't think this is that useful and I think a map approach seems the way to go. @Eric: Yes pig doesn't have any ts control upon writes currently (and that should be improved), but that shouldn't rule out the ability to read them. I can see many use cases where some non-Pig process is populating HBase, but Pig is used for queries. @Dmitriy: I prototyped that exact use case using tuples of tuples, but ran into the downsides you point out. Also each row read has a variable length of tuples, which would seem really difficult to work with. I like this approach when reading all columns in a family: ( rowKey, { col1 => ((val1, ts), ..), col2 => ((val2, ts), ..) } ) For Dymitriy's use case, having the same schema returned (alwaya a map) regardless of how the column families are specified (i.e., 'cf1: cf2:foo' vs 'cf1:' vs 'cf2:foo cf2:bar') is one option. Another is to return a map for CFs and a ((val1, ts), ..) for explicit columns. I'm not sure which approach would make life easier on the script writer.
          Hide
          Dmitriy V. Ryaboy added a comment -

          To Eric's point, we should add timestamp controls straight into Storage.

          Returning tuples of the form ( optionalRowKey,

          { col1 => val1, col2 => val2 }

          ) makes sense to me.

          I don't like the tuple of tuples option because it makes it hard to pull out specific columns in that structure, which is likely what one wants to do.

          We should give some thought to someone loading using HbaseStorage( 'cf1:, cf2:some_col' , '-loadKey')

          Show
          Dmitriy V. Ryaboy added a comment - To Eric's point, we should add timestamp controls straight into Storage. Returning tuples of the form ( optionalRowKey, { col1 => val1, col2 => val2 } ) makes sense to me. I don't like the tuple of tuples option because it makes it hard to pull out specific columns in that structure, which is likely what one wants to do. We should give some thought to someone loading using HbaseStorage( 'cf1:, cf2:some_col' , '-loadKey')
          Hide
          Eric Yang added a comment -

          There is no control of hbase timestamp in pig. Hence, the timestamp returned is the actual insertion time when calling pig store function. I am not sure how useful this could be. To be more explicit, it will look like:

          ( rowKey,
            (  column_name, ( (  value, ts  ), ...  )  ), ...
          )
          

          It is concise but not user friendly.

          I am leaning toward returning a map.

          Show
          Eric Yang added a comment - There is no control of hbase timestamp in pig. Hence, the timestamp returned is the actual insertion time when calling pig store function. I am not sure how useful this could be. To be more explicit, it will look like: ( rowKey, ( column_name, ( ( value, ts ), ... ) ), ... ) It is concise but not user friendly. I am leaning toward returning a map.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Return a map?

          Show
          Dmitriy V. Ryaboy added a comment - Return a map?
          Hide
          Bill Graham added a comment -

          Assigning this to myself, since I've got a working patch, but the design needs to be vetted out further with this approach.

          One issue is that the number of columns per family per row is not constant, so with a sparse table you'd have no idea what column names go with each value of the tuple returned. Another issue is that the column name is actually dynamic descriptive data often times in HBase and there can be multiple timestamped values for a cell.

          • Option A:
            Instead of returning a tuple of values the load can return a tuple of tuples. Each inner tuple is a two-tuple that contains the column descriptor and the most recent value. This data structure would be returned if a 'cf:' style column exists in the column list, but default behavior exists with explicit column names. This is the simplest approach.
          • Option B:
            Build out an even more rich (and complex) data structure that also takes into account multiple values and their timestamps. A tuple of tuple of tuple of tuples to capture the entire HBase KeyValue data structure. Something like this:
          (
           ( column name, ( (value, ts), ... ) ), ...
          )
          

          Either way, the variable length tuples returned for each row containing additional variable length tuples would probably require a number of custom UDFs to do anything useful with variable name columns and multiple timestamped values.

          I guess I lean towards option B so we can support more use cases down the road with this refactor. Other opinions?

          Show
          Bill Graham added a comment - Assigning this to myself, since I've got a working patch, but the design needs to be vetted out further with this approach. One issue is that the number of columns per family per row is not constant, so with a sparse table you'd have no idea what column names go with each value of the tuple returned. Another issue is that the column name is actually dynamic descriptive data often times in HBase and there can be multiple timestamped values for a cell. Option A: Instead of returning a tuple of values the load can return a tuple of tuples. Each inner tuple is a two-tuple that contains the column descriptor and the most recent value. This data structure would be returned if a 'cf:' style column exists in the column list, but default behavior exists with explicit column names. This is the simplest approach. Option B: Build out an even more rich (and complex) data structure that also takes into account multiple values and their timestamps. A tuple of tuple of tuple of tuples to capture the entire HBase KeyValue data structure. Something like this: ( ( column name, ( (value, ts), ... ) ), ... ) Either way, the variable length tuples returned for each row containing additional variable length tuples would probably require a number of custom UDFs to do anything useful with variable name columns and multiple timestamped values. I guess I lean towards option B so we can support more use cases down the road with this refactor. Other opinions?

            People

            • Assignee:
              Bill Graham
              Reporter:
              Eric Yang
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development