Pig
  1. Pig
  2. PIG-1782

Add ability to load data by column family in HBaseStorage

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Java 6, Mac OS X 10.6

    • Release Note:
      Hide
      Enhanced HBaseStorage functionality to support loading dynamically named columns by column family or by column name prefixes.

      Javadoc:


      /**
       * A HBase implementation of LoadFunc and StoreFunc.
       * <P>
       * Below is an example showing how to load data from HBase:
       * <pre>{@code
       * raw = LOAD 'hbase://SampleTable&#39;
       * USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
       * 'info:first_name info:last_name friends:* info:*', '-loadKey true -limit 5')
       * AS (id:bytearray, first_name:chararray, last_name:chararray, friends_map:map[], info_map:map[]);
       * }</pre>
       * This example loads data redundantly from the info column family just to
       * illustrate usage. Note that the row key is inserted first in the result schema.
       * To load only column names that start with a given prefix, specify the column
       * name with a trailing '*'. For example passing <code>friends:bob_*</code> to
       * the constructor in the above example would cause only columns that start with
       * <i>bob_</i> to be loaded.
       * <P>
       * Below is an example showing how to store data into HBase:
       * <pre>{@code
       * copy = STORE raw INTO 'hbase://SampleTableCopy&#39;
       * USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
       * 'info:first_name info:last_name friends:* info:*')
       * AS (info:first_name info:last_name buddies:* info:*);
       * }</pre>
       * Note that STORE will expect the first value in the tuple to be the row key.
       * Scalars values need to map to an explicit column descriptor and maps need to
       * map to a column family name. In the above examples, the <code>friends</code>
       * column family data from <code>SampleTable</code> will be written to a
       * <code>buddies</code> column family in the <code>SampleTableCopy</code> table.
       *
       */
      Show
      Enhanced HBaseStorage functionality to support loading dynamically named columns by column family or by column name prefixes. Javadoc: /**  * A HBase implementation of LoadFunc and StoreFunc.  * <P>  * Below is an example showing how to load data from HBase:  * <pre>{@code  * raw = LOAD ' hbase://SampleTable&#39;  * USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(  * 'info:first_name info:last_name friends:* info:*', '-loadKey true -limit 5')  * AS (id:bytearray, first_name:chararray, last_name:chararray, friends_map:map[], info_map:map[]);  * }</pre>  * This example loads data redundantly from the info column family just to  * illustrate usage. Note that the row key is inserted first in the result schema.  * To load only column names that start with a given prefix, specify the column  * name with a trailing '*'. For example passing <code>friends:bob_*</code> to  * the constructor in the above example would cause only columns that start with  * <i>bob_</i> to be loaded.  * <P>  * Below is an example showing how to store data into HBase:  * <pre>{@code  * copy = STORE raw INTO ' hbase://SampleTableCopy&#39;  * USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(  * 'info:first_name info:last_name friends:* info:*')  * AS (info:first_name info:last_name buddies:* info:*);  * }</pre>  * Note that STORE will expect the first value in the tuple to be the row key.  * Scalars values need to map to an explicit column descriptor and maps need to  * map to a column family name. In the above examples, the <code>friends</code>  * column family data from <code>SampleTable</code> will be written to a  * <code>buddies</code> column family in the <code>SampleTableCopy</code> table.  *  */

      Description

      It would be nice to load all columns in the column family by using short hand syntax like:

      CpuMetrics = load 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
      

      Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1, in cpu column family.

      CpuMetrics would contain something like:

      (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
      
      1. apply-PIG-1782-patch.sh
        2 kB
        Bill Graham
      2. PIG_1782_2.patch
        35 kB
        Bill Graham
      3. PIG_1782_3.patch
        30 kB
        Bill Graham
      4. PIG-1782_1.patch
        23 kB
        Bill Graham
      5. PIG-1782_4.patch
        33 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Bill Graham
              Reporter:
              Eric Yang
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development