Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-966 Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
  3. PIG-1205

Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7.0
    • 0.8.0
    • None
    • None
    • Hide
      HBaseStorage has been significantly reworked with this release.

      Usage:
      {code}
      my_data = LOAD 'hbase://table_name' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfamily:col1 colfamily:col2', '-caching 100') as (col1:int, col2:chararray);

      STORE my_date INTO 'hbaseL//other_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfamily:col1 colfamily:col2');
      {code}

      HBaseStorage can now write data into HBase as well as read it. The first argument is a space-delimited list of columns to be loaded (or stored). Columns are specified as columnfamily:column_name. The second argument is an optional set of key-value pairs used to control HBaseStorage behavior. Available arguments are:

      * {{monospaced}}-loadKey{{monospaced}} Used to load the row key; false by default. If true, the first field in the returned tuple will be the value of the row key.
      * {{monospaced}}-gt, -gte, -lt, and -lte{{monospaced}} Used to specify bounds on row keys to be scanned. The keys are specified as binary data, using the hex representation. Any slashes have to be double-escaped (two slashes per single "real" slash) to be parsed correctly.
      * {{monospaced}}-caching{{monospaced}} Used to specify the number of rows to be cached per HBase RPC call. See http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#setScannerCaching%28int%29 for more information about this HBase feature.
      * {{monospaced}}-limit{{monospaced}} Used to control how many rows *per scanned region* will be retrieved. This can of course speed up processing if you just want a few rows. The total number of rows returned will be up to number of regions * limit. The limit is applied after any -gt, -lt, etc filters. Pig's LIMIT operator can be used in conjunction with this argument.
      * {{monospaced}}-caster{{monospaced}} Used to specify a LoadCaster (or LoadStoreCaster, for storage) used to convert the data stored in HBase into Pig data. By default, the Utf8StorageConverter is used, which stores all data as its string representation. The string "HBaseBinaryConverter" can be used to specify that data is stored in HBase's native binary format. Note that the HBaseBinary converter does not work with complex data types such as maps, tuples, and bags. You can also specify a full class path such as org.apache.pig.backend.hadoop.hbase.HBaseBinaryConverter to use your own Caster. The default caster can be changed by setting the pig.hbase.caster property in pig,properties

      HBaseStorage matches column arguments to tuple fields based on their ordinal position. When storing, the first field is expected to be the key value.
      Show
      HBaseStorage has been significantly reworked with this release. Usage: {code} my_data = LOAD ' hbase://table_name' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfamily:col1 colfamily:col2', '-caching 100') as (col1:int, col2:chararray); STORE my_date INTO 'hbaseL//other_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfamily:col1 colfamily:col2'); {code} HBaseStorage can now write data into HBase as well as read it. The first argument is a space-delimited list of columns to be loaded (or stored). Columns are specified as columnfamily:column_name. The second argument is an optional set of key-value pairs used to control HBaseStorage behavior. Available arguments are: * {{monospaced}}-loadKey{{monospaced}} Used to load the row key; false by default. If true, the first field in the returned tuple will be the value of the row key. * {{monospaced}}-gt, -gte, -lt, and -lte{{monospaced}} Used to specify bounds on row keys to be scanned. The keys are specified as binary data, using the hex representation. Any slashes have to be double-escaped (two slashes per single "real" slash) to be parsed correctly. * {{monospaced}}-caching{{monospaced}} Used to specify the number of rows to be cached per HBase RPC call. See http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#setScannerCaching%28int%29 for more information about this HBase feature. * {{monospaced}}-limit{{monospaced}} Used to control how many rows *per scanned region* will be retrieved. This can of course speed up processing if you just want a few rows. The total number of rows returned will be up to number of regions * limit. The limit is applied after any -gt, -lt, etc filters. Pig's LIMIT operator can be used in conjunction with this argument. * {{monospaced}}-caster{{monospaced}} Used to specify a LoadCaster (or LoadStoreCaster, for storage) used to convert the data stored in HBase into Pig data. By default, the Utf8StorageConverter is used, which stores all data as its string representation. The string "HBaseBinaryConverter" can be used to specify that data is stored in HBase's native binary format. Note that the HBaseBinary converter does not work with complex data types such as maps, tuples, and bags. You can also specify a full class path such as org.apache.pig.backend.hadoop.hbase.HBaseBinaryConverter to use your own Caster. The default caster can be changed by setting the pig.hbase.caster property in pig,properties HBaseStorage matches column arguments to tuple fields based on their ordinal position. When storing, the first field is expected to be the key value.
    • hbase

    Attachments

      1. PIG_1205.patch
        13 kB
        Jeff Zhang
      2. PIG_1205_2.patch
        13 kB
        Jeff Zhang
      3. PIG_1205_3.patch
        14 kB
        Jeff Zhang
      4. PIG_1205_4.patch
        15 kB
        Jeff Zhang
      5. PIG_1205_5.path
        32 kB
        Dmitriy V. Ryaboy
      6. PIG_1205_6.patch
        35 kB
        Dmitriy V. Ryaboy
      7. PIG_1205_7.patch
        52 kB
        Dmitriy V. Ryaboy
      8. PIG_1205_8.patch
        66 kB
        Jeff Zhang
      9. hbase-0.20.6.jar
        1.50 MB
        Dmitriy V. Ryaboy
      10. hbase-0.20.6-test.jar
        1.94 MB
        Dmitriy V. Ryaboy
      11. PIG_1205_9.patch
        75 kB
        Dmitriy V. Ryaboy

      Activity

        This comment will be Viewable by All Users Viewable by All Users
        Cancel

        People

          dvryaboy Dmitriy V. Ryaboy
          zjffdu Jeff Zhang
          Votes:
          1 Vote for this issue
          Watchers:
          8 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved:

            Slack

              Issue deployment