Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-18161

Incremental Load support for Multiple-Table HFileOutputFormat

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.0.0-alpha-2, 2.0.0
    • None
    • None
    • Incompatible change, Reviewed
    • Hide
      In order to use this feature, a user must 

      1. Register their tables when configuring their job
      
2. Create a composite key of the tablename and original rowkey to send as the mapper output key.

      

To register their tables (and configure their job for incremental load into multiple tables), a user must call the static MultiHFileOutputFormat.configureIncrementalLoad function to register the HBase tables that will be ingested into. 



      To create the composite key, a helper function MultiHFileOutputFormat2.createCompositeKey should be called with the destination tablename and rowkey as arguments, and the result should be output as the mapper key.

      
Before this JIRA, for HFileOutputFormat2 a configuration for the storage policy was set per Column Family. This was set manually by the user. In this JIRA, this is unchanged when using HFileOutputFormat2. However, when specifically using MultiHFileOutputFormat2, the user now has to manually set the prefix by creating a composite of the table name and the column family. The user can create the new composite value by calling MultiHFileOutputFormat2.createCompositeKey with the tablename and column family as arguments.

      Changes added through this JIRA are backwards compatible with existing HFileOutputFormat2 apis and functionality.

      The configuration parameter "hbase.mapreduce.hfileoutputformat.table.name" is now a REQUIRED parameter though it is normally set automatically when configureIncrementalLoad method is called within HFileOutputFormat2
      Show
      In order to use this feature, a user must 
 1. Register their tables when configuring their job 
2. Create a composite key of the tablename and original rowkey to send as the mapper output key. 

To register their tables (and configure their job for incremental load into multiple tables), a user must call the static MultiHFileOutputFormat.configureIncrementalLoad function to register the HBase tables that will be ingested into. 

 To create the composite key, a helper function MultiHFileOutputFormat2.createCompositeKey should be called with the destination tablename and rowkey as arguments, and the result should be output as the mapper key. 
Before this JIRA, for HFileOutputFormat2 a configuration for the storage policy was set per Column Family. This was set manually by the user. In this JIRA, this is unchanged when using HFileOutputFormat2. However, when specifically using MultiHFileOutputFormat2, the user now has to manually set the prefix by creating a composite of the table name and the column family. The user can create the new composite value by calling MultiHFileOutputFormat2.createCompositeKey with the tablename and column family as arguments. Changes added through this JIRA are backwards compatible with existing HFileOutputFormat2 apis and functionality. The configuration parameter "hbase.mapreduce.hfileoutputformat.table.name" is now a REQUIRED parameter though it is normally set automatically when configureIncrementalLoad method is called within HFileOutputFormat2

    Description

      Introduction

      MapReduce currently supports the ability to write HBase records in bulk to HFiles for a single table. The file(s) can then be uploaded to the relevant RegionServers information with reasonable latency. This feature is useful to make a large set of data available for queries at the same time as well as provides a way to efficiently process very large input into HBase without affecting query latencies.

      There is, however, no support to write variations of the same record key to HFiles belonging to multiple HBase tables from within the same MapReduce job.

      Goal

      The goal of this JIRA is to extend HFileOutputFormat2 to support writing to HFiles for different tables within the same MapReduce job while single-table HFile features backwards-compatible.

      For our use case, we needed to write a record key to a smaller HBase table for quicker access, and the same record key with a date appended to a larger table for longer term storage with chronological access. Each of these tables would have different TTL and other settings to support their respective access patterns. We also needed to be able to bulk write records to multiple tables with different subsets of very large input as efficiently as possible. Rather than run the MapReduce job multiple times (one for each table or record structure), it would be useful to be able to parse the input a single time and write to multiple tables simultaneously.

      Additionally, we'd like to maintain backwards compatibility with the existing heavily-used HFileOutputFormat2 interface to allow benefits such as locality sensitivity (that was introduced long after we implemented support for multiple tables) to support both single table and multi table hfile writes.

      Proposal

      • Backwards compatibility for existing single table support in HFileOutputFormat2 will be maintained and in this case, mappers will need to emit the table rowkey as before. However, a new class - MultiHFileOutputFormat - will provide a helper function to generate a rowkey for mappers that prefixes the desired tablename to the existing rowkey as well as provides configureIncrementalLoad support for multiple tables.
      • HFileOutputFormat2 will be updated in the following way:
        • configureIncrementalLoad will now accept multiple table descriptor and region locator pairs, analogous to the single pair currently accepted by HFileOutputFormat2.
        • Compression, Block Size, Bloom Type and Datablock settings PER column family that are set in the Configuration object are now indexed and retrieved by tablename AND column family
        • getRegionStartKeys will now support multiple regionlocators and calculate split points and therefore partitions collectively for all tables. Similarly, now the eventual number of Reducers will be equal to the total number of partitions across all tables.
        • The RecordWriter class will be able to process rowkeys either with or without the tablename prepended depending on how configureIncrementalLoad was configured with MultiHFileOutputFormat or HFileOutputFormat2.
      • The use of MultiHFileOutputFormat will write the output into HFiles which will match the output format of HFileOutputFormat2. However, while the default use case will keep the existing directory structure with column family name as the directory and HFiles within that directory, in the case of MultiHFileOutputFormat, it will output HFiles in the output directory with the following relative paths:
             --table1 
               --family1 
                 --HFiles 
             --table2 
               --family1 
               --family2 
                 --HFiles
        

      This aims to be a comprehensive solution to the original tickets - HBASE-3727 and HBASE-16261. Thanks to clayb for his support. This is a contribution from Bloomberg developers.

      The patch will be attached shortly.

      Attachments

        1. MultiHFileOutputFormatSupport_HBASE_18161_v11.patch
          123 kB
          Densel Santhmayor
        2. MultiHFileOutputFormatSupport_HBASE_18161_v10.patch
          126 kB
          Densel Santhmayor
        3. MultiHFileOutputFormatSupport_HBASE_18161_v9.patch
          125 kB
          Densel Santhmayor
        4. MultiHFileOutputFormatSupport_HBASE_18161_v8.patch
          125 kB
          Densel Santhmayor
        5. MultiHFileOutputFormatSupport_HBASE_18161_v7.patch
          122 kB
          Densel Santhmayor
        6. MultiHFileOutputFormatSupport_HBASE_18161_v6.patch
          122 kB
          Densel Santhmayor
        7. MultiHFileOutputFormatSupport_HBASE_18161_v5.patch
          122 kB
          Densel Santhmayor
        8. MultiHFileOutputFormatSupport_HBASE_18161_v4.patch
          122 kB
          Densel Santhmayor
        9. MultiHFileOutputFormatSupport_HBASE_18161_v3.patch
          123 kB
          Densel Santhmayor
        10. MultiHFileOutputFormatSupport_HBASE_18161_v2.patch
          125 kB
          Densel Santhmayor
        11. MultiHFileOutputFormatSupport_HBASE_18161.patch
          117 kB
          Densel Santhmayor

        Issue Links

          Activity

            People

              denselm Densel Santhmayor
              denselm Densel Santhmayor
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: