Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-13153

Bulk Loaded HFile Replication

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0, 2.0.0
    • Component/s: Replication
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This enhances the HBase replication to support replication of bulk loaded data. This is configurable, by default it is set to false which means it will not replicate the bulk loaded data to its peer(s). To enable it set "hbase.replication.bulkload.enabled" to true.

      Following are the additional configurations added for this enhancement,
       a. hbase.replication.cluster.id - This is manadatory to configure in cluster where replication for bulk loaded data is enabled. A source cluster is uniquely identified by sink cluster using this id. This should be configured in the source cluster configuration file for all the RS.
       b. hbase.replication.conf.dir - This represents the directory where all the active cluster's file system client configurations are defined in subfolders corresponding to their respective replication cluster id in peer cluster. This should be configured in the peer cluster configuration file for all the RS. Default is HBASE_CONF_DIR.
       c. hbase.replication.source.fs.conf.provider - This represents the class which provides the source cluster file system client configuration to peer cluster. This should be configured in the peer cluster configuration file for all the RS. Default is org.apache.hadoop.hbase.replication.regionserver.DefaultSourceFSConfigurationProvider

       For example: If source cluster FS client configurations are copied in peer cluster under directory /home/user/dc1/ then hbase.replication.cluster.id should be configured as dc1 and hbase.replication.conf.dir as /home/user

      Note:
       a. Any modification to source cluster FS client configuration files in peer cluster side replication configuration directory then it needs to restart all its peer(s) cluster RS with default hbase.replication.source.fs.conf.provider.
       b. Only 'xml' type files will be loaded by the default hbase.replication.source.fs.conf.provider.

      As part of this we have made following changes to LoadIncrementalHFiles class which is marked as Public and Stable class,
       a. Raised the visibility scope of LoadQueueItem class from package private to public.
       b. Added a new method loadHFileQueue, which loads the queue of LoadQueueItem into the table as per the region keys provided.
      Show
      This enhances the HBase replication to support replication of bulk loaded data. This is configurable, by default it is set to false which means it will not replicate the bulk loaded data to its peer(s). To enable it set "hbase.replication.bulkload.enabled" to true. Following are the additional configurations added for this enhancement,  a. hbase.replication.cluster.id - This is manadatory to configure in cluster where replication for bulk loaded data is enabled. A source cluster is uniquely identified by sink cluster using this id. This should be configured in the source cluster configuration file for all the RS.  b. hbase.replication.conf.dir - This represents the directory where all the active cluster's file system client configurations are defined in subfolders corresponding to their respective replication cluster id in peer cluster. This should be configured in the peer cluster configuration file for all the RS. Default is HBASE_CONF_DIR.  c. hbase.replication.source.fs.conf.provider - This represents the class which provides the source cluster file system client configuration to peer cluster. This should be configured in the peer cluster configuration file for all the RS. Default is org.apache.hadoop.hbase.replication.regionserver.DefaultSourceFSConfigurationProvider  For example: If source cluster FS client configurations are copied in peer cluster under directory /home/user/dc1/ then hbase.replication.cluster.id should be configured as dc1 and hbase.replication.conf.dir as /home/user Note:  a. Any modification to source cluster FS client configuration files in peer cluster side replication configuration directory then it needs to restart all its peer(s) cluster RS with default hbase.replication.source.fs.conf.provider.  b. Only 'xml' type files will be loaded by the default hbase.replication.source.fs.conf.provider. As part of this we have made following changes to LoadIncrementalHFiles class which is marked as Public and Stable class,  a. Raised the visibility scope of LoadQueueItem class from package private to public.  b. Added a new method loadHFileQueue, which loads the queue of LoadQueueItem into the table as per the region keys provided.

      Description

      Currently we plan to use HBase Replication feature to deal with disaster tolerance scenario.But we encounter an issue that we will use bulkload very frequently,because bulkload bypass write path, and will not generate WAL, so the data will not be replicated to backup cluster. It's inappropriate to bukload twice both on active cluster and backup cluster. So i advise do some modification to bulkload feature to enable bukload to both active cluster and backup cluster

        Attachments

        1. HBASE-13153.patch
          180 kB
          Ashish Singhi
        2. HBASE-13153-branch-1-v20.patch
          244 kB
          Ashish Singhi
        3. HBASE-13153-branch-1-v21.patch
          244 kB
          Ashish Singhi
        4. HBASE-13153-v1.patch
          187 kB
          Ashish Singhi
        5. HBASE-13153-v10.patch
          206 kB
          Ashish Singhi
        6. HBASE-13153-v11.patch
          208 kB
          Ashish Singhi
        7. HBASE-13153-v12.patch
          229 kB
          Ashish Singhi
        8. HBASE-13153-v13.patch
          230 kB
          Ashish Singhi
        9. HBASE-13153-v14.patch
          239 kB
          Ashish Singhi
        10. HBASE-13153-v15.patch
          240 kB
          Ashish Singhi
        11. HBASE-13153-v16.patch
          240 kB
          Ashish Singhi
        12. HBASE-13153-v17.patch
          241 kB
          Ashish Singhi
        13. HBASE-13153-v18.patch
          241 kB
          Ashish Singhi
        14. HBASE-13153-v19.patch
          241 kB
          Ashish Singhi
        15. HBASE-13153-v2.patch
          186 kB
          Ashish Singhi
        16. HBASE-13153-v20.patch
          241 kB
          Ashish Singhi
        17. HBASE-13153-v21.patch
          240 kB
          Ashish Singhi
        18. HBASE-13153-v3.patch
          187 kB
          Ashish Singhi
        19. HBASE-13153-v4.patch
          177 kB
          Ashish Singhi
        20. HBASE-13153-v5.patch
          177 kB
          Ashish Singhi
        21. HBASE-13153-v6.patch
          180 kB
          Ashish Singhi
        22. HBASE-13153-v7.patch
          193 kB
          Ashish Singhi
        23. HBASE-13153-v8.patch
          193 kB
          Ashish Singhi
        24. HBASE-13153-v9.patch
          205 kB
          Ashish Singhi
        25. HBase Bulk Load Replication.pdf
          344 kB
          Ashish Singhi
        26. HBase Bulk Load Replication-v1-1.pdf
          310 kB
          Ashish Singhi
        27. HBase Bulk Load Replication-v2.pdf
          349 kB
          Ashish Singhi
        28. HBase Bulk Load Replication-v3.pdf
          322 kB
          Ashish Singhi
        29. HDFS_HA_Solution.PNG
          14 kB
          Ashish Singhi

          Issue Links

            Activity

              People

              • Assignee:
                ashish singhi Ashish Singhi
                Reporter:
                haitao-tony sunhaitao
              • Votes:
                0 Vote for this issue
                Watchers:
                41 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: