HBase
  1. HBase
  2. HBASE-1684

Backup (Export/Import) contrib tool for 0.20

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.1, 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Add a new Result/KeyValue based Export MapReduce job to contrib for 0.20.

      Make it in the hadoop 0.20 and hbase 0.20 MR API, and hbase 0.20 API (Result/Put).

      1. RestoreTable.java
        12 kB
        Lars George
      2. HBASE-1684-v1.patch
        7 kB
        Jonathan Gray
      3. HBASE-1684-trunk.patch
        16 kB
        Jonathan Gray
      4. exportimport.patch
        17 kB
        stack
      5. ExportImport.java
        5 kB
        stack
      6. BackupTable.java
        10 kB
        Lars George

        Activity

        Hide
        Jonathan Gray added a comment -

        Untested, just quick code dump. There is not much to this, utilizes TIF for Export and TOF for Import.

        What's the best way to distribute this MR job as contrib? Need to make an individual build.xml and add it to main build?

        Once I have that, will add class comments / package-info about how to use it. Could use some help on how to properly build it though.

        Show
        Jonathan Gray added a comment - Untested, just quick code dump. There is not much to this, utilizes TIF for Export and TOF for Import. What's the best way to distribute this MR job as contrib? Need to make an individual build.xml and add it to main build? Once I have that, will add class comments / package-info about how to use it. Could use some help on how to properly build it though.
        Hide
        stack added a comment -

        What should the contrib be called? 'backupmr'? How about dump? ExportMR and ImportMR should implement Tool. See the HStoreFileToStoreFile for sample. Then they can take command line arguments. Why have a map in ExportMR? It does nothing. Same for reducer? Just use Identity Map? For ImportMR, reduce does nothing either. Just don't have it in there and set reducer to zero in the job setup. Rather than a contrib, lets just add this to mapreduce package?

        Show
        stack added a comment - What should the contrib be called? 'backupmr'? How about dump? ExportMR and ImportMR should implement Tool. See the HStoreFileToStoreFile for sample. Then they can take command line arguments. Why have a map in ExportMR? It does nothing. Same for reducer? Just use Identity Map? For ImportMR, reduce does nothing either. Just don't have it in there and set reducer to zero in the job setup. Rather than a contrib, lets just add this to mapreduce package?
        Hide
        Jonathan Gray added a comment -

        Punting to 0.20.1

        Will commit when it's ready, whenever that is. Will probably move it to mapreduce package instead of contrib.

        Show
        Jonathan Gray added a comment - Punting to 0.20.1 Will commit when it's ready, whenever that is. Will probably move it to mapreduce package instead of contrib.
        Hide
        Lars George added a comment -

        Attached RestoreTable and BackupTable that we use internally to backup a table into gzipped text files and then restore on a different (or the same of course) cluster. The only thing missing is also base64 encode the row key, but otherwise it works as-is.

        If you like I can pretty this up and add next to RowCounter as a hbase.jar tool. What do you think?

        Show
        Lars George added a comment - Attached RestoreTable and BackupTable that we use internally to backup a table into gzipped text files and then restore on a different (or the same of course) cluster. The only thing missing is also base64 encode the row key, but otherwise it works as-is. If you like I can pretty this up and add next to RowCounter as a hbase.jar tool. What do you think?
        Hide
        Lars George added a comment -

        Stack, about your comments re: reducer/mapper needed. For the RestoreTable I am using both, the mapper reads from the backup files and then randomizes the rows using a random intermediate key. This is along what Ryan did with his pure randomizer MR class. That way all the RegionServers are hit equally.

        For the BackupTable I am using an IdentityTableMapper and encode the data in the reducer to have it written out in the TextOutputFormat. After we discussed that a while ago with you and Jon it should also be possible to use only a Mapper and do the work there and set the Reducers to 0, which then hands out the Mapper records straight to the TextOutputFormat.

        Lastly, implementing Tool seems deprecated. The new mapreduce WordCounter sample that comes with Hadoop 0.20 abandons it too. That is also why I changed RowCounter not to use it when I cleaned up the hbase.mapreduce package. The parsing of the generic options is done using the GenericParser directly inside the main(), and the remaining arguments used for the specific MR job. I have done the same in the attached two classes.

        Show
        Lars George added a comment - Stack, about your comments re: reducer/mapper needed. For the RestoreTable I am using both, the mapper reads from the backup files and then randomizes the rows using a random intermediate key. This is along what Ryan did with his pure randomizer MR class. That way all the RegionServers are hit equally. For the BackupTable I am using an IdentityTableMapper and encode the data in the reducer to have it written out in the TextOutputFormat. After we discussed that a while ago with you and Jon it should also be possible to use only a Mapper and do the work there and set the Reducers to 0, which then hands out the Mapper records straight to the TextOutputFormat. Lastly, implementing Tool seems deprecated. The new mapreduce WordCounter sample that comes with Hadoop 0.20 abandons it too. That is also why I changed RowCounter not to use it when I cleaned up the hbase.mapreduce package. The parsing of the generic options is done using the GenericParser directly inside the main(), and the remaining arguments used for the specific MR job. I have done the same in the attached two classes.
        Hide
        stack added a comment -

        Classes look good but why toString binary data? Why not write out KeyValues to SequenceFiles? And yes, I you are right, I noticed that Tool seems deprecated now.

        Show
        stack added a comment - Classes look good but why toString binary data? Why not write out KeyValues to SequenceFiles? And yes, I you are right, I noticed that Tool seems deprecated now.
        Hide
        Lars George added a comment -

        This is from before all binary KV's. The advantage is I guess that you can still somewhat read the base64 encoded backup files. Of course they are larger than the plain KV-dumped-into-SequenceFile version. The code above uses TextOutputFormat and does so with minimal extra effort. But that is no strict requirement of course and could be changed.

        Show
        Lars George added a comment - This is from before all binary KV's. The advantage is I guess that you can still somewhat read the base64 encoded backup files. Of course they are larger than the plain KV-dumped-into-SequenceFile version. The code above uses TextOutputFormat and does so with minimal extra effort. But that is no strict requirement of course and could be changed.
        Hide
        stack added a comment -

        I think the toString+base64'ing messy. I think a tool that did binary dump'll be more performant and more generally useful. Regards readability, its a non-issue since you added that fancy usage and options to hfile. Folks can look at their binary data in hfiles easy enough.

        Show
        stack added a comment - I think the toString+base64'ing messy. I think a tool that did binary dump'll be more performant and more generally useful. Regards readability, its a non-issue since you added that fancy usage and options to hfile. Folks can look at their binary data in hfiles easy enough.
        Hide
        stack added a comment -

        This is based on Jon's code – some cleanup, using the inner-class idiom – but its not right yet. It puts the import/export together in one class. I think it'd just be cleaner doing them as separate classes as Jon had it.

        Show
        stack added a comment - This is based on Jon's code – some cleanup, using the inner-class idiom – but its not right yet. It puts the import/export together in one class. I think it'd just be cleaner doing them as separate classes as Jon had it.
        Hide
        stack added a comment -

        Patch that adds import and export jobs to the hbase mapreduce MR driver.

        Export does like Jon's writing out to sequence files.

        Import reads in from said sequence files.

        These classes are more inline w/ the the 0.20.0 MR idiom (and not unnecessary reduce, etc.).

        Also fixed RowCounter so no longer an output dir.

        Fixed tableinputformat so no longer need to specify columns (if no columns, then all columns).

        Changed some of the util so it can take null classes.. sometimes need this (e.g. above imports/exports).

        Please review.

        Show
        stack added a comment - Patch that adds import and export jobs to the hbase mapreduce MR driver. Export does like Jon's writing out to sequence files. Import reads in from said sequence files. These classes are more inline w/ the the 0.20.0 MR idiom (and not unnecessary reduce, etc.). Also fixed RowCounter so no longer an output dir. Fixed tableinputformat so no longer need to specify columns (if no columns, then all columns). Changed some of the util so it can take null classes.. sometimes need this (e.g. above imports/exports). Please review.
        Hide
        stack added a comment -

        I tested this by loading a table, exporting it, dropping it, recreating it, then reimporting, then running rowcounter to confirm.

        Show
        stack added a comment - I tested this by loading a table, exporting it, dropping it, recreating it, then reimporting, then running rowcounter to confirm.
        Hide
        Jonathan Gray added a comment -

        Version of stack's patch that applies cleanly to trunk.

        Show
        Jonathan Gray added a comment - Version of stack's patch that applies cleanly to trunk.
        Hide
        Jonathan Gray added a comment -

        Patch looks good to me. +1 for commit

        Show
        Jonathan Gray added a comment - Patch looks good to me. +1 for commit
        Hide
        stack added a comment -

        Committed branch and trunk.

        Show
        stack added a comment - Committed branch and trunk.
        Hide
        Schubert Zhang added a comment -

        The Export(backup)/Import tools in this issue seem just get and insert data by normal API.

        For the bulk backup/export tool: Why not just copy out the HFiles?
        For the bulk import tool: it should be HBASE-48

        Show
        Schubert Zhang added a comment - The Export(backup)/Import tools in this issue seem just get and insert data by normal API. For the bulk backup/export tool: Why not just copy out the HFiles? For the bulk import tool: it should be HBASE-48
        Hide
        Jonathan Gray added a comment -

        @Schubert What do you mean just use the normal API? These are MapReduce jobs. You are referring to wanting to do imports/exports at the HDFS level instead? (that is not this issue)

        Bulk importing can be done with HBASE-48, but it is not yet fully-featured (only works on single families, doesn't work into existing tables, etc)

        Bulk exporting at HFile level is going to require some form of freezing/snapshotting so it's 100% safe and consistent to do that (though it's still possible now, and we're doing it in production here). There is an old issue for this as well, HBASE-50.

        Show
        Jonathan Gray added a comment - @Schubert What do you mean just use the normal API? These are MapReduce jobs. You are referring to wanting to do imports/exports at the HDFS level instead? (that is not this issue) Bulk importing can be done with HBASE-48 , but it is not yet fully-featured (only works on single families, doesn't work into existing tables, etc) Bulk exporting at HFile level is going to require some form of freezing/snapshotting so it's 100% safe and consistent to do that (though it's still possible now, and we're doing it in production here). There is an old issue for this as well, HBASE-50 .
        Hide
        stack added a comment -

        @Schubert Yeah, these use normal API. Fellas might use them if they want to backup a table then reload it. Using these tools, there is no need to mess with .META. (presuming table already exists). Regards bulk exporting by copying hfiles, do you want to make an issue? It shouldn't be hard to do. Just copy the table directory in hdfs. On restore, would probably need to wipe the table from .META. and then add entries per region in the backup (using the content of the .regioninfo file that is underneath the region directory). Shouldn't be hard. If you need it, lets work on it together.

        Show
        stack added a comment - @Schubert Yeah, these use normal API. Fellas might use them if they want to backup a table then reload it. Using these tools, there is no need to mess with .META. (presuming table already exists). Regards bulk exporting by copying hfiles, do you want to make an issue? It shouldn't be hard to do. Just copy the table directory in hdfs. On restore, would probably need to wipe the table from .META. and then add entries per region in the backup (using the content of the .regioninfo file that is underneath the region directory). Shouldn't be hard. If you need it, lets work on it together.

          People

          • Assignee:
            stack
            Reporter:
            Jonathan Gray
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development