|
What should the contrib be called? 'backupmr'? How about dump? ExportMR and ImportMR should implement Tool. See the HStoreFileToStoreFile for sample. Then they can take command line arguments. Why have a map in ExportMR? It does nothing. Same for reducer? Just use Identity Map? For ImportMR, reduce does nothing either. Just don't have it in there and set reducer to zero in the job setup. Rather than a contrib, lets just add this to mapreduce package?
Punting to 0.20.1
Will commit when it's ready, whenever that is. Will probably move it to mapreduce package instead of contrib. Attached RestoreTable and BackupTable that we use internally to backup a table into gzipped text files and then restore on a different (or the same of course) cluster. The only thing missing is also base64 encode the row key, but otherwise it works as-is.
If you like I can pretty this up and add next to RowCounter as a hbase.jar tool. What do you think? Stack, about your comments re: reducer/mapper needed. For the RestoreTable I am using both, the mapper reads from the backup files and then randomizes the rows using a random intermediate key. This is along what Ryan did with his pure randomizer MR class. That way all the RegionServers are hit equally.
For the BackupTable I am using an IdentityTableMapper and encode the data in the reducer to have it written out in the TextOutputFormat. After we discussed that a while ago with you and Jon it should also be possible to use only a Mapper and do the work there and set the Reducers to 0, which then hands out the Mapper records straight to the TextOutputFormat. Lastly, implementing Tool seems deprecated. The new mapreduce WordCounter sample that comes with Hadoop 0.20 abandons it too. That is also why I changed RowCounter not to use it when I cleaned up the hbase.mapreduce package. The parsing of the generic options is done using the GenericParser directly inside the main(), and the remaining arguments used for the specific MR job. I have done the same in the attached two classes. This is from before all binary KV's. The advantage is I guess that you can still somewhat read the base64 encoded backup files. Of course they are larger than the plain KV-dumped-into-SequenceFile version. The code above uses TextOutputFormat and does so with minimal extra effort. But that is no strict requirement of course and could be changed.
I think the toString+base64'ing messy. I think a tool that did binary dump'll be more performant and more generally useful. Regards readability, its a non-issue since you added that fancy usage and options to hfile. Folks can look at their binary data in hfiles easy enough.
Patch that adds import and export jobs to the hbase mapreduce MR driver.
Export does like Jon's writing out to sequence files. Import reads in from said sequence files. These classes are more inline w/ the the 0.20.0 MR idiom (and not unnecessary reduce, etc.). Also fixed RowCounter so no longer an output dir. Fixed tableinputformat so no longer need to specify columns (if no columns, then all columns). Changed some of the util so it can take null classes.. sometimes need this (e.g. above imports/exports). Please review. Version of stack's patch that applies cleanly to trunk.
Patch looks good to me. +1 for commit
The Export(backup)/Import tools in this issue seem just get and insert data by normal API.
For the bulk backup/export tool: Why not just copy out the HFiles? @Schubert What do you mean just use the normal API? These are MapReduce jobs. You are referring to wanting to do imports/exports at the HDFS level instead? (that is not this issue)
Bulk importing can be done with Bulk exporting at HFile level is going to require some form of freezing/snapshotting so it's 100% safe and consistent to do that (though it's still possible now, and we're doing it in production here). There is an old issue for this as well, HBASE-50. @Schubert Yeah, these use normal API. Fellas might use them if they want to backup a table then reload it. Using these tools, there is no need to mess with .META. (presuming table already exists). Regards bulk exporting by copying hfiles, do you want to make an issue? It shouldn't be hard to do. Just copy the table directory in hdfs. On restore, would probably need to wipe the table from .META. and then add entries per region in the backup (using the content of the .regioninfo file that is underneath the region directory). Shouldn't be hard. If you need it, lets work on it together.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What's the best way to distribute this MR job as contrib? Need to make an individual build.xml and add it to main build?
Once I have that, will add class comments / package-info about how to use it. Could use some help on how to properly build it though.