|
MapReduce files for importing and exporting data from HBase tables.
Extra scripts used to create the jars and execute the jobs are attached for reference. Removed the dependency to HBaseRef since it is not needed.
Erik:
Would suggest that when you are successful with Yair, that you get him to +1 one this issue. On the ExportMR job, I wonder if its possible to set maps == 0 so you don't have to supply a map task? Should we commit these classes to hbase? Into a contrib or under examples? I like the way they serialize RR. Good idea. If we're going to commit, they need apache licenses and the style fixed up (don't ask jgray – he'll only tell you wrong thing.... smile). For the below, check Writables in hbase util. I think there are methods there to help you do the below: ByteArrayInputStream bis = new ByteArrayInputStream( ((BytesWritable)val).get() );// baos.toByteArray()); DataInputStream dis = new DataInputStream(bis); RowResult rowRes = new RowResult(); rowRes.readFields(dis); If you use HbaseMapWritable instead of MW, you could do without Text and toString'ing table name (I think). In 880, I believe RowResult and BatchUpdate have same ancestor. Would be sweet if they could be used interchangeably so you wouldn't need to do the conversion in rowResultToBatchUpdate. You think it makes sense creating the new HTable in the reduce each time its invoked and not in its configure step? Hey Stack!
I did some changes to the files so that they are looking better according to the standards and are using the util.Writables instead. Haven't tested the functionality yet, so will post the new code when I have, so you can comment on it again Not really sure where it fits best, but would say examples for now. Don't want to spend too much time fixing the code though, since I think that we will have more efficient ways of doing the backup in a little, when we start messing with Cascading for example, but will see what Yair says and after that post the updated code. First I didn't really understand why I created I new HTable in every reducer, but today it
struck me, that we had it setup in another way. We have kinds like a pool of tables that you check in and out, but it has dependencies so that is why I removed it. Of course it doesn't make any sense to have it the way it is now, it just slows things down a lot. very interesting idea of directly serializing RR. However, in importer reducer, as you said, you could create new HTable in configure(), but you don't even have to do that. You can just directly let output collect (key, batchUpdate) and TableReduce would take care of committing. Plus TableReduce sets autoflush off which significantly boosts importing performance.
see original example in TableReduce has had some performance issues in the past. I think it's pretty good now though.
Yes, you'll definitely want to turn off autoflush. Honestly I don't think that option existed when these jobs were written I will close these issues once I open an 0.20 issue. Issue contains tools to perform this on 0.18 and 0.19. No plans to commit any of this into branches.
Closing issue as Won't Fix. 0.20 backup now being worked on in |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
However, looking forward to see this effort.