Affects Version/s: 0.20.6, 0.89.20100924
Fix Version/s: None
Mozilla is currently in the process of trying to migrate our HBase cluster to a new datacenter.
We have our existing 25 node cluster in our SJC datacenter. It is serving production traffic 24/7. While we can take downtimes, it is very costly and difficult to take them for more than a few hours in the evening.
We have two new 30 node clusters in our PHX datacenter. We are wanting to cut production over to one of these this week.
The old cluster is running 0.20.6. The new clusters are running CDH3b3 with HBase 0.89.
We have tried running a pull distcp using hftp URLs. If HBase is running, this causes SAX XML Parsing exceptions when a directory is removed during the scan.
If HBase is stopped, it takes hours for the directory compare to finish before it even begins copying data.
We have tried a custom backup MR job. This job uses the map phase to evaluate and copy changed files. It can run while HBase is live, but that results in a dirty copy of the data.
We have tried running the custom backup job while HBase is shut down as well. When we do this, even on two back to back runs, it still copies over some data and seems to not be an entirely clean copy.
When we have gotten what we thought was an entire copy onto the new cluster, we ran add_table on it, but the resulting hbase table had holes. Investigating the holes revealed there were directories that were not transfered.
We had a meeting to brainstorm ideas and two further suggestions that came up were:
1. Build a file list of files to transfer on the SJC side, transfer that file list to PHX and then run distcp on it.
2. Try a full copy instead of incremental, skipping the expensive file compare step
3. Evaluate copying from SJC to S3 then from S3 to PHX.