Description
I'm in a bigdata company which served over 1000+ company, we adopted kudu as main or auxiliary storage engine, some of our customers are just small startups, they had a lot of data but too much nodes are expensive to them.
So some of cases are based on: few nodes, much data and maybe not compacted well data.
In our scenarios, there exists some migration cases
- from standalone tserver to another standalone tserver
- from 3 nodes tserver cluster to another 3 nodes tserver
In the past, we have to do something like this
// First, download tablet data via kudu local_replica copy_from_remote // then rewrite all the raft info for each tablet echo ${tablet_id_list} | xargs -i kudu local_replica cmeta rewrite_raft_config {} PEER_INFO -fs_data_dirs=xxx -fs_wal_dir=yyy
Download data via copy_from_remote is blazing fast.
However sometimes it takes us a lot of time to rewrite raft info of all tablet, 30s - 60s per tablet as I witnessed. Sometimes it could take more time if the data were not fully compacted. So sometimes it take us 2 hours to download tablet data, but 6 hours to rewrite meta.
I noticed some code fragment in RewriteRaftConfig function
FsManager fs_manager(env, FsManagerOpts()); RETURN_NOT_OK(fs_manager.Open());
This means I have to open the fs_data_dirs and fs_wal_dir 100 times if I want to rewrite raft of 100 tablets.
To saving the overhead of each operation, we can just skip opening block manager for rewrite_raft_config, cause all the operations only happened on meta files.