For the spark use case. We sometimes will use spark to write data to kudu.
Such as import a hive table data to kudu table.
There will have 2 problems here in current implement.
- It use a FlushMode.AUTO_FLUSH_BACKGROUND, which is not efficient for error processing. When some error happen such as timeout. It will always flush all data in the task.Then failed the task. It retry by the task level.
- For the write mode, spark use default hash way to split data to partition. And the hash method is not always meets the tablet distribution. Such as a big hive table for 500G size.It will give 2000 task, but we only have 20 tserver machines. so there will may 2000 machines write at same time to 20 tserver machines. There will be two bad thing for the performance. First is primary key lock, tserver user row lock, so there will so many lock wait. The worst case it always timeout for the write operation.Second is there are so many machines write data at the same time to tserver. And no any controller in the code.
So we suggest two thing to do
- Change the flush mode to MANNUL_FLUSH_MODE, and process the error at row level. At last at task level.
- Give an optional repartition step in spark. We can repartition the data by the tablet distribution. Then we can get only one machine will write to one tserver. There will no lock any more.
We use this feature for some times. And it solve some problem when write big table data to spark.I hope this feature will be useful for the community who uses a lot spark with kudu.