We have a job dumping newly added data in HDFS into Kudu table for good performance of point queries. Each day we create a new range partition in Kudu for the new data on this day. When we add more and more Kudu range partitions, we found performance degradation of this job.
The root cause is, the insert statement for kudu does not leverage the partition predicates for kudu range partition keys, which causes skew on the insert nodes.
How to reveal this:
Step 1: Launch impala cluster with 3 nodes.
Step 2: Create an HDFS table with more than 3 underlying files, thus will have more than 3 scan ranges
Upload the three attached tsv files into its directory and refresh this table in Impala.
Step 3: Create a Kudu table with mix partitions containing 3 hash partitions and 3 range partitions.
Step 4: Dump rows in HDFS table into Kudu giving partition predicates.
Step 5: Looking into the profile, there're three fragment instances containing KuduTableSink but only one of them received and generated data.
Thus, only one fragment instance of F01 is sorting and ingesting data into Impala.
Generally, if there're N range partitions and all the inserted rows are belong to one range (supplied by the partition predicates in WHERE clause), only 1/N of the insert fragments are producing data.