We ran some experiments loading lineitem at scale factor 15k. The 10 nodes cluster (1 master, 9 TS) is equipped with Intel P3700 SSDs, one per TS, dedicated for the WALs. The table is hash-partitioned and set to have 10 tablets per TS.
Our findings :
- Reading the bloom filters puts a lot of contention on the block cache. This isn't new, see KUDU-613, but it's now coming up when writing because the SSDs are just really fast.
- Kudu performs best when data is inserted in order, but with hash partitioning we end up multiple clients writing simultaneously in different key ranges in each tablet. This becomes a worst case scenario, we have to compact (optimize) the row sets over and over again to put them in order. Even if we were to delay this to the end of the bulk load, we're still taking a hit because we have to look at more and more bloom filters to check if a row currently exists or not.
- In the case of an initial bulk load, we know we're not trying to overwrite rows or update them, so all those checks are unnecessary.
Some ideas for improvements:
- Obviously, we need a better block cache.
- When flushing, we could detect those disjoint set of rows and make sure that maps to row sets that don't cover the gaps. For example, if the MRS has a,b,c,x,y,z then flushing would give us two row sets eg a,b,c and x,y,z instead of one. The danger here is generating too many row sets.
- When reading, to have the row set interval tree be smart enough to not send readers into the row set gaps. Again with the same example, let's say we're looking for "m", normally we'd see a row set that's a-z so we'd have to check its bloom filter, but if we could detect that it's actually a-c then x-z then we'd save a check.