There are really three goals.
The first goal is to fix bulk load files. Right now their ordering gets messed up after a compaction happens. This leads to some weird compactions where the smallest files are being compacted with the largest. This is possible because the compaction policy right now approves the candidates list as soon as one file is less than or equal to the files after it. The bulk loaded files are always on the on the left. The new large file created from compaction does not have the bulk load flag (thats lost) and it will have a seqId of 0.
The other goal is to only compact files that are all inside of a ratio. All canidate files are selected if the is one file to the left that satisfies the ratio SizeFile(j) <= SumFileSize( j-1, 0). Workloads where there are large fluctuations can select weird groups of files.
Suppose there's a write work load that's heavily sinusoidal.
[ 1 1 50 150 180 150 50 1 1 1 1 ]
Currently we'd pick 1 1 50 as the files to compact.
1 1 1 are the most like each other.
150 180 150 are also more similar and would logically be better matches than the ones currently picked.
Just because files are picked doesn't mean they are the best choice. Right now our compaction algorithm is pretty naive. This is a cut at choosing files based on more than one heuristic (ratio, num files removed, and IO required).