[FLINK-4867] Investigate code generation for improving sort performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Task
Labels:
- performance

Description

This issue is for investigating whether code generation could speed up sorting. We should make some performance measurements on hand-written code that is similar to what we could generate, to see whether investing more time into this is worth it. If we find that it is worth it, we can open a second Jira for the actual implementation of the code generation.

I think we could generate one class at places where we currently instantiate QuickSort. This generated class would include the functionality of QuickSort, NormalizedKeySorter or FixedLengthRecordSorter, MemorySegment.compare, and MemorySegment.swap.

Btw. I'm planning to give this as a student project at a TU Berlin course in the next few months.

Some concrete ideas about how could a generated sorter be faster than the current sorting code:

MemorySegment.compare could be specialized for
- Length: for small records, the loop could be unrolled
- Endiannes (currently it is optimized for big endian; and in the little endian case (e.g. x86) it does a Long.reverseBytes for each long read)
MemorySegment.swapBytes
- In case of small records, using three UNSAFE.copyMemory is probably not as efficient as a specialized swap, because
  - We could use total loop unrolling in generated code (because we know the exact record size)
  - UNSAFE.copyMemory checks for alignment first [6,9]
  - We will only need 2/3 the memory bandwidth, because the temporary storage could be a register if we swap one byte (or one long) at a time
- several checks might be eliminated
Better inlining behaviour could be achieved
- Virtual function calls to the methods of InMemorySorter could be eliminated. (Note, that these are problematic to devirtualize by the JVM if there are different derived classes used in a single Flink job (see [8,7]).)
- MemorySegment.swapBytes is probably not inlined currently, because the excessive checks make it too large
- MemorySegment.compare is probably also not inlined currently, because those two while loops are too large