When benchmark generated aggregate with grouping keys, the profiling show that lookup in BytesToBytesMap took about 90% of the CPU time, we should optimize it.
After profiling with jvisualvm, here are the things that take most of the time:
1. decode address from Long to baseObject and offset
2. calculate hash code
3. compare the bytes (equality check)