Currently, Hash-shuffle keeps intermediate file appender and tuple list in memory and the required memory will be in proportion to the input size
If input size is 10GB, the hash-join key partition count will be 78125 (10TB / 128MB) and the required memory is 10GB (78125 * 128KB).
We should improve the hash-shuffle file writer as following :
- Separate the buffer from the file writer
- Keep the tuples in off-heap buffer and reuse the buffer
- Flush the buffers, if total buffer capacity is required more than maxBufferSize
- Write the partition files asynchronously