We are planning to use some batch algorithms (sorting & bytes hash table) to improve the performance of streaming SQL operators, especially for the the mini-batch operators introduced by FLIP-145.
Currently, we have to buffer input records and accumulators in heap (i.e. Java HashMap) which is not efficient and there are potential risks of full GC and OOM. With the managed memory, we can fully use the memory to buffer more data without worrying about OOM and improve the performance a lot. However, the managed memory is not allowed to be used in streaming operators.
As discussed in the mailing list , we have reached a consensus that we can extend the configuration taskmanager.memory.managed.consumer-weights to have 2 more options OPERATOR and STATE_BACKEND, the available consumer options will be :
- `OPERATOR` for both streaming and bath operators
- `STATE_BACKEND` for state backends
- `PYTHON` for python processes
- `DATAPROC` as a legacy key for state backend or batch operators if
`STATE_BACKEND` or `OPERATOR` are not specified.
The previous default value is DATAPROC:70,PYTHON:30, the new default value will be OPERATOR:70,STATE_BACKEND:70,PYTHON:30.
The weight for OPERATOR and STATE_BACKEND will be the same value to align with previous behaviors.