The allocation of temporary memory for SPOOF CUDA rowwise operators is happening in native space and requires rows*columns*num_intermediates*sizeof(usedDataType) bytes of memory at the moment to enable all threadblocks to work concurrently.
This improvement should
- move the allocation to the Java side of SystemDS to integrate better with the gpu memory manager
- reduce the memory footprint by either having less thread blocks process more rows or queueing less blocks at once for execution.
- update the estimate of intermediate memory at HOP level