The DataStreamSender allocates row batches in whatever thread handles the TransmitData() RPC, but then deallocates them in the fragment instance thread.
That is an anti-pattern for tcmalloc. Instead we should see if we can recycle the row batches where possible.
We could try to 'pin' row batches to service threads, and give them each a thread-local ability to reallocate row batch data - the key is ensuring that the deallocations happen on the same thread, so we can't just give each sender a list of row batches because that sender may be handled by different service pool threads.
Alternatively we can try to cut down on the number of allocations, but that's hard to do with cross-thread coordination.