Description
I investigated a driver OOM that was building a large broadcast table with a memory profiler and found that a huge amount of memory is used while building a broadcast table. This is because BroadcastExchangeExec uses executeCollect. In executeCollect, all of the partitions are fetched as compressed blocks, then each block is decompressed (with a stream), and each row is copied to a new byte buffer and added to an ArrayBuffer, which is copied to an Array. This results in a huge amount of allocation: a buffer for each row in the broadcast. Those rows are only used to get copied into a BytesToBytesMap that will be broadcasted, so there is no need to keep them in memory.
Replacing the array buffer step with an iterator reduces the amount of memory held while creating the map by not requiring all rows to be in memory. It also avoids allocating a large Array for the rows. In practice, a 16MB broadcast table used 100MB less memory with this approach, but the reduction depends on the size of rows and compression (16MB was in Parquet format).