Affects Version/s: 1.2.0, 1.3.0
Fix Version/s: None
Consider the following query:
Where variables is a Struct containing many fields, similarly it can be a Map with many key-value pairs.
During execution, SparkSQL will shuffle the whole map or struct column instead of extracting the value first. The performance is very poor.
The optimized version could use a subquery:
Where we extract the field/key-value only in the mapper side, so data being shuffled is small.
A benchmark for a table with 600 variables shows drastic improvment in runtime:
|Parquet (using Map)||Parquet (using Struct)|
Parquet already supports reading a single field/key-value in the storage level, but SparkSQL currently doesn’t have optimization for it. This will be very useful optimization for tables having Map or Struct with many columns.