Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.2.0, 1.3.0
-
None
-
None
Description
Consider the following query:
select stddev_pop(variables.var1) stddev from model group by model_name
Where variables is a Struct containing many fields, similarly it can be a Map with many key-value pairs.
During execution, SparkSQL will shuffle the whole map or struct column instead of extracting the value first. The performance is very poor.
The optimized version could use a subquery:
select stddev_pop(var) stddev from (select variables.var1 as var, model_name from model) m group by model_name
Where we extract the field/key-value only in the mapper side, so data being shuffled is small.
A benchmark for a table with 600 variables shows drastic improvment in runtime:
Parquet (using Map) | Parquet (using Struct) | |
---|---|---|
Stddev (unoptimized) | 12890s | 583s |
Stddev (optimized) | 84s | 61s |
Parquet already supports reading a single field/key-value in the storage level, but SparkSQL currently doesn’t have optimization for it. This will be very useful optimization for tables having Map or Struct with many columns.
Jianshi
Attachments
Issue Links
- duplicates
-
SPARK-4502 Spark SQL reads unneccesary nested fields from Parquet
- Resolved