[SPARK-5446] Parquet column pruning should work for Map and Struct - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.2.0, 1.3.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Consider the following query:

select stddev_pop(variables.var1) stddev
from model
group by model_name

Where variables is a Struct containing many fields, similarly it can be a Map with many key-value pairs.

During execution, SparkSQL will shuffle the whole map or struct column instead of extracting the value first. The performance is very poor.

The optimized version could use a subquery:

select stddev_pop(var) stddev
from (select variables.var1 as var, model_name from model) m
group by model_name

Where we extract the field/key-value only in the mapper side, so data being shuffled is small.

A benchmark for a table with 600 variables shows drastic improvment in runtime:

	Parquet (using Map)	Parquet (using Struct)
Stddev (unoptimized)	12890s	583s
Stddev (optimized)	84s	61s

Parquet already supports reading a single field/key-value in the storage level, but SparkSQL currently doesn’t have optimization for it. This will be very useful optimization for tables having Map or Struct with many columns.

Jianshi

Attachments

Issue Links

duplicates

SPARK-4502 Spark SQL reads unneccesary nested fields from Parquet

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jianshi Huang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Jan/15 08:24

Updated:: 12/Dec/22 18:10

Resolved:: 08/Oct/16 10:17