Description
HQL "distribute by <column_name>" partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information.
Attachments
Issue Links
- duplicates
-
SPARK-5354 Set InMemoryColumnarTableScan's outputPartitioning and outputOrdering
- Resolved
- relates to
-
SPARK-11410 Add a DataFrame API that provides functionality similar to HiveQL's DISTRIBUTE BY
- Resolved