[SPARK-14098] Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
- release-notes

Description

Here is a design document for this change (**TODO: Update the document**).

This JIRA implements a new in-memory cache feature used by DataFrame.cache and Dataset.cache. The followings are basic design based on discussions with Sameer, Weichen, Xiao, Herman, and Nong.

Use ColumnarBatch with ColumnVector that are common data representations for columnar storage
Use multiple compression scheme (such as RLE, intdelta, and so on) for each ColumnVector in ColumnarBatch depends on its data typpe
Generate code that is simple and specialized for each in-memory cache to build an in-memory cache
Generate code that directly reads data from ColumnVector for the in-memory cache by whole-stage codegen.
Enhance ColumnVector to keep UnsafeArrayData
Use primitive-type array for primitive uncompressed data type in ColumnVector
Use byte[] for UnsafeArrayData and compressed data

Based on this design, this JIRA generates two kinds of Java code for DataFrame.cache()/Dataset.cache()

Generate Java code to build CachedColumnarBatch, which keeps data in ColumnarBatch
Generate Java code to get a value of each column from ColumnarBatch
- a Get a value directly from from ColumnarBatch in code generated by whole stage code gen (primary path)
- b Get a value thru an iterator if whole stage code gen is disabled (e.g. # of columns is more than 100, as backup path)

Attachments

Issue Links

links to

[Github] Pull Request #11956 (kiszk)

[Github] Pull Request #15219 (kiszk)

Sub-Tasks

1.	Improve ColumnStats	Resolved	Kazuaki Ishizaki
2.	Enhance ColumnVector to support compressed representation	Resolved	Kazuaki Ishizaki
3.	Add compression/decompression of data to ColumnVector	Resolved	Unassigned
4.	Enhance ColumnVector to keep UnsafeArrayData for other types	Resolved	Unassigned
5.	Add compression/decompression of data to ColumnVector for other compression schemes	Resolved	Unassigned
6.	Add compression/decompression of data to ColumnVector for other data types	Resolved	Unassigned
7.	Generate code to get value from CachedBatchColumnVector in ColumnarBatch	Resolved	Kazuaki Ishizaki
8.	Generate code to build table cache using ColumnarBatch and to get value from ColumnVector for other types	Resolved	Unassigned
9.	Generate code to get value from table cache with wider column in ColumnarBatch	Resolved	Unassigned
10.	Generate code to get value from table cache with wider column in ColumnarBatch for other data types	Resolved	Unassigned
11.	Support compression/decompression of ColumnVector in generated code	Resolved	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Kazuaki Ishizaki

Votes:: 0 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 23/Mar/16 18:34

Updated:: 09/Sep/19 16:32

Resolved:: 09/Sep/19 16:31