Details
-
Umbrella
-
Status: Resolved
-
Major
-
Resolution: Done
-
None
-
None
Description
Here is a design document for this change (**TODO: Update the document**).
This JIRA implements a new in-memory cache feature used by DataFrame.cache and Dataset.cache. The followings are basic design based on discussions with Sameer, Weichen, Xiao, Herman, and Nong.
- Use ColumnarBatch with ColumnVector that are common data representations for columnar storage
- Use multiple compression scheme (such as RLE, intdelta, and so on) for each ColumnVector in ColumnarBatch depends on its data typpe
- Generate code that is simple and specialized for each in-memory cache to build an in-memory cache
- Generate code that directly reads data from ColumnVector for the in-memory cache by whole-stage codegen.
- Enhance ColumnVector to keep UnsafeArrayData
- Use primitive-type array for primitive uncompressed data type in ColumnVector
- Use byte[] for UnsafeArrayData and compressed data
Based on this design, this JIRA generates two kinds of Java code for DataFrame.cache()/Dataset.cache()
- Generate Java code to build CachedColumnarBatch, which keeps data in ColumnarBatch
- Generate Java code to get a value of each column from ColumnarBatch
- a Get a value directly from from ColumnarBatch in code generated by whole stage code gen (primary path)
- b Get a value thru an iterator if whole stage code gen is disabled (e.g. # of columns is more than 100, as backup path)