Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Invalid
-
None
-
None
-
None
Description
Per discussions in the following threads, as spec(http://arrow.apache.org/docs/format/IPC.html#streaming-format) described, as long as a record batch doesn't reference a dictionary they can be interleaved.
https://github.com/apache/arrow/pull/4960
https://github.com/apache/arrow/pull/5146
Currently it’s able to parse dictionaries and batches which are interleaved via ARROW-6040, But it’s impossible to write data in this format.
cases below should be supported:
i. have a record batch of one dictionary encoded column S
- Schema
- RecordBatch: S=[null, null, null, null]
- DictionaryBatch: ['abc', 'efg']
- Recordbatch: S=[0, 1, 0, 1]
ii. have a record batch of two dictionary encoded column S1, S2
- Schema
- DictionaryBatch S1: ['ab', 'cd']
- RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,]
- DictionaryBatch S2: ['cc', 'dd']
- RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1]
This issue is used to record this problem, and should be done after a ML discuss.