Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
Description
I'm currently working on making pyarrow.serialization data available from the Java side, one problem I was running into is that it seems the Java implementation cannot read UnionArrays generated from C++. To make this easily reproducible I created a clean Python implementation for creating UnionArrays: https://github.com/apache/arrow/pull/1216
The data is generated with the following script:
import pyarrow as pa binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') int64 = pa.array([1, 2, 3], type='int64') types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) batch = pa.RecordBatch.from_arrays([result], ["test"]) sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, batch.schema) writer.write_batch(batch) sink.close() b = sink.get_result() with open("union_array.arrow", "wb") as f: f.write(b) # Sanity check: Read the batch in again with open("union_array.arrow", "rb") as f: b = f.read() reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) batch = reader.read_next_batch() print("union array is", batch.column(0))
I attached the file generated by that script. Then when I run the following code in Java:
RootAllocator allocator = new RootAllocator(1000000000); ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); ArrowStreamReader reader = new ArrowStreamReader(in, allocator); reader.loadNextBatch()
I get the following error:
| java.lang.IllegalArgumentException thrown: Could not load buffers for field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can not truncate buffer to a larger size 7: 0 | at VectorLoader.loadBuffers (VectorLoader.java:83) | at VectorLoader.load (VectorLoader.java:62) | at ArrowReader$1.visit (ArrowReader.java:125) | at ArrowReader$1.visit (ArrowReader.java:111) | at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) | at ArrowReader.loadNextBatch (ArrowReader.java:137) | at (#7:1)
It seems like Java is not picking up that the UnionArray is Dense instead of Sparse. After changing the default in java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I get this:
jshell> reader.getVectorSchemaRoot().getSchema()
$9 ==> Schema<list: Union(Dense, [0])<: Struct<list: List<item: Union(Dense, [0])<: Int(64, true)>>>>>
but then reading doesn't work:
jshell> reader.loadNextBatch() | java.lang.IllegalArgumentException thrown: Could not load buffers for field list: Union(Dense, [1])<: Struct<list: List<$data$: Union(Dense, [5])<: Int(64, true)>>>>. error message: can not truncate buffer to a larger size 1: 0 | at VectorLoader.loadBuffers (VectorLoader.java:83) | at VectorLoader.load (VectorLoader.java:62) | at ArrowReader$1.visit (ArrowReader.java:125) | at ArrowReader$1.visit (ArrowReader.java:111) | at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) | at ArrowReader.loadNextBatch (ArrowReader.java:137) | at (#8:1)
Any help with this is appreciated!
Attachments
Attachments
Issue Links
- is blocked by
-
ARROW-590 [Integration] Add integration tests for Union types
- Resolved
- is duplicated by
-
ARROW-9284 [Java] getMinorTypeForArrowType returns sparse minor type for dense union types
- Closed
- links to