[ARROW-1692] [Python, Java] UnionArray round trip not working - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: Integration, Java, Python
Labels:
- columnar-format-1.0
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/17700

Description

I'm currently working on making pyarrow.serialization data available from the Java side, one problem I was running into is that it seems the Java implementation cannot read UnionArrays generated from C++. To make this easily reproducible I created a clean Python implementation for creating UnionArrays: https://github.com/apache/arrow/pull/1216

The data is generated with the following script:

import pyarrow as pa

binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
int64 = pa.array([1, 2, 3], type='int64')
types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)

batch = pa.RecordBatch.from_arrays([result], ["test"])

sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, batch.schema)

writer.write_batch(batch)

sink.close()

b = sink.get_result()

with open("union_array.arrow", "wb") as f:
    f.write(b)

# Sanity check: Read the batch in again

with open("union_array.arrow", "rb") as f:
    b = f.read()
    reader = pa.RecordBatchStreamReader(pa.BufferReader(b))

batch = reader.read_next_batch()

print("union array is", batch.column(0))

I attached the file generated by that script. Then when I run the following code in Java:

RootAllocator allocator = new RootAllocator(1000000000);

ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));

ArrowStreamReader reader = new ArrowStreamReader(in, allocator);

reader.loadNextBatch()

I get the following error:

|  java.lang.IllegalArgumentException thrown: Could not load buffers for field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can not truncate buffer to a larger size 7: 0
|        at VectorLoader.loadBuffers (VectorLoader.java:83)
|        at VectorLoader.load (VectorLoader.java:62)
|        at ArrowReader$1.visit (ArrowReader.java:125)
|        at ArrowReader$1.visit (ArrowReader.java:111)
|        at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
|        at ArrowReader.loadNextBatch (ArrowReader.java:137)
|        at (#7:1)

It seems like Java is not picking up that the UnionArray is Dense instead of Sparse. After changing the default in java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I get this:

jshell> reader.getVectorSchemaRoot().getSchema()
$9 ==> Schema<list: Union(Dense, [0])<: Struct<list: List<item: Union(Dense, [0])<: Int(64, true)>>>>>

but then reading doesn't work:

jshell> reader.loadNextBatch()
|  java.lang.IllegalArgumentException thrown: Could not load buffers for field list: Union(Dense, [1])<: Struct<list: List<$data$: Union(Dense, [5])<: Int(64, true)>>>>. error message: can not truncate buffer to a larger size 1: 0
|        at VectorLoader.loadBuffers (VectorLoader.java:83)
|        at VectorLoader.load (VectorLoader.java:62)
|        at ArrowReader$1.visit (ArrowReader.java:125)
|        at ArrowReader$1.visit (ArrowReader.java:111)
|        at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
|        at ArrowReader.loadNextBatch (ArrowReader.java:137)
|        at (#8:1)

Any help with this is appreciated!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

union_array.arrow
20/Oct/17 01:39
0.8 kB
Philipp Moritz

Issue Links

is blocked by

ARROW-590 [Integration] Add integration tests for Union types

Resolved

is duplicated by

ARROW-9284 [Java] getMinorTypeForArrowType returns sparse minor type for dense union types

Closed

links to

GitHub Pull Request #7290

Activity

People

Assignee:: Ryan Murray

Reporter:: Philipp Moritz

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 20/Oct/17 03:00

Updated:: 11/Jan/23 07:16

Resolved:: 11/Jul/20 22:48

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

8h 40m