Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1692

[Python, Java] UnionArray round trip not working

    XMLWordPrintableJSON

Details

    Description

      I'm currently working on making pyarrow.serialization data available from the Java side, one problem I was running into is that it seems the Java implementation cannot read UnionArrays generated from C++. To make this easily reproducible I created a clean Python implementation for creating UnionArrays: https://github.com/apache/arrow/pull/1216

      The data is generated with the following script:

      import pyarrow as pa
      
      binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
      int64 = pa.array([1, 2, 3], type='int64')
      types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
      value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
      
      result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)
      
      batch = pa.RecordBatch.from_arrays([result], ["test"])
      
      sink = pa.BufferOutputStream()
      writer = pa.RecordBatchStreamWriter(sink, batch.schema)
      
      writer.write_batch(batch)
      
      sink.close()
      
      b = sink.get_result()
      
      with open("union_array.arrow", "wb") as f:
          f.write(b)
      
      # Sanity check: Read the batch in again
      
      with open("union_array.arrow", "rb") as f:
          b = f.read()
          reader = pa.RecordBatchStreamReader(pa.BufferReader(b))
      
      batch = reader.read_next_batch()
      
      print("union array is", batch.column(0))
      

      I attached the file generated by that script. Then when I run the following code in Java:

      RootAllocator allocator = new RootAllocator(1000000000);
      
      ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));
      
      ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
      
      reader.loadNextBatch()
      

      I get the following error:

      |  java.lang.IllegalArgumentException thrown: Could not load buffers for field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can not truncate buffer to a larger size 7: 0
      |        at VectorLoader.loadBuffers (VectorLoader.java:83)
      |        at VectorLoader.load (VectorLoader.java:62)
      |        at ArrowReader$1.visit (ArrowReader.java:125)
      |        at ArrowReader$1.visit (ArrowReader.java:111)
      |        at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
      |        at ArrowReader.loadNextBatch (ArrowReader.java:137)
      |        at (#7:1)
      

      It seems like Java is not picking up that the UnionArray is Dense instead of Sparse. After changing the default in java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I get this:

      jshell> reader.getVectorSchemaRoot().getSchema()
      $9 ==> Schema<list: Union(Dense, [0])<: Struct<list: List<item: Union(Dense, [0])<: Int(64, true)>>>>>
      

      but then reading doesn't work:

      jshell> reader.loadNextBatch()
      |  java.lang.IllegalArgumentException thrown: Could not load buffers for field list: Union(Dense, [1])<: Struct<list: List<$data$: Union(Dense, [5])<: Int(64, true)>>>>. error message: can not truncate buffer to a larger size 1: 0
      |        at VectorLoader.loadBuffers (VectorLoader.java:83)
      |        at VectorLoader.load (VectorLoader.java:62)
      |        at ArrowReader$1.visit (ArrowReader.java:125)
      |        at ArrowReader$1.visit (ArrowReader.java:111)
      |        at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
      |        at ArrowReader.loadNextBatch (ArrowReader.java:137)
      |        at (#8:1)
      

      Any help with this is appreciated!

      Attachments

        1. union_array.arrow
          0.8 kB
          Philipp Moritz

        Issue Links

          Activity

            People

              rymurr Ryan Murray
              pcmoritz Philipp Moritz
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h 40m
                  8h 40m