Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23034

Arrow serializer should not keep the reference of arrow offset and validity buffers

    XMLWordPrintableJSON

    Details

      Description

      Currently, a part of writeList() method in arrow serializer is implemented like -

      final ArrowBuf offsetBuffer = arrowVector.getOffsetBuffer();
          int nextOffset = 0;
      
          for (int rowIndex = 0; rowIndex < size; rowIndex++) {
            int selectedIndex = rowIndex;
            if (vectorizedRowBatch.selectedInUse) {
              selectedIndex = vectorizedRowBatch.selected[rowIndex];
            }
            if (hiveVector.isNull[selectedIndex]) {
              offsetBuffer.setInt(rowIndex * OFFSET_WIDTH, nextOffset);
            } else {
              offsetBuffer.setInt(rowIndex * OFFSET_WIDTH, nextOffset);
              nextOffset += (int) hiveVector.lengths[selectedIndex];
              arrowVector.setNotNull(rowIndex);
            }
          }
          offsetBuffer.setInt(size * OFFSET_WIDTH, nextOffset);
      

      1) Here we obtain a reference to final ArrowBuf offsetBuffer = arrowVector.getOffsetBuffer(); and keep updating the arrow vector and offset vector.

      Problem -

      arrowVector.setNotNull(rowIndex) keeps checking the index and reallocates the offset and validity buffers when a threshold is crossed, updates the references internally and also releases the old buffers (which decrements the buffer reference count). Now the reference which we obtained in 1) becomes obsolete. Furthermore if try to read or write old buffer, we see -

      Caused by: io.netty.util.IllegalReferenceCountException: refCnt: 0
      	at io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1413)
      	at io.netty.buffer.ArrowBuf.checkIndexD(ArrowBuf.java:131)
      	at io.netty.buffer.ArrowBuf.chk(ArrowBuf.java:162)
      	at io.netty.buffer.ArrowBuf.setInt(ArrowBuf.java:656)
      	at org.apache.hadoop.hive.ql.io.arrow.Serializer.writeList(Serializer.java:432)
      	at org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:285)
      	at org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:352)
      	at org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:288)
      	at org.apache.hadoop.hive.ql.io.arrow.Serializer.writeList(Serializer.java:419)
      	at org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:285)
      	at org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:205)
      

      Solution -
      This can be fixed by getting the buffers each time ( arrowVector.getOffsetBuffer() ) we want to update them.

      In our internal tests, this is very frequently seen on arrow 0.8.0 but not on 0.10.0 but should be handled the same way for 0.10.0 too as it does the same thing.

        Attachments

        1. HIVE-23034.01.patch
          6 kB
          Shubham Chaurasia

          Issue Links

            Activity

              People

              • Assignee:
                ShubhamChaurasia Shubham Chaurasia
                Reporter:
                ShubhamChaurasia Shubham Chaurasia
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h