Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15687

[Format] Clarify that 8 byte padding must not be applied to compressed buffers

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Format
    • None

    Description

      I was unable to find where this is discussed, but I think we do not mention that 8 byte padding must not be applied when the buffer is compressed, as it causes us to lose the size of the compressed buffer.

      For example

      ```
      import pyarrow.ipc

      data = [
      pyarrow.array([1, 2, 3, 4, 5], type="int32"),
      ]

      batch = pyarrow.record_batch(data, names=['f0'])

      with pyarrow.OSFile('test1.arrow', 'wb') as sink:
      with pyarrow.ipc.new_file(sink, batch.schema, options=pyarrow.ipc.IpcWriteOptions(compression="zstd")) as writer:
      writer.write(batch)
      ```

      outputs a single data buffer with

      ```
      [20, 0, 0, 0, 0, 0, 0, 0, 40, 181, 47, 253, 32, 20, 161, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0]
      ```
      which has 37 bytes (padding would require 40 bytes).

      My understanding is that we do not pad because doing so make us unable to recover the original size of the (compressed) data, and offers no advantage since users can't mmap data anyways.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jorgecarleitao Jorge Leitão
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: