Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3667

[JS] Incorrectly reads record batches with an all null column

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: JS-0.3.1
    • Fix Version/s: JS-0.4.1
    • Component/s: JavaScript
    • Labels:
      None

      Description

      The JS library seems to incorrectly read any columns that come after an all-null column in IPC buffers produced by pyarrow.

      Here's a python script that generates two arrow buffers, one with an all-null column followed by a utf-8 column, and a second with those two reversed

      import pyarrow as pa
      import pandas as pd
      
      def serialize_to_arrow(df, fd, compress=True):
        batch = pa.RecordBatch.from_pandas(df)
        writer = pa.RecordBatchFileWriter(fd, batch.schema)
      
        writer.write_batch(batch)
        writer.close()
      
      if __name__ == "__main__":
          df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 'def', 'ghi']}, columns=['nulls', 'not nulls'])
          with open('bad.arrow', 'wb') as fd:
              serialize_to_arrow(df, fd)
          df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
          with open('good.arrow', 'wb') as fd:
              serialize_to_arrow(df, fd)
      

      JS incorrectly interprets the [null, not null] case:

      > var arrow = require('apache-arrow')
      undefined
      > var fs = require('fs')
      undefined
      > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not nulls').get(0)
      'abc'
      > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
      '\u0000\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u0006\u0000\u0000\u0000\t\u0000\u0000\u0000'
      

      Presumably this is because pyarrow is omitting some (or all) of the buffers associated with the all-null column, but the JS IPC reader is still looking for them, causing the buffer count to get out of sync.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                paul.e.taylor Paul Taylor
                Reporter:
                bhulette Brian Hulette
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: