Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2571

[C++] Lz4Codec doesn't properly handle empty data

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: C++

      Description

      For example a following closure test will fail:

      import pyarrow as pa
      import pyarrow.parquet as pq
      
      data = [pa.array([None] * 10)]
      batch = pa.RecordBatch.from_arrays(data, ['x'])
      table = pa.Table.from_batches([batch])
      pq.write_table(table, "test.parquet", compression='LZ4')
      table = pq.read_table("test.parquet")
      

      with a following error

      Traceback (most recent call last): File "test.py", line 8, in <module> table = pq.read_table("test.parquet") File "python3.6/site-packages/pyarrow/parquet.py", line 987, in read_table use_pandas_metadata=use_pandas_metadata) File "python3.6/site-packages/pyarrow/parquet.py", line 149, in read nthreads=nthreads) File "_parquet.pyx", line 736, in pyarrow._parquet.ParquetReader.read_all File "error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Arrow error: IOError: Corrupt Lz4 compressed data.
      

      Writing file from with LZ4 from python requires patch for ARROW-2570. But the issue can be reproduced by creating an input file with parquet-cpp. The file must be compressed with LZ4 and contain a column with only gap values.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                veprbl Dmitry Kalinkin
                Reporter:
                veprbl Dmitry Kalinkin
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m