Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1878

[C++] lz4 codec is not compatible with Hadoop Lz4Codec

    XMLWordPrintableJSON

Details

    Description

      As described in HADOOP-12990, the Hadoop Lz4Codec uses the lz4 block format, and it prepends 8 extra bytes before the compressed data. I believe that lz4 implementation in parquet-cpp also uses the lz4 block format, but it does not prepend these 8 extra bytes.

       

      Using Java parquet-mr, I wrote a Parquet file with lz4 compression:

      $ parquet-tools meta /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
      file:        file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
      creator:     parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
      --------------------------------------------------------------------------------
      c1:          REQUIRED INT64 R:0 D:0
      c0:          REQUIRED BINARY R:0 D:0
      v0:          REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
      --------------------------------------------------------------------------------
      c1:           INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007 ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max: 1571211622650188000, num_nulls: 0]
      c0:           BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007 ENC:PLAIN,RLE_DICTIONARY ST:[min: 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D, max: 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D, num_nulls: 0]
      v0:           INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007 ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] 

      When I attempted to read this file with parquet-cpp, I got the following error:

      >>> import pyarrow.parquet as pq
      >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py", line 1536, in read_table
          return pf.read(columns=columns, use_threads=use_threads,
        File "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py", line 1260, in read
          table = piece.read(columns=columns, use_threads=use_threads,
        File "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py", line 707, in read
          table = reader.read(**options)
        File "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py", line 336, in read
          return self.reader.read_all(column_indices=column_indices,
        File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all
        File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
      OSError: IOError: Corrupt Lz4 compressed data. 

       

      https://github.com/apache/arrow/issues/3491 reported incompatibility in the other direction, using Spark (which uses the Hadoop lz4 codec) to read a parquet file that was written with parquet-cpp.

       

      Given that the Hadoop lz4 codec has long been in use, and users have accumulated Parquet files that were written with this implementation, I propose changing parquet-cpp to match the Hadoop implementation.

       

      See also:

      Attachments

        Issue Links

          Activity

            People

              ppai Patrick Pai
              chairmank Steve M. Kim
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10h 20m
                  10h 20m