Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3238

[Python] Can't read pyarrow string columns in fastparquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • Python

    Description

      Writing really long strings from pyarrow causes exception in fastparquet read.

      Traceback (most recent call last):
      File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
      read_fastparquet()
      File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in read_fastparquet
      dff = pf.to_pandas(['A'])
      File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 426, in to_pandas
      index=index, assign=parts)
      File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 258, in read_row_group
      scheme=self.file_scheme)
      File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 344, in read_row_group
      cats, selfmade, assign=assign)
      File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 321, in read_row_group_arrays
      catdef=out.get(name+'-catdef', None))
      File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 235, in read_col
      skip_nulls, selfmade=selfmade)
      File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 99, in read_data_page
      raw_bytes = _read_page(f, header, metadata)
      File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 31, in _read_page
      page_header.uncompressed_page_size)
      AssertionError: found 175532 raw bytes (expected 200026)

      If written with compression, it reports compression errors instead:

      SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
      
      GZIP: zlib.error: Error -3 while decompressing data: incorrect header check

       

       

      Minimal code to reproduce:

      import os
      import pandas as pd
      import pyarrow
      import pyarrow.parquet as arrow_pq
      from fastparquet import ParquetFile
      
      # data to generate
      ROW_LENGTH = 40000 # decreasing below 32750ish eliminates exception
      N_ROWS = 10
      
      # file write params
      ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is read (e.g. Nones)
      FILENAME = 'test.parquet'
      
      def write_arrow():
          df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
          if os.path.isfile(FILENAME):
              os.remove(FILENAME)
          arrow_table = pyarrow.Table.from_pandas(df)
          arrow_pq.write_table(arrow_table,
          FILENAME,
          use_dictionary=False,
          compression='NONE',
          row_group_size=ROW_GROUP_SIZE)
      
      def read_arrow():
          print "arrow:"
          table2 = arrow_pq.read_table(FILENAME)
          print table2.to_pandas().head()
      
      
      def read_fastparquet():
          print "fastparquet:"
          pf = ParquetFile(FILENAME)
          dff = pf.to_pandas(['A'])
          print dff.head()
      
      
      write_arrow()
      read_arrow()
      read_fastparquet()
      

      Versions:

      fastparquet==0.1.6
      pyarrow==0.10.0
      pandas==0.22.0
      sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'

      Also opened issue here: https://github.com/dask/fastparquet/issues/375

      Attachments

        Activity

          People

            Unassigned Unassigned
            naroom Theo Walker
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: