Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-434

Segfaults and encoding issues in Python Parquet reads

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.2.0
    • Python
    • Ubuntu, Python 3.5, installed pyarrow from conda-forge

    Description

      I've conda installed pyarrow and am trying to read data from the parquet-compatibility project. I haven't explicitly built parquet-cpp or anything and may or may not have old versions lying around, so please take this issue with some salt:

      In [1]: import pyarrow.parquet
      
      In [2]: t = pyarrow.parquet.read_table('nation.plain.parquet')
      ---------------------------------------------------------------------------
      ArrowException                            Traceback (most recent call last)
      <ipython-input-2-5d966681a384> in <module>()
      ----> 1 t = pyarrow.parquet.read_table('nation.plain.parquet')
      
      /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.read_table (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2783)()
      
      /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.ParquetReader.read_all (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2200)()
      
      /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/error.pyx in pyarrow.error.check_status (/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/error.cxx:1185)()
      
      ArrowException: NotImplemented: list<: uint8>
      

      Additionally I tried to read data from a Python file-like object pointing to data on S3. Let me know if you'd prefer a separate issue.

      In [1]: import s3fs
      
      In [2]: fs = s3fs.S3FileSystem()
      
      In [3]: f = fs.open('dask-data/nyc-taxi/2015/parquet/part.0.parquet')
      
      In [4]: f.read(100)
      Out[4]: b'PAR1\x15\x00\x15\x90\xc4\xa2\x12\x15\x90\xc4\xa2\x12,\x15\xc2\xa8\xa4\x02\x15\x00\x15\x06\x15\x08\x00\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00@\xc2\xce\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\x00\x89\xfc\xe7\x8b\x0b\x05\x00@\xcb\x0b\xe8\x8b\x0b\x05\x00\x80\r\x1b\xe8\x8b\x0b'
      
      In [5]: import pyarrow.parquet
      
      In [6]: t = pyarrow.parquet.read_table(f)
      Segmentation fault (core dumped)
      

      Here is a more reproducible version:

      In [1]: with open('nation.plain.parquet', 'rb') as f:
         ...:     data = f.read()
         ...:     
      
      In [2]: from io import BytesIO
      
      In [3]: f = BytesIO(data)
      
      In [4]: f.seek(0)
      Out[4]: 0
      
      In [5]: import pyarrow.parquet
      
      In [6]: t = pyarrow.parquet.read_table(f)
      Segmentation fault (core dumped)
      

      I was however pleased with round-trip functionality within this project, which was very pleasant.

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              mrocklin Matthew Rocklin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: