Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1100

[C++] Reading repeated types should decode number of records rather than number of values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • cpp-1.2.0
    • cpp-1.3.0
    • parquet-cpp
    • None

    Description

      Reading the attached parquet file into pandas dataframe and then using the dataframe segfaults.

      Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
      [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
      Type "help", "copyright", "credits" or "license" for more information.
      >>> 
      >>> import pyarrow
      >>> import pyarrow.parquet as pq
      >>> pyarrow.__version__
      '0.6.0'
      >>> import pandas as pd
      >>> pd.__version__
      '0.19.0'
      >>> df = pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet') \
      ...        .to_pandas()
      >>> len(df)
      69
      >>> df.info()
      <class 'pandas.core.frame.DataFrame'>
      RangeIndex: 69 entries, 0 to 68
      Data columns (total 6 columns):
      label               69 non-null int32
      account_meta        69 non-null object
      features_type       69 non-null int32
      features_size       69 non-null int32
      features_indices    1 non-null object
      features_values     1 non-null object
      dtypes: int32(3), object(3)
      memory usage: 2.5+ KB
      >>> 
      >>> pd.concat([df, df])
      Segmentation fault (core dumped)
      

      Actually just print(df) is enough to trigger the segfault

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              jseppanen Jarno Seppanen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: