Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11480

[Python] Segmentation fault reading parquet with date filter with INT96 column

    XMLWordPrintableJSON

Details

    Description

      If I read a parquet file (see attachment) with timestamps generated in Spark and apply a filter on a date column I get segmentation fault
       

      import pyarrow.parquet as pq  
      now = datetime.datetime.now()
      table = pq.read_table("timestamp.parquet", filters=[("date", "<=", now)])
      

       

      The attached parquet file is generated with this code in spark:

      now = datetime.datetime.now() 
      data = {"date": [ now - datetime.timedelta(days=i) for i in range(100)]} 
      schema = { "type": "struct", "fields": [{"name": "date", "type": "timestamp", "nullable": True, "metadata": {}}, ], } 
      spf = spark.createDataFrame(pd.DataFrame(data), schema=StructType.fromJson(schema)) 
      spf.write.format("parquet").mode("overwrite").save("timestamp.parquet") 
      

      If I downgrade pyarrow to 2.0.0 it works fine.

      Python version 3.7.7

      pyarrow version 3.0.0

      Attachments

        1. timestamp.parquet
          0.9 kB
          Henrik Anker Rasmussen

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              henrikrasmussen Henrik Anker Rasmussen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m