Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17058

Timezone aware parquet read with schema and filters

Add voteWatch issue
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 8.0.0
    • None
    • Parquet, Python
    • None

    Description

      The parquet.read_table() method in pyarrow 8.0.0 added `schema` parameter which is great for handling timestamps, i.e., they are correctly converted from UTC to the timezone specified in the schema.

      However, when `schema` is used together with `filters`, timezone conversion fails with "Cannot compare timestamp with timezone to timestamp without timezone" error. This was tested on 2 files created with different versions of spark. The test code, files and the output are attached.

      Attachments

        1. output.txt
          2 kB
          Blaž Zupančič
        2. pyarrow_bug.py
          1 kB
          Blaž Zupančič
        3. spark_parquet.py
          0.8 kB
          Blaž Zupančič
        4. spark-3.1.parquet
          0.5 kB
          Blaž Zupančič
        5. spark-3.2.parquet
          0.5 kB
          Blaž Zupančič

        Activity

          People

            Unassigned Unassigned
            bzupancic Blaž Zupančič

            Dates

              Created:
              Updated:

              Slack

                Issue deployment