Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7678

[C++][Parquet] setting TZ= in environment on Linux causes broken parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Invalid
    • 0.15.1
    • None
    • C++
    • None
    • Linux, Ubuntu 18.04, arrow/parquet 0.15.1 from instructions https://arrow.apache.org/install/

    Description

      When I set TZ=CST-8, or other timezone on Linux time columns are corrupted in my resulting parquet file.

       

      Below are the calls I use to define my schema:

       

      PrimitiveNode::Make( columnName, Repetition::REQUIRED,
       LogicalType::Timestamp( true, LogicalType::TimeUnit::MICROS, false, false ),
       ::parquet::Type::INT64 ) );
      PrimitiveNode::Make( columnName,
       repetition,
       LogicalType::Time( true, LogicalType::TimeUnit::MICROS ),
       ::parquet::Type::INT64 ) );
      

      I use an Int64Writer for both types. When reading, in this case using pandas with pyarrow, but also in C++, I get the following exception:

       File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all
       File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
      pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data
      Deserializing page header failed.

      Seems as if the column header must be defining a timestamp+timezone even though I manually set is_adjusted_to_utc.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jpedrick Joshua Pedrick
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: