Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4203

Parquet File : Date is stored wrongly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.4.0
    • 1.9.0
    • None

    Description

      Hello,

      I have some problems when i try to read parquet files produce by drill with Spark, all dates are corrupted.

      I think the problem come from drill

      cat /tmp/date_parquet.csv 
      Epoch,1970-01-01
      
      0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
      +--------+-------------+
      |  name  | epoch_date  |
      +--------+-------------+
      | Epoch  | 1970-01-01  |
      +--------+-------------+
      
      0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
      +-----------+----------------------------+
      | Fragment  | Number of records written  |
      +-----------+----------------------------+
      | 0_0       | 1                          |
      +-----------+----------------------------+
      

      When I read the file with parquet tools, i found

      java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
      name = Epoch
      epoch_date = 4881176
      

      According to https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date, epoch_date should be equals to 0.

      Meta :

      java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
      file:        file:/tmp/buggy_parquet/0_0_0.parquet 
      creator:     parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
      extra:       drill.version = 1.4.0 
      
      file schema: root 
      --------------------------------------------------------------------------------
      name:        OPTIONAL BINARY O:UTF8 R:0 D:1
      epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
      
      row group 1: RC:1 TS:93 OFFSET:4 
      --------------------------------------------------------------------------------
      name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
      epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
      

      Implementation:

      After the fix Drill can automatically determine date corruption in parquet files
      and convert it to correct values.

      For the reason, when the user want to work with the dates over the 5 000 years,
      an option is included to turn off the auto-correction.
      Use of this option is assumed to be extremely unlikely, but it is included for
      completeness.
      To disable "auto correction" you should use the parquet config in the plugin settings. Something like this:

        "formats": {
          "parquet": {
            "type": "parquet",
            "autoCorrectCorruptDates": false
          }
      

      Or you can try to use the query like this:

      select l_shipdate, l_commitdate from table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem` 
      (type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
      

      Attachments

        Issue Links

          Activity

            People

              vitalii Vitalii Diravka
              stephanet Stéphane Trou
              Rahul Kumar Challapalli Rahul Kumar Challapalli
              Votes:
              3 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: