[DRILL-4203] Parquet File : Date is stored wrongly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.9.0
Component/s: None
Labels:
- doc-impacting

Description

Hello,

I have some problems when i try to read parquet files produce by drill with Spark, all dates are corrupted.

I think the problem come from drill

cat /tmp/date_parquet.csv 
Epoch,1970-01-01

0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
+--------+-------------+
|  name  | epoch_date  |
+--------+-------------+
| Epoch  | 1970-01-01  |
+--------+-------------+

0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+

When I read the file with parquet tools, i found

java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
name = Epoch
epoch_date = 4881176

According to https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date, epoch_date should be equals to 0.

Meta :

java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
file:        file:/tmp/buggy_parquet/0_0_0.parquet 
creator:     parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
extra:       drill.version = 1.4.0 

file schema: root 
--------------------------------------------------------------------------------
name:        OPTIONAL BINARY O:UTF8 R:0 D:1
epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1

row group 1: RC:1 TS:93 OFFSET:4 
--------------------------------------------------------------------------------
name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN

Implementation:

After the fix Drill can automatically determine date corruption in parquet files
and convert it to correct values.

For the reason, when the user want to work with the dates over the 5 000 years,
an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included for
completeness.
To disable "auto correction" you should use the parquet config in the plugin settings. Something like this:

  "formats": {
    "parquet": {
      "type": "parquet",
      "autoCorrectCorruptDates": false
    }

Or you can try to use the query like this:

select l_shipdate, l_commitdate from table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem` 
(type => 'parquet', autoCorrectCorruptDates => false)) limit 1;

Attachments

Issue Links

is duplicated by

DRILL-4342 Drill fails to read a date column from hive generated parquet

Closed

DRILL-4763 Parquet file with DATE logical type produces wrong results for simple SELECT

Closed

relates to

DRILL-4996 Parquet Date auto-correction is not working in auto-partitioned parquet files generated by drill-1.6

Closed

DRILL-4980 Upgrading of the approach of parquet date correctness status detection

Closed

links to

GitHub Pull Request #341

GitHub Pull Request #595

(1 links to)

Activity

People

Assignee:: Vitalii Diravka

Reporter:: Stéphane Trou

Reviewer:: Rahul Kumar Challapalli

Votes:: 3 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 15/Dec/15 23:21

Updated:: 18/Aug/17 15:13

Resolved:: 03/Mar/17 09:08