Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.4.0
-
None
Description
Hello,
I have some problems when i try to read parquet files produce by drill with Spark, all dates are corrupted.
I think the problem come from drill
cat /tmp/date_parquet.csv Epoch,1970-01-01
0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
+--------+-------------+
| name | epoch_date |
+--------+-------------+
| Epoch | 1970-01-01 |
+--------+-------------+
0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`; +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 1 | +-----------+----------------------------+
When I read the file with parquet tools, i found
java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/ name = Epoch epoch_date = 4881176
According to https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date, epoch_date should be equals to 0.
Meta :
java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/ file: file:/tmp/buggy_parquet/0_0_0.parquet creator: parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) extra: drill.version = 1.4.0 file schema: root -------------------------------------------------------------------------------- name: OPTIONAL BINARY O:UTF8 R:0 D:1 epoch_date: OPTIONAL INT32 O:DATE R:0 D:1 row group 1: RC:1 TS:93 OFFSET:4 -------------------------------------------------------------------------------- name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
Implementation:
After the fix Drill can automatically determine date corruption in parquet files
and convert it to correct values.
For the reason, when the user want to work with the dates over the 5 000 years,
an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included for
completeness.
To disable "auto correction" you should use the parquet config in the plugin settings. Something like this:
"formats": { "parquet": { "type": "parquet", "autoCorrectCorruptDates": false }
Or you can try to use the query like this:
select l_shipdate, l_commitdate from table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem` (type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
Attachments
Issue Links
- is duplicated by
-
DRILL-4342 Drill fails to read a date column from hive generated parquet
-
- Closed
-
-
DRILL-4763 Parquet file with DATE logical type produces wrong results for simple SELECT
-
- Closed
-
- relates to
-
DRILL-4996 Parquet Date auto-correction is not working in auto-partitioned parquet files generated by drill-1.6
-
- Closed
-
-
DRILL-4980 Upgrading of the approach of parquet date correctness status detection
-
- Closed
-
- links to