Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.5.1
Description
We found a data sequence issue in Hudi when we use API to import data(use spark.read.json("filename") read to dataframe then write to hudi). The original d is rowkey:1 dt:2 time:3.
But the value is unexpected when query the data by Presto(rowkey:2 dt:1 time:2), but correctly in Hive.
After analysis, if I use dt to partition the column data, it is also written in the parquet file. dt = xxx, and the value of the partition column should be the value in the path of the hudi. However, I found that the value of the presto query must be one-to-one with the columns in the parquet. He will not detect the column names.
Transformation methods and suggestions:
- Can the inputformat class be ignored to read the column value of the partition column dt in parquet?
- Can hive data be synchronized without dt as a partition column? Consider adding a column such as repl_dt as a partition column and dt as an ordinary field.
- The dt column is not written to the parquet file.
4, dt is written to the parquet file, but as the last column.
Attachments
Attachments
Issue Links
- is depended upon by
-
HUDI-901 Bug Bash 0.6.0 Tracking Ticket
- Resolved