[HUDI-733] presto query data error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.5.1
Fix Version/s: 0.7.0
Component/s: trino-presto
Labels:
- sev:critical
- user-support-issues

Description

We found a data sequence issue in Hudi when we use API to import data(use spark.read.json("filename") read to dataframe then write to hudi). The original d is rowkey:1 dt:2 time:3.

But the value is unexpected when query the data by Presto(rowkey:2 dt:1 time:2), but correctly in Hive.

After analysis, if I use dt to partition the column data, it is also written in the parquet file. dt = xxx, and the value of the partition column should be the value in the path of the hudi. However, I found that the value of the presto query must be one-to-one with the columns in the parquet. He will not detect the column names.

Transformation methods and suggestions:

Can the inputformat class be ignored to read the column value of the partition column dt in parquet?
Can hive data be synchronized without dt as a partition column? Consider adding a column such as repl_dt as a partition column and dt as an ordinary field.
The dt column is not written to the parquet file.

4, dt is written to the parquet file, but as the last column.

bhasudha

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hive_table.png
25/Mar/20 06:13
458 kB
jing
parquet_context.png
25/Mar/20 06:13
426 kB
jing
parquet_schema.png
25/Mar/20 06:13
521 kB
jing
presto_query_data.png
25/Mar/20 06:13
282 kB
jing

Issue Links

is depended upon by

HUDI-901 Bug Bash 0.6.0 Tracking Ticket

Resolved

Activity

People

Assignee:: Bhavani Sudha

Reporter:: jing

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Mar/20 06:17

Updated:: 14/May/21 15:30

Resolved:: 14/May/21 15:30