[HIVE-25827] Parquet file footer is read multiple times, when multiple splits are created in same file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0-alpha-2
Component/s: Iceberg integration
Labels:
- performance
- pull-request-available

Description

With large files, it is possible that multiple splits are created in the same file. With current codebase, "ParquetRecordReaderBase" ends up reading file footer for each split.

It can be optimized not to read footer information multiple times for the same file.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L160

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L91

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-12-21-03-19-38-577.png
20/Dec/21 21:49
901 kB
Rajesh Balamohan

Issue Links

relates to

HADOOP-18028 High performance S3A input stream with prefetching & caching

Open

links to

GitHub Pull Request #3368

Activity

People

Assignee:: Ádám Szita

Reporter:: Rajesh Balamohan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Dec/21 21:50

Updated:: 16/Nov/22 13:50

Resolved:: 22/Jun/22 09:08

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h