[BEAM-11913] Add support for Hadoop configuration on ParquetIO - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Triage Needed
Priority: P2
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.29.0
Component/s: io-java-parquet
Labels:
None

Description

This is a common request from users and we did not do it in the past because we tried to avoid Hadoop objects in ParquetIO's public API. However there are valid reasons to do it:

1. Many functionalities of Parquet are configurable via public helper methods on Parquet that prepare data inside of Hadoop's Configuration object, e.g. Column Projection via `AvroReadSupport.setRequestedProjection(conf, projectionSchema);` or Predicate Filters via `ParquetInputFormat.setFilterPredicate(sc.hadoopConfiguration(), filterPredicate);`. Giving access to those would allow power users to do advanced stuff without any maintenance on the IO side.

2. The main reason to avoid the Hadoop Configuration object was to align with future non Hadoop required APIs on Parquet see PARQUET-1126 for details but this does not seem that will happen soon.

Attachments

Issue Links

is related to

BEAM-10284 Allow to pass config to Parquet sink

Resolved

BEAM-11527 Support user configurable Hadoop Configuration flags for ParquetIO

Triage Needed

links to

GitHub Pull Request #14171

Activity

People

Assignee:: Ismaël Mejía

Reporter:: Ismaël Mejía

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Mar/21 15:46

Updated:: 13/Apr/23 11:03

Resolved:: 09/Mar/21 15:45

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m