Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11913

Add support for Hadoop configuration on ParquetIO

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: P2
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.29.0
    • Component/s: io-java-parquet
    • Labels:
      None

      Description

      This is a common request from users and we did not do it in the past because we tried to avoid Hadoop objects in ParquetIO's public API. However there are valid reasons to do it:

      1. Many functionalities of Parquet are configurable via public helper methods on Parquet that prepare data inside of Hadoop's Configuration object, e.g. Column Projection via `AvroReadSupport.setRequestedProjection(conf, projectionSchema);` or Predicate Filters via `ParquetInputFormat.setFilterPredicate(sc.hadoopConfiguration(), filterPredicate);`. Giving access to those would allow power users to do advanced stuff without any maintenance on the IO side.

      2. The main reason to avoid the Hadoop Configuration object was to align with future non Hadoop required APIs on Parquet see PARQUET-1126 for details but this does not seem that will happen soon.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                iemejia Ismaël Mejía
                Reporter:
                iemejia Ismaël Mejía
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m