Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8953

Extend ParquetIO.Read/ReadFiles.Builder to support Avro GenericData model

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.19.0
    • Component/s: io-java-parquet
    • Labels:
      None

      Description

      When utilizing ParquetIO to deserialize objects into case classes in Scala, we'd like to utilize a downstream converter which takes GenericRecords and converts them to instances of our case classes, rather than relying on ParquetIO to deserialize into the case class via reflection + implementing the IndexedRecord interface.

      The ParquetIO.Read / ParquetIO.ReadFiles Builders currently support a filepattern + schema / schema arguments respectively. When using the Read / ReadFiles Builders with these arguments, the underlying AvroParquetReader object that gets created in the ParquetIO.ReadFiles.ReadFn method defaults to utilizing an AvroReadSupport instance whose GenericData model gets set to SpecificData. We'd like to have the the underlying AvroReadSupport utilize the GenericData model, but there's currently no way to force this to happen via the existing ParquetIO Read / ReadFiles builders. 

      I'd like to extend the ParquetIO Read / ReadFiles builders to support a new method allowing users to define a GenericData model, which will then be passed into the AvroParquetReader builder. I've tested and validated that this method allows ParquetIO to generate GenericRecord instances without requiring that the users classes can be reflectively instantiated and initialized via the IndexedRecord interface.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Ryan Berti Ryan Berti
                Reporter:
                Ryan Berti Ryan Berti
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6h 20m
                  6h 20m