Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11460

Support reading Parquet files with unknown schema

Details

    • New Feature
    • Status: Triage Needed
    • P1
    • Resolution: Fixed
    • None
    • 2.28.0
    • io-java-parquet
    • Important

    Description

      Data engineers encounter times when schema of Parquet file is unknown at the time of writing the pipeline or multiple schema may be present in different files. Reading Parquet files using ParquetIO requires providing an Avro (equivalent) schema, Many a times its not possible to know the schema of the Parquet files.

      On the other hand AvroIO supports reading unknow schema files by providing a parse function : #parseGenericRecords(SerializableFunction<GenericRecord,T>)

      Supporting this functionality in ParquetIO is simple and requires minimal changes to the ParquetIO surface.

       
      Pipeline p = ...;
      
      PCollection<String> filepatterns = p.apply(...);
      
      PCollection<Foo> records =
           filepatterns
               .apply(FileIO.matchAll())
               .apply(FileIO.readMatches())
               .apply(ParquetIO.parseGenericRecords(new SerializableFunction<GenericRecord, Foo>() {
                   public Foo apply(GenericRecord record) { 
                     // If needed, access the schema of the record using record.getSchema()                
                     return ...;             
                   }
                })); 
      

      Attachments

        Activity

          People

            anantdamle Anant Damle
            anantdamle Anant Damle
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                0h
                Logged:
                Time Not Required
                2h 10m