Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11460

Support reading Parquet files with unknown schema

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: P1
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.28.0
    • Component/s: io-java-parquet
    • Labels:
    • Flags:
      Important

      Description

      Data engineers encounter times when schema of Parquet file is unknown at the time of writing the pipeline or multiple schema may be present in different files. Reading Parquet files using ParquetIO requires providing an Avro (equivalent) schema, Many a times its not possible to know the schema of the Parquet files.

      On the other hand AvroIO supports reading unknow schema files by providing a parse function : #parseGenericRecords(SerializableFunction<GenericRecord,T>)

      Supporting this functionality in ParquetIO is simple and requires minimal changes to the ParquetIO surface.

       
      Pipeline p = ...;
      
      PCollection<String> filepatterns = p.apply(...);
      
      PCollection<Foo> records =
           filepatterns
               .apply(FileIO.matchAll())
               .apply(FileIO.readMatches())
               .apply(ParquetIO.parseGenericRecords(new SerializableFunction<GenericRecord, Foo>() {
                   public Foo apply(GenericRecord record) { 
                     // If needed, access the schema of the record using record.getSchema()                
                     return ...;             
                   }
                })); 
      

        Attachments

          Activity

            People

            • Assignee:
              anantdamle Anant Damle
              Reporter:
              anantdamle Anant Damle
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                0h
                Logged:
                Time Not Required
                2h 10m