Details
-
New Feature
-
Status: Triage Needed
-
P1
-
Resolution: Fixed
-
None
-
Important
Description
Data engineers encounter times when schema of Parquet file is unknown at the time of writing the pipeline or multiple schema may be present in different files. Reading Parquet files using ParquetIO requires providing an Avro (equivalent) schema, Many a times its not possible to know the schema of the Parquet files.
On the other hand AvroIO supports reading unknow schema files by providing a parse function : #parseGenericRecords(SerializableFunction<GenericRecord,T>)
Supporting this functionality in ParquetIO is simple and requires minimal changes to the ParquetIO surface.
Pipeline p = ...; PCollection<String> filepatterns = p.apply(...); PCollection<Foo> records = filepatterns .apply(FileIO.matchAll()) .apply(FileIO.readMatches()) .apply(ParquetIO.parseGenericRecords(new SerializableFunction<GenericRecord, Foo>() { public Foo apply(GenericRecord record) { // If needed, access the schema of the record using record.getSchema() return ...; } }));