Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-5323

Decouple virtual key with writing bloom filters to parquet files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • index, writer-core

    Description

      When the virtual key feature is enabled by setting hoodie.populate.meta.fields to false, the bloom filters are not written to parquet base files in the write transactions.  Relevant logic in HoodieFileWriterFactory class:

      private static <T extends HoodieRecordPayload, R extends IndexedRecord> HoodieFileWriter<R> newParquetFileWriter(
          String instantTime, Path path, HoodieWriteConfig config, Schema schema, HoodieTable hoodieTable,
          TaskContextSupplier taskContextSupplier, boolean populateMetaFields) throws IOException {
        return newParquetFileWriter(instantTime, path, config, schema, hoodieTable.getHadoopConf(),
            taskContextSupplier, populateMetaFields, populateMetaFields);
      }
      
      private static <T extends HoodieRecordPayload, R extends IndexedRecord> HoodieFileWriter<R> newParquetFileWriter(
          String instantTime, Path path, HoodieWriteConfig config, Schema schema, Configuration conf,
          TaskContextSupplier taskContextSupplier, boolean populateMetaFields, boolean enableBloomFilter) throws IOException {
        Option<BloomFilter> filter = enableBloomFilter ? Option.of(createBloomFilter(config)) : Option.empty();
        HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new AvroSchemaConverter(conf).convert(schema), schema, filter);
      
        HoodieParquetConfig<HoodieAvroWriteSupport> parquetConfig = new HoodieParquetConfig<>(writeSupport, config.getParquetCompressionCodec(),
            config.getParquetBlockSize(), config.getParquetPageSize(), config.getParquetMaxFileSize(),
            conf, config.getParquetCompressionRatio(), config.parquetDictionaryEnabled());
      
        return new HoodieAvroParquetWriter<>(path, parquetConfig, instantTime, taskContextSupplier, populateMetaFields);
      } 

      Given that bloom filters are absent, when using Bloom Index on the same table, the writer encounters NPE (HUDI-5319).

      We should decouple the virtual key feature with bloom filter and always write the bloom filters to the parquet files. 

      Attachments

        Issue Links

          Activity

            People

              guoyihua Ethan Guo
              guoyihua Ethan Guo
              Alexey Kudinkin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: