Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1339

AvroSequenceFile is always uncompressed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • java
    • None

    Description

      It appears that AvroSequenceFile is not passing compression type/codec info down to the SequenceFile.Writer. This is because AvroSequenceFile.Writer is making a direct call to SequenceFile.Writer's public constructor rather than using one of the SequenceFile createWriter factory methods

      https://github.com/apache/avro/blob/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSequenceFile.java#L532

      Here is a bit of workaround code that I came up with

      AvroSequenceFile.Writer.Options options = new AvroSequenceFile.Writer.Options()
        .withConfiguration(hdfsInfo.getConf())
        .withFileSystem(hdfsInfo.getFileSystem())
        .withOutputPath(hdfsInfo.getPath())
        .withCompressionType(configuration.getCompressionType())
        .withCompressionCodec(configuration.getCompressionCodec().getCodec())
        .withProgressable(new Progressable() {
            @Override
            public void progress(){
      
            }
        })
        .withKeySchema(configuration.getKeySchema())
        .withValueSchema(configuration.getValueSchema());
      
      // Have to do this here b/c it's hidden in a private method :(
      Metadata metadata = options.getMetadata();
      if (null != configuration.getKeySchema()) {
        metadata.set(AvroSequenceFile.METADATA_FIELD_KEY_SCHEMA, new Text(configuration.getKeySchema().toString()));
      }
      if (null != configuration.getValueSchema()) {
        metadata.set(AvroSequenceFile.METADATA_FIELD_VALUE_SCHEMA, new Text(configuration.getValueSchema().toString()));
      }
      
      return SequenceFile.createWriter(
          options.getFileSystem(),
          options.getConfigurationWithAvroSerialization(),
          options.getOutputPath(),
          options.getKeyClass(),
          options.getValueClass(),
          options.getBufferSizeBytes(),
          options.getReplicationFactor(),
          options.getBlockSizeBytes(),
          options.getCompressionType(),
          options.getCompressionCodec(),
          options.getProgressable(),
          metadata);
      

      I used this code to write a BZIP2 block compressed sequence file, and was able to read it using the Avro mapreduce classes just fine.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mumrah David Arthur
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: