Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-1749

Add dependency for handling xz-compressed Avro file

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • gobblin-core
    • None

    Description

      After upgrading Avro to 1.9.2 on master, xz-compressed Avro files are neither readable or writable by default.
      For example, given the following avro file which is compressed with xz codec,

      $ java -jar avro-tools-1.11.1.jar getmeta /tmp/avro/weather.avro
      avro.schema	{"type":"record","name":"Weather","namespace":"test","doc":"A weather reading.","fields":[{"name":"station","type":"string"},{"name":"time","type":"long"},{"name":"temp","type":"int"}]}
      avro.codec	xz
      

      reading that file fails on master as follows.

      $ git status 
      On branch master
      Your branch is ahead of 'origin/master' by 285 commits.
        (use "git push" to publish your local commits)
      
      nothing to commit, working tree clean
      $ vi gobblin-distribution/gobblin-flavor-standard.gradle  # Remove the gobblin-elasticsearch and gobblin-example submodules. They can conflict with other modules on Jackson and Avro (via transitive dependency) respectively.
      $ ./gradlew assemble
      $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.17.0.tar.gz -C /tmp
      $ cd /tmp/gobblin-dist
      $ cat /tmp/sample.job 
      source.class=org.apache.gobblin.source.extractor.hadoop.AvroFileSource
      source.filebased.data.directory=/tmp/avro
      extract.table.type=SNAPSHOT_ONLY
      writer.builder.class=org.apache.gobblin.writer.ConsoleWriterBuilder
      data.publisher.type=org.apache.gobblin.publisher.NoopPublisher
      $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job 
      
      ...
      
      2022-11-26 20:15:52 JST ERROR [TaskExecutor-0] org.apache.gobblin.runtime.Task  - Task task_EmbeddedGobblin_1669461352066_0 failed
      java.lang.NoClassDefFoundError: org/tukaani/xz/XZInputStream
      	at org.apache.avro.file.XZCodec.decompress(XZCodec.java:74)
      
      ...
      

      This issue doesn't occur on past releases.

      $ curl -sLO https://downloads.apache.org/gobblin/apache-gobblin-0.16.0/apache-gobblin-incubating-sources-0.16.0.tgz
      $ tar xf apache-gobblin-incubating-sources-0.16.0.tgz 
      $ cd apache-gobblin-incubating-sources-0.16.0
      $ vi gobblin-distribution/gobblin-flavor-standard.gradle  # Remove the gobblin-elasticsearch and gobblin-example submodules
      $ curl -sL https://github.com/apache/gobblin/raw/master/gradle/wrapper/gradle-wrapper.jar -o gradle/wrapper/gradle-wrapper.jar
      $ ./gradlew assemble
      $ rm -rf /tmp/gobblin-dist
      $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.16.0.tar.gz -C /tmp
      $ cd /tmp/gobblin-dist
      $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job 
      
      ...
      
      {"station": "011990-99999", "time": -619524000000, "temp": 0}
      2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619524000000, "temp": 0}
      {"station": "011990-99999", "time": -619506000000, "temp": 22}
      2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619506000000, "temp": 22}
      {"station": "011990-99999", "time": -619484400000, "temp": -11}
      2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "011990-99999", "time": -619484400000, "temp": -11}
      {"station": "012650-99999", "time": -655531200000, "temp": 111}
      2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "012650-99999", "time": -655531200000, "temp": 111}
      {"station": "012650-99999", "time": -655509600000, "temp": 78}
      2022-11-26 20:21:33 JST INFO  [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter  - {"station": "012650-99999", "time": -655509600000, "temp": 78}
      

      This is because Avro 1.9.2 declares xz's scope as "provided" for some reason. It was fixed in the next release, but while using Avro 1.9.2, it would be helpful for users to include this dependency on Gobblin's side.

      In addition, upgrading Avro to 1.9.2 enables to leverage zstd compression. It should be documented as it's beneficial for users.

      Attachments

        Activity

          People

            abti Abhishek Tiwari
            sekikn Kengo Seki
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m