Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
After upgrading Avro to 1.9.2 on master, xz-compressed Avro files are neither readable or writable by default.
For example, given the following avro file which is compressed with xz codec,
$ java -jar avro-tools-1.11.1.jar getmeta /tmp/avro/weather.avro avro.schema {"type":"record","name":"Weather","namespace":"test","doc":"A weather reading.","fields":[{"name":"station","type":"string"},{"name":"time","type":"long"},{"name":"temp","type":"int"}]} avro.codec xz
reading that file fails on master as follows.
$ git status On branch master Your branch is ahead of 'origin/master' by 285 commits. (use "git push" to publish your local commits) nothing to commit, working tree clean $ vi gobblin-distribution/gobblin-flavor-standard.gradle # Remove the gobblin-elasticsearch and gobblin-example submodules. They can conflict with other modules on Jackson and Avro (via transitive dependency) respectively. $ ./gradlew assemble $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.17.0.tar.gz -C /tmp $ cd /tmp/gobblin-dist $ cat /tmp/sample.job source.class=org.apache.gobblin.source.extractor.hadoop.AvroFileSource source.filebased.data.directory=/tmp/avro extract.table.type=SNAPSHOT_ONLY writer.builder.class=org.apache.gobblin.writer.ConsoleWriterBuilder data.publisher.type=org.apache.gobblin.publisher.NoopPublisher $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job ... 2022-11-26 20:15:52 JST ERROR [TaskExecutor-0] org.apache.gobblin.runtime.Task - Task task_EmbeddedGobblin_1669461352066_0 failed java.lang.NoClassDefFoundError: org/tukaani/xz/XZInputStream at org.apache.avro.file.XZCodec.decompress(XZCodec.java:74) ...
This issue doesn't occur on past releases.
$ curl -sLO https://downloads.apache.org/gobblin/apache-gobblin-0.16.0/apache-gobblin-incubating-sources-0.16.0.tgz $ tar xf apache-gobblin-incubating-sources-0.16.0.tgz $ cd apache-gobblin-incubating-sources-0.16.0 $ vi gobblin-distribution/gobblin-flavor-standard.gradle # Remove the gobblin-elasticsearch and gobblin-example submodules $ curl -sL https://github.com/apache/gobblin/raw/master/gradle/wrapper/gradle-wrapper.jar -o gradle/wrapper/gradle-wrapper.jar $ ./gradlew assemble $ rm -rf /tmp/gobblin-dist $ tar xf build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.16.0.tar.gz -C /tmp $ cd /tmp/gobblin-dist $ bin/gobblin cli run -jobName sample -jobFile /tmp/sample.job ... {"station": "011990-99999", "time": -619524000000, "temp": 0} 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "011990-99999", "time": -619524000000, "temp": 0} {"station": "011990-99999", "time": -619506000000, "temp": 22} 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "011990-99999", "time": -619506000000, "temp": 22} {"station": "011990-99999", "time": -619484400000, "temp": -11} 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "011990-99999", "time": -619484400000, "temp": -11} {"station": "012650-99999", "time": -655531200000, "temp": 111} 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "012650-99999", "time": -655531200000, "temp": 111} {"station": "012650-99999", "time": -655509600000, "temp": 78} 2022-11-26 20:21:33 JST INFO [ForkExecutor-0] org.apache.gobblin.writer.ConsoleWriter - {"station": "012650-99999", "time": -655509600000, "temp": 78}
This is because Avro 1.9.2 declares xz's scope as "provided" for some reason. It was fixed in the next release, but while using Avro 1.9.2, it would be helpful for users to include this dependency on Gobblin's side.
In addition, upgrading Avro to 1.9.2 enables to leverage zstd compression. It should be documented as it's beneficial for users.