Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2373

AvroSource: Premature End of stream Exception on SnappyCompressorInputStream

Details

    • Bug
    • Status: Resolved
    • P1
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • sdk-java-core

    Description

      During processing we encountered on some of our snappy encoded avro input files

      Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Premature end of stream
       at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:330)
       at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
       at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
       at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
       at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
       at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
      Caused by: java.io.IOException: Premature end of stream
       at org.apache.beam.sdk.repackaged.org.apache.commons.compress.compressors.snappy.SnappyCompressorInputStream.expandLiteral(SnappyCompressorInputStream.java:310)
       at org.apache.beam.sdk.repackaged.org.apache.commons.compress.compressors.snappy.SnappyCompressorInputStream.fill(SnappyCompressorInputStream.java:169)
       at org.apache.beam.sdk.repackaged.org.apache.commons.compress.compressors.snappy.SnappyCompressorInputStream.read(SnappyCompressorInputStream.java:134)
       at org.apache.avro.io.BinaryDecoder$InputStreamByteSource.tryReadRaw(BinaryDecoder.java:839)
       at org.apache.avro.io.BinaryDecoder$ByteSource.compactAndFill(BinaryDecoder.java:692)
       at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:471)
       at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128)
       at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:423)
       at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
       at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
       at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
       at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:178)
       at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
       at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
       at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
       at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
       at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
       at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
       at org.apache.beam.sdk.io.AvroSource$AvroBlock.readNextRecord(AvroSource.java:579)
       at org.apache.beam.sdk.io.BlockBasedSource$BlockBasedReader.readNextRecord(BlockBasedSource.java:198)
       at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:479)
       at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:277)
       at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$BoundedReadEvaluator.processElement(BoundedReadEvaluatorFactory.java:148)
       at org.apache.beam.runners.direct.TransformExecutor.processElements(TransformExecutor.java:146)
       at org.apache.beam.runners.direct.TransformExecutor.run(TransformExecutor.java:110)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       at java.lang.Thread.run(Thread.java:745)
      

      This seems to be caused by a bug in apache.commons.compress:1.9, which was addressed here:
      https://github.com/apache/commons-compress/commit/9ae37525134089dd0c9ee1cf8738192b70e0fc07

      Used a pipeline using AvroIO, on spark and direct, both on hdfs and local file system.

      In our short tests we got it running without exceptions by either:

      • upgrading to commons.compress:1.14
      • applying the patch to the 1.9er code of SnappyCompressorInputStream

      Impacts on other components were not tested, of course

      Attachments

        Issue Links

          Activity

            People

              michel Michael Luckey
              michel Michael Luckey
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: