Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-167

TextIO can't read concatenated gzip files

    XMLWordPrintableJSON

    Details

      Description

      $ cat <<END > header.csv
      a,b,c
      END
      $ cat <<END > body.csv
      1,2,3
      4,5,6
      7,8,9
      END
      $ gzip -c header.csv > file.gz
      $ gzip -c body.csv >> file.gz

      The file is well-formed:
      $ gzip -dc file.gz
      a,b,c
      1,2,3
      4,5,6
      7,8,9

      However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" - reproducible even when the file is on local disk and with the DirectPipelineRunner.

      The bug is in CompressedSource. It uses GzipCompressorInputStream, which by default reads only the first gzip stream in the file, but has an option to read all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream which reads all streams.

        Attachments

          Activity

            People

            • Assignee:
              lcwik Luke Cwik
              Reporter:
              jkff Eugene Kirpichov
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: