Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8168

Python GCSFileSystem failing with gzip content encoding

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • 2.15.0
    • None
    • io-py-gcp
    • None

    Description

      Google Storage supports gzip content encoding.

       

      While Apache Beam (Python) can correctly work with .gz files without content encoding.

      It however fails to handle .gz files that have content encoding applied.

      e.g. (the following would work run in a Jupyer notebook)

      file_url_1 = 'gs://some-bucket/test1.gz'
      file_url_2 = 'gs://some-bucket/test2.gz'
      
      !echo 'my content' > /tmp/test
      
      # file 1 without content encoding
      !cat /tmp/test | gzip | gsutil cp - "{file_url_1}"
      
      # file 2 with content encoding
      !gsutil cp -Z /tmp/test "{file_url_2}"
      
      !gsutil cat "{file_url_1}" | zcat -
      # output: my content
      
      !gsutil cat "{file_url_2}" | zcat -
      # output: my content
      
      import apache_beam as beam
      from apache_beam.io.filesystem import CompressionTypes
      from apache_beam.io.filesystems import FileSystems
      
      print(beam.__version__)
      # output: 2.15.0
      
      with FileSystems.open(file_url_1, compression_type=CompressionTypes.UNCOMPRESSED) as fp:
          print(fp.read(10))
      # output: b'\x1f\x8b\x08\x00\x10\xd6r]\x00\x03'
      
      with FileSystems.open(file_url_1) as fp:
          print(fp.read(10))
      # output: b'my content'
      
      with FileSystems.open(file_url_2, compression_type=CompressionTypes.UNCOMPRESSED) as fp:
          print(fp.read(10))
      # output: b'my content'
      # (here I would expect the gzipped byte code)
      
      with FileSystems.open(file_url_2) as fp:
          print(fp.read(10))
      # exception: FailedToDecompressContent: Content purported to be compressed with gzip but failed to decompress.
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            de Daniel Ecer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: