[BEAM-8168] Python GCSFileSystem failing with gzip content encoding - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P3
Resolution: Unresolved
Affects Version/s: 2.15.0
Fix Version/s: None
Component/s: io-py-gcp
Labels:
None

Description

Google Storage supports gzip content encoding.

While Apache Beam (Python) can correctly work with .gz files without content encoding.

It however fails to handle .gz files that have content encoding applied.

e.g. (the following would work run in a Jupyer notebook)

file_url_1 = 'gs://some-bucket/test1.gz'
file_url_2 = 'gs://some-bucket/test2.gz'

!echo 'my content' > /tmp/test

# file 1 without content encoding
!cat /tmp/test | gzip | gsutil cp - "{file_url_1}"

# file 2 with content encoding
!gsutil cp -Z /tmp/test "{file_url_2}"

!gsutil cat "{file_url_1}" | zcat -
# output: my content

!gsutil cat "{file_url_2}" | zcat -
# output: my content

import apache_beam as beam
from apache_beam.io.filesystem import CompressionTypes
from apache_beam.io.filesystems import FileSystems

print(beam.__version__)
# output: 2.15.0

with FileSystems.open(file_url_1, compression_type=CompressionTypes.UNCOMPRESSED) as fp:
    print(fp.read(10))
# output: b'\x1f\x8b\x08\x00\x10\xd6r]\x00\x03'

with FileSystems.open(file_url_1) as fp:
    print(fp.read(10))
# output: b'my content'

with FileSystems.open(file_url_2, compression_type=CompressionTypes.UNCOMPRESSED) as fp:
    print(fp.read(10))
# output: b'my content'
# (here I would expect the gzipped byte code)

with FileSystems.open(file_url_2) as fp:
    print(fp.read(10))
# exception: FailedToDecompressContent: Content purported to be compressed with gzip but failed to decompress.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Daniel Ecer

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Sep/19 22:30

Updated:: 16/Jun/20 17:25