Details
-
Bug
-
Status: Triage Needed
-
P2
-
Resolution: Fixed
-
Not applicable
-
None
Description
The Python apache_beam.io.filesystem module has a bug handling concatenated compressed files.
The PR I will create has two commits:
- a new unit test that shows the problem
- a fix to the problem.
The unit test is added to the apache_beam.io.filesystem_test module. It was added to this module because the test: apache_beam.io.textio_test.test_read_gzip_concat does not encounter the problem in the Beam 2.11 and earlier code base because the test data is too small: the data is smaller than read_size, so it goes through logic in the code that avoids the problem in the code. So, this test sets read_size smaller and test data bigger, in order to encounter the problem. It would be difficult to test in the textio_test module, because you'd need very large test data because default read_size is 16MiB, and the ReadFromText interface does not allow you to modify the read_size.