Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-6952

concatenated compressed files bug with python sdk

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Not applicable
    • Fix Version/s: 2.14.0
    • Component/s: sdk-py-core
    • Labels:
      None

      Description

      The Python apache_beam.io.filesystem module has a bug handling concatenated compressed files.

      The PR I will create has two commits:

      1. a new unit test that shows the problem
      2. a fix to the problem.

      The unit test is added to the apache_beam.io.filesystem_test module. It was added to this module because the test: apache_beam.io.textio_test.test_read_gzip_concat does not encounter the problem in the Beam 2.11 and earlier code base because the test data is too small: the data is smaller than read_size, so it goes through logic in the code that avoids the problem in the code. So, this test sets read_size smaller and test data bigger, in order to encounter the problem. It would be difficult to test in the textio_test module, because you'd need very large test data because default read_size is 16MiB, and the ReadFromText interface does not allow you to modify the read_size.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              danl Daniel Lescohier
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 20m
                3h 20m