Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2490

ReadFromText function is not taking all data with glob operator (*)

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • sdk-py-core
    • None
    • Usage with Google Cloud Platform: Dataflow runner

    Description

      I run a very simple pipeline:

      • Read my files from Google Cloud Storage
      • Split with '\n' char
      • Write in on a Google Cloud Storage

      I have 8 files that match with the pattern:

      • my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
      • my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
      • my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
      • my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
      • my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
      • my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
      • my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
      • my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)

      This code should take them all:

      beam.io.ReadFromText(
            "gs://XXXX_folder1/my_files_20160901*.csv.gz",
            skip_header_lines=1,
            compression_type=beam.io.filesystem.CompressionTypes.GZIP
            )
      

      It runs well but there is only a 288.62 MB file in output of this pipeline (instead of a 1.5 GB file).

      The whole pipeline code:

      data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
                "gs://XXXX_folder1/my_files_20160901*.csv.gz",
                skip_header_lines=1,
                compression_type=beam.io.filesystem.CompressionTypes.GZIP
                )
                             | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
                          )
      output = (
                data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', num_shards=1)
                  )
      

      Dataflow indicates me that the estimated size of the output after the ReadFromText step is 602.29 MB only, which not correspond to any unique input file size nor the overall file size matching with the pattern.

      Attachments

        Activity

          People

            chamikara Chamikara Madhusanka Jayalath
            oliviernguyenquoc Olivier NGUYEN QUOC
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: