Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8576

Regression when reading many files

Details

    • Bug
    • Status: Resolved
    • P3
    • Resolution: Won't Fix
    • 2.14.0, 2.15.0, 2.16.0
    • Missing
    • runner-spark
    • None

    Description

      When reading many files, I used to get many tasks. (beam 2.12)

      When I upgrade to beam 2.14, the same code leads to different execution where all files are read by only 1 task.

      This happens when not using the Source but the DoFn's (via 'withHintMatchesManyFiles')

      final PCollection<GenericRecord> records = pipeline.apply(AvroIO.readGenericRecords(mySchema)
          .from(options.getInputPath() + "/*/*/*/data/file.avro").withHintMatchesManyFiles());
      records.apply(Count.globally()) 

      Attachments

        1. Beam_2.12_Dag.png
          79 kB
          Stefan De Smit
        2. Beam_2.12_Stages.png
          103 kB
          Stefan De Smit
        3. Beam_2.14_Dag.png
          61 kB
          Stefan De Smit
        4. Beam_2.14_Stages.png
          88 kB
          Stefan De Smit

        Activity

          People

            Unassigned Unassigned
            desmit Stefan De Smit
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: