Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8576

Regression when reading many files

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Triage Needed
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.14.0, 2.15.0, 2.16.0
    • Fix Version/s: None
    • Component/s: runner-spark
    • Labels:
      None

      Description

      When reading many files, I used to get many tasks. (beam 2.12)

      When I upgrade to beam 2.14, the same code leads to different execution where all files are read by only 1 task.

      This happens when not using the Source but the DoFn's (via 'withHintMatchesManyFiles')

      final PCollection<GenericRecord> records = pipeline.apply(AvroIO.readGenericRecords(mySchema)
          .from(options.getInputPath() + "/*/*/*/data/file.avro").withHintMatchesManyFiles());
      records.apply(Count.globally()) 

        Attachments

        1. Beam_2.12_Dag.png
          79 kB
          Stefan De Smit
        2. Beam_2.12_Stages.png
          103 kB
          Stefan De Smit
        3. Beam_2.14_Dag.png
          61 kB
          Stefan De Smit
        4. Beam_2.14_Stages.png
          88 kB
          Stefan De Smit

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              desmit Stefan De Smit
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: