Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2643

Add TextIO and AvroIO read transforms that can read a PCollection of files

Details

    • New Feature
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 2.2.0
    • sdk-py-core
    • None

    Description

      Java SDK now has TextIO.read_all() API that allows reading a massive number of files by moving from using the BoundedSource API (which may perform expensive source operations on the control plane) to using ParDo operations.

      https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L170

      This API should be added for Python SDK as well.

      This form of reading files does not support dynamic work rebalancing for now. But this should not matter much when reading a massive number of relatively small files. In the future this API can support dynamic work rebalancing through Splittable DoFn.

      cc: jkff

      Attachments

        Issue Links

          Activity

            People

              chamikara Chamikara Madhusanka Jayalath
              chamikara Chamikara Madhusanka Jayalath
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: