Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-73

IO design pattern: Decouple Parsers and Coders

Details

    Description

      Many Sources can be thought of as providing a byte[] payload – e.g. TextIO bytes between newlines, or PubSubIO messages. Therefore, we originally suggested a Coder as the thing to use to decode these byte[] into T (what I'll call Parsing).

      Consider the case of a text file of integers.

      123\n
      456\n
      ...

      We want a PCollection<Integer> out, so we can use TextualIntegerCoder with TextIO.Read. However, that Coder will get propagated as the default coder for that PCollection (and may be used in downstream DoFns). This seem bad as, once the data is parsed, we probably want to use VarIntCoder or another Coder that is more CPU- and Space-efficient.

      Another design pattern is
      TextIO.Read() -> MapElements<String, Integer> (lambda s : Integer.parseInt(s))

      This has better behavior, but now we go from byte[] to String to Integer rather than directly from byte[] to Integer.

      The solution seems to be to explicitly add Parser and Coder abstractions.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dhalperi Dan Halperin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: