Details
-
Bug
-
Status: Open
-
P3
-
Resolution: Unresolved
-
2.4.0
-
None
Description
TLDR: if you have a multi-output DoFn, then the non-main PCollections with incorrectly have their element types set to None. This affects type checking for pipelines involving these PCollections.
Minimal example:
import apache_beam as beam class TripleDoFn(beam.DoFn): def process(self, elem): yield_elem if elem % 2 == 0: yield beam.pvalue.TaggedOutput('ten_times', elem * 10) if elem % 3 == 0: yield beam.pvalue.TaggedOutput('hundred_times', elem * 100) @beam.typehints.with_input_types(int) @beam.typehints.with_output_types(int) class MultiplyBy(beam.DoFn): def __init__(self, multiplier): self._multiplier = multiplier def process(self, elem): return elem * self._multiplier def main(): with beam.Pipeline() as p: x, a, b = ( p | 'Create' >> beam.Create([1, 2, 3]) | 'TripleDo' >> beam.ParDo(TripleDoFn()).with_outputs( 'ten_times', 'hundred_times', main='main_output')) _ = a | 'MultiplyBy2' >> beam.ParDo(MultiplyBy(2)) if __name__ == '__main__': main()
Running this yields the following error:
apache_beam.typehints.decorators.TypeCheckError: Type hint violation for 'MultiplyBy2': requires <type 'int'> but got None for elem
Replacing a with b yields the same error. Replacing a with x instead yields the following error:
apache_beam.typehints.decorators.TypeCheckError: Type hint violation for 'MultiplyBy2': requires <type 'int'> but got Union[TaggedOutput, int] for elem
I would expect Beam to correctly infer that a and b have element types of int rather than None, and I would also expect Beam to correctly figure out that the element types of x are compatible with int.