Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12434

implement num_shard side_input to WriteToTFRecord

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: P2
    • Resolution: Implemented
    • Affects Version/s: 3.0.0, 2.29.0, 2.30.0, 2.31.0, 2.32.0
    • Fix Version/s: None
    • Component/s: io-py-tfrecord
    • Labels:
      None

      Description

      As concisely explained in https://stackoverflow.com/questions/49156159/can-i-pass-side-inputs-to-apache-beam-ptransforms 
      EXAMPLES_PER_SHARD = 5.0
      num_tfexamples = tfexample_strs | "count tf examples" >> beam.combiners.Count.Globally()
      num_shards = num_tfexamples | ("compute number of shards" >>
      beam.Map(lambda num_examples: int(math.ceil(num_examples / EXAMPLES_PER_SHARD))))
      _ = tfexample_strs | ("output to tfrecords" >>
      beam.io.WriteToTFRecord(OUTPUT_DIR, num_shards=beam.pvalue.AsSingleton(num_shards)))
      fails with
      File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/iobase.py", line 1011, in start_bundle
      self.counter = random.randint(0, self.count - 1)
      TypeError: unsupported operand type(s) for -: 'AsSingleton' and 'int' [while running 'output VALIDATION to tfrecords/Write/WriteImpl/ParDo(_RoundRobinKeyFn)']
      WriteToTFRecords op in the python SDK of apache-beam does currently not support side_input to num_shards.

      It can easily be solved by implementing the _RoundRobinKeyFn a bit differently and calling the ParDo with side_input instead of class init values. 

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hoshimura Johan Sternby
                Reporter:
                hoshimura Johan Sternby
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 40m
                  4h 40m