Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-1787

Python DirectRunner silently blocks reading full query from Google Datastore

Details

    Description

      When I run a query (even with many splits) against the production datastore (such as in the datastore_wordcount demo), it operates as follows:

      1. split the query into a bunch of split queries
      2. run each split query, collecting the results
      3. then pass the results to the following stage / ParDo

      However, 2 is run to completion with DirectRunner before starting 3. So a large dataset must be fully downloaded before it attempts to run any of the following stages.

      While it may make sense and local parallelism/pipelining might be impossible....there is no output or status messages. And debugging why my code appeared to hang before processing results, took forever to dig through code and instrument-log-debug all the beam code to figure out what was going on.

      See https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/36 for more details

      This happens with github head 0.7.0-dev (there was no "version" tag for this above).

      Attachments

        Activity

          People

            Unassigned Unassigned
            mlambert Mike Lambert
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: