Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3067

BigQueryIO.Write fails on empty PCollection with DirectRunner (batch job)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P2
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.2.0
    • Component/s: io-java-gcp, runner-direct
    • Labels:
      None
    • Environment:
      Arch Linux, Java 1.8.0_144

      Description

      I'm using side output feature to filter out malformatted events (errors) from a stream of valid events. Then I save valid events into one BigQuery table and errors go into another dedicated table.
      Here is the code for outputting error rows:

      invalidEventRows.apply("WriteErrors", BigQueryIO.writeTableRows()
              .to(errorTableRef)
              .withSchema(ProcessEvents.getErrorSchema())
              .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
              .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
      

      The problem is that when running on DirectRunner in a batch mode (reading input from a file) and invalidEventRows PCollection ends up being empty (all events are valid – no errors), I get the following error:

      [ERROR]   "status" : {
      [ERROR]     "errorResult" : {
      [ERROR]       "message" : "No schema specified on job or table.",
      [ERROR]       "reason" : "invalid"
      [ERROR]     },
      [ERROR]     "errors" : [ {
      [ERROR]       "message" : "No schema specified on job or table.",
      [ERROR]       "reason" : "invalid"
      [ERROR]     } ],
      [ERROR]     "state" : "DONE"
      [ERROR]   },
      

      There are no errors when executing the same code and invalidEventRows PCollection is not empty, the BigQuery table is created and the data are correctly inserted.
      Also everything seems to be working fine in a streaming mode (reading from Pub/Sub) on both DirectRunner and DataflowRunner.

      Looks like a bug?
      Or should I open an issue in GoogleCloudPlatform/DataflowJavaSDK github project?

        Attachments

          Activity

            People

            • Assignee:
              reuvenlax Reuven Lax
              Reporter:
              bigunyak Dmitry Bigunyak
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: