Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-14407

Jenkins worker sometimes crashes while running Python Flink pipeline

Details

    • Bug
    • Status: Open
    • P2
    • Resolution: Unresolved
    • None
    • None
    • test-failures

    Description

      Example failure from https://ci-beam.apache.org/job/beam_PostCommit_Python37/5184/

       >>> RUNNING integration tests with pipeline options: --runner=FlinkRunner --project=apache-beam-testing --environment_type=LOOPBACK –      temp_location=gs://temp-storage-for-end-to-end-tests/temp-it --flink_job_server_jar=/home/jenkins/jenkins-slave/workspace/                  beam_PostCommit_Python37/src/runners/flink/1.14/job-server/build/libs/beam-runners-flink-1.14-job-server-2.39.0-SNAPSHOT.jar
      4216 >>>   pytest options: apache_beam/io/gcp/bigquery_read_it_test.py apache_beam/io/external/xlang_jdbcio_it_test.py apache_beam/io/           external/xlang_kafkaio_it_test.py apache_beam/io/external/xlang_kinesisio_it_test.py apache_beam/io/external/xlang_debeziumio_it_test.      py --log-cli-level=INFO
      
      ...
      
      15:27:18 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:116 Starting service with ['java' '{-}jar' '/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37/src/runners/flink/1.14/job-server/build/libs/beam-runners-flink-1.14-job-server-2.39.0-SNAPSHOT.jar' '{-}{-}flink-master' '[auto]' '{-}{-}artifacts-dir' '/tmp/beam-temp34uahjm8/artifactsfzc4uc4c' '{-}{-}job-port' '56343' '{-}{-}artifact-port' '0' '{-}-expansion-port' '0']
      15:27:18 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'May 03, 2022 1:27:20 PM software.amazon.awssdk.regions.internal.util.EC2MetadataUtils getItems'
      15:27:20 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'WARNING: Unable to retrieve the requested metadata.'
      15:27:20 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'May 03, 2022 1:27:20 PM org.apache.beam.sdk.io.aws2.s3.DefaultS3ClientBuilderFactory createBuilder'
      15:27:20 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b"INFO: The AWS S3 Beam extension was included in this build, but the awsRegion flag was not specified. If you don't plan to use S3, then ignore this message."
      15:27:20 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'May 03, 2022 1:27:21 PM org.apache.beam.runners.jobsubmission.JobServerDriver createArtifactStagingService'
      15:27:21 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'INFO: ArtifactStagingService started on localhost:36631'
      15:27:21 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'May 03, 2022 1:27:21 PM org.apache.beam.runners.jobsubmission.JobServerDriver createExpansionService'
      15:27:21 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'INFO: Java ExpansionService started on localhost:35729'
      15:27:21 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'May 03, 2022 1:27:21 PM org.apache.beam.runners.jobsubmission.JobServerDriver createJobServer'
      15:27:21 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'INFO: JobService started on localhost:56343'
      15:27:21 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'May 03, 2022 1:27:21 PM org.apache.beam.runners.jobsubmission.JobServerDriver run'
      15:27:21 INFO     apache_beam.utils.subprocess_server:subprocess_server.py:125 b'INFO: Job server now running, terminate with Ctrl+C'
      15:27:21 FATAL: command execution failed
      15:27:21 java.io.IOException: Backing channel 'apache-beam-jenkins-10' is disconnected.
      15:27:21     at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)
      
      ...
      
      4318 FATAL: command execution failed                                                 
      4319 java.io.IOException: Backing channel 'apache-beam-jenkins-10' is disconnected.  
      4320   at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)                                           
      4321   at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:286)
      
       

      Perhaps a random crash or worker got overloaded. Other suites running at the same time:

      beam_BiqQueryIO_Streaming_Performance_Test_Java #3729    beam_LoadTests_Java_CoGBK_Dataflow_V2_Streaming_Java17 #134
      beam_LoadTests_Python_GBK_Dataflow_Batch #1060

      also crashed, but at the moment those tests have launched Dataflow jobs and were streaming log output. Only the beam_PostCommit_Python37 suite appeared to be running something intensive on the worker.

      Filing to see how frequently this happens.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tvalentyn Valentyn Tymofieiev
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: