Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-25426

UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint fails on AZP because it cannot allocate enough network buffers

    XMLWordPrintableJSON

Details

    Description

      The test UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint fails with

      2021-12-23T02:54:46.2862342Z Dec 23 02:54:46 [ERROR] UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint  Time elapsed: 2.992 s  <<< ERROR!
      2021-12-23T02:54:46.2865774Z Dec 23 02:54:46 java.lang.OutOfMemoryError: Could not allocate enough memory segments for NetworkBufferPool (required (Mb): 64, allocated (Mb): 14, missing (Mb): 50). Cause: Direct buffer memory. The direct out-of-memory error has occurred. This can mean two things: either job(s) require(s) a larger size of JVM direct memory or there is a direct memory leak. The direct memory can be allocated by user code or some of its dependencies. In this case 'taskmanager.memory.task.off-heap.size' configuration option should be increased. Flink framework and its dependencies also consume the direct memory, mostly for network communication. The most of network memory is managed by Flink and should not result in out-of-memory error. In certain special cases, in particular for jobs with high parallelism, the framework may require more direct memory which is not managed by Flink. In this case 'taskmanager.memory.framework.off-heap.size' configuration option should be increased. If the error persists then there is probably a direct memory leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
      2021-12-23T02:54:46.2868239Z Dec 23 02:54:46 	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.<init>(NetworkBufferPool.java:138)
      2021-12-23T02:54:46.2868975Z Dec 23 02:54:46 	at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:140)
      2021-12-23T02:54:46.2869771Z Dec 23 02:54:46 	at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:94)
      2021-12-23T02:54:46.2870550Z Dec 23 02:54:46 	at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:79)
      2021-12-23T02:54:46.2871312Z Dec 23 02:54:46 	at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:58)
      2021-12-23T02:54:46.2872062Z Dec 23 02:54:46 	at org.apache.flink.runtime.taskexecutor.TaskManagerServices.createShuffleEnvironment(TaskManagerServices.java:414)
      2021-12-23T02:54:46.2872767Z Dec 23 02:54:46 	at org.apache.flink.runtime.taskexecutor.TaskManagerServices.fromConfiguration(TaskManagerServices.java:282)
      2021-12-23T02:54:46.2873436Z Dec 23 02:54:46 	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:523)
      2021-12-23T02:54:46.2877615Z Dec 23 02:54:46 	at org.apache.flink.runtime.minicluster.MiniCluster.startTaskManager(MiniCluster.java:645)
      2021-12-23T02:54:46.2878247Z Dec 23 02:54:46 	at org.apache.flink.runtime.minicluster.MiniCluster.startTaskManagers(MiniCluster.java:626)
      2021-12-23T02:54:46.2878856Z Dec 23 02:54:46 	at org.apache.flink.runtime.minicluster.MiniCluster.start(MiniCluster.java:379)
      2021-12-23T02:54:46.2879487Z Dec 23 02:54:46 	at org.apache.flink.runtime.testutils.MiniClusterResource.startMiniCluster(MiniClusterResource.java:209)
      2021-12-23T02:54:46.2880152Z Dec 23 02:54:46 	at org.apache.flink.runtime.testutils.MiniClusterResource.before(MiniClusterResource.java:95)
      2021-12-23T02:54:46.2880821Z Dec 23 02:54:46 	at org.apache.flink.test.util.MiniClusterWithClientResource.before(MiniClusterWithClientResource.java:64)
      2021-12-23T02:54:46.2881519Z Dec 23 02:54:46 	at org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:151)
      2021-12-23T02:54:46.2882310Z Dec 23 02:54:46 	at org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:534)
      2021-12-23T02:54:46.2882978Z Dec 23 02:54:46 	at jdk.internal.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
      2021-12-23T02:54:46.2883574Z Dec 23 02:54:46 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      2021-12-23T02:54:46.2884171Z Dec 23 02:54:46 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
      2021-12-23T02:54:46.2884732Z Dec 23 02:54:46 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
      2021-12-23T02:54:46.2885527Z Dec 23 02:54:46 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
      2021-12-23T02:54:46.2886135Z Dec 23 02:54:46 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
      2021-12-23T02:54:46.2886755Z Dec 23 02:54:46 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
      2021-12-23T02:54:46.2887387Z Dec 23 02:54:46 	at org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
      2021-12-23T02:54:46.2887892Z Dec 23 02:54:46 	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
      2021-12-23T02:54:46.2888435Z Dec 23 02:54:46 	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
      2021-12-23T02:54:46.2889007Z Dec 23 02:54:46 	at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
      2021-12-23T02:54:46.2889568Z Dec 23 02:54:46 	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
      2021-12-23T02:54:46.2890104Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
      2021-12-23T02:54:46.2890686Z Dec 23 02:54:46 	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
      2021-12-23T02:54:46.2891259Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
      2021-12-23T02:54:46.2891819Z Dec 23 02:54:46 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
      2021-12-23T02:54:46.2892421Z Dec 23 02:54:46 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
      2021-12-23T02:54:46.2892978Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
      2021-12-23T02:54:46.2893508Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
      2021-12-23T02:54:46.2894049Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
      2021-12-23T02:54:46.2894588Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
      2021-12-23T02:54:46.2895203Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
      2021-12-23T02:54:46.2895721Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
      2021-12-23T02:54:46.2896304Z Dec 23 02:54:46 	at org.junit.runners.Suite.runChild(Suite.java:128)
      2021-12-23T02:54:46.2896781Z Dec 23 02:54:46 	at org.junit.runners.Suite.runChild(Suite.java:27)
      2021-12-23T02:54:46.2897359Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
      2021-12-23T02:54:46.2897892Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
      2021-12-23T02:54:46.2898429Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
      2021-12-23T02:54:46.2898968Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
      2021-12-23T02:54:46.2899487Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
      2021-12-23T02:54:46.2900025Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
      2021-12-23T02:54:46.2900542Z Dec 23 02:54:46 	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
      2021-12-23T02:54:46.2901044Z Dec 23 02:54:46 	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
      2021-12-23T02:54:46.2901540Z Dec 23 02:54:46 	at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
      2021-12-23T02:54:46.2902086Z Dec 23 02:54:46 	at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42)
      2021-12-23T02:54:46.2902702Z Dec 23 02:54:46 	at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80)
      2021-12-23T02:54:46.2903297Z Dec 23 02:54:46 	at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72)
      2021-12-23T02:54:46.2903944Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:107)
      2021-12-23T02:54:46.2904712Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
      2021-12-23T02:54:46.2905493Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54)
      2021-12-23T02:54:46.2906245Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67)
      2021-12-23T02:54:46.2906968Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52)
      2021-12-23T02:54:46.2907692Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
      2021-12-23T02:54:46.2908303Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
      2021-12-23T02:54:46.2908971Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
      2021-12-23T02:54:46.2909664Z Dec 23 02:54:46 	at org.junit.platform.launcher.core.SessionPerRequestLauncher.execute(SessionPerRequestLauncher.java:53)
      2021-12-23T02:54:46.2910347Z Dec 23 02:54:46 	at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.execute(JUnitPlatformProvider.java:188)
      2021-12-23T02:54:46.2911042Z Dec 23 02:54:46 	at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:154)
      2021-12-23T02:54:46.2911743Z Dec 23 02:54:46 	at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124)
      2021-12-23T02:54:46.2912399Z Dec 23 02:54:46 	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
      2021-12-23T02:54:46.2913009Z Dec 23 02:54:46 	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
      2021-12-23T02:54:46.2913589Z Dec 23 02:54:46 	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
      2021-12-23T02:54:46.2914162Z Dec 23 02:54:46 	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)
      

      https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28502&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=b78d9d30-509a-5cea-1fef-db7abaa325ae&l=14634

      Maybe the test instability is caused by exceeding our available memory on the CI machines by running too many tests concurrently.

      Attachments

        Issue Links

          Activity

            People

              akalashnikov Anton Kalashnikov
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: