Uploaded image for project: 'James Server'
  1. James Server
  2. JAMES-3676

S3 blobStore leaks HTTP connexions when reading streams partially (even when closing them))

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.6.0
    • 3.7.0, 3.6.1
    • Blob

    Description

      Symptoms

      Recently I found those delightful logs on one of my production environment:

      	
      {
        "timestamp": "2021-11-25T04:19:56.897Z",
        "level": "ERROR",
        "thread": "imapserver-executor-560",
        "mdc": {
          "fetchedData": "FetchData{flags=false, uid=true, internalDate=false, size=true, envelope=false, body=false, bodyStructure=false, setSeen=false, bodyElements=[BodyFetchElement{firstOctet=null, numberOfOctets=null, name=BODY[], sectionType=CONTENT, fieldNames=Optional.empty}], modSeq=false, changedSince=-1, vanished=false}",
          "protocol": "IMAP",
          "useUid": "true",
          "action": "FETCH",
          "sessionId": "SID-cdkbuyectzcc",
          "user": "xxx@linagora.com"
        },
        "logger": "org.apache.james.imap.processor.AbstractMailboxProcessor",
        "message": "Unexpected error during IMAP processing",
        "context": "default",
        "exception": "software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.\n\tat software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)\n\tat software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)\n\tat software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)\n\tat software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)\n\tat software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\tat software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\tat java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)\n\tat software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\tat java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)\n\tat software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:104)\n\tat java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)\n\tat software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onError(MakeAsyncHttpRequestStage.java:158)\n\tat software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:300)\n\tat software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:172)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\tat io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\tat io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\tat io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: java.lang.Throwable: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.\n\tat software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.decorateException(NettyRequestExecutor.java:306)\n\tat software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:299)\n\t... 11 common frames omitted\nCaused by: java.util.concurrent.TimeoutException: Acquire operation took longer than 10000 milliseconds.\n\tat software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\tat software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\tat org.glowroot.agent.plugin.executor.RunnableWrapper.run(RunnableWrapper.java:35)\n\tat io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t... 6 common frames omitted\n"
      }
      

      TLDR the S3 driver cannot get HTTP connexion from the pool and fails randomly, which is very inconvenient.

      I attempted a reboot, which temporarily solved the issue but we eventually run out again of HTTP connections after some time...

      Diagnostic

      This sounds like a leak to me!

      I quickly wrote a unit test reproducing the exact same issue by partially reading many blobs. Closing the resulting or not proved to have no effect.

      This pointed me to a defect within ReactorUtils.toInputStream was not cancelling the related publisher when closing the stream: only full consumption was releasing the subscription! Once again an easy unit test properly demonstrate the issue.

      Propagating the "close" from the SequencedInputStream back to the subscribed publisher cannot be done with an iterator intermediate representation, but adopting a Java8 Stream instead allows to propagate closure and cancel the subscription if need be. This approach solves both failing tests.

      Possible causes that "revealed" this issue includes:

      • S3 driver update <3
      • Additionnal usage of the BlobStore stream API for JMAP blob uploads?

      Complementary work

      We might benefit to invest into tools for automatic leak detections.

      I found some java agents to do so: https://file-leak-detector.kohsuke.org/ could do the job!

      Attachments

        Activity

          People

            Unassigned Unassigned
            btellier Benoit Tellier
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m