Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11234

impalad keeps reporting ShortCircuitCache slot release failures in heavy workload

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 4.1.0
    • Impala 4.2.0
    • Backend
    • None
    • ghx-label-14

    Description

      I keep seeing this error during a local perf test on my desktop machine:

      E0410 07:04:10.691095   430 ShortCircuitCache.java:232] ShortCircuitCache(0x6e76c6a7): failed to release short-circuit shared memory slot Slot(slotIdx=0, shm=DfsClientShm(1effcf56a590fbc371938a368987f4e9)) by sending ReleaseShortCircuitAccessRequestProto to /var/lib/hadoop-hdfs/socket.31001.  Closing shared memory segment.
      Java exception follows:
      java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 1effcf56a590fbc371938a368987f4e9
              at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:214)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
       

      I can also find it in our Jenkins jobs, but it only happens in the data-loading phase. So I suspend it only happens in heavy workloads.

      HDFS-14701 mentioned that this happens when the DataNode is stopped/restarted. But I didn't restart my HDFS cluster and I'm still able to see this error log.

      It worth investigating if we are doing something wrong in short-circuit related stuffs.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stigahuang Quanlong Huang
            stigahuang Quanlong Huang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment