Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4614

Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0.1
    • Fix Version/s: 0.10.2.0
    • Component/s: core

      Description

      First let me clarify our system environment information as I think it's important to understand this issue:

      OS: CentOS6
      Kernel version: 2.6.32-XX
      Filesystem used for data volume: XFS
      Java version: 1.8.0_66
      GC option: Kafka default(G1GC)
      Kafka version: 0.10.0.1

      Phenomenon

      In our Kafka cluster, an usual response time for Produce request is about 1ms for 50th percentile to 10ms for 99th percentile. All topics are configured to have 3 replicas and all producers are configured acks=all so this time includes replication latency.
      Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms but for the most cases the time consuming part is "Remote" which means it is caused by slow replication which is known to happen by various reasons(which is also an issue that we're trying to improve, but out of interest within this issue).
      However, we found that there are some different patterns which happens rarely but stationary 3 ~ 5 times a day for each servers. The interesting part is that "RequestQueue" also got increased as well as "Total" and "Remote".
      At the same time, we observed that the disk read metrics(in terms of both read bytes and read time) spikes exactly for the same moment. Currently we have only caught up consumers so this metric sticks to zero while all requests are served by page cache.

      In order to investigate what Kafka is "read"ing, I employed SystemTap and wrote the following script. It traces all disk reads(only actual read by physical device) made for the data volume by broker process.

      global target_pid = KAFKA_PID
      global target_dev = DATA_VOLUME
      
      probe ioblock.request {
        if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
           t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
           printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
           print_backtrace()
           print_ubacktrace()
        }
      }
      

      As the result, we could observe many logs like below:

      Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size = 4096
       0xffffffff81275050 : generic_make_request+0x0/0x5a0 [kernel]
       0xffffffff81275660 : submit_bio+0x70/0x120 [kernel]
       0xffffffffa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
       0xffffffffa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
       0xffffffffa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
       0xffffffffa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
       0xffffffffa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
       0xffffffffa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
       0xffffffffa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
       0xffffffffa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
       0xffffffffa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
       0xffffffffa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
       0xffffffffa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
       0xffffffffa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
       0xffffffffa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
       0xffffffffa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
       0xffffffff811affbc : clear_inode+0xac/0x140 [kernel]
       0xffffffff811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
       0xffffffff811b0815 : generic_drop_inode+0x65/0x80 [kernel]
       0xffffffff811af662 : iput+0x62/0x70 [kernel]
       0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
       0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
       0x7ff269a1307e
      

      We took a jstack dump of the broker process and found that tid = 126123 corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):

      $ grep 0x1ecab /tmp/jstack.dump
      "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007ff278d0c800 nid=0x1ecab in Object.wait() [0x00007ff17da11000]
      

      In order to confirm, we enabled PrintGCApplicationStoppedTime switch and confirmed that in total the time the broker paused is longer than usual, when a broker experiencing the issue.

      $ grep --text 'Total time for which application threads were stopped' kafkaServer-gc.log  | grep --text '11T01:4' | perl -ne '/^(.*T\d{2}:\d{2}).*stopped: ([0-9\.]+)/; $h{$1} += $2 * 1000; END { print "$_ = $h{$_}\n" for sort keys %h }'
      2017-01-11T01:40 = 332.0601
      2017-01-11T01:41 = 315.1805
      2017-01-11T01:42 = 318.4317
      2017-01-11T01:43 = 317.8821
      2017-01-11T01:44 = 302.1132
      2017-01-11T01:45 = 950.5807 # << here
      2017-01-11T01:46 = 344.9449
      2017-01-11T01:47 = 328.936
      2017-01-11T01:48 = 296.3204
      2017-01-11T01:49 = 299.2834
      
      $ grep --text 'Total time for which application threads were stopped' kafkaServer-gc.log  | grep --text '11T01:45' | perl -ne '/^(.*T\d{2}:\d{2}:\d{2}).*stopped: ([0-9\.]+)/; $h{$1} += $2 * 1000; END { print "$_ = $h{$_}\n" for sort keys %h }' | head -20
      2017-01-11T01:45:00 = 31.1591
      2017-01-11T01:45:01 = 28.7814
      2017-01-11T01:45:02 = 30.0265
      2017-01-11T01:45:03 = 66.7318
      2017-01-11T01:45:04 = 20.6993
      2017-01-11T01:45:05 = 52.8268
      2017-01-11T01:45:06 = 29.6571
      2017-01-11T01:45:07 = 25.6357
      2017-01-11T01:45:08 = 58.6863
      2017-01-11T01:45:09 = 33.022
      2017-01-11T01:45:10 = 3.2497
      2017-01-11T01:45:11 = 9.916
      2017-01-11T01:45:12 = 19.7215
      2017-01-11T01:45:13 = 61.652
      2017-01-11T01:45:15 = 8.9058
      2017-01-11T01:45:16 = 16.7779
      2017-01-11T01:45:17 = 11.3215
      2017-01-11T01:45:18 = 82.2986
      2017-01-11T01:45:19 = 32.2718
      2017-01-11T01:45:20 = 1.1403
      

      Hypothesis

      In case of Kafka broker running on Linux, an mmap object of offset index are cleaned up like below:

      • OffsetIndex.delete() called and the corresponding file marked as deleted on VFS level(still not physically deleted at this moment, because there's an mmap object which holds opened reference to the same file)
      • OffsetIndex object also cut off from any reference and becomes a collectable garbage
      • While after, GC collects mmap object(MappedByteBuffer) and run the cleanup routine so it munmaps the mapping

      Hence, munmap being called from GC thread itself sounds normal.
      So the question should be why releasing mmap inside the GC thread likely causes such a long STW moment.
      After a tough investigation, we found that the following scenario can explain it:

      1. Since OffsetIndex object kept living for a long moment(depending retention.ms?) it should have already promoted to old gen.
      2. OffsetIndex.delete() called so the file marked as deleted. At the same time, the operation should warmed up inode cache.
      3. However, since G1GC collects a region which contains a lot of garbages preferentially, if a region that an mmap object contained still has many living objects, it might be left and not being collected for a long moment.
      4. During this moment, inode cache for the index file evicted from XFS layer cache.
      5. Eventually the mmap object collected and munmap called, which triggers kernel to perform physical deletion of the inode associated to the file, but at this moment, metadata cache for the index file already evicted from cache so XFS driver have to read it from disk. Since data volume disks of Kafka logs are very busy for write workload, reading disk is very time-consuming operation. During the moment, application threads are blocked and after resumed, all pending requests marks slow response time.

      Experiment

      In order to confirm this scenario, I tried to reproduce STW caused by GC freeing mmap which corresponds to a file that it's metadata out of the cache.
      I wrote below Java code and ran several times with observations.

      import java.io.File;
      import java.io.FileInputStream;
      import java.io.FileOutputStream;
      import java.io.IOException;
      import java.io.RandomAccessFile;
      import java.nio.MappedByteBuffer;
      import java.nio.channels.FileChannel.MapMode;
      import java.text.SimpleDateFormat;
      import java.util.Date;
      import java.util.concurrent.TimeUnit;
      
      import sun.nio.ch.DirectBuffer;
      
      /**
       * Test scenario:
       *
       * - Create N(given as an argument) number of files under the directory and corresponding N mmaps.
       * - Remove all files created.
       * - Sleep for a while, with expect to periodic XFS cache cleaner routine executed.
       * - Drop VFS dentry cache
       * - Cut references to mmaps
       * - Keep creating objects to pressure heap and lead GC to collect mmap garbages.
       */
      public final class MmapGcTest {
          private static String filename(int num) {
              return "/path/to/data_volume/mmaps/mmap-" + num;
          }
      
          public static void main(String[] args) throws IOException, InterruptedException {
              int numFiles = Integer.parseInt(args[0]);
              boolean cleanMmapsExplicitly = Boolean.parseBoolean(args[1]);
      
              Thread.sleep(20000);
      
              MappedByteBuffer[] mmaps = new MappedByteBuffer[numFiles];
              for (int i = 0; i < numFiles; i++) {
                  RandomAccessFile file = new RandomAccessFile(filename(i), "rw");
                  file.setLength(128);
                  MappedByteBuffer mmap = file.getChannel().map(MapMode.READ_WRITE, 0, file.length());
                  file.close();
                  mmaps[i] = mmap;
              }
      
              System.err.println("removing files");
              for (int i = 0; i < numFiles; i++) {
                  boolean deleted = new File(filename(i)).delete();
                  if (!deleted) {
                      throw new RuntimeException("failed to delete file");
                  }
              }
      
              // Wait until XFS expires inode buffer cache
              final long timeToSleep;
              try (FileInputStream fin = new FileInputStream("/proc/sys/fs/xfs/xfssyncd_centisecs")) {
                  byte[] buf = new byte[256];
                  fin.read(buf);
                  timeToSleep = Long.parseLong(new String(buf).trim()) * 10 /* centisecs to millisecs */ + 1000 /* and little bit more */;
              }
              System.err.println("Sleeping " + timeToSleep + " ms for waiting XFS inode buffer expiration");
              Thread.sleep(timeToSleep);
      
              System.err.println("drop caches");
              try (FileOutputStream out = new FileOutputStream("/proc/sys/vm/drop_caches")) {
                  out.write("2\n".getBytes());
              }
      
              // Cut references to mmaps and let GC to collect them
              if (cleanMmapsExplicitly) {
                  for (MappedByteBuffer mmap : mmaps) {
                      ((DirectBuffer) mmap).cleaner().clean();
                  }
                  System.err.println("mmaps cleaned up");
              }
              mmaps = null;
              System.err.println("cut reference to mmaps");
      
              SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
      
              // Keep creating relatively large objects and see if there's a significantly large interval which
              // seemingly caused by GC pause.
              byte[][] chunks = new byte[32768][];
              for (int i = 0; i < Integer.MAX_VALUE; i++) {
                  long t0 = System.nanoTime();
                  byte[] chunk = new byte[4096];
                  long elapsedMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - t0);
                  if (elapsedMillis > 100) {
                      System.err.printf("%s likely long GC pause: %d ms\n", sdf.format(new Date()), elapsedMillis);
                  }
                  chunks[i % chunks.length] = chunk;
              }
          }
      }
      

      At the same time the above program executed, I ran below two commands in different session:

      • The SystemTap script that I pasted above: to confirm same disk access pattern(code path)
      • dd if=/dev/zero of=$DATA_VOLUME/huge-zero.out bs=1024 count=$((100 * 1024 * 1024)): to simulate disk device pressure of broker servers.

      Here's the logs of execution:

      $ sudo java -Xmx1024m -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Xloggc:gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -XX:+PrintGCApplicationStoppedTime MmapGcTest 10000 false
      removing files
      Sleeping 31000 ms for waiting XFS inode buffer expiration
      drop caches
      cut reference to mmaps
      2017-01-06 20:52:58.195 likely long GC pause: 3372 ms
      2017-01-06 20:53:02.802 likely long GC pause: 2311 ms
      2017-01-06 20:53:04.275 likely long GC pause: 1467 ms
      2017-01-06 20:53:06.572 likely long GC pause: 2268 ms
      ...
      $ grep -A 30 20:53:02 stap.log | head -30
      Fri Jan  6 20:53:02 2017,024:  tid = 95825, device = XXX, inode = -1, size = 8192
       0xffffffff812746d0 : generic_make_request+0x0/0x5a0 [kernel]
       0xffffffff81274ce0 : submit_bio+0x70/0x120 [kernel]
       0xffffffffa042acba : _xfs_buf_ioapply+0x16a/0x200 [xfs]
       0xffffffffa042c96f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
       0xffffffffa042cb56 : _xfs_buf_read+0x36/0x60 [xfs]
       0xffffffffa042cc2b : xfs_buf_read+0xab/0x100 [xfs]
       0xffffffffa0422487 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
       0xffffffffa040a654 : xfs_imap_to_bp+0x54/0x130 [xfs]
       0xffffffffa040abdc : xfs_inotobp+0x8c/0xc0 [xfs]
       0xffffffffa040acf1 : xfs_iunlink_remove+0xe1/0x320 [xfs]
       0xffffffffa040af57 : xfs_ifree+0x27/0x110 [xfs]
       0xffffffffa0425c98 : xfs_inactive+0x198/0x450 [xfs]
       0xffffffffa0433630 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
       0xffffffff811af6fc : clear_inode+0xac/0x140 [kernel]
       0xffffffff811afeb6 : generic_delete_inode+0x196/0x1d0 [kernel]
       0xffffffff811aff55 : generic_drop_inode+0x65/0x80 [kernel]
       0xffffffff811aeda2 : iput+0x62/0x70 [kernel]
       0xffffffff811ab7a0 : dentry_iput+0x90/0x100 [kernel]
       0xffffffff811ab901 : d_kill+0x31/0x60 [kernel]
       0xffffffff811ad53d : dput+0xcd/0x160 [kernel]
       0x391d6e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
       0x7ff8d873dd47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 [/usr/jdk1.8.0_92/jre/lib/amd64/libnio.so]
       0x7ff9893376fe
      ...
      
      $ grep 20:53:02 gc.log | grep -E 'Total time for which application threads were stopped: [^0]\.'
      2017-01-06T20:53:02.802+0900: 60.898: Total time for which application threads were stopped: 2.3118335 seconds, Stopping threads took: 2.3054883 seconds
      
      $ grep 0x17651 jstack-case1
      "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007ff9982b6000 nid=0x17651 in Object.wait() [0x00007ff8da12f000]
      

      As you can see, GC is obviously blocking the application unusually long, and at the same time there is a same code path that the XFS is reading metadata from physical disk.
      So next I switched the second argument of the program to true and enforeced mmap cleanup within the main thread:

      $ sudo java -Xmx1024m -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Xloggc:gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -XX:+PrintGCApplicationStoppedTime MmapGcTest 10000 true
      removing files
      Sleeping 31000 ms for waiting XFS inode buffer expiration
      drop caches
      mmaps cleaned up
      cut reference to mmaps
      
      $ tail -24 stap.log
      Fri Jan  6 21:06:53 2017,166:  tid = 118226, device = XXX, inode = -1, size = 4096
       0xffffffff812746d0 : generic_make_request+0x0/0x5a0 [kernel]
       0xffffffff81274ce0 : submit_bio+0x70/0x120 [kernel]
       0xffffffffa042acba : _xfs_buf_ioapply+0x16a/0x200 [xfs]
       0xffffffffa042c96f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
       0xffffffffa042cb56 : _xfs_buf_read+0x36/0x60 [xfs]
       0xffffffffa042cc2b : xfs_buf_read+0xab/0x100 [xfs]
       0xffffffffa0422487 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
       0xffffffffa03ef15e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
       0xffffffffa03ef864 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
       0xffffffffa03efeef : xfs_btree_lookup+0xbf/0x470 [xfs]
       0xffffffffa03d6fd9 : xfs_alloc_lookup_eq+0x19/0x20 [xfs]
       0xffffffffa03d7dd4 : xfs_free_ag_extent+0x1c4/0x750 [xfs]
       0xffffffffa03da111 : xfs_free_extent+0x101/0x130 [xfs]
       0xffffffffa03e3fcd : xfs_bmap_finish+0x15d/0x1a0 [xfs]
       0xffffffffa0425e97 : xfs_inactive+0x397/0x450 [xfs]
       0xffffffffa0433630 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
       0xffffffff811af6fc : clear_inode+0xac/0x140 [kernel]
       0xffffffff811afeb6 : generic_delete_inode+0x196/0x1d0 [kernel]
       0xffffffff811aff55 : generic_drop_inode+0x65/0x80 [kernel]
       0xffffffff811aeda2 : iput+0x62/0x70 [kernel]
       0x391d6e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
       0x7fc11ca7ad47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 [/usr/jdk1.8.0_92/jre/lib/amd64/libnio.so]
       0x7fc1e90174f4
      
      $ grep 0x1cdd2 jstack-case2
      "main" #1 prio=5 os_prio=0 tid=0x00007fc1f8008800 nid=0x1cdd2 runnable [0x00007fc1fe8dd000]
      
      $ grep -E 'Total time for which application threads were stopped: [^0]\.' gc.log | wc -l
      0
      

      Of course there were same code path that XFS is reading from physical disk, but this time it's comming from main thread.
      This time, none of GC occurrence blocked application thread more than 100ms, which can be confirmed by both of program's log and gc log.

      Suggestion

      In order to solve this problem, we can call DirectBuffer.cleaner().clean() explicitly within log cleaner thread so no more GC time affected by transparent disk reads for metadata.
      The only downside that I can imagine is that it uses hidden API which is not recommended to use directly, but we're already importing it in order to support Windows so it's out of consideration.

        Issue Links

          Activity

          Hide
          ijuma Ismael Juma added a comment -

          Thanks Uwe Schindler. We are aware of the issue and we are tracking it in KAFKA-4501. You will notice that we have a link to the Lucene fix there already.

          Show
          ijuma Ismael Juma added a comment - Thanks Uwe Schindler . We are aware of the issue and we are tracking it in KAFKA-4501 . You will notice that we have a link to the Lucene fix there already.
          Hide
          junrao Jun Rao added a comment -

          Uwe Schindler, thanks for the info. Much appreciated!

          Show
          junrao Jun Rao added a comment - Uwe Schindler , thanks for the info. Much appreciated!
          Hide
          thetaphi Uwe Schindler added a comment -

          Just to inform you. Cleaner is no longer available in Java 9. So you need another hack. In cooperation with Apache Lucene, Oracle added a workaround that works differently for Java 9 as a temporary workaround until unmapping ist supported in Java 10. I am traveling to FOSDEM at the moment where I will talk about this an meet Mark Reinhold to discuss about the Java 10 solution.

          The Lucene issue with a nonreflective MethodHandle approach for unmapping in Java 7, 8 and 9 can be found here: LUCENE-6989

          Show
          thetaphi Uwe Schindler added a comment - Just to inform you. Cleaner is no longer available in Java 9. So you need another hack. In cooperation with Apache Lucene, Oracle added a workaround that works differently for Java 9 as a temporary workaround until unmapping ist supported in Java 10. I am traveling to FOSDEM at the moment where I will talk about this an meet Mark Reinhold to discuss about the Java 10 solution. The Lucene issue with a nonreflective MethodHandle approach for unmapping in Java 7, 8 and 9 can be found here: LUCENE-6989
          Hide
          guozhang Guozhang Wang added a comment -

          Yuto Kawamura Thanks for sharing!!

          Show
          guozhang Guozhang Wang added a comment - Yuto Kawamura Thanks for sharing!!
          Hide
          kawamuray Yuto Kawamura added a comment - - edited

          Just as a sequel, the below picture shows before and after of the 99th percentile response time of Produce request of our production cluster by applying the patch to all brokers.

          Thank you all

          Show
          kawamuray Yuto Kawamura added a comment - - edited Just as a sequel, the below picture shows before and after of the 99th percentile response time of Produce request of our production cluster by applying the patch to all brokers. Thank you all
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/kafka/pull/2352

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/2352
          Hide
          junrao Jun Rao added a comment -

          Issue resolved by pull request 2352
          https://github.com/apache/kafka/pull/2352

          Show
          junrao Jun Rao added a comment - Issue resolved by pull request 2352 https://github.com/apache/kafka/pull/2352
          Hide
          ijuma Ismael Juma added a comment -

          Ewen Cheslack-Postava, sure. I assigned Jun Rao as the reviewer since the fix itself is trivial, but it would be good for him to take a look. I'll do a pass through all unfixed Core JIRAs still assigned to 0.10.2.0 towards the end of the week to ensure that we either make progress or drop them from the release.

          Show
          ijuma Ismael Juma added a comment - Ewen Cheslack-Postava , sure. I assigned Jun Rao as the reviewer since the fix itself is trivial, but it would be good for him to take a look. I'll do a pass through all unfixed Core JIRAs still assigned to 0.10.2.0 towards the end of the week to ensure that we either make progress or drop them from the release.
          Hide
          ewencp Ewen Cheslack-Postava added a comment -

          Ismael Juma Seems like you started tracking work on this – can you assign a reviewer (whether yourself or someone else, even non-committer) that can help shepherd this for 0.10.2.0 since it sounds like folks are looking to land a fix before the code freeze?

          Show
          ewencp Ewen Cheslack-Postava added a comment - Ismael Juma Seems like you started tracking work on this – can you assign a reviewer (whether yourself or someone else, even non-committer) that can help shepherd this for 0.10.2.0 since it sounds like folks are looking to land a fix before the code freeze?
          Hide
          guozhang Guozhang Wang added a comment -

          Thanks for this report and thorough analysis Yuto Kawamura!

          Show
          guozhang Guozhang Wang added a comment - Thanks for this report and thorough analysis Yuto Kawamura !
          Hide
          kawamuray Yuto Kawamura added a comment -

          I applied the patch to one of our production brokers, deployed and observe over 24 hours.
          As the result, this issue disappeared completely. As I wrote in description, it was happening at least 3 ~ 4 times a day so I think we can say the patch solved the problem.

          Ismael Juma Filled up a PR. Please review if possible.

          huxihx
          Thanks for your comments.

          > Kafka already offers a method named 'forceUnmap' that actually does this kind of thing
          this is exactly what I did

          > If others use MappedByteBuffer later, it will crash.
          Right. I think it is somewhat a dangerous solution, but still think its safe because access to mmap object is limited to AbstractIndex and it's subclasses. Indeed we need to be careful for changing things around there though.

          > Portability risk, although sun.misc.Cleaner is present in most of major JVM vendors.
          Right, but this class is already imported to support windows so I don't think this is a problem.

          > System.gc()
          I think this doesn't work at all. AFAIK:

          • System.gc() does not guarantee(JVM does "best effort") to collect all garbage which are available to be collected.
          • No matter if GC triggered manually or not, it blocks application threads by the same reason.

          Michael Noll Thanks. The investigation was hard but was a great opportunity to learn many new things

          Show
          kawamuray Yuto Kawamura added a comment - I applied the patch to one of our production brokers, deployed and observe over 24 hours. As the result, this issue disappeared completely. As I wrote in description, it was happening at least 3 ~ 4 times a day so I think we can say the patch solved the problem. Ismael Juma Filled up a PR. Please review if possible. huxihx Thanks for your comments. > Kafka already offers a method named 'forceUnmap' that actually does this kind of thing this is exactly what I did > If others use MappedByteBuffer later, it will crash. Right. I think it is somewhat a dangerous solution, but still think its safe because access to mmap object is limited to AbstractIndex and it's subclasses. Indeed we need to be careful for changing things around there though. > Portability risk, although sun.misc.Cleaner is present in most of major JVM vendors. Right, but this class is already imported to support windows so I don't think this is a problem. > System.gc() I think this doesn't work at all. AFAIK: System.gc() does not guarantee(JVM does "best effort") to collect all garbage which are available to be collected. No matter if GC triggered manually or not, it blocks application threads by the same reason. Michael Noll Thanks. The investigation was hard but was a great opportunity to learn many new things
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user kawamuray opened a pull request:

          https://github.com/apache/kafka/pull/2352

          KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause

          Issue: https://issues.apache.org/jira/browse/KAFKA-4614

          Fixes the problem that the broker threads suffered by long GC pause.
          When GC thread collects mmap objects which were created for index files, it unmaps memory mapping so kernel turns to delete a file physically. This work may transparently read file's metadata from physical disk if it's not available on cache.
          This seems to happen typically when we're using G1GC, due to it's strategy to left a garbage for a long time if other objects in the same region are still alive.
          See the link for the details.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/kawamuray/kafka KAFKA-4614-force-munmap-for-index

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/kafka/pull/2352.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2352


          commit 8c3a4c9f5a6188c641a45cb3e111094f545f7e66
          Author: Yuto Kawamura <kawamuray.dadada@gmail.com>
          Date: 2017-01-11T08:10:05Z

          KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user kawamuray opened a pull request: https://github.com/apache/kafka/pull/2352 KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause Issue: https://issues.apache.org/jira/browse/KAFKA-4614 Fixes the problem that the broker threads suffered by long GC pause. When GC thread collects mmap objects which were created for index files, it unmaps memory mapping so kernel turns to delete a file physically. This work may transparently read file's metadata from physical disk if it's not available on cache. This seems to happen typically when we're using G1GC, due to it's strategy to left a garbage for a long time if other objects in the same region are still alive. See the link for the details. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kawamuray/kafka KAFKA-4614 -force-munmap-for-index Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/2352.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2352 commit 8c3a4c9f5a6188c641a45cb3e111094f545f7e66 Author: Yuto Kawamura <kawamuray.dadada@gmail.com> Date: 2017-01-11T08:10:05Z KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause
          Hide
          miguno Michael Noll added a comment -

          Yuto Kawamura Sorry to hear that you ran into this. Thanks a lot for the thorough investigation!

          Show
          miguno Michael Noll added a comment - Yuto Kawamura Sorry to hear that you ran into this. Thanks a lot for the thorough investigation!
          Hide
          huxi_2b huxihx added a comment - - edited

          Great catch for the whole things.

          As for the unmap things, seems there is no perfect solution at this moment. Java does not offer a corresponding unmap that invokes underlying munmap system call for the mapped file. See the jdk bug

          Right now there exists two "not-that-perfect" solutions:
          1. Explicitly unmap the mapped file as Yuto Kawamura said via invoking 'DirectBuffer.cleaner().clean()'. By the way, Kafka already offers a method named 'forceUnmap' that actually does this kind of thing. We could simply reuse this method. However, this solution has two drawbacks: 1. If others use MappedByteBuffer later, it will crash. 2. Portability risk, although sun.misc.Cleaner is present in most of major JVM vendors.
          2 Explicitly invoke 'System.gc()', which is also a not-good-enough idea, especially when setting -XX:+DisableExplicitGC.

          Show
          huxi_2b huxihx added a comment - - edited Great catch for the whole things. As for the unmap things, seems there is no perfect solution at this moment. Java does not offer a corresponding unmap that invokes underlying munmap system call for the mapped file. See the jdk bug Right now there exists two "not-that-perfect" solutions: 1. Explicitly unmap the mapped file as Yuto Kawamura said via invoking 'DirectBuffer.cleaner().clean()'. By the way, Kafka already offers a method named 'forceUnmap' that actually does this kind of thing. We could simply reuse this method. However, this solution has two drawbacks: 1. If others use MappedByteBuffer later, it will crash. 2. Portability risk, although sun.misc.Cleaner is present in most of major JVM vendors. 2 Explicitly invoke 'System.gc()', which is also a not-good-enough idea, especially when setting -XX:+DisableExplicitGC.
          Hide
          ijuma Ismael Juma added a comment -

          Yuto Kawamura, thanks for reporting this. Looking forward to your results.

          Show
          ijuma Ismael Juma added a comment - Yuto Kawamura , thanks for reporting this. Looking forward to your results.
          Hide
          kawamuray Yuto Kawamura added a comment - - edited

          I'm trying to fix this and now I'm testing patched broker in our environment.
          Will submit a PR once it confirmed to fix this problem.

          Show
          kawamuray Yuto Kawamura added a comment - - edited I'm trying to fix this and now I'm testing patched broker in our environment. Will submit a PR once it confirmed to fix this problem.

            People

            • Assignee:
              kawamuray Yuto Kawamura
              Reporter:
              kawamuray Yuto Kawamura
              Reviewer:
              Jun Rao
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development