Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-1646

Improve consumer read performance for Windows

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.1.1
    • Fix Version/s: 0.9.0.0
    • Component/s: log
    • Labels:
    • Environment:
      Windows

      Description

      This patch is for Window platform only. In Windows platform, if there are more than one replicas writing to disk, the segment log files will not be consistent in disk and then consumer reading performance will be dropped down greatly. This fix allocates more disk spaces when rolling a new segment, and then it will improve the consumer reading performance in NTFS file system.
      This patch doesn't affect file allocation of other filesystems, for it only adds statements like 'if(Os.iswindow)' or adds methods used on Windows.

      1. Improve consumer read performance for Windows.patch
        11 kB
        xueqiang wang
      2. KAFKA-1646_20141216_163008.patch
        10 kB
        Qianlin Xia
      3. KAFKA-1646_20150306_005526.patch
        12 kB
        Honghai Chen
      4. KAFKA-1646_20150511_AddTestcases.patch
        25 kB
        Honghai Chen
      5. KAFKA-1646_20150609_MergeToLatestTrunk.patch
        25 kB
        Honghai Chen
      6. KAFKA-1646_20150616_FixFormat.patch
        24 kB
        Honghai Chen
      7. KAFKA-1646_20150618_235231.patch
        24 kB
        Honghai Chen
      8. KAFKA-1646-truncate-off-trailing-zeros-on-broker-restart-if-bro.patch
        1 kB
        xueqiang wang

        Issue Links

          Activity

          Hide
          xueqiang xueqiang wang added a comment -

          This patch doesn't affect file allocation of other filesystems, for it only adds statements like 'if(Os.iswindow)' or adds methods used on Windows.

          Show
          xueqiang xueqiang wang added a comment - This patch doesn't affect file allocation of other filesystems, for it only adds statements like 'if(Os.iswindow)' or adds methods used on Windows.
          Hide
          jkreps Jay Kreps added a comment -

          Can you describe the problem in a little more detail?

          Show
          jkreps Jay Kreps added a comment - Can you describe the problem in a little more detail?
          Hide
          xueqiang xueqiang wang added a comment -

          This issue is caused by the discrete blocks of a segment log file in Windows NTFS system. Unlike Linux, Window doesn’t allocate a large space in the disk when creating a new file, and it just finds free blocks if new data come in and links them. Then after a lot of segment log deleting and creating, log blocks may be spread all over the disk. So if a consumer reads data from the disk, the performance will be down.

          Show
          xueqiang xueqiang wang added a comment - This issue is caused by the discrete blocks of a segment log file in Windows NTFS system. Unlike Linux, Window doesn’t allocate a large space in the disk when creating a new file, and it just finds free blocks if new data come in and links them. Then after a lot of segment log deleting and creating, log blocks may be spread all over the disk. So if a consumer reads data from the disk, the performance will be down.
          Hide
          jkreps Jay Kreps added a comment - - edited

          Ah, you are saying Windows does a worse job of preallocation? How much does this help? Did you do any benchmarking on the performance improvement?

          Show
          jkreps Jay Kreps added a comment - - edited Ah, you are saying Windows does a worse job of preallocation? How much does this help? Did you do any benchmarking on the performance improvement?
          Hide
          xueqiang xueqiang wang added a comment -

          Yes, if there are many deleting and creating, Windows can't do well in preallocation. We have run a cluster for more than a month, and find if consumer reads history logs(such as from the earliest offset), the performance will be down to only 40% compared to that a month ago. By using the fix, the performance can keep stable.

          Show
          xueqiang xueqiang wang added a comment - Yes, if there are many deleting and creating, Windows can't do well in preallocation. We have run a cluster for more than a month, and find if consumer reads history logs(such as from the earliest offset), the performance will be down to only 40% compared to that a month ago. By using the fix, the performance can keep stable.
          Hide
          jkreps Jay Kreps added a comment -

          Interesting. What is the behavior of RandomAccessFile.setLength(size) when setting a size larger than the current file size on Windows? Does it fully preallocate all the blocks up to the full length? If so does that cause any kind of pause if you are allocating like 1GB? On Linux it does "sparse" allocation which doesn't actually assign any blocks it just fills it in with fake zeros--so that happens very quickly but doesn't help with getting linear reads. Presumably Windows is also treating the unwritten part of the file as all zeros.

          How do you handle file close? If the broker is running and is hard killed we run recovery and would truncate off any trailing zeros in a log segment. However if the broker is stopped gracefully I don't see how the trailing zeros are truncated off so on restart don't you end up with a bunch of zeros in the log?

          Show
          jkreps Jay Kreps added a comment - Interesting. What is the behavior of RandomAccessFile.setLength(size) when setting a size larger than the current file size on Windows? Does it fully preallocate all the blocks up to the full length? If so does that cause any kind of pause if you are allocating like 1GB? On Linux it does "sparse" allocation which doesn't actually assign any blocks it just fills it in with fake zeros--so that happens very quickly but doesn't help with getting linear reads. Presumably Windows is also treating the unwritten part of the file as all zeros. How do you handle file close? If the broker is running and is hard killed we run recovery and would truncate off any trailing zeros in a log segment. However if the broker is stopped gracefully I don't see how the trailing zeros are truncated off so on restart don't you end up with a bunch of zeros in the log?
          Hide
          xueqiang xueqiang wang added a comment -

          Yes, Windows will fully preallocate all the blocks, and it also doesn't assign zeros for these preallocated blocks.
          If broker is gracefully stopped, the active segment should be truncated off on broker restart. So it is a bug and the new patch is to fix it. Thanks a lot!

          Show
          xueqiang xueqiang wang added a comment - Yes, Windows will fully preallocate all the blocks, and it also doesn't assign zeros for these preallocated blocks. If broker is gracefully stopped, the active segment should be truncated off on broker restart. So it is a bug and the new patch is to fix it. Thanks a lot!
          Hide
          xueqiang xueqiang wang added a comment -

          bug fix: truncate off trailing zeros on broker restart if broker is gracefully stopped.

          Show
          xueqiang xueqiang wang added a comment - bug fix: truncate off trailing zeros on broker restart if broker is gracefully stopped.
          Hide
          jkreps Jay Kreps added a comment -

          Hey xueqiang wang what is the latency impact for that preallocation for a large file? I.e. if your segment size is 1GB will there be a long pause when the new log segment is rolled? How long does that take?

          Show
          jkreps Jay Kreps added a comment - Hey xueqiang wang what is the latency impact for that preallocation for a large file? I.e. if your segment size is 1GB will there be a long pause when the new log segment is rolled? How long does that take?
          Hide
          xueqiang xueqiang wang added a comment - - edited

          Hey, Jay, sorry for late. I have done a test and find there is no pause when rolling new log segment. The time for creating an 1G file and 1K file is almost identical: all about 1ms.
          Here is the test code which creating 10 files of 1G:

              public static void main(String[] args) throws Exception  {
                  String filePre = "d:\\temp\\file";
                  long startTime, elapsedTime;
                  startTime = System.currentTimeMillis();
                  try {
                      long initFileSize = 1024000000l;
                      for (int i = 0; i < 10; i++) {
                          RandomAccessFile randomAccessFile = new RandomAccessFile(filePre + i, "rw");
                          randomAccessFile.setLength(initFileSize);
                          randomAccessFile.getChannel();
                      }
                      elapsedTime = System.currentTimeMillis() - startTime;
                      System.out.format("elapsedTime: %2d ms", elapsedTime);
                  } catch (Exception exception) { }
              }
          

          The result is: elapsedTime: 14 ms

          Show
          xueqiang xueqiang wang added a comment - - edited Hey, Jay, sorry for late. I have done a test and find there is no pause when rolling new log segment. The time for creating an 1G file and 1K file is almost identical: all about 1ms. Here is the test code which creating 10 files of 1G: public static void main( String [] args) throws Exception { String filePre = "d:\\temp\\file" ; long startTime, elapsedTime; startTime = System .currentTimeMillis(); try { long initFileSize = 1024000000l; for ( int i = 0; i < 10; i++) { RandomAccessFile randomAccessFile = new RandomAccessFile(filePre + i, "rw" ); randomAccessFile.setLength(initFileSize); randomAccessFile.getChannel(); } elapsedTime = System .currentTimeMillis() - startTime; System .out.format( "elapsedTime: %2d ms" , elapsedTime); } catch (Exception exception) { } } The result is: elapsedTime: 14 ms
          Hide
          waldenchen Honghai Chen added a comment -

          Hi Jay,
          I came from same team with Xueqiang, any concern for the change? What kind of information do you need us provide more?

          Show
          waldenchen Honghai Chen added a comment - Hi Jay, I came from same team with Xueqiang, any concern for the change? What kind of information do you need us provide more?
          Hide
          junrao Jun Rao added a comment -

          The bug fix patch forces log recovery on clean shutdown. Could that be avoided since it will slow down startup?

          Show
          junrao Jun Rao added a comment - The bug fix patch forces log recovery on clean shutdown. Could that be avoided since it will slow down startup?
          Hide
          qixia Qianlin Xia added a comment - - edited

          Hi, Jun.
          I also came from same team with Xueqiang & Honghai.
          Just as Jay mentioned before, recover the log is also necessary when the broker is stopped gracefully. Since we just recover the activeSegment (only one LogSegment), so it would not cost a lot of time.
          And I just write some code to test the performance of LogSegment.recover(),

          public class TestLogSegment {
              public static void main(String[] args) throws Exception {
                  String dirTemplate = args.length > 0? args[0] : "D:\\C\\scp2\\tachyon\\logs\\testbroker\\mvlogs-";
                  int logFileNums = args.length > 1? Integer.parseInt(args[1]) : 10;
          
                  long startOffset = 0L;
                  int indexIntervalBytes = 4096;
                  int maxIndexSize = 10485760;
                  long initFileSize = 536870912L;
                  int maxMessageSize = 1000000;
          
                  long totalTime = 0L;
                  for (int i=0; i<logFileNums; i++)
                  {
                      LogSegment segment = new LogSegment(new File(dirTemplate + i), 0, indexIntervalBytes, maxIndexSize, initFileSize, new SystemTime());
                      long start = System.currentTimeMillis();
                      segment.recover(maxMessageSize);
                      long end = System.currentTimeMillis();
                      totalTime += (end - start);
                  }
          
                  System.out.println("Recover cost time: " + totalTime/logFileNums);
              }
          }
          

          I use some scripts to create 1000 copies from a real partition of one topic (mvlogs-0, mvlogs-1, ..., mvlogs-999), and use this code to test the average time cost of LogSegment.recover() to 1000 log partition directories. After test, the average recover time cost is dozen seconds (in my server is 11ms).
          So, I think the recovery operation is just OK.

          Show
          qixia Qianlin Xia added a comment - - edited Hi, Jun. I also came from same team with Xueqiang & Honghai. Just as Jay mentioned before, recover the log is also necessary when the broker is stopped gracefully. Since we just recover the activeSegment (only one LogSegment), so it would not cost a lot of time. And I just write some code to test the performance of LogSegment.recover(), public class TestLogSegment { public static void main( String [] args) throws Exception { String dirTemplate = args.length > 0? args[0] : "D:\\C\\scp2\\tachyon\\logs\\testbroker\\mvlogs-" ; int logFileNums = args.length > 1? Integer .parseInt(args[1]) : 10; long startOffset = 0L; int indexIntervalBytes = 4096; int maxIndexSize = 10485760; long initFileSize = 536870912L; int maxMessageSize = 1000000; long totalTime = 0L; for ( int i=0; i<logFileNums; i++) { LogSegment segment = new LogSegment( new File(dirTemplate + i), 0, indexIntervalBytes, maxIndexSize, initFileSize, new SystemTime()); long start = System .currentTimeMillis(); segment.recover(maxMessageSize); long end = System .currentTimeMillis(); totalTime += (end - start); } System .out.println( "Recover cost time: " + totalTime/logFileNums); } } I use some scripts to create 1000 copies from a real partition of one topic (mvlogs-0, mvlogs-1, ..., mvlogs-999), and use this code to test the average time cost of LogSegment.recover() to 1000 log partition directories. After test, the average recover time cost is dozen seconds (in my server is 11ms). So, I think the recovery operation is just OK.
          Hide
          sriharsha Sriharsha Chintalapani added a comment -

          Jay Kreps Jun Rao can you please take a look at the updated comments regarding the patch.
          Please give us any feedback on this patch. Thanks.

          Show
          sriharsha Sriharsha Chintalapani added a comment - Jay Kreps Jun Rao can you please take a look at the updated comments regarding the patch. Please give us any feedback on this patch. Thanks.
          Hide
          qixia Qianlin Xia added a comment -

          Updated reviewboard https://reviews.apache.org/r/29091/diff/
          against branch origin/0.8.1

          Show
          qixia Qianlin Xia added a comment - Updated reviewboard https://reviews.apache.org/r/29091/diff/ against branch origin/0.8.1
          Hide
          sriharsha Sriharsha Chintalapani added a comment -

          Thanks Qianlin Xia for the reviewboard patch. Jay Kreps Jun Rao could you please give your feedback on updated patch. Thanks.

          Show
          sriharsha Sriharsha Chintalapani added a comment - Thanks Qianlin Xia for the reviewboard patch. Jay Kreps Jun Rao could you please give your feedback on updated patch. Thanks.
          Hide
          sriharsha Sriharsha Chintalapani added a comment -

          Jay Kreps Jun Rao Can you please take a look at the latest patch. Thanks.

          Show
          sriharsha Sriharsha Chintalapani added a comment - Jay Kreps Jun Rao Can you please take a look at the latest patch. Thanks.
          Hide
          sriharsha Sriharsha Chintalapani added a comment -

          Jay Kreps Jun Rao Neha Narkhede pinging for a review. Thanks.

          Show
          sriharsha Sriharsha Chintalapani added a comment - Jay Kreps Jun Rao Neha Narkhede pinging for a review. Thanks.
          Hide
          jghoman Jakob Homan added a comment -

          This is a big performance hit on Windows, and it looks like the concerns about the fix's impact on normal performance have been addressed. Any further concerns? This is looking ready to go in.

          Show
          jghoman Jakob Homan added a comment - This is a big performance hit on Windows, and it looks like the concerns about the fix's impact on normal performance have been addressed. Any further concerns? This is looking ready to go in.
          Hide
          jghoman Jakob Homan added a comment -

          To make it easier to test future patches that affect Windows, I've opened KAFKA-1924 to have CI for Windows builds. However, we shouldn't block this patch on that issue.

          Show
          jghoman Jakob Homan added a comment - To make it easier to test future patches that affect Windows, I've opened KAFKA-1924 to have CI for Windows builds. However, we shouldn't block this patch on that issue.
          Hide
          jkreps Jay Kreps added a comment -

          Hey guys if this forces full recovery the impact on startup time will be considerable if you have a large number of partitions.

          Say you have 2000 partitions per machine and a 1GB log segment file size. On average these files will have about 500MB per partition when a restart occurs. The result is running recovery on 2000 * 500MB = 1TB of data. This will take about 5.5 hours at 50MB/sec.

          Qianlin Xia not sure how the above reasoning compares to your test?

          I think this would be a blocker issue, no?

          Show
          jkreps Jay Kreps added a comment - Hey guys if this forces full recovery the impact on startup time will be considerable if you have a large number of partitions. Say you have 2000 partitions per machine and a 1GB log segment file size. On average these files will have about 500MB per partition when a restart occurs. The result is running recovery on 2000 * 500MB = 1TB of data. This will take about 5.5 hours at 50MB/sec. Qianlin Xia not sure how the above reasoning compares to your test? I think this would be a blocker issue, no?
          Hide
          waldenchen Honghai Chen added a comment -

          How about add one more column to recovery-point-offset-checkpoint to record the last position of the active segment, and then when recover, do NOT call " activeSegment.recover(config.maxMessageSize)", but call the new LogSegment with set end as the position read from recovery-point-offset-checkpoint.
          By this way, we should only need recover the missed messages in last file , not the whole last file, make sense? Jay Kreps Or do you have other recommendation for the fix?

          Show
          waldenchen Honghai Chen added a comment - How about add one more column to recovery-point-offset-checkpoint to record the last position of the active segment, and then when recover, do NOT call " activeSegment.recover(config.maxMessageSize)", but call the new LogSegment with set end as the position read from recovery-point-offset-checkpoint. By this way, we should only need recover the missed messages in last file , not the whole last file, make sense? Jay Kreps Or do you have other recommendation for the fix?
          Hide
          jkreps Jay Kreps added a comment -

          Hey Honghai Chen yeah I think what you are saying is that the offset in the recovery point checkpoint should always be the last sync'd offset irrespective of segment file boundaries and we should optimize recovery to just recovery from that offset rather than always recovering the full last segment. That would work, and actually makes more sense than the current approach, but is a fairly involved and correctness critical change. One of the challenges is the offset index needs to be reconstructed from that point on as well, but there is a bit of a chicken and egg problem, because how do you search into the log at all if the index itself is corrupt?

          Another approach could be to truncate off the preallocated file extent on clean shutdown. This is actually effectively what we do for the offset indexes anyway. This would also avoid windows specific code since we could do this in all cases.

          Show
          jkreps Jay Kreps added a comment - Hey Honghai Chen yeah I think what you are saying is that the offset in the recovery point checkpoint should always be the last sync'd offset irrespective of segment file boundaries and we should optimize recovery to just recovery from that offset rather than always recovering the full last segment. That would work, and actually makes more sense than the current approach, but is a fairly involved and correctness critical change. One of the challenges is the offset index needs to be reconstructed from that point on as well, but there is a bit of a chicken and egg problem, because how do you search into the log at all if the index itself is corrupt? Another approach could be to truncate off the preallocated file extent on clean shutdown. This is actually effectively what we do for the offset indexes anyway. This would also avoid windows specific code since we could do this in all cases.
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Hey, Jay Kreps just clarify, the 50MB/s you mentioned before is the checksum calculation on the machine, not copy replica data from other machine, right?

          If that's true, seemly we need do 3 changes:
          1, when call logManager.shutdown. and os is windows , truncate active segment.
          2, when start, if the os is windows, add one new segment.
          3, remove the change " KAFKA-1646-truncate-off-trailing-zeros-on-broker-restart-if-bro.patch " made previously since it's unnecessary.
          Make sense?

          Show
          waldenchen Honghai Chen added a comment - - edited Hey, Jay Kreps just clarify, the 50MB/s you mentioned before is the checksum calculation on the machine, not copy replica data from other machine, right? If that's true, seemly we need do 3 changes: 1, when call logManager.shutdown. and os is windows , truncate active segment. 2, when start, if the os is windows, add one new segment. 3, remove the change " KAFKA-1646 -truncate-off-trailing-zeros-on-broker-restart-if-bro.patch " made previously since it's unnecessary. Make sense?
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Or do you prefer the second option:
          Add one more column to file "recovery-point-offset-checkpoint", currently it only record offset, like below:
          0
          2
          mvlogs 1 100
          mvlogs 0 200
          Change to below by add one column "recoverposition"
          0
          2
          mvlogs 1 100 8000
          mvlogs 0 200 16000

          8000 is the start position of the data file for message with offset 100 . And 16000 is start position of the data file for message with offset 200.
          Take first one as example, what we need do are:
          1, keep offset and position consistent and regularly write to file "recovery-point-offset-checkpoint",
          2, when in clean shutdown, truncate the file to the "recoverposition".
          3, when start, find the log segment related with the recover point, truncate the file to the "recoverposition"
          4, when start, if the os is windows, add one new segment.
          But this change is big, since so many places are using variable recoveryPoint.

          Which one do you recommend? Really appreciate for your guide. Jay KrepsNeha NarkhedeJun Rao

          Show
          waldenchen Honghai Chen added a comment - - edited Or do you prefer the second option: Add one more column to file "recovery-point-offset-checkpoint", currently it only record offset, like below: 0 2 mvlogs 1 100 mvlogs 0 200 Change to below by add one column "recoverposition" 0 2 mvlogs 1 100 8000 mvlogs 0 200 16000 8000 is the start position of the data file for message with offset 100 . And 16000 is start position of the data file for message with offset 200. Take first one as example, what we need do are: 1, keep offset and position consistent and regularly write to file "recovery-point-offset-checkpoint", 2, when in clean shutdown, truncate the file to the "recoverposition". 3, when start, find the log segment related with the recover point, truncate the file to the "recoverposition" 4, when start, if the os is windows, add one new segment. But this change is big, since so many places are using variable recoveryPoint. Which one do you recommend? Really appreciate for your guide. Jay Kreps Neha Narkhede Jun Rao
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Updated reviewboard against branch origin/0.8.1
          Hi, Jay Kreps Jun Rao Jakob Homan please check the review at https://reviews.apache.org/r/29091/diff/7/ , appreciate.

          Show
          waldenchen Honghai Chen added a comment - - edited Updated reviewboard against branch origin/0.8.1 Hi, Jay Kreps Jun Rao Jakob Homan please check the review at https://reviews.apache.org/r/29091/diff/7/ , appreciate.
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Hey, Jay Kreps Would you like help check the review at https://reviews.apache.org/r/29091/diff/7/ , really appreciate, thanks.

          Show
          waldenchen Honghai Chen added a comment - - edited Hey, Jay Kreps Would you like help check the review at https://reviews.apache.org/r/29091/diff/7/ , really appreciate, thanks.
          Hide
          jkreps Jay Kreps added a comment -

          Hey Honghai Chen this patch is adding a TON of windows-specific if/else statements. I don't think that is sustainable. I think if we are going to do this we need to try to make it the same strategy across OS's just for maintainability.

          That said, are you sure NTFS can't just be tuned to accomplish the same thing?

          Show
          jkreps Jay Kreps added a comment - Hey Honghai Chen this patch is adding a TON of windows-specific if/else statements. I don't think that is sustainable. I think if we are going to do this we need to try to make it the same strategy across OS's just for maintainability. That said, are you sure NTFS can't just be tuned to accomplish the same thing?
          Hide
          waldenchen Honghai Chen added a comment -

          Hey Jay Kreps We have ask the NTFS owners, seemly this is the only way for NTFS to optimize the performance. Jakob Homan also in the email thread. Below is what they say in the email thread. Currently there are 7 "if OS.IsWindows", can reduce to 3, will update the code review soon. And if you agree that we can also pre allocate file with big size in Linux, we can reduce the "if OS.IsWindows" to 0. Do you agree?

          ==============The answer from NTFS owners==========
          ""There’s no external way to influence NTFS’s allocation algorithms.

          As suggested in the blog post, pre-allocating space in a file help a lot in cases where an app does small extending writes to a file over a period of time (a text-based log file being one canonical example), especially on a busy file system where extending writes to other files are also happening.

          All file systems require defragmenting. However Linux file systems (ext2/3/4 etc) implement the defragmenting inside the file system, while Windows file systems implement the defragmenting using an external program. There are trade-offs to both approaches. The external program (defrag.exe) is in a scheduled task so it shouldn’t normally be necessary to ever run it by hand.""
          ===============================================================

          http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
          https://stackoverflow.com/questions/29007319/how-to-make-windows-ntfs-does-not-have-fragment-or-reduce-fragmen

          Show
          waldenchen Honghai Chen added a comment - Hey Jay Kreps We have ask the NTFS owners, seemly this is the only way for NTFS to optimize the performance. Jakob Homan also in the email thread. Below is what they say in the email thread. Currently there are 7 "if OS.IsWindows", can reduce to 3, will update the code review soon. And if you agree that we can also pre allocate file with big size in Linux, we can reduce the "if OS.IsWindows" to 0. Do you agree? ==============The answer from NTFS owners========== ""There’s no external way to influence NTFS’s allocation algorithms. As suggested in the blog post, pre-allocating space in a file help a lot in cases where an app does small extending writes to a file over a period of time (a text-based log file being one canonical example), especially on a busy file system where extending writes to other files are also happening. All file systems require defragmenting. However Linux file systems (ext2/3/4 etc) implement the defragmenting inside the file system, while Windows file systems implement the defragmenting using an external program. There are trade-offs to both approaches. The external program (defrag.exe) is in a scheduled task so it shouldn’t normally be necessary to ever run it by hand."" =============================================================== http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/ https://stackoverflow.com/questions/29007319/how-to-make-windows-ntfs-does-not-have-fragment-or-reduce-fragmen
          Hide
          jkreps Jay Kreps added a comment -

          Yeah I was proposing doing this across the board rather than just for windows. I don't think it will help much on the better linux filesystems which do delayed allocation anyway but it should help on the older filesystems. We'll need to think this through carefully though as it changes what you would see if you ls the files.

          Show
          jkreps Jay Kreps added a comment - Yeah I was proposing doing this across the board rather than just for windows. I don't think it will help much on the better linux filesystems which do delayed allocation anyway but it should help on the older filesystems. We'll need to think this through carefully though as it changes what you would see if you ls the files.
          Hide
          waldenchen Honghai Chen added a comment -

          Updated reviewboard https://reviews.apache.org/r/29091/diff/
          against branch origin/0.8.1

          Show
          waldenchen Honghai Chen added a comment - Updated reviewboard https://reviews.apache.org/r/29091/diff/ against branch origin/0.8.1
          Hide
          waldenchen Honghai Chen added a comment -

          Reduce the check of "if OS.IsWindows" from 7 to 3.
          https://reviews.apache.org/r/29091/diff/8/

          Show
          waldenchen Honghai Chen added a comment - Reduce the check of "if OS.IsWindows" from 7 to 3. https://reviews.apache.org/r/29091/diff/8/
          Hide
          waldenchen Honghai Chen added a comment -

          Hey Jay Kreps Is it ok to add one configuration like "log.preallocatefile" to the configuration and change the three places of "if Os.IsWindows" to check the configuration?

          Show
          waldenchen Honghai Chen added a comment - Hey Jay Kreps Is it ok to add one configuration like "log.preallocatefile" to the configuration and change the three places of "if Os.IsWindows" to check the configuration?
          Hide
          sriharsha Sriharsha Chintalapani added a comment -

          Honghai Chen Its looks like the patch is against 0.8.1.1 branch can you send us a patch against trunk.

          Show
          sriharsha Sriharsha Chintalapani added a comment - Honghai Chen Its looks like the patch is against 0.8.1.1 branch can you send us a patch against trunk.
          Hide
          sriharsha Sriharsha Chintalapani added a comment -

          Jay Kreps Any guidance on the latest patch?

          Show
          sriharsha Sriharsha Chintalapani added a comment - Jay Kreps Any guidance on the latest patch?
          Hide
          jghoman Jakob Homan added a comment -

          Jay Kreps - still interested in reviewing and finishing off this patch? If not, I'll take a look at it. It's a big performance hit on an important platform, as well as a good opportunity to expand the Kafka community dev.

          Show
          jghoman Jakob Homan added a comment - Jay Kreps - still interested in reviewing and finishing off this patch? If not, I'll take a look at it. It's a big performance hit on an important platform, as well as a good opportunity to expand the Kafka community dev.
          Hide
          jkreps Jay Kreps added a comment -

          This is pretty critical code and there have been a few issues with these patches so far which is what is making me a little skiddish. Honghai Chen can you walk me through what you guys are doing in the way of validation and performance testing?

          Also, I agree with your suggestion of making this a configuration rather than an OS specific check. Theoretically this could help with Linux filesystems and conversely Windows may later get smarter about preallocation. How about log.preallocate=

          {true/false}

          defaulting to false.

          Show
          jkreps Jay Kreps added a comment - This is pretty critical code and there have been a few issues with these patches so far which is what is making me a little skiddish. Honghai Chen can you walk me through what you guys are doing in the way of validation and performance testing? Also, I agree with your suggestion of making this a configuration rather than an OS specific check. Theoretically this could help with Linux filesystems and conversely Windows may later get smarter about preallocation. How about log.preallocate= {true/false} defaulting to false.
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Jay Kreps All issues have been addressed, please check updated reviewboard https://reviews.apache.org/r/29091/diff/10/ against branch origin/0.8.1
          Many thanks for your guidance, really appreciate.

          Show
          waldenchen Honghai Chen added a comment - - edited Jay Kreps All issues have been addressed, please check updated reviewboard https://reviews.apache.org/r/29091/diff/10/ against branch origin/0.8.1 Many thanks for your guidance, really appreciate.
          Hide
          jkreps Jay Kreps added a comment -

          Oh yes, also, we'll need a patch against trunk to be able to commit it.

          Show
          jkreps Jay Kreps added a comment - Oh yes, also, we'll need a patch against trunk to be able to commit it.
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Update code do not do flush when trim. Only do flush when close. Update the the review against 0.8.1 (https://reviews.apache.org/r/29091/diff/11/),

          will add one new review/patch against trunk soon.

          Show
          waldenchen Honghai Chen added a comment - - edited Update code do not do flush when trim. Only do flush when close. Update the the review against 0.8.1 ( https://reviews.apache.org/r/29091/diff/11/ ), will add one new review/patch against trunk soon.
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Jay Kreps The patch against trunk attached.
          And the reviewboard against branch trunk is here https://reviews.apache.org/r/33204/diff/

          Show
          waldenchen Honghai Chen added a comment - - edited Jay Kreps The patch against trunk attached. And the reviewboard against branch trunk is here https://reviews.apache.org/r/33204/diff/
          Hide
          jkreps Jay Kreps added a comment -

          Cool, this looks good to me. This introduces a new config so we should do a quick KIP discussion on the mailing list. I think this can be pretty minimal since it really is a fairly small change in the end.

          https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

          Show
          jkreps Jay Kreps added a comment - Cool, this looks good to me. This introduces a new config so we should do a quick KIP discussion on the mailing list. I think this can be pretty minimal since it really is a fairly small change in the end. https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
          Show
          waldenchen Honghai Chen added a comment - Added KIP here and send out the email, will wait for 2 days to see if there is any issue. https://cwiki.apache.org/confluence/display/KAFKA/KIP-20+-+Enable+log+preallocate+to+improve+consume+performance+under+windows+and+some+old+Linux+file+system http://mail-archives.apache.org/mod_mbox/kafka-dev/201504.mbox/%3Cb1f1db2cb51645f6b459c252abae6641%40HKNPR30MB018.064d.mgd.msft.net%3E
          Hide
          waldenchen Honghai Chen added a comment - - edited

          New code review board https://reviews.apache.org/r/33204/diff/2/
          patch against trunk also attached.

          Show
          waldenchen Honghai Chen added a comment - - edited New code review board https://reviews.apache.org/r/33204/diff/2/ patch against trunk also attached.
          Hide
          jkreps Jay Kreps added a comment -

          So Honghai Chen next steps for this patch:
          1. KIP is under discussion
          2. This patch still includes no testing at all, right? We definitely need to remedy that
          3. It would be good to get perf results for Windows/NTFS + Linux/ext4

          Also, are you guys using this patch in production? If so what has been the experience?

          Show
          jkreps Jay Kreps added a comment - So Honghai Chen next steps for this patch: 1. KIP is under discussion 2. This patch still includes no testing at all, right? We definitely need to remedy that 3. It would be good to get perf results for Windows/NTFS + Linux/ext4 Also, are you guys using this patch in production? If so what has been the experience?
          Hide
          waldenchen Honghai Chen added a comment - - edited

          When trying add test case for Log, got test failures for existing test cases in windows.
          https://issues.apache.org/jira/browse/KAFKA-2170

          Show
          waldenchen Honghai Chen added a comment - - edited When trying add test case for Log, got test failures for existing test cases in windows. https://issues.apache.org/jira/browse/KAFKA-2170
          Hide
          waldenchen Honghai Chen added a comment -
          Show
          waldenchen Honghai Chen added a comment - Add test cases https://reviews.apache.org/r/33204/diff/3/
          Hide
          waldenchen Honghai Chen added a comment -

          Created reviewboard against branch origin/trunk

          Show
          waldenchen Honghai Chen added a comment - Created reviewboard against branch origin/trunk
          Hide
          jkreps Jay Kreps added a comment -

          This looks good to me but it would be good to get a second reviewer on this since it is fairly critical code. Jun Rao any chance you could take a look too?

          Show
          jkreps Jay Kreps added a comment - This looks good to me but it would be good to get a second reviewer on this since it is fairly critical code. Jun Rao any chance you could take a look too?
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Merge to latest trunk, patch attached.

          code review no change https://reviews.apache.org/r/33204/diff/4/

          Show
          waldenchen Honghai Chen added a comment - - edited Merge to latest trunk, patch attached. code review no change https://reviews.apache.org/r/33204/diff/4/
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Fix all issues mentioned by Jun Rao
          https://reviews.apache.org/r/33204/diff/5/
          latest patch also attached, can we ship it now?

          Show
          waldenchen Honghai Chen added a comment - - edited Fix all issues mentioned by Jun Rao https://reviews.apache.org/r/33204/diff/5/ latest patch also attached, can we ship it now?
          Hide
          junrao Jun Rao added a comment -

          Honghai Chen, thanks for the latest patch. A couple of other issues.

          1. There are two unit test failures.

          unit.kafka.log.LogConfigTest > testFromPropsToProps FAILED
          org.apache.kafka.common.config.ConfigException: Invalid value 1352153985 for configuration preallocate: Expected value to be either true or false
          at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:210)
          at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:172)
          at kafka.log.LogConfig$.fromProps(LogConfig.scala:208)
          at unit.kafka.log.LogConfigTest.testFromPropsToProps(LogConfigTest.scala:66)

          unit.kafka.server.KafkaConfigConfigDefTest > testFromPropsToProps FAILED
          org.apache.kafka.common.config.ConfigException: Invalid value 619129959 for configuration log.preallocate: Expected value to be either true or false
          at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:210)
          at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:172)
          at kafka.server.KafkaConfig$.fromProps(KafkaConfig.scala:531)
          at unit.kafka.server.KafkaConfigConfigDefTest.testFromPropsToProps(KafkaConfigConfigDefTest.scala:256)

          2. We also create new FileMessageSet in LogCleaner.cleanSegments(). Should we enable preallocation there as well?

          Show
          junrao Jun Rao added a comment - Honghai Chen , thanks for the latest patch. A couple of other issues. 1. There are two unit test failures. unit.kafka.log.LogConfigTest > testFromPropsToProps FAILED org.apache.kafka.common.config.ConfigException: Invalid value 1352153985 for configuration preallocate: Expected value to be either true or false at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:210) at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:172) at kafka.log.LogConfig$.fromProps(LogConfig.scala:208) at unit.kafka.log.LogConfigTest.testFromPropsToProps(LogConfigTest.scala:66) unit.kafka.server.KafkaConfigConfigDefTest > testFromPropsToProps FAILED org.apache.kafka.common.config.ConfigException: Invalid value 619129959 for configuration log.preallocate: Expected value to be either true or false at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:210) at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:172) at kafka.server.KafkaConfig$.fromProps(KafkaConfig.scala:531) at unit.kafka.server.KafkaConfigConfigDefTest.testFromPropsToProps(KafkaConfigConfigDefTest.scala:256) 2. We also create new FileMessageSet in LogCleaner.cleanSegments(). Should we enable preallocation there as well?
          Hide
          waldenchen Honghai Chen added a comment -

          Updated reviewboard https://reviews.apache.org/r/33204/diff/
          against branch origin/trunk

          Show
          waldenchen Honghai Chen added a comment - Updated reviewboard https://reviews.apache.org/r/33204/diff/ against branch origin/trunk
          Hide
          waldenchen Honghai Chen added a comment - - edited

          Updated reviewboard https://reviews.apache.org/r/33204/diff/
          against branch origin/trunk

          The latest code review : https://reviews.apache.org/r/33204/diff/8/
          1, fix 2 test cases.
          2, fix logCleaner. Whenever create one new file, should set the preallocate parameter.
          3, merge to latest trunk.

          Is it ok to go? Jun Rao

          Try push some code to open source even harder than play China Shanghai Stock A, hehe.

          Show
          waldenchen Honghai Chen added a comment - - edited Updated reviewboard https://reviews.apache.org/r/33204/diff/ against branch origin/trunk The latest code review : https://reviews.apache.org/r/33204/diff/8/ 1, fix 2 test cases. 2, fix logCleaner. Whenever create one new file, should set the preallocate parameter. 3, merge to latest trunk. Is it ok to go? Jun Rao Try push some code to open source even harder than play China Shanghai Stock A, hehe.
          Hide
          junrao Jun Rao added a comment -

          Thanks for the patch. Sorry for the delay. +1. Committed to trunk after fixing a compilation error.

          :core:compileTestScala/Users/junrao/intellij/kafka/core/src/test/scala/unit/kafka/log/LogSegmentTest.scala:233: type mismatch;
          found : Long
          required: Int
          val seg = new LogSegment(tempDir, offset, 10, 1000, 0, SystemTime, fileAlreadyExists = fileAlreadyExists, initFileSize = initFileSize, preallocate = preallocate)
          ^
          one error found
          FAILED

          Show
          junrao Jun Rao added a comment - Thanks for the patch. Sorry for the delay. +1. Committed to trunk after fixing a compilation error. :core:compileTestScala/Users/junrao/intellij/kafka/core/src/test/scala/unit/kafka/log/LogSegmentTest.scala:233: type mismatch; found : Long required: Int val seg = new LogSegment(tempDir, offset, 10, 1000, 0, SystemTime, fileAlreadyExists = fileAlreadyExists, initFileSize = initFileSize, preallocate = preallocate) ^ one error found FAILED
          Hide
          waldenchen Honghai Chen added a comment -

          See the commit https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=commit;h=ca758252c5a524fe6135a585282dd4bf747afef2
          Many thanks for everyone for your help to make this happen.

          Show
          waldenchen Honghai Chen added a comment - See the commit https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=commit;h=ca758252c5a524fe6135a585282dd4bf747afef2 Many thanks for everyone for your help to make this happen.

            People

            • Assignee:
              waldenchen Honghai Chen
              Reporter:
              xueqiang xueqiang wang
              Reviewer:
              Jun Rao
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development