Log4j 2
  1. Log4j 2
  2. LOG4J2-531

Rolled log files overwritten by RollingFile appender with composite time and size based policies

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta9
    • Fix Version/s: 2.0-rc1
    • Component/s: Appenders
    • Labels:
      None
    • Environment:

      Ubuntu 12.04 and 13.04, java version "1.7.0_51"

      Description

      We have a system which generates high volume logs which are required to be preserved for audit purposes, and have been having problems with files being unexpectedly overwritten.

      We are using a RollingFile appender with day granularity, time based and size based triggering policies, and a rollover strategy with a suitably large max value.

      I have created a simple test case with minute granularity to quickly illustrate the problem, which is v. similar to the example given in the documentation:

      import org.apache.logging.log4j.LogManager;
      import org.apache.logging.log4j.Logger;
      
      public class LogTest
      {
          private static final Logger logger = LogManager.getLogger("TestLogger");
      
          public static void main(String[] args) throws Exception
          {
              for (long i=0; ; i+=1) {
                  logger.debug("Sequence: " + i);
                  Thread.sleep(250);
              }
          }
      }
      

      ... with a config of:

      <?xml version="1.0" encoding="UTF-8"?>
      <Configuration>
          <Appenders>
              <RollingFile name="Test" fileName="logs/test.log" filePattern="logs/test/$${date:yyyyMMddHHmm}/TEST-%d{yyyyMMddHHmm}-%i.log.gz">
                  <PatternLayout pattern="%d %p (%t) [%c] - %m%n"/>
                  <Policies>
                      <TimeBasedTriggeringPolicy />
                      <SizeBasedTriggeringPolicy size="1 KB"/>
                  </Policies>
                  <DefaultRolloverStrategy max="999999"/>
              </RollingFile>
          </Appenders>
          <Loggers>
              <Root level="debug">
                  <AppenderRef ref="Test"/>
              </Root>
          </Loggers>
      </Configuration>
      

      If this is run as is many of the rollover logfiles have other files written over them and are lost, as can clearly be seen by the gaps in the remaining sequence numbers, and the order the sequence numbers appear in the resulting files.

      If the time based policy is removed from the config and it is re-run then all sequence numbers are correctly stored and in the expected order., Without the time based trigger some are carried over into the folder for the next period which is not ideal, though is what we are using at present to avoid data loss.

        Issue Links

          Activity

          Hide
          Matt Sicker added a comment -

          Can confirm this bug using trunk. The results I get when I run the provided test as is indicate a loss of the initial 64 log messages, then a loss of 50 message in between each file rollover. This test is comparable to rolling3, but with the addition of a default rollover strategy.

          Show
          Matt Sicker added a comment - Can confirm this bug using trunk. The results I get when I run the provided test as is indicate a loss of the initial 64 log messages, then a loss of 50 message in between each file rollover. This test is comparable to rolling3, but with the addition of a default rollover strategy.
          Hide
          Matt Sicker added a comment -

          My mistake. If you look at the RollingAppenderTimeAndSizeTest along with its generated files, you'll see the bug described in the report. It looks like that test will have to be updated once this problem is fixed.

          Show
          Matt Sicker added a comment - My mistake. If you look at the RollingAppenderTimeAndSizeTest along with its generated files, you'll see the bug described in the report. It looks like that test will have to be updated once this problem is fixed.
          Hide
          Remko Popma added a comment -

          Geoff, just a preliminary update (still investigating):
          because DefaultRolloverStrategy is configured with max="999999", the rollover logic will allow for up to 1 million files with the same timestamp (but a different %i index if the rollover was triggered by the size threshold). Within the same time window, the rollover logic will then ensure that no more than max files are created, and if the max has been reached, then the logic will remove the oldest file and rename all other indexed files to make room for the new one.
          (See DefaultRolloverStrategy.)

          Why is this relevant? The logic above is implemented as a loop that will count down from 999999 to 1 (almost 1 million times), and call if (new File(format(pattern, i)).exists()) ... in order to check if such a file exists and needs to be renamed.

          I suspect that accessing the disk so often will take significant time, and that the first rollover action is not complete yet when other threads have logged enough to trigger the next rollover, and that these overlapping rollover actions cause the problem you are seeing.

          If I am correct, the problem can be worked around by creating a smaller time window, configuring a larger size for SizeBasedTriggeringPolicy so you can configure a smaller value for max in DefaultRolloverStrategy.

          I will continue investigating and update later.

          Show
          Remko Popma added a comment - Geoff, just a preliminary update (still investigating): because DefaultRolloverStrategy is configured with max="999999" , the rollover logic will allow for up to 1 million files with the same timestamp (but a different %i index if the rollover was triggered by the size threshold). Within the same time window, the rollover logic will then ensure that no more than max files are created, and if the max has been reached, then the logic will remove the oldest file and rename all other indexed files to make room for the new one. (See DefaultRolloverStrategy .) Why is this relevant? The logic above is implemented as a loop that will count down from 999999 to 1 (almost 1 million times), and call if (new File(format(pattern, i)).exists()) ... in order to check if such a file exists and needs to be renamed. I suspect that accessing the disk so often will take significant time, and that the first rollover action is not complete yet when other threads have logged enough to trigger the next rollover, and that these overlapping rollover actions cause the problem you are seeing. If I am correct, the problem can be worked around by creating a smaller time window, configuring a larger size for SizeBasedTriggeringPolicy so you can configure a smaller value for max in DefaultRolloverStrategy . I will continue investigating and update later.
          Hide
          Remko Popma added a comment -

          Update: on my home laptop a rollover takes 5 minutes: that is how long it takes to call File.exists() 1 million times (and that machine has a fairly fast SSD drive).

          However, this does not explain the loss of the initial zip file. That seems to be caused by a bug in the archive renaming logic. Here is what I've found so far.

          2014-02-09 14:42:29,365 DEBUG (main) [log4j2_531.LogTest] - Sequence: 0
          2014-02-09 14:42:29,617 DEBUG (main) [log4j2_531.LogTest] - Sequence: 1
          2014-02-09 14:42:29,868 DEBUG (main) [log4j2_531.LogTest] - Sequence: 2
          2014-02-09 14:42:30,119 DEBUG (main) [log4j2_531.LogTest] - Sequence: 3
          2014-02-09 14:42:30,370 DEBUG (main) [log4j2_531.LogTest] - Sequence: 4
          2014-02-09 14:42:30,620 DEBUG (main) [log4j2_531.LogTest] - Sequence: 5
          2014-02-09 14:42:30,871 DEBUG (main) [log4j2_531.LogTest] - Sequence: 6
          2014-02-09 14:42:31,122 DEBUG (main) [log4j2_531.LogTest] - Sequence: 7
          2014-02-09 14:42:31,372 DEBUG (main) [log4j2_531.LogTest] - Sequence: 8
          2014-02-09 14:42:31,623 DEBUG (main) [log4j2_531.LogTest] - Sequence: 9
          2014-02-09 14:42:31,874 DEBUG (main) [log4j2_531.LogTest] - Sequence: 10
          2014-02-09 14:42:32,125 DEBUG (main) [log4j2_531.LogTest] - Sequence: 11
          2014-02-09 14:42:32,376 DEBUG (main) [log4j2_531.LogTest] - Sequence: 12
          2014-02-09 14:42:32,627 DEBUG (main) [log4j2_531.LogTest] - Sequence: 13
          2014-02-09 14:47:24,372 TRACE DefaultRolloverStrategy.purge() took 291.484612529 seconds
          2014-02-09 14:42:32,879 DEBUG (main) [log4j2_531.LogTest] - Sequence: 14
          2014-02-09 14:52:16,386 TRACE DefaultRolloverStrategy.purge() took 291.746054955 seconds
          2014-02-09 14:47:24,630 DEBUG (main) [log4j2_531.LogTest] - Sequence: 15
          2014-02-09 14:57:09,710 TRACE DefaultRolloverStrategy.purge() took 293.05872818 seconds
          2014-02-09 14:52:16,642 DEBUG (main) [log4j2_531.LogTest] - Sequence: 16
          2014-02-09 15:02:03,915 TRACE DefaultRolloverStrategy.purge() took 293.942685691 seconds
          2014-02-09 14:57:09,964 DEBUG (main) [log4j2_531.LogTest] - Sequence: 17
          2014-02-09 15:06:58,657 TRACE DefaultRolloverStrategy.purge() took 294.479377825 seconds
          2014-02-09 15:02:04,170 DEBUG (main) [log4j2_531.LogTest] - Sequence: 18
          2014-02-09 15:11:55,732 TRACE DefaultRolloverStrategy.purge() took 296.808908234 seconds
          2014-02-09 15:06:58,914 DEBUG (main) [log4j2_531.LogTest] - Sequence: 19
          
          1. 14:47 rollover creates TEST-201402091447-1.log.zip - contains 14 lines with Sequence 0-13
          2. 14:52 rollover creates TEST-201402091442-1.log.zip - contains 1 line with Sequence 14
          3. 14:57 rollover overwrites TEST-201402091447-1.log.zip - contains 1 line with Sequence 15
          4. 15:02 rollover creates TEST-201402091452-1.log.zip - contains 1 line with Sequence 16
          5. 15:07 rollover creates TEST-201402091457-1.log.zip - contains 1 line with Sequence 17
          6. 15:11 rollover creates TEST-201402091502-1.log.zip - contains 1 line with Sequence 18

          (to be continued)

          Show
          Remko Popma added a comment - Update: on my home laptop a rollover takes 5 minutes: that is how long it takes to call File.exists() 1 million times (and that machine has a fairly fast SSD drive). However, this does not explain the loss of the initial zip file. That seems to be caused by a bug in the archive renaming logic. Here is what I've found so far. 2014-02-09 14:42:29,365 DEBUG (main) [log4j2_531.LogTest] - Sequence: 0 2014-02-09 14:42:29,617 DEBUG (main) [log4j2_531.LogTest] - Sequence: 1 2014-02-09 14:42:29,868 DEBUG (main) [log4j2_531.LogTest] - Sequence: 2 2014-02-09 14:42:30,119 DEBUG (main) [log4j2_531.LogTest] - Sequence: 3 2014-02-09 14:42:30,370 DEBUG (main) [log4j2_531.LogTest] - Sequence: 4 2014-02-09 14:42:30,620 DEBUG (main) [log4j2_531.LogTest] - Sequence: 5 2014-02-09 14:42:30,871 DEBUG (main) [log4j2_531.LogTest] - Sequence: 6 2014-02-09 14:42:31,122 DEBUG (main) [log4j2_531.LogTest] - Sequence: 7 2014-02-09 14:42:31,372 DEBUG (main) [log4j2_531.LogTest] - Sequence: 8 2014-02-09 14:42:31,623 DEBUG (main) [log4j2_531.LogTest] - Sequence: 9 2014-02-09 14:42:31,874 DEBUG (main) [log4j2_531.LogTest] - Sequence: 10 2014-02-09 14:42:32,125 DEBUG (main) [log4j2_531.LogTest] - Sequence: 11 2014-02-09 14:42:32,376 DEBUG (main) [log4j2_531.LogTest] - Sequence: 12 2014-02-09 14:42:32,627 DEBUG (main) [log4j2_531.LogTest] - Sequence: 13 2014-02-09 14:47:24,372 TRACE DefaultRolloverStrategy.purge() took 291.484612529 seconds 2014-02-09 14:42:32,879 DEBUG (main) [log4j2_531.LogTest] - Sequence: 14 2014-02-09 14:52:16,386 TRACE DefaultRolloverStrategy.purge() took 291.746054955 seconds 2014-02-09 14:47:24,630 DEBUG (main) [log4j2_531.LogTest] - Sequence: 15 2014-02-09 14:57:09,710 TRACE DefaultRolloverStrategy.purge() took 293.05872818 seconds 2014-02-09 14:52:16,642 DEBUG (main) [log4j2_531.LogTest] - Sequence: 16 2014-02-09 15:02:03,915 TRACE DefaultRolloverStrategy.purge() took 293.942685691 seconds 2014-02-09 14:57:09,964 DEBUG (main) [log4j2_531.LogTest] - Sequence: 17 2014-02-09 15:06:58,657 TRACE DefaultRolloverStrategy.purge() took 294.479377825 seconds 2014-02-09 15:02:04,170 DEBUG (main) [log4j2_531.LogTest] - Sequence: 18 2014-02-09 15:11:55,732 TRACE DefaultRolloverStrategy.purge() took 296.808908234 seconds 2014-02-09 15:06:58,914 DEBUG (main) [log4j2_531.LogTest] - Sequence: 19 14:47 rollover creates TEST-201402091447-1.log.zip - contains 14 lines with Sequence 0-13 14:52 rollover creates TEST-201402091442-1.log.zip - contains 1 line with Sequence 14 14:57 rollover overwrites TEST-201402091447-1.log.zip - contains 1 line with Sequence 15 15:02 rollover creates TEST-201402091452-1.log.zip - contains 1 line with Sequence 16 15:07 rollover creates TEST-201402091457-1.log.zip - contains 1 line with Sequence 17 15:11 rollover creates TEST-201402091502-1.log.zip - contains 1 line with Sequence 18 (to be continued)
          Hide
          Geoff Ballinger added a comment -

          Thanks for the updates Remko, in particular the comment on the effect of the max value on the rollover time which we will definitely take into account going forward.

          For background we originally observed the issue on a system running with a vanilla DefaultRolloverStrategy (i.e. max=7) but with a size of 1GB. This would have been much quicker to find the next available filename but I guess much slower to do the rollover due to gzip'ing 1GB of data.

          Also we never observed the issue on the first day the system ran, only on the second and subsequent days, though we haven't repeated it enough times to be absolutely confident that wasn't just coincidental.

          wrt your observation of a 5 minute rollover time I had observed that they were slow but they are nothing like that slow on my dev system - more like 2-3 seconds. I am using a very vanilla office desktop - i5 w/ a single HDD - running Ubuntu 13.04. Maybe an OS difference?,

          Geoff.

          Show
          Geoff Ballinger added a comment - Thanks for the updates Remko, in particular the comment on the effect of the max value on the rollover time which we will definitely take into account going forward. For background we originally observed the issue on a system running with a vanilla DefaultRolloverStrategy (i.e. max=7) but with a size of 1GB. This would have been much quicker to find the next available filename but I guess much slower to do the rollover due to gzip'ing 1GB of data. Also we never observed the issue on the first day the system ran, only on the second and subsequent days, though we haven't repeated it enough times to be absolutely confident that wasn't just coincidental. wrt your observation of a 5 minute rollover time I had observed that they were slow but they are nothing like that slow on my dev system - more like 2-3 seconds. I am using a very vanilla office desktop - i5 w/ a single HDD - running Ubuntu 13.04. Maybe an OS difference?, Geoff.
          Hide
          Remko Popma added a comment -

          Well, this is a nasty one... Sorry if what follows is a long story, but it has taken me all day to figure this out.

          At rollover, the name of the archive file is based on a date format and an index (assuming both TimeBased and SizeBased triggering).

          • The index is determined based on a scan of the target directory.
          • The date-time is not the current date-time, but the date-time of the previous rollover.
            This is by design: for example, if you roll over once a day, you want the archive containing Wednesday's log files to be called "Wednesday.log.zip", even if the rollover happens on Thursday at midnight.

          This behaviour is implemented in PatternProcessor (the class responsible for producing the post-rollover file name) by remembering the previous rollover time in an instance variable.

          The current implementation assumes that rollover will not take longer than the rollover period specified in the date format. Since the DefaultRolloverStrategy is configured with max=999999, and a directory scan takes 5 minutes, this assumption no longer holds: we are supposed to roll over every minute but the rollover process takes five minutes.

          This highlights three problems in the current implementation:

          1. Initialization is incorrect

          • At startup, PatternProcessor.prevFileTime (the field storing previous rollover time) is not initialized (its value is copied from nextFileTime which is still zero).
          • At the same time, PatternProcessor.nextFileTime is initialized to the log file's lastModified time (for example 10:00).
          • For the first rollover, the directory is scanned. This completes at 10:05.
          • The archive file is created based on PatternProcessor.prevFileTime. Since this is not initialized yet, the logic falls back to the current time (10:05). Rollover creates archive <...>1005.1.log.zip
          • The next log event is processed. Event time is 10:00:01, triggering another rollback. Also, PatternProcessor.nextFileTime is copied to prevFileTime and re-initialized to the current time. So, now prevFileTime=10:00 and nextFileTime=10:05.
          • For the second rollback, the directory is scanned again. This completes at 10:10.
          • The archive file is created based on PatternProcessor.prevFileTime, which is 10:00. Rollover creates archive <...>1000.1.log.zip (And we just went back in time...)
          • The next log event is processed. Event time is 10:00:02, triggering another rollback. Also, PatternProcessor.nextFileTime is copied to prevFileTime and re-initialized to the current time. So, now prevFileTime=10:05 and nextFileTime=10:10.
          • For the third rollback, the directory is scanned again. This completes at 10:15.
          • The archive file is created based on PatternProcessor.prevFileTime, which is 10:05. Rollover overwrites archive <...>1005.1.log.zip (Overwriting an existing archive file... but that is problem #2, see below.)
          • The next log event is processed. Event time is 10:00:03, triggering another rollback. Also, PatternProcessor.nextFileTime is copied to prevFileTime and re-initialized to the current time. So, now prevFileTime=10:10 and nextFileTime=10:15.
          • For the third rollback, the directory is scanned again. This completes at 10:20.
          • The archive file is created based on PatternProcessor.prevFileTime, which is 10:10. Rollover creates archive <...>1010.1.log.zip. From here on rollover should work correctly.

          In my opinion, the PatternProcessor.prevFileTime should be initialized to the log file's last modified time at startup.
          That would prevent the "going back in time" phenomenon we're seeing above.

          2. Rollover overwrites existing archives
          If an archive <datetime>-1.zip already exists, I would expect it to be renamed to <datetime>-2.zip before the new archive file is created. This is not happening...

          Ater much digging, I found the reason. The check for existing files is with a different method than the method used to generate the new archive name...

          In our config, filePattern="logs/test/$${date:yyyyMMddHHmm}/TEST-%d{yyyyMMddHHmm}-%i.log.zip". Now here is the fun bit: the directory pattern $${date:yyyyMMddHHmm} has two dollar chars and is evaluated dynamically.

          So if our scan starts at 10:10 and we work down from maxIndex, then by the time we reach index 1, five minutes have passed and the logic checks if file "logs/test/201402091015/TEST-201402091010-1.log.zip" exists.

          After the scan the log file is zipped to "logs/test/201402091010/TEST-201402091010-1.log.zip". This file did exist but was not found during the scan.

          The fix for this is that the directory scan must use the same pattern evaluation logic as used to generate the rolled-over archive file name.

          3. Directory scan time linear with DefaultRolloverStrategy.maxIndex
          Ideally I'd like a faster implementation that does not call File.exists() maxIndex times.
          One alternative implementation would be to build a list of all files in all subdirectories of the top-most directory in the file pattern, and do string-matching on that list.
          This is possible, but not trivial either. I may take a stab at this later.

          I'll work on solutions for the first two issues.

          Show
          Remko Popma added a comment - Well, this is a nasty one... Sorry if what follows is a long story, but it has taken me all day to figure this out. At rollover, the name of the archive file is based on a date format and an index (assuming both TimeBased and SizeBased triggering). The index is determined based on a scan of the target directory. The date-time is not the current date-time, but the date-time of the previous rollover. This is by design: for example, if you roll over once a day, you want the archive containing Wednesday's log files to be called "Wednesday.log.zip", even if the rollover happens on Thursday at midnight. This behaviour is implemented in PatternProcessor (the class responsible for producing the post-rollover file name) by remembering the previous rollover time in an instance variable. The current implementation assumes that rollover will not take longer than the rollover period specified in the date format. Since the DefaultRolloverStrategy is configured with max=999999 , and a directory scan takes 5 minutes, this assumption no longer holds: we are supposed to roll over every minute but the rollover process takes five minutes. This highlights three problems in the current implementation: 1. Initialization is incorrect At startup, PatternProcessor.prevFileTime (the field storing previous rollover time) is not initialized (its value is copied from nextFileTime which is still zero). At the same time, PatternProcessor.nextFileTime is initialized to the log file's lastModified time (for example 10:00). For the first rollover, the directory is scanned. This completes at 10:05. The archive file is created based on PatternProcessor.prevFileTime . Since this is not initialized yet, the logic falls back to the current time (10:05). Rollover creates archive <...>1005.1.log.zip The next log event is processed. Event time is 10:00:01, triggering another rollback. Also, PatternProcessor.nextFileTime is copied to prevFileTime and re-initialized to the current time. So, now prevFileTime=10:00 and nextFileTime=10:05 . For the second rollback, the directory is scanned again. This completes at 10:10. The archive file is created based on PatternProcessor.prevFileTime , which is 10:00. Rollover creates archive <...>1000.1.log.zip (And we just went back in time...) The next log event is processed. Event time is 10:00:02, triggering another rollback. Also, PatternProcessor.nextFileTime is copied to prevFileTime and re-initialized to the current time. So, now prevFileTime=10:05 and nextFileTime=10:10 . For the third rollback, the directory is scanned again. This completes at 10:15. The archive file is created based on PatternProcessor.prevFileTime , which is 10:05. Rollover overwrites archive <...>1005.1.log.zip (Overwriting an existing archive file... but that is problem #2, see below.) The next log event is processed. Event time is 10:00:03, triggering another rollback. Also, PatternProcessor.nextFileTime is copied to prevFileTime and re-initialized to the current time. So, now prevFileTime=10:10 and nextFileTime=10:15 . For the third rollback, the directory is scanned again. This completes at 10:20. The archive file is created based on PatternProcessor.prevFileTime , which is 10:10. Rollover creates archive <...>1010.1.log.zip. From here on rollover should work correctly. In my opinion, the PatternProcessor.prevFileTime should be initialized to the log file's last modified time at startup. That would prevent the "going back in time" phenomenon we're seeing above. 2. Rollover overwrites existing archives If an archive <datetime>-1.zip already exists, I would expect it to be renamed to <datetime>-2.zip before the new archive file is created. This is not happening... Ater much digging, I found the reason. The check for existing files is with a different method than the method used to generate the new archive name... In our config, filePattern="logs/test/$${date:yyyyMMddHHmm}/TEST-%d{yyyyMMddHHmm}-%i.log.zip" . Now here is the fun bit: the directory pattern $${date:yyyyMMddHHmm} has two dollar chars and is evaluated dynamically. So if our scan starts at 10:10 and we work down from maxIndex , then by the time we reach index 1, five minutes have passed and the logic checks if file "logs/test/20140209 1015 /TEST-20140209 1010 -1.log.zip" exists. After the scan the log file is zipped to "logs/test/20140209 1010 /TEST-20140209 1010 -1.log.zip". This file did exist but was not found during the scan. The fix for this is that the directory scan must use the same pattern evaluation logic as used to generate the rolled-over archive file name. 3. Directory scan time linear with DefaultRolloverStrategy.maxIndex Ideally I'd like a faster implementation that does not call File.exists() maxIndex times. One alternative implementation would be to build a list of all files in all subdirectories of the top-most directory in the file pattern, and do string-matching on that list. This is possible, but not trivial either. I may take a stab at this later. I'll work on solutions for the first two issues.
          Hide
          Remko Popma added a comment -

          Fixed in revision 1566302.
          Please verify and close.


          Changes:

          • TimeBasedTriggeringPolicy.initialize now calls PatternProcessor.getNextTime twice to force initialization of both prevFileTime and nextFileTime.
          • RollingFileManager.createManager now calls File.lastModified() after creating the file, so field initialTime no longer has a value of zero.
          • DefaultRolloverStrategy.purgeAscending and purgeDescending now use the same PatternProcessor.formatFileName method as the method used to create the rollover file.
          • DefaultRolloverStrategy.purgeAscending and purgeDescending now emit debug-level status log messages when a file is deleted or renamed.
          • PatternProcessor.getNextTime now emits trace-level status log messages when its prevFileTime and nextFileTime fields are modified before a rollover.
          Show
          Remko Popma added a comment - Fixed in revision 1566302. Please verify and close. Changes: TimeBasedTriggeringPolicy.initialize now calls PatternProcessor.getNextTime twice to force initialization of both prevFileTime and nextFileTime . RollingFileManager.createManager now calls File.lastModified() after creating the file, so field initialTime no longer has a value of zero. DefaultRolloverStrategy.purgeAscending and purgeDescending now use the same PatternProcessor.formatFileName method as the method used to create the rollover file. DefaultRolloverStrategy.purgeAscending and purgeDescending now emit debug-level status log messages when a file is deleted or renamed. PatternProcessor.getNextTime now emits trace-level status log messages when its prevFileTime and nextFileTime fields are modified before a rollover.
          Hide
          Gary Gregory added a comment -

          Good job digging through all of this! Do we have unit tests now for all of this? Without tests I would have zero confidence touching any of this delicate logic.

          Show
          Gary Gregory added a comment - Good job digging through all of this! Do we have unit tests now for all of this? Without tests I would have zero confidence touching any of this delicate logic.
          Hide
          Remko Popma added a comment -

          I will add JUnit tests for this ASAP. Probably one day from now. Sorry but I won't be able to do it sooner.

          Show
          Remko Popma added a comment - I will add JUnit tests for this ASAP. Probably one day from now. Sorry but I won't be able to do it sooner.
          Hide
          Gary Gregory added a comment -

          That's quite all right Remko I just need tests so that when I got in there and hack away (see my proposal on the ML), I can do so with confidence that the tests will back me up.

          Show
          Gary Gregory added a comment - That's quite all right Remko I just need tests so that when I got in there and hack away (see my proposal on the ML), I can do so with confidence that the tests will back me up.
          Hide
          Remko Popma added a comment -

          Understood. I was also thinking to refactor the DefaultRolloverStrategy class when (or before) working on LOG4J2-524. The current implementation makes unit testing a bit difficult. I probably won't have time to work on this in the next week or so, so if you want to go first, no problem.

          Show
          Remko Popma added a comment - Understood. I was also thinking to refactor the DefaultRolloverStrategy class when (or before) working on LOG4J2-524 . The current implementation makes unit testing a bit difficult. I probably won't have time to work on this in the next week or so, so if you want to go first, no problem.
          Hide
          Remko Popma added a comment -

          I just re-read your ML message & realized your proposal would also address LOG4J2-524. Cool!

          Show
          Remko Popma added a comment - I just re-read your ML message & realized your proposal would also address LOG4J2-524 . Cool!
          Hide
          Remko Popma added a comment - - edited

          Geoff, the fix just made it into the 2.0-rc1 release. It will take a few days for the team to verify the release and the artifacts to get published to the mirrors. Meanwhile, you can verify the fix with the artifacts staged here: https://repository.apache.org/content/repositories/orgapachelogging-1002/org/apache/logging/log4j/
          (or you can build from trunk, or course).

          Show
          Remko Popma added a comment - - edited Geoff, the fix just made it into the 2.0-rc1 release. It will take a few days for the team to verify the release and the artifacts to get published to the mirrors. Meanwhile, you can verify the fix with the artifacts staged here: https://repository.apache.org/content/repositories/orgapachelogging-1002/org/apache/logging/log4j/ (or you can build from trunk, or course).
          Hide
          Geoff Ballinger added a comment - - edited

          I can confirm that using your staged artefacts log files / lines are no longer being lost which is the main thing!

          The distribution of the log files into the folders is a bit odd however and it feels like there might be a flaw in the timestamp logic in there somewhere still.

          I kicked off the test run at 14:29:11 using my example setup from the first post above more or less as is - but with a more sensible max=99.

          The initial folder created was 201402101429 and contained 27 numbered files prefixed TEST-201402101429-. Numbers 1-12 contained 16 log lines each with timestamps 14:29:11,667 to 14:29:59,649, number 13 a single log line with timestamp 14:29:59,899, and numbers 14-27 again contain 16 lines each with timestamps 14:30:00,164 to 14:30:56,155.

          The second folder created was 201402101430 and contained 15 numbered files. The first file contained 15 log lines with timestamps 14:30:56,405 to 14:30:59,919. The subsequent files contained 16 log lines each with timestamps 14:31:00,170 to 14:31:56,123.

          The third folder created was 201402101431. The first file contained 15 log lines with timestamps 14:31:56,373 to 14:31:59,886. The subsequent files contained 16 log lines each with timestamps from 14:32:00,136 onwards.

          So in summary after the first minute the majority of log lines from minute X end up in the folder for minute X-1. The exception are the last file of lines which end up in the correct folder.

          Geoff.

          Show
          Geoff Ballinger added a comment - - edited I can confirm that using your staged artefacts log files / lines are no longer being lost which is the main thing! The distribution of the log files into the folders is a bit odd however and it feels like there might be a flaw in the timestamp logic in there somewhere still. I kicked off the test run at 14:29:11 using my example setup from the first post above more or less as is - but with a more sensible max=99 . The initial folder created was 201402101429 and contained 27 numbered files prefixed TEST-201402101429- . Numbers 1-12 contained 16 log lines each with timestamps 14:29:11,667 to 14:29:59,649 , number 13 a single log line with timestamp 14:29:59,899 , and numbers 14-27 again contain 16 lines each with timestamps 14:30:00,164 to 14:30:56,155 . The second folder created was 201402101430 and contained 15 numbered files. The first file contained 15 log lines with timestamps 14:30:56,405 to 14:30:59,919 . The subsequent files contained 16 log lines each with timestamps 14:31:00,170 to 14:31:56,123 . The third folder created was 201402101431 . The first file contained 15 log lines with timestamps 14:31:56,373 to 14:31:59,886 . The subsequent files contained 16 log lines each with timestamps from 14:32:00,136 onwards. So in summary after the first minute the majority of log lines from minute X end up in the folder for minute X-1. The exception are the last file of lines which end up in the correct folder. Geoff.
          Hide
          Remko Popma added a comment -

          Geoff, thanks first of all for confirming that the main problem is solved.

          Also thanks for pointing out that there may still be issues. I'll take another look with the max=99 setting. I suspect the problem you pointed out has to do with the time taken by the rollover, but would need to check.
          (It is certainly an interesting scenario: rolling over that many times within the same minute. I'm curious if it will lead us to other hidden flaws in the rollover implementation.)

          Show
          Remko Popma added a comment - Geoff, thanks first of all for confirming that the main problem is solved. Also thanks for pointing out that there may still be issues. I'll take another look with the max=99 setting. I suspect the problem you pointed out has to do with the time taken by the rollover, but would need to check. (It is certainly an interesting scenario: rolling over that many times within the same minute. I'm curious if it will lead us to other hidden flaws in the rollover implementation.)
          Hide
          Geoff Ballinger added a comment -

          Fair comment wrt the unrealistic settings so I adjusted it to roll every hour and every 64kb (i.e. expect similar number of files per folder):

          <?xml version="1.0" encoding="UTF-8"?>
          <Configuration>
              <Appenders>
                  <RollingFile name="Test" fileName="logs/test.log" filePattern="logs/test/$${date:yyyyMMddHH}/TEST-%d{yyyyMMddHH}-%i.log.gz">
                      <PatternLayout pattern="%d %p (%t) [%c] - %m%n"/>
                      <Policies>
                          <TimeBasedTriggeringPolicy />
                          <SizeBasedTriggeringPolicy size="64 KB"/>
                      </Policies>
                      <DefaultRolloverStrategy max="99"/>
                  </RollingFile>
              </Appenders>
              <Loggers>
                  <Root level="debug">
                      <AppenderRef ref="Test"/>
                  </Root>
              </Loggers>
          </Configuration>
          

          Running this overnight gave a similar pattern, e.g. the folder for 2014021100 contains one file with timestamps 00:59:42,836 to 00:59:59,851, and then 15 others covering timestamps 01:00:00,101 to 01:59:24,196. Only the last file's worth for each hour (the one rotated based on time) is in the correct folder, and the rest (rotated based on size) are in the previous hour's folder.

          Geoff.

          Show
          Geoff Ballinger added a comment - Fair comment wrt the unrealistic settings so I adjusted it to roll every hour and every 64kb (i.e. expect similar number of files per folder): <?xml version= "1.0" encoding= "UTF-8" ?> <Configuration> <Appenders> <RollingFile name= "Test" fileName= "logs/test.log" filePattern= "logs/test/$${date:yyyyMMddHH}/TEST-%d{yyyyMMddHH}-%i.log.gz" > <PatternLayout pattern= "%d %p (%t) [%c] - %m%n" /> <Policies> <TimeBasedTriggeringPolicy /> <SizeBasedTriggeringPolicy size= "64 KB" /> </Policies> <DefaultRolloverStrategy max= "99" /> </RollingFile> </Appenders> <Loggers> <Root level= "debug" > <AppenderRef ref= "Test" /> </Root> </Loggers> </Configuration> Running this overnight gave a similar pattern, e.g. the folder for 2014021100 contains one file with timestamps 00:59:42,836 to 00:59:59,851 , and then 15 others covering timestamps 01:00:00,101 to 01:59:24,196 . Only the last file's worth for each hour (the one rotated based on time) is in the correct folder, and the rest (rotated based on size) are in the previous hour's folder. Geoff.
          Hide
          Remko Popma added a comment -

          Geoff, If you don't mind, I would like to close this issue (seeing that the title/description issue of data loss has been resolved). If you disagree, please feel free to re-open this issue.

          I've started a new ticket (LOG4J2-535) for the new issue you found. I've tentatively called it "Rolled log files end up in the wrong directory" and used your last comment as the issue description, but please feel free to edit the title, description and anything else as you see fit.

          Best regards, -Remko

          Show
          Remko Popma added a comment - Geoff, If you don't mind, I would like to close this issue (seeing that the title/description issue of data loss has been resolved). If you disagree, please feel free to re-open this issue. I've started a new ticket ( LOG4J2-535 ) for the new issue you found. I've tentatively called it "Rolled log files end up in the wrong directory" and used your last comment as the issue description, but please feel free to edit the title, description and anything else as you see fit. Best regards, -Remko

            People

            • Assignee:
              Remko Popma
              Reporter:
              Geoff Ballinger
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development