Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: v1.5.0
    • Component/s: None
    • Labels:
      None

      Description

      File Channel has scaled so well that people now run channels with sizes in 100's of millions of events. Turns out, replay can be crazy slow even between checkpoints at this scale - because of the remove() method in FlumeEventQueue moving every pointer that follows the one being removed (1 remove causes 99 million+ moves for a channel of 100 million!). There are several ways of improving - one being move at the end of replay - sort of like a compaction. Another is to use the fact that all removes happen from the top of the queue, so move the first "k" events out to hashset and remove from there - we can find k using the write id of the last checkpoint and the current one.

      1. FLUME-2155.5.patch
        37 kB
        Brock Noland
      2. FLUME-2155.4.patch
        35 kB
        Brock Noland
      3. FLUME-2155.2.patch
        34 kB
        Brock Noland
      4. FLUME-FC-SLOW-REPLAY-FIX-1.patch
        4 kB
        Brock Noland
      5. FLUME-FC-SLOW-REPLAY-1.patch
        11 kB
        Brock Noland
      6. FLUME-2155.patch
        12 kB
        Hari Shreedharan
      7. FLUME-2155-initial.patch
        2 kB
        Hari Shreedharan
      8. SmartReplay1.1.pdf
        86 kB
        Hari Shreedharan
      9. SmartReplay.pdf
        72 kB
        Hari Shreedharan
      10. fc-test.patch
        12 kB
        Hari Shreedharan
      11. 100000-110000
        5.70 MB
        Hari Shreedharan
      12. 10000-20000
        5.69 MB
        Hari Shreedharan
      13. 300000-310000
        5.70 MB
        Hari Shreedharan
      14. 700000-710000
        5.70 MB
        Hari Shreedharan

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          SUCCESS: Integrated in flume-trunk #527 (See https://builds.apache.org/job/flume-trunk/527/)
          FLUME-2155. Index the Flume Event Queue during replay to improve replay time. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=6373032a620bdc687b6d03b12726713d08c71a10)

          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java
          • flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestEventQueueBackingStoreFactory.java
          • flume-ng-channels/flume-file-channel/pom.xml
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Serialization.java
          • flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestCheckpoint.java
          • flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestCheckpointRebuilder.java
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/ReplayHandler.java
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FlumeEventQueue.java
          • flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestFlumeEventQueue.java
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannel.java
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Log.java
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/CheckpointRebuilder.java
          • pom.xml
          • flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/EventQueueBackingStoreFile.java
          Show
          Hudson added a comment - SUCCESS: Integrated in flume-trunk #527 (See https://builds.apache.org/job/flume-trunk/527/ ) FLUME-2155 . Index the Flume Event Queue during replay to improve replay time. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=6373032a620bdc687b6d03b12726713d08c71a10 ) flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestEventQueueBackingStoreFactory.java flume-ng-channels/flume-file-channel/pom.xml flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Serialization.java flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestCheckpoint.java flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestCheckpointRebuilder.java flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/ReplayHandler.java flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FlumeEventQueue.java flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestFlumeEventQueue.java flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannel.java flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Log.java flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/CheckpointRebuilder.java pom.xml flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/EventQueueBackingStoreFile.java
          Hide
          Hari Shreedharan added a comment -

          Committed, rev: 6373032a620bdc687b6d03b12726713d08c71a10. Thanks Brock!

          Show
          Hari Shreedharan added a comment - Committed, rev: 6373032a620bdc687b6d03b12726713d08c71a10. Thanks Brock!
          Hide
          Hari Shreedharan added a comment -

          +1. The latest patch looks good. I am running tests and committing it.

          Show
          Hari Shreedharan added a comment - +1. The latest patch looks good. I am running tests and committing it.
          Hide
          Hari Shreedharan added a comment -

          Adding "queueset" to Log.EXCLUDES and changing EventQueueBackingStoreFile:169 to:

          if(Log.EXCLUDES.contains(origFile.getName())) {
          

          fixes this issue.

          Show
          Hari Shreedharan added a comment - Adding "queueset" to Log.EXCLUDES and changing EventQueueBackingStoreFile:169 to: if (Log.EXCLUDES.contains(origFile.getName())) { fixes this issue.
          Hide
          Brock Noland added a comment -
          Show
          Brock Noland added a comment - RB item https://reviews.apache.org/r/16107/
          Hide
          Hari Shreedharan added a comment -

          Assigning to Brock.

          Show
          Hari Shreedharan added a comment - Assigning to Brock.
          Hide
          Brock Noland added a comment -

          Sounds good, I am updating it a bit as I think we only want to be adding items to the queueSet during replay.

          Show
          Brock Noland added a comment - Sounds good, I am updating it a bit as I think we only want to be adding items to the queueSet during replay.
          Hide
          Hari Shreedharan added a comment -

          Brock Noland - I think your patch looks good. If you add a couple tests to that patch (to make sure that it is working as expected, maybe using the total searches etc), I will commit it.

          Show
          Hari Shreedharan added a comment - Brock Noland - I think your patch looks good. If you add a couple tests to that patch (to make sure that it is working as expected, maybe using the total searches etc), I will commit it.
          Hide
          Brock Noland added a comment - - edited

          Yes, we could use it to store indexes. However, I noted I was unable to find a time where something was actually removed and wasn't at index 0. So I think most real removes are very fast. Additionally I think we should limit mapdb's use for now. Once we are more comfortable with it we can expand it's use. For example the overwrite map would be a good place.

          Show
          Brock Noland added a comment - - edited Yes, we could use it to store indexes. However, I noted I was unable to find a time where something was actually removed and wasn't at index 0. So I think most real removes are very fast. Additionally I think we should limit mapdb's use for now. Once we are more comfortable with it we can expand it's use. For example the overwrite map would be a good place.
          Hide
          Hari Shreedharan added a comment -

          Nice work! That makes sense - and nice find too. I was indeed looking at an off-heap data structure as a possible solution, but didn't know or have any experience using one. I really don't have a channel where I can see the copy being terribly slow, but I think we can take it one step at a time. If we can fix the issue you found - where searching for takes without puts is taking too long, then we can actually know if copy is problematic (once this patch is in, we'd see slowness in replay if copy is still an issue).

          I am also wondering if mapdb can be used to make searches faster too. The patch you provided fixes the specific issue of missing puts for takes (possibly because the old files got deleted), but if we can use Mapdb as an index, we can find the index of the FEP in the queue and remove it from the queue too (and then copy or compact in the end, either way is fine). Looks like mapdb can give us a map instead of a set - which can be used for this case to make searches faster.

          Show
          Hari Shreedharan added a comment - Nice work! That makes sense - and nice find too. I was indeed looking at an off-heap data structure as a possible solution, but didn't know or have any experience using one. I really don't have a channel where I can see the copy being terribly slow, but I think we can take it one step at a time. If we can fix the issue you found - where searching for takes without puts is taking too long, then we can actually know if copy is problematic (once this patch is in, we'd see slowness in replay if copy is still an issue). I am also wondering if mapdb can be used to make searches faster too. The patch you provided fixes the specific issue of missing puts for takes (possibly because the old files got deleted), but if we can use Mapdb as an index, we can find the index of the FEP in the queue and remove it from the queue too (and then copy or compact in the end, either way is fine). Looks like mapdb can give us a map instead of a set - which can be used for this case to make searches faster.
          Hide
          Brock Noland added a comment -

          Hi Hari,

          In my testing this patch does look to improve the performance of copy! Nice work!

          However, I was unable to find a scenerio where copy took a significant amount of time during replay.
          I think we should find a case where copy takes a large amount of time during a file channel replay
          before considering the change to the copy code.

          However, in testing this patch, I do believe I have found the scenerio I have seen most often seen
          cause long replays. In addition this is the scenerio I believe played out in FLUME-2118. In the
          thread dump attached to the ticket you see the code in FEQ.remove():

          "lifecycleSupervisor-1-0" prio=10 tid=0x00007fea505f7000 nid=0x279e runnable [0x00007fe84240d000]
             java.lang.Thread.State: RUNNABLE
            at org.apache.flume.channel.file.FlumeEventQueue.remove(FlumeEventQueue.java:195)
            - locked <0x00007fe84d0007b8> (a org.apache.flume.channel.file.FlumeEventQueue)
            at org.apache.flume.channel.file.ReplayHandler.processCommit(ReplayHandler.java:404)
            at org.apache.flume.channel.file.ReplayHandler.replayLog(ReplayHandler.java:327)
            at org.apache.flume.channel.file.Log.doReplay(Log.java:503)
            at org.apache.flume.channel.file.Log.replay(Log.java:430)
            at org.apache.flume.channel.file.FileChannel.start(FileChannel.java:301)
          

          In the attached patch, FLUME-FC-SLOW-REPLAY-1.patch which is for demonstration purposes, there is a
          new file TestFileChannelReplayPerformance.java. The test in that file demonstrates the issue.

          The issue is that is when there are many takes in files that don't have associated put's we search
          the entire queue which is O(N) for each take in the file.

          As you noted, using an fast lookup data structure would solve this. However, if it were a memory
          based data structure it would also consume large amounts of memory where it did not previously.
          Specifically your example of 100 million capacity would result in a 1.4GB data structure
          (100 million * 16 bytes - I use 16 bytes because we have to store Long objects and assuming 64bit JVM).

          I think we need an off-heap fast data structure to perform these lookups. There is a project
          called MapDB (former jdbm) I have used in the past which provides such a data structure.

          In the attached patch, FLUME-FC-SLOW-REPLAY-FIX-1.patch, I have used it to provide an off-heap Set
          which mirrors the FEQ. Without FLUME-FC-SLOW-REPLAY-FIX-1.patch, TestFileChannelReplayPerformance
          takes 50 minutes to replay while it takes only 6.5 minutes with the fix.

          Without fix:

          FlumeEventQueue.logTimings Search Count = 669000.0, Search Time = 3044957.0, Copy Count = 321014.0, Copy Time = 1103.0
          TestFileChannelReplayPerformance.testReplayPerformanc Total Replay Time = 3500624
          

          With fix:

           FlumeEventQueue.logTimings Search Count = 274449.0, Search Time = 1080.0, Copy Count = 274449.0, Copy Time = 1012.0
           TestFileChannelReplayPerformance.testReplayPerformance Total Replay Time = 396338
          

          NOTE: FLUME-FC-SLOW-REPLAY-1.patch and FLUME-FC-SLOW-REPLAY-FIX-1.patch are not for commit.

          Show
          Brock Noland added a comment - Hi Hari, In my testing this patch does look to improve the performance of copy! Nice work! However, I was unable to find a scenerio where copy took a significant amount of time during replay. I think we should find a case where copy takes a large amount of time during a file channel replay before considering the change to the copy code. However, in testing this patch, I do believe I have found the scenerio I have seen most often seen cause long replays. In addition this is the scenerio I believe played out in FLUME-2118 . In the thread dump attached to the ticket you see the code in FEQ.remove(): "lifecycleSupervisor-1-0" prio=10 tid=0x00007fea505f7000 nid=0x279e runnable [0x00007fe84240d000] java.lang.Thread.State: RUNNABLE at org.apache.flume.channel.file.FlumeEventQueue.remove(FlumeEventQueue.java:195) - locked <0x00007fe84d0007b8> (a org.apache.flume.channel.file.FlumeEventQueue) at org.apache.flume.channel.file.ReplayHandler.processCommit(ReplayHandler.java:404) at org.apache.flume.channel.file.ReplayHandler.replayLog(ReplayHandler.java:327) at org.apache.flume.channel.file.Log.doReplay(Log.java:503) at org.apache.flume.channel.file.Log.replay(Log.java:430) at org.apache.flume.channel.file.FileChannel.start(FileChannel.java:301) In the attached patch, FLUME-FC-SLOW-REPLAY-1.patch which is for demonstration purposes, there is a new file TestFileChannelReplayPerformance.java. The test in that file demonstrates the issue. The issue is that is when there are many takes in files that don't have associated put's we search the entire queue which is O(N) for each take in the file. As you noted, using an fast lookup data structure would solve this. However, if it were a memory based data structure it would also consume large amounts of memory where it did not previously. Specifically your example of 100 million capacity would result in a 1.4GB data structure (100 million * 16 bytes - I use 16 bytes because we have to store Long objects and assuming 64bit JVM). I think we need an off-heap fast data structure to perform these lookups. There is a project called MapDB (former jdbm) I have used in the past which provides such a data structure. In the attached patch, FLUME-FC-SLOW-REPLAY-FIX-1.patch, I have used it to provide an off-heap Set which mirrors the FEQ. Without FLUME-FC-SLOW-REPLAY-FIX-1.patch, TestFileChannelReplayPerformance takes 50 minutes to replay while it takes only 6.5 minutes with the fix. Without fix: FlumeEventQueue.logTimings Search Count = 669000.0, Search Time = 3044957.0, Copy Count = 321014.0, Copy Time = 1103.0 TestFileChannelReplayPerformance.testReplayPerformanc Total Replay Time = 3500624 With fix: FlumeEventQueue.logTimings Search Count = 274449.0, Search Time = 1080.0, Copy Count = 274449.0, Copy Time = 1012.0 TestFileChannelReplayPerformance.testReplayPerformance Total Replay Time = 396338 NOTE: FLUME-FC-SLOW-REPLAY-1.patch and FLUME-FC-SLOW-REPLAY-FIX-1.patch are not for commit.
          Hide
          Roshan Naik added a comment -

          Actually it is not clear to me why this patch will help improve perf. My thinking: Since the takes are always happening from the head, then there should not be a need to shift n/2 items during the replay of every take.

          Show
          Roshan Naik added a comment - Actually it is not clear to me why this patch will help improve perf. My thinking: Since the takes are always happening from the head, then there should not be a need to shift n/2 items during the replay of every take.
          Hide
          Roshan Naik added a comment -

          Couple questions:

          • Does 'use-fast-replay' need to be disabled for this to kick in ? Any other config changes required ?
          • Any measurements on how much improvement this is providing ?
          Show
          Roshan Naik added a comment - Couple questions: Does 'use-fast-replay' need to be disabled for this to kick in ? Any other config changes required ? Any measurements on how much improvement this is providing ?
          Hide
          Hari Shreedharan added a comment -

          Yes, it is.

          Show
          Hari Shreedharan added a comment - Yes, it is.
          Hide
          Roshan Naik added a comment -

          curious if this patch is complete ?

          Show
          Roshan Naik added a comment - curious if this patch is complete ?
          Hide
          Hari Shreedharan added a comment -

          Patch that addresses this. Removed old replay, only compact mode is now available (it is no longer a configurable option, since we need to have 2 code paths making it riskier). All current tests pass with this patch.

          Show
          Hari Shreedharan added a comment - Patch that addresses this. Removed old replay, only compact mode is now available (it is no longer a configurable option, since we need to have 2 code paths making it riskier). All current tests pass with this patch.
          Hide
          Hari Shreedharan added a comment -

          Looks like my last patch was incomplete (as I had created a branch locally and I probably took the diff not based on trunk, sorry!). I will add more tests and update the patch

          Show
          Hari Shreedharan added a comment - Looks like my last patch was incomplete (as I had created a branch locally and I probably took the diff not based on trunk, sorry!). I will add more tests and update the patch
          Hide
          Hari Shreedharan added a comment -

          Initial working patch. There are several fixes I want to implement (like integrating this into fast replay and ensuring it does not run when it is a full replay).

          I also need to add several tests to make sure things work as intended. With the current patch, all current tests pass - and the batch puts and removes are what are being invoked as that is enabled as default config (and also from running it through a debugger).

          Show
          Hari Shreedharan added a comment - Initial working patch. There are several fixes I want to implement (like integrating this into fast replay and ensuring it does not run when it is a full replay). I also need to add several tests to make sure things work as intended. With the current patch, all current tests pass - and the batch puts and removes are what are being invoked as that is enabled as default config (and also from running it through a debugger).
          Hide
          Hari Shreedharan added a comment -

          Interestingly, I think we maybe able to get rid of CheckpointRebuilder etc. In case of a full replay, we can optimize this by not going to the queue to do removes, thus not requiring a compaction (we can simply remove from the putBuffer - if we don't find it there, buffer it - these are buffered in normal replay too as pendingTakes). This would perform at least as well as fast replay, but would require less memory because we are reading data in the order it was written. This also requires less memory than normal replay too (surprisingly!), because the puts are buffered inside the queue anyway and takes not in the queue are buffered as pendingTakes. In the normal case, it requires more memory than normal replay (since we buffer the takes), but this is likely to be acceptable.

          Brock Noland Does that make sense? Remove checkpointrebuilder, and make this fast replay. I think this can also be made default, because the memory usage is not substantially higher.

          Show
          Hari Shreedharan added a comment - Interestingly, I think we maybe able to get rid of CheckpointRebuilder etc. In case of a full replay, we can optimize this by not going to the queue to do removes, thus not requiring a compaction (we can simply remove from the putBuffer - if we don't find it there, buffer it - these are buffered in normal replay too as pendingTakes). This would perform at least as well as fast replay, but would require less memory because we are reading data in the order it was written. This also requires less memory than normal replay too (surprisingly!), because the puts are buffered inside the queue anyway and takes not in the queue are buffered as pendingTakes. In the normal case, it requires more memory than normal replay (since we buffer the takes), but this is likely to be acceptable. Brock Noland Does that make sense? Remove checkpointrebuilder, and make this fast replay. I think this can also be made default, because the memory usage is not substantially higher.
          Hide
          Hari Shreedharan added a comment -

          Updated doc, with remove even when singleScanReplay is not set.

          Show
          Hari Shreedharan added a comment - Updated doc, with remove even when singleScanReplay is not set.
          Hide
          Hari Shreedharan added a comment -

          Document updated to terminate linear scan once all takes have been marked as removed

          Show
          Hari Shreedharan added a comment - Document updated to terminate linear scan once all takes have been marked as removed
          Hide
          Hari Shreedharan added a comment - - edited

          Relevant numbers from each run:
          10000-20000:

          Time in searches: 7489
          Time in moves: 14021
          Total # of moves: 99999999
          

          100000-110000:

          Time in searches: 68949
          Time in moves: 134158
          Total # of moves: 1000089999
          

          300000-310000:

          Time in searches: 206698
          Time in moves: 433623
          Total # of moves: 3000289999
          

          700000-710000:

          Time in searches: 461028
          Time in moves: 425380
          Total # of moves: 2950295000
          

          As you can see searches and moves are both expensive (moves really are still traversals doing some extra work). The reason the searches are more expensive than moves in the 700K-710K is because searches traverse more of the queue each time and the moves are smarter because we just pull events up from the bottom. The files I attached have details of how much time each search and move takes.

          The new algorithm pretty much reduces the traversals to 2 (one for marking the events and one for compaction) and number of moves to a max of total number of events in the channel (really it is less than this, it is really the highest index of a removed element).

          Show
          Hari Shreedharan added a comment - - edited Relevant numbers from each run: 10000-20000: Time in searches: 7489 Time in moves: 14021 Total # of moves: 99999999 100000-110000: Time in searches: 68949 Time in moves: 134158 Total # of moves: 1000089999 300000-310000: Time in searches: 206698 Time in moves: 433623 Total # of moves: 3000289999 700000-710000: Time in searches: 461028 Time in moves: 425380 Total # of moves: 2950295000 As you can see searches and moves are both expensive (moves really are still traversals doing some extra work). The reason the searches are more expensive than moves in the 700K-710K is because searches traverse more of the queue each time and the moves are smarter because we just pull events up from the bottom. The files I attached have details of how much time each search and move takes. The new algorithm pretty much reduces the traversals to 2 (one for marking the events and one for compaction) and number of moves to a max of total number of events in the channel (really it is less than this, it is really the highest index of a removed element).
          Hide
          Hari Shreedharan added a comment -

          Each of these queues had 1 million FlumeEventPointers. Each file name shows the upper and lower limits of removes called. Basically 700000-710000 is the log for remove(700000) to remove(710000) called.

          The patch shows the code change in the test and logging in FEQ to do this logging.

          Show
          Hari Shreedharan added a comment - Each of these queues had 1 million FlumeEventPointers. Each file name shows the upper and lower limits of removes called. Basically 700000-710000 is the log for remove(700000) to remove(710000) called. The patch shows the code change in the test and logging in FEQ to do this logging.
          Hide
          Hari Shreedharan added a comment -

          I did some testing earlier today to total out the time required for scans and moves. I will post it soon

          Show
          Hari Shreedharan added a comment - I did some testing earlier today to total out the time required for scans and moves. I will post it soon
          Hide
          Hari Shreedharan added a comment -

          Btw, the number of scans is also reduced to 1 (by buffering the takes right upto the end and then marking them in one shot)

          Show
          Hari Shreedharan added a comment - Btw, the number of scans is also reduced to 1 (by buffering the takes right upto the end and then marking them in one shot)
          Hide
          Hari Shreedharan added a comment -

          ah, ok. Yes, I think we should be able to do that. The only issue is that removes from the top of the queue are inexpensive - it is only when we remove from somewhere in the middle that we hit an issue. Let me create a unit test which does this.

          Show
          Hari Shreedharan added a comment - ah, ok. Yes, I think we should be able to do that. The only issue is that removes from the top of the queue are inexpensive - it is only when we remove from somewhere in the middle that we hit an issue. Let me create a unit test which does this.
          Hide
          Brock Noland added a comment -

          Hari,

          Thanks for the numbers! I do agree that the new algorithm could significantly reduce copy time. I am very sorry, I am afraid I was not clear enough earlier. The document makes the assumption the assumption that by improving the copy cost we'll significantly improve replay times. What I'd like to see is empirical evidence of that. I think this could be achieved by placing timers around the remove method and separating out search time from copy time.

          Does this make sense?

          Show
          Brock Noland added a comment - Hari, Thanks for the numbers! I do agree that the new algorithm could significantly reduce copy time. I am very sorry, I am afraid I was not clear enough earlier. The document makes the assumption the assumption that by improving the copy cost we'll significantly improve replay times. What I'd like to see is empirical evidence of that. I think this could be achieved by placing timers around the remove method and separating out search time from copy time. Does this make sense?
          Hide
          Hari Shreedharan added a comment -

          Here are some numbers for just the current remove algorithm:
          Assume a queue with 1 million elements - where element 400,000 to 450,000 are removed.
          Each remove causes every element above it to get moved down. So remove element# 400000 causes 399,999 moves, removing 400,001 causes another 399,999 (since there are only 399,999 elements before that now), removing 400,002 causes another 399,999 etc.
          Total number of copy operations = 399,999 * 50,000 = 19999950000 -> This probably will take forever to complete.

          With the new algorithm, all the 50,000 elements are simply marked as removed. We start from 1 millionth element and move up and finally hit a REMOVED_MARKER at 450,000 (which is the first place where writePos is not changed with readPos) - so the first copy is 399,999 to 450,000 so the total moves = 399,999. I noticed that there is a bug in the compact pseudo-code. Will fix and update.

          Show
          Hari Shreedharan added a comment - Here are some numbers for just the current remove algorithm: Assume a queue with 1 million elements - where element 400,000 to 450,000 are removed. Each remove causes every element above it to get moved down. So remove element# 400000 causes 399,999 moves, removing 400,001 causes another 399,999 (since there are only 399,999 elements before that now), removing 400,002 causes another 399,999 etc. Total number of copy operations = 399,999 * 50,000 = 19999950000 -> This probably will take forever to complete. With the new algorithm, all the 50,000 elements are simply marked as removed. We start from 1 millionth element and move up and finally hit a REMOVED_MARKER at 450,000 (which is the first place where writePos is not changed with readPos) - so the first copy is 399,999 to 450,000 so the total moves = 399,999. I noticed that there is a bug in the compact pseudo-code. Will fix and update.
          Hide
          Hari Shreedharan added a comment -

          Also, I meant non-trivial, because depending on the number of takes, it can end up having even several million elements - which can be several tens to hundreds of MB in the worst case.

          Show
          Hari Shreedharan added a comment - Also, I meant non-trivial, because depending on the number of takes, it can end up having even several million elements - which can be several tens to hundreds of MB in the worst case.
          Hide
          Hari Shreedharan added a comment -

          Sure. I will add some worst-case complexity numbers.

          Show
          Hari Shreedharan added a comment - Sure. I will add some worst-case complexity numbers.
          Hide
          Brock Noland added a comment -

          Hari,

          Thank you for taking this on! Before we hack on this can we put a few counters in to show the cost of the linear search versus the move?

          "it is likely to be non-trivial"

          I assume you mean trivial?

          Show
          Brock Noland added a comment - Hari, Thank you for taking this on! Before we hack on this can we put a few counters in to show the cost of the linear search versus the move? "it is likely to be non-trivial" I assume you mean trivial?
          Hide
          Hari Shreedharan added a comment -

          Minor clarification of the new algorithms

          Show
          Hari Shreedharan added a comment - Minor clarification of the new algorithms
          Hide
          Hari Shreedharan added a comment -

          Updated version of the design document. Updated the document with pseudo-code for the algorithms.

          Show
          Hari Shreedharan added a comment - Updated version of the design document. Updated the document with pseudo-code for the algorithms.
          Hide
          Hari Shreedharan added a comment -

          In fact, since all of the takes would be at the top of the queue, it might be a smarter idea to move events down than up. Another idea would be to start from the middle of the queue, and pull first half events down and the second half events up (this will complicate things, and I am not entirely sure this would yield a significant advantage over moving things down - since if there are takes in 2nd half of the queue, the queue is likely to be small enough to cause any real latency in the first approach mentioned in this comment).

          Show
          Hari Shreedharan added a comment - In fact, since all of the takes would be at the top of the queue, it might be a smarter idea to move events down than up. Another idea would be to start from the middle of the queue, and pull first half events down and the second half events up (this will complicate things, and I am not entirely sure this would yield a significant advantage over moving things down - since if there are takes in 2nd half of the queue, the queue is likely to be small enough to cause any real latency in the first approach mentioned in this comment).
          Hide
          Hari Shreedharan added a comment -

          Initial design doc for compaction in replay

          Show
          Hari Shreedharan added a comment - Initial design doc for compaction in replay

            People

            • Assignee:
              Brock Noland
              Reporter:
              Hari Shreedharan
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development