Commons JCS
  1. Commons JCS
  2. JCS-3

File Optimization expensive in memory and disk space

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: jcs-1.2.7.9
    • Component/s: Indexed Disk Cache
    • Labels:
      None
    • Environment:
      Any

      Description

      The file optimization for the IndexedDiskCache creates a duplicate of the cache file. In deployments with large volumes of data this doubling of disk usage can be problematic. If this could be done inside the single file, it would be easier on the disk.

      Additionally, OutOfMemoryErrors are common due to the deserializing and reserializing of the objects in the cache. Given the the size in bytes of each object in the file is known (it's part of the record), the objects could be moved in the file without recreating an instance of the object in memory. The use of a single buffer for reading and then writing the data (to either the same file or a new one) would be much more efficient for handling this optimization.

      1. StandardSerializer.java
        2 kB
        Peter Schwarz
      2. IndexedDiskElementDescriptor.java
        2 kB
        Peter Schwarz
      3. IndexedDiskDumper.java
        2 kB
        Peter Schwarz
      4. IndexedDiskCache.java
        42 kB
        Peter Schwarz
      5. indexed-disk.patch
        73 kB
        Peter Schwarz
      6. indexed-disk.patch
        73 kB
        Peter Schwarz
      7. IndexedDisk.java
        8 kB
        Peter Schwarz

        Issue Links

          Activity

          Hide
          Aaron Smuts added a comment -

          This enhancement has been applied for some time now. Closing issue.

          Show
          Aaron Smuts added a comment - This enhancement has been applied for some time now. Closing issue.
          Hide
          Aaron Smuts added a comment -

          The patch has been applied and undergone sustained testing. I fixed several bugs and added a few unit tests.

          The optimization routine is now far superior.

          Show
          Aaron Smuts added a comment - The patch has been applied and undergone sustained testing. I fixed several bugs and added a few unit tests. The optimization routine is now far superior.
          Hide
          Aaron Smuts added a comment -

          The optimization patch broke the recyle bin. I fixed it and added two new unit tests to verify that it is being used properly.

          I added a bytes free counter and unit tests to verify that is works properly. I added the counter value to the stats.
          Eventually the optimization should be driven by this number.

          I added a new load test for sustained testing of the disk cache.

          I added a new config property called OptimizeOnShutdown which allows you to turn off the shutdown optimization.

          I added two new simple unit tests for the sorted pref array.

          I made some minor formatting changes.

          Show
          Aaron Smuts added a comment - The optimization patch broke the recyle bin. I fixed it and added two new unit tests to verify that it is being used properly. I added a bytes free counter and unit tests to verify that is works properly. I added the counter value to the stats. Eventually the optimization should be driven by this number. I added a new load test for sustained testing of the disk cache. I added a new config property called OptimizeOnShutdown which allows you to turn off the shutdown optimization. I added two new simple unit tests for the sorted pref array. I made some minor formatting changes.
          Hide
          Aaron Smuts added a comment -

          I made a few changes to the overall patch.

          I added more info logging, since all the major parts of JCS log major events at info.

          I added a bit to some of the javadocs.

          I added more unit tests and will add more to methods such as detect overlap.

          I don't call defrag for the put list if it is empty.

          I'll check in my initial changes that bring in the bulk of the patch. The new optimization methods look nice.

          Show
          Aaron Smuts added a comment - I made a few changes to the overall patch. I added more info logging, since all the major parts of JCS log major events at info. I added a bit to some of the javadocs. I added more unit tests and will add more to methods such as detect overlap. I don't call defrag for the put list if it is empty. I'll check in my initial changes that bring in the bulk of the patch. The new optimization methods look nice.
          Hide
          Aaron Smuts added a comment -

          I'm adding additional unit tests and more info logging about the major events.

          The JCS logging is set such that you must run at info or hihger in prod.

          Show
          Aaron Smuts added a comment - I'm adding additional unit tests and more info logging about the major events. The JCS logging is set such that you must run at info or hihger in prod.
          Hide
          Aaron Smuts added a comment -

          The patch is a bit hard to work with. It uses trace logging which we don't use in JCS. All the other classes use debug as the lowest level. There are many missing or malformed javadocs. There are very few tests for the new methods. And it's very big! . . .

          I'm working on getting making it a bit easier to compare. I should have something soon.

          Show
          Aaron Smuts added a comment - The patch is a bit hard to work with. It uses trace logging which we don't use in JCS. All the other classes use debug as the lowest level. There are many missing or malformed javadocs. There are very few tests for the new methods. And it's very big! . . . I'm working on getting making it a bit easier to compare. I should have something soon.
          Hide
          Peter Schwarz added a comment -

          This issue depends on JCS-1 being fixed. The patch contains a fix for it, as well as the patch attached to JCS-1.

          Show
          Peter Schwarz added a comment - This issue depends on JCS-1 being fixed. The patch contains a fix for it, as well as the patch attached to JCS-1 .
          Hide
          Peter Schwarz added a comment -

          Also, there are a few additional bugs we noticed today:

          1. The createPositionSortedDescriptorList() method iterates over LRUMapEntry objects, and should iterate over Map.Entry objects intead (class cast exception, if we're not using the LRUMap). That one's my fault.

          2. The IndexedDisk.writeObject() method blows the roof off of the memory when writing the keys for a large (i.e. 5 million) set of keys. This could be fixed using some buffers, streams and etc to wrap the bytes, instead of the ByteArrayOutputStream.

          Show
          Peter Schwarz added a comment - Also, there are a few additional bugs we noticed today: 1. The createPositionSortedDescriptorList() method iterates over LRUMapEntry objects, and should iterate over Map.Entry objects intead (class cast exception, if we're not using the LRUMap). That one's my fault. 2. The IndexedDisk.writeObject() method blows the roof off of the memory when writing the keys for a large (i.e. 5 million) set of keys. This could be fixed using some buffers, streams and etc to wrap the bytes, instead of the ByteArrayOutputStream.
          Hide
          Peter Schwarz added a comment -

          To respond to your observations, in a back-stepping fashion:

          4. we'll will probably use internally (shouldn't have made it in with the diff), until a complete fix is made.

          My reasoning for using that flag, though, was that it is always set to true, and not accessible to users outside the JCS interface to the cache. It didn't seem terribly public.

          Granted, that was making an architectural change, outside the proper place. The flag should probably be put on the object that wraps the key-value pair (CacheElement).

          3. I'll send the test files, the defrag test we're using. It's not a direct unit test, since all of the issues we encountered while rewriting the defrag was through threading. Pretty hard to write it that way. But we probably should write a unit test around

          I have a whole philosophical argument against writing unit tests against private methods by making them protected involving software api contracts, but I won't get into that...

          2. Might have missed a few places for that. Ideally, it's really only required for messages that append things (saves the appending) since log ignores are pretty fast in themselves. I'm not against the checks being put in there. It's not really the performance bottleneck here.

          1. Custom serialization is already in the Java language spec. The users (i.e. the developers writing objects to be put in the cache) should be use Externalizable if they want something different (more compact, different initialization, etc).

          Although, I suppose the argument could be made for an XML Serialization, but I'm not really sure there's a value in using it in the context of object caching. That should again be accomplished through Exteralizable though.

          Either way. The pattern for using the Serializers wasn't clear, since was half implemented. I was just replacing the private methods of IndexedDisk with the serializer. It's a step towards using the Serializers, and can be more easily refactored in the future, when the custom serializer feature is fully implemented.

          Show
          Peter Schwarz added a comment - To respond to your observations, in a back-stepping fashion: 4. we'll will probably use internally (shouldn't have made it in with the diff), until a complete fix is made. My reasoning for using that flag, though, was that it is always set to true, and not accessible to users outside the JCS interface to the cache. It didn't seem terribly public. Granted, that was making an architectural change, outside the proper place. The flag should probably be put on the object that wraps the key-value pair (CacheElement). 3. I'll send the test files, the defrag test we're using. It's not a direct unit test, since all of the issues we encountered while rewriting the defrag was through threading. Pretty hard to write it that way. But we probably should write a unit test around I have a whole philosophical argument against writing unit tests against private methods by making them protected involving software api contracts, but I won't get into that... 2. Might have missed a few places for that. Ideally, it's really only required for messages that append things (saves the appending) since log ignores are pretty fast in themselves. I'm not against the checks being put in there. It's not really the performance bottleneck here. 1. Custom serialization is already in the Java language spec. The users (i.e. the developers writing objects to be put in the cache) should be use Externalizable if they want something different (more compact, different initialization, etc). Although, I suppose the argument could be made for an XML Serialization, but I'm not really sure there's a value in using it in the context of object caching. That should again be accomplished through Exteralizable though. Either way. The pattern for using the Serializers wasn't clear, since was half implemented. I was just replacing the private methods of IndexedDisk with the serializer. It's a step towards using the Serializers, and can be more easily refactored in the future, when the custom serializer feature is fully implemented.
          Hide
          Aaron Smuts added a comment -

          I'm just now going over the some of the changes. They look very interesting and should be radical improvements to the disk cache. Thanks.

          A few initial comments. I'll post more as I get deeper into the code. I'll run some tests tomorrow or Friday.

          1. The general plan for the cache is that you should be able to define a custom serializer for any auxiliary. If possible I don't want them to be static, as this is in the revised IndexedDisk file:

          private static final StandardSerializer SERIALIZER = new StandardSerializer();

          2. All log statements below warn must be wrapped in code guards, even if they don't concatenate a String, since someone might change them in the future. . .

          3. Can you send me the test file too. I'm interested in the tests you are using for the optimization process, etc. I may want to make some of the methods, such as defrag, protected so they can be tested directly with more precision. It looks like you added a few tests, but not many and not enough for such major changes.

          4. I really don't like the change you made to keep the cache from thrashing when the memory is full. The change is pasted below. I don't want to change an element attribute underneath the user, since they may use these attributes to clone addition instances. Attribute changes like this should be explicitly user defined. The general idea is kind of interesting. Perhaps an additional flag, indicating that the item is on disk and does not need to be spooled would be ideal. We are also going to add the ability to treat the disk cache like any other auxiliary, that is, we will not jsut use it as a swap--items will be put to disk and memory simultaneously. In that scenario it is important to keep them from spooling on to disk as well. I'm not sure what the best solution is, but I'll come up with something soon.

          ===================================================================
          — C:/ide/workspace/jcs/src/java/org/apache/jcs/engine/control/CompositeCache.java (revision 426131)
          +++ C:/ide/workspace/jcs/src/java/org/apache/jcs/engine/control/CompositeCache.java (working copy)
          @@ -539,6 +539,10 @@
          // into purgatory
          if ( memCache.getCacheAttributes().getMaxObjects() > 0 )

          { + // We'll set this to false to prevent the element from spooling back to disk. + // If the user changes it, they should do another put anyway, which + // will reset this back to the default value for the cache. + element.getElementAttributes().setIsSpool(false); memCache.update( element ); }

          else
          Index: C:/ide/workspace/jcs/src/java/org/apache/jcs/auxiliary/disk/indexed/IndexedDiskCache.java

          Show
          Aaron Smuts added a comment - I'm just now going over the some of the changes. They look very interesting and should be radical improvements to the disk cache. Thanks. A few initial comments. I'll post more as I get deeper into the code. I'll run some tests tomorrow or Friday. 1. The general plan for the cache is that you should be able to define a custom serializer for any auxiliary. If possible I don't want them to be static, as this is in the revised IndexedDisk file: private static final StandardSerializer SERIALIZER = new StandardSerializer(); 2. All log statements below warn must be wrapped in code guards, even if they don't concatenate a String, since someone might change them in the future. . . 3. Can you send me the test file too. I'm interested in the tests you are using for the optimization process, etc. I may want to make some of the methods, such as defrag, protected so they can be tested directly with more precision. It looks like you added a few tests, but not many and not enough for such major changes. 4. I really don't like the change you made to keep the cache from thrashing when the memory is full. The change is pasted below. I don't want to change an element attribute underneath the user, since they may use these attributes to clone addition instances. Attribute changes like this should be explicitly user defined. The general idea is kind of interesting. Perhaps an additional flag, indicating that the item is on disk and does not need to be spooled would be ideal. We are also going to add the ability to treat the disk cache like any other auxiliary, that is, we will not jsut use it as a swap--items will be put to disk and memory simultaneously. In that scenario it is important to keep them from spooling on to disk as well. I'm not sure what the best solution is, but I'll come up with something soon. =================================================================== — C:/ide/workspace/jcs/src/java/org/apache/jcs/engine/control/CompositeCache.java (revision 426131) +++ C:/ide/workspace/jcs/src/java/org/apache/jcs/engine/control/CompositeCache.java (working copy) @@ -539,6 +539,10 @@ // into purgatory if ( memCache.getCacheAttributes().getMaxObjects() > 0 ) { + // We'll set this to false to prevent the element from spooling back to disk. + // If the user changes it, they should do another put anyway, which + // will reset this back to the default value for the cache. + element.getElementAttributes().setIsSpool(false); memCache.update( element ); } else Index: C:/ide/workspace/jcs/src/java/org/apache/jcs/auxiliary/disk/indexed/IndexedDiskCache.java
          Hide
          Peter Schwarz added a comment -

          Added the java files that surround just the IndexedDiskCache changes (as there were a couple of minor changes that were not related).

          Show
          Peter Schwarz added a comment - Added the java files that surround just the IndexedDiskCache changes (as there were a couple of minor changes that were not related).
          Hide
          Peter Schwarz added a comment -

          Use the second file. Caught one minor bug with the statistics incrementing.

          Show
          Peter Schwarz added a comment - Use the second file. Caught one minor bug with the statistics incrementing.
          Hide
          Peter Schwarz added a comment -

          On more thing: Since we're dealing with large volumes of data we changed the logging around a little bit. Trace is a more accurate usage for logging each crud op, debug is appropriate for all of the timing statements,and info really should only be used for notification that we've started/stopped.

          Show
          Peter Schwarz added a comment - On more thing: Since we're dealing with large volumes of data we changed the logging around a little bit. Trace is a more accurate usage for logging each crud op, debug is appropriate for all of the timing statements,and info really should only be used for notification that we've started/stopped.
          Hide
          Peter Schwarz added a comment -

          A bit more on the solution provided:

          We chose to shift the elements on disk, instead of copying them to a new file. This is accomplished through compressing out the dead space between records. The initial optimization operates on a snapshot of the current element descriptors in memory. The locking is broken up in to three segments: lock while we take the snapshot and switch into optimization mode, lock during a move of an individual record, and lock while we optimize the queued puts (using the same compression technique) and switch out of optimization mode.

          All changes for the descriptors are now done on the object themselves. What I mean by this is that they are no longer treated as if they were immutable. It's key, for this routine, that the position on the descriptor change as the record is moved, this means that updates need to only alter the length on the descriptor, not replace it.

          Show
          Peter Schwarz added a comment - A bit more on the solution provided: We chose to shift the elements on disk, instead of copying them to a new file. This is accomplished through compressing out the dead space between records. The initial optimization operates on a snapshot of the current element descriptors in memory. The locking is broken up in to three segments: lock while we take the snapshot and switch into optimization mode, lock during a move of an individual record, and lock while we optimize the queued puts (using the same compression technique) and switch out of optimization mode. All changes for the descriptors are now done on the object themselves. What I mean by this is that they are no longer treated as if they were immutable. It's key, for this routine, that the position on the descriptor change as the record is moved, this means that updates need to only alter the length on the descriptor, not replace it.
          Hide
          Peter Schwarz added a comment -

          The attached patch includes a new version of the optimization routine, which involved extensive rewrites of the IndexedDiskCache class, and major changes to IndexedDisk.

          This patch's optimization routine works as follows:

          1. Shutdown recycling and turn on queuing of puts.
          2. Take a snapshot of the current descriptors. If there are any removes, ignore them, as they will be compacted during the next optimization.
          3. Optimize the snapshot. For each descriptor:
          1. Obtain the write-lock.
          2. Shift the element on the disk, in order to compact out the free space.
          3. Release the write-lock. This allows elements to still be accessible
          during optimization.
          4. All queued puts are made at the end of the file. Optimize these under a single write-lock.
          5. Restore system to standard operation.

          Additionally, this contians a fix to the problem from JCS-1

          Show
          Peter Schwarz added a comment - The attached patch includes a new version of the optimization routine, which involved extensive rewrites of the IndexedDiskCache class, and major changes to IndexedDisk. This patch's optimization routine works as follows: 1. Shutdown recycling and turn on queuing of puts. 2. Take a snapshot of the current descriptors. If there are any removes, ignore them, as they will be compacted during the next optimization. 3. Optimize the snapshot. For each descriptor: 1. Obtain the write-lock. 2. Shift the element on the disk, in order to compact out the free space. 3. Release the write-lock. This allows elements to still be accessible during optimization. 4. All queued puts are made at the end of the file. Optimize these under a single write-lock. 5. Restore system to standard operation. Additionally, this contians a fix to the problem from JCS-1

            People

            • Assignee:
              Aaron Smuts
              Reporter:
              Peter Schwarz
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development