UIMA
  1. UIMA
  2. UIMA-2385

Improve XmiCasDeserializer performance

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0SDK
    • Fix Version/s: 2.4.1SDK
    • Component/s: Core Java Framework
    • Labels:
      None

      Description

      I profiled an expensive CAS deserialization and found that 46% of the time was in CASImpl.ll_getFSForRef (the method that creates a FeatureStructure Java object for a CAS FS). All those calls were coming from deserializing arrays (of which this particular CAS has many).

      It is unnecessary to create FeatureStructure Java objects here. For non-array FSs, XmiCasDeserializer uses low-level CAS APIs in order to avoid this overhead. But for arrays, it currently does not use the low-level APIs.

      1. UIMA-2385.bps.patch
        0.6 kB
        Benjamin Segal

        Issue Links

          Activity

          Adam Lally created issue -
          Adam Lally made changes -
          Field Original Value New Value
          Assignee Adam Lally [ alally ]
          Adam Lally made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Hide
          Adam Lally added a comment -

          Well, I may have gotten some bad data from the profiler (yourkit), since deserialization time appears to be the same after my "fix".

          Show
          Adam Lally added a comment - Well, I may have gotten some bad data from the profiler (yourkit), since deserialization time appears to be the same after my "fix".
          Adam Lally made changes -
          Status In Progress [ 3 ] Open [ 1 ]
          Adam Lally made changes -
          Summary Improve XmiCasDeserializer performance by using low-level CAS APIs to create arrays Improve XmiCasDeserializer performance
          Hide
          Benjamin Segal added a comment -

          Upon further analysis, the poor performance is due to (slowly) expanding the CAS (heaps). IF the XMI CAS contains hundreds of thousands to millions of 8, 16, or 64 bit feature values which are being added roughly one (or a few) at a time (through the XMI deserialization), then there is a high CAS expansion cost with the current implementation.

          The existing byte/short/long heap expansion algorithm is defined in CommonAuxHeap. The idea is that the heap will exponentially grow until a threshold is reached (DEFAULT_HEAP_MULT_LIMIT), after which point, exponential growth is replaced with a linear growth. All that sounds reasonable, but, DEFAULT_HEAP_MULT_LIMIT is only defined to be 1024. Once the array grows to 1024, the algorithm only expands by 1024 entries at a time.

          In our case, the XMI deserialization is only adding a few feature values at a time, and slowly expanding out to 1.5 million 18 seconds later (for 5 million, it's over 2 minutes).

          Furthermore, the byte/short/long heap seeding values (dealing with CAS expansion) are not currently configurable or exposed through the CAS (CASImpl) or CAS creation utility.

          I think, minimally, we should consider increasing the DEFAULT_HEAP_MULT_LIMIT to something a bit larger to allow for quicker expansion. The regular (32-bit) Heap uses a default size of 500,000. For what it's worth, I changed DEFAULT_HEAP_MULT_LIMIT to 512K, and cut the deserialization time by 2/3 or more (from 18 seconds down to 6 seconds, and 140 seconds down to 20 seconds). From a memory footprint perspective, that would allow an exponential expansion up to 4MBs (512K*8), and then linear expansion from there, which seems reasonable.

          We could also consider exposing DEFAULT_HEAP_MULT_LIMIT as a "setable" property (something analogous to CAS_INITIAL_HEAP_SIZE), and allow the cas creation utility to honor the requested limit.

          It should also be noted that binary CAS deserialization does not suffer this same fate, as the total heap sizes are self described within the binary blob, so the total heaps sizes are known prior to allocating the (heap) arrays.

          And it's also worth noting that this overhead may or may not be problematic, depending on the various types of use cases.

          Show
          Benjamin Segal added a comment - Upon further analysis, the poor performance is due to (slowly) expanding the CAS (heaps). IF the XMI CAS contains hundreds of thousands to millions of 8, 16, or 64 bit feature values which are being added roughly one (or a few) at a time (through the XMI deserialization), then there is a high CAS expansion cost with the current implementation. The existing byte/short/long heap expansion algorithm is defined in CommonAuxHeap. The idea is that the heap will exponentially grow until a threshold is reached (DEFAULT_HEAP_MULT_LIMIT), after which point, exponential growth is replaced with a linear growth. All that sounds reasonable, but , DEFAULT_HEAP_MULT_LIMIT is only defined to be 1024. Once the array grows to 1024, the algorithm only expands by 1024 entries at a time. In our case, the XMI deserialization is only adding a few feature values at a time, and slowly expanding out to 1.5 million 18 seconds later (for 5 million, it's over 2 minutes). Furthermore, the byte/short/long heap seeding values (dealing with CAS expansion) are not currently configurable or exposed through the CAS (CASImpl) or CAS creation utility. I think, minimally, we should consider increasing the DEFAULT_HEAP_MULT_LIMIT to something a bit larger to allow for quicker expansion. The regular (32-bit) Heap uses a default size of 500,000. For what it's worth, I changed DEFAULT_HEAP_MULT_LIMIT to 512K, and cut the deserialization time by 2/3 or more (from 18 seconds down to 6 seconds, and 140 seconds down to 20 seconds). From a memory footprint perspective, that would allow an exponential expansion up to 4MBs (512K*8), and then linear expansion from there, which seems reasonable. We could also consider exposing DEFAULT_HEAP_MULT_LIMIT as a "setable" property (something analogous to CAS_INITIAL_HEAP_SIZE), and allow the cas creation utility to honor the requested limit. It should also be noted that binary CAS deserialization does not suffer this same fate, as the total heap sizes are self described within the binary blob, so the total heaps sizes are known prior to allocating the (heap) arrays. And it's also worth noting that this overhead may or may not be problematic, depending on the various types of use cases.
          Hide
          Adam Lally added a comment -

          Excellent. I completely agree with increasing the DEFAULT_HEAP_MULTI_LIMIT to 512k. Please submit a patch. I don't have any particularly strong feeling that it needs to be configurable, but if you want to do that I wouldn't object.

          Show
          Adam Lally added a comment - Excellent. I completely agree with increasing the DEFAULT_HEAP_MULTI_LIMIT to 512k. Please submit a patch. I don't have any particularly strong feeling that it needs to be configurable, but if you want to do that I wouldn't object.
          Hide
          deprecated (use "rec") added a comment -

          I think it could be useful as a tunable parameter in scenarios where many CASes are in memory at the same time. Is there any particular reason that it should not be configurable?

          Show
          deprecated (use "rec") added a comment - I think it could be useful as a tunable parameter in scenarios where many CASes are in memory at the same time. Is there any particular reason that it should not be configurable?
          Hide
          Adam Lally added a comment -

          I consider this analogous to how an ArrayList's capacity grows in Java. They provided some reasonable implementation that seems to work for everyone, and they don't expose ways to tune it.

          With the multi limit at 512K we are talking about at most 4MB of unused space, and I don't consider that excessive. As I said though, I wouldn't object if someone made this parameterizable.

          Show
          Adam Lally added a comment - I consider this analogous to how an ArrayList's capacity grows in Java. They provided some reasonable implementation that seems to work for everyone, and they don't expose ways to tune it. With the multi limit at 512K we are talking about at most 4MB of unused space, and I don't consider that excessive. As I said though, I wouldn't object if someone made this parameterizable.
          Benjamin Segal made changes -
          Attachment UIMA-2385.bps.patch [ 12527501 ]
          Hide
          Benjamin Segal added a comment -

          I attached a patch that changes DEFAULT_HEAP_MULTI_LIMIT to 512K. I did not include function to make this parameterizable (though, could be done in the future, if necessary).

          Show
          Benjamin Segal added a comment - I attached a patch that changes DEFAULT_HEAP_MULTI_LIMIT to 512K. I did not include function to make this parameterizable (though, could be done in the future, if necessary).
          Hide
          Marshall Schor added a comment -

          Applied patch.

          Show
          Marshall Schor added a comment - Applied patch.
          Marshall Schor made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Assignee Adam Lally [ alally ] Marshall Schor [ schor ]
          Fix Version/s 2.4.1SDK [ 12319201 ]
          Resolution Fixed [ 1 ]
          Marshall Schor made changes -
          Link This issue relates to UIMA-4279 [ UIMA-4279 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open In Progress In Progress
          2d 16h 54m 1 Adam Lally 09/Apr/12 15:49
          In Progress In Progress Open Open
          4d 4h 22m 1 Adam Lally 13/Apr/12 20:11
          Open Open Closed Closed
          46d 19h 56m 1 Marshall Schor 30/May/12 16:08

            People

            • Assignee:
              Marshall Schor
              Reporter:
              Adam Lally
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development