UIMA
  1. UIMA
  2. UIMA-1067

Remove char heap/ref heap in StringHeap of the CAS

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.2
    • Fix Version/s: 2.3
    • Component/s: Core Java Framework
    • Labels:
      None

      Description

      The StringHeap class provides two ways to store strings: either as Java strings, or by copying characters onto a character heap. The second option is only used for deserialization from a binary CAS. However, even if not used, this capability means a very significant memory overhead. To demonstrate this, I ran the following experiment. As analysis engine, I used our sandbox POS tagger. It sets just one string feature on each token. As text, I used a 2.4MB input file (2x moby.txt). To run this in IBM Java 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify -Xmx135M. I checked 5MB increments. The I patched the StringHeap implementation to work without the additional book keeping overhead and ran the experiment again. I was then able to run with -Xmx115M. This represents a very significant gain, particularly given the fact that I ran so little analysis (only tokens and sentences are produced, and only a single string-valued feature set). The new code also ran a tiny bit faster, but not much. One might see more improvement for analysis that is not as compute intensive as the Tagger.

      The challenge is to make sure that the serialization code still works after this change.

        Activity

        Thilo Goetz created issue -
        Hide
        Thilo Goetz added a comment -

        Fixed, all unit tests pass. Please test this change if you use (binary) serialization. It should work the same as before, I haven't changed the serialization format in any way.

        Show
        Thilo Goetz added a comment - Fixed, all unit tests pass. Please test this change if you use (binary) serialization. It should work the same as before, I haven't changed the serialization format in any way.
        Thilo Goetz made changes -
        Field Original Value New Value
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Fixed [ 1 ]
        Hide
        Thilo Goetz added a comment -

        Fix in 2.2.2 hotfix 1.

        Show
        Thilo Goetz added a comment - Fix in 2.2.2 hotfix 1.
        Thilo Goetz made changes -
        Resolution Fixed [ 1 ]
        Status Closed [ 6 ] Reopened [ 4 ]
        Hide
        Thilo Goetz added a comment -

        Backported to 2.2.2-01.

        Show
        Thilo Goetz added a comment - Backported to 2.2.2-01.
        Thilo Goetz made changes -
        Resolution Fixed [ 1 ]
        Status Reopened [ 4 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Closed Closed
        4h 1 Thilo Goetz 06/Jun/08 15:21
        Closed Closed Reopened Reopened
        17d 21h 26m 1 Thilo Goetz 24/Jun/08 12:48
        Reopened Reopened Closed Closed
        1d 23h 5m 1 Thilo Goetz 26/Jun/08 11:54

          People

          • Assignee:
            Thilo Goetz
            Reporter:
            Thilo Goetz
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development