UIMA
  1. UIMA
  2. UIMA-483

JCas method like getSofaDataString that doesn't copy the chars from the StringHeap

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.1, 2.2
    • Fix Version/s: 2.3
    • Component/s: Core Java Framework
    • Labels:
      None

      Description

      I process large documents--the String I pass to JCas.setSofaDataString may be as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of memory when we have many concurrent AnalysisEngines running.

      I traced JCas.getSofaDataString(), and it eventually calls StringHeap.getStringForCode(), which does a "new String" from it's private char[] (which does a copy).

      This would happen for each annotator. We have five, so now the 100 MBs has become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000 MBs.

      Perhaps there could be a variation on getSofaDataString that returns one of the other classes (besides String) that implements CharSequence. A CharBuffer perhaps, or even a new class the implements the CharSequence interface but is read-only (just four methods). Or even just return a char[] or char[] and begin/end offset into the StringHeap.

      If nothing else, perhaps the document text should be treated specially from all the little strings in the StringHeap, and be stored separately, so calls to getSofaDataString() simply return a reference to an existing String object, without copying.

      I'm open to possibilities, I just need the copying to end.

        Activity

        Hide
        Thilo Goetz added a comment -

        That's not really what should happen. There are two ways strings are kept in the CAS: either as String objects, or as character data. The regular APIs all use the version where the CAS simply keeps a reference to the original String object, and that's what the sofa data APIs also do (or at least I think so from eyeballing them). So there should be no copying going on, the relevant piece of code being this from StringHeap.getStringForCode():

        if (internalStringCode != NULL)

        { return (String) this.stringList.get(internalStringCode); }

        If you have traced your code in the debugger and found that this is not used, and instead the String constructor is called as you describe, it would be helpful if you could provide a test case.

        The character data method of keeping string data in the CAS is obsolete. I'll see if there are any real dependencies on it, or if we can completely remove that code.

        --Thilo

        Show
        Thilo Goetz added a comment - That's not really what should happen. There are two ways strings are kept in the CAS: either as String objects, or as character data. The regular APIs all use the version where the CAS simply keeps a reference to the original String object, and that's what the sofa data APIs also do (or at least I think so from eyeballing them). So there should be no copying going on, the relevant piece of code being this from StringHeap.getStringForCode(): if (internalStringCode != NULL) { return (String) this.stringList.get(internalStringCode); } If you have traced your code in the debugger and found that this is not used, and instead the String constructor is called as you describe, it would be helpful if you could provide a test case. The character data method of keeping string data in the CAS is obsolete. I'll see if there are any real dependencies on it, or if we can completely remove that code. --Thilo
        Hide
        Marshall Schor added a comment -

        Eddie remarked that the JNI interface to C++ analytics and the blob serialization method both use the string heap for transfer. In those cases the CAS is delivered [or delivered back] to a Java CAS with strings in the old character array string heap. Also, there is a low-level CAS API that still creates string features in the old string heap.

        One possible improvement: The java impl could change the code where it finds the "internalStringCode == null" to not only create a new Java string, but also store it in the string list, updating the heap so that future refs would find the internalStringCode != null. Serialization which used the character array format would need to be updated to not add these strings to the output twice.

        Another improvement we could do that might significantly reduce storage in many common cases:

        Add an "identity" hash map: key = strings being added to the string heap from Java, value = <stringCode>. This would allow sharing of things when the strings are ==, and this sharing would be preserved across serialization. An Identity hashmap would only need to hash 4 (or 8) byte "addresses", not the whole string.

        Does anyone see any issues with this?

        Show
        Marshall Schor added a comment - Eddie remarked that the JNI interface to C++ analytics and the blob serialization method both use the string heap for transfer. In those cases the CAS is delivered [or delivered back] to a Java CAS with strings in the old character array string heap. Also, there is a low-level CAS API that still creates string features in the old string heap. One possible improvement: The java impl could change the code where it finds the "internalStringCode == null" to not only create a new Java string, but also store it in the string list, updating the heap so that future refs would find the internalStringCode != null. Serialization which used the character array format would need to be updated to not add these strings to the output twice. Another improvement we could do that might significantly reduce storage in many common cases: Add an "identity" hash map: key = strings being added to the string heap from Java, value = <stringCode>. This would allow sharing of things when the strings are ==, and this sharing would be preserved across serialization. An Identity hashmap would only need to hash 4 (or 8) byte "addresses", not the whole string. Does anyone see any issues with this?
        Hide
        Thilo Goetz added a comment -

        I really think completely removing the character heap would be the better solution, particularly when we'd need to adapt the serialization anyway. The low-level API we could deprecate, and internally make it use regular Strings. I don't think anybody's using it, it's not even documented. You'd need to look at the source code to know what it does.

        I'm not too keen on the hash map either. We would make everybody pay for a case that doesn't concern everybody. You can use String.intern() as a programmer, if you think that's useful in your case. One might consider intern()ing strings on deserialization, but even there, I'm not sure everybody wants that (I sure hope nobody relies on Strings being equal(), but not ==, but you never know). And wrt space requirements: we would need to create an Integer object for every string code in that hash map, as primitive values can't be stored in Maps.

        Show
        Thilo Goetz added a comment - I really think completely removing the character heap would be the better solution, particularly when we'd need to adapt the serialization anyway. The low-level API we could deprecate, and internally make it use regular Strings. I don't think anybody's using it, it's not even documented. You'd need to look at the source code to know what it does. I'm not too keen on the hash map either. We would make everybody pay for a case that doesn't concern everybody. You can use String.intern() as a programmer, if you think that's useful in your case. One might consider intern()ing strings on deserialization, but even there, I'm not sure everybody wants that (I sure hope nobody relies on Strings being equal(), but not ==, but you never know). And wrt space requirements: we would need to create an Integer object for every string code in that hash map, as primitive values can't be stored in Maps.
        Hide
        Marshall Schor added a comment -

        Changed version affected to 2.2 - based on the comments - it's not something we'll likely fix for 2.2 release. In addition to Thilo's comments, Eddie suggest that although it (removing the special support in the String Heap to store data as one big char array - something done originally to support the C++ side of the story) might make the Java <-> C++ connection (done via JNI) slower in some cases, there are better things to do to ameliorate this, including figuring out and supporting some kind of "delta CAS" approach that just sends changes. This might also allow us to move away from the JNI approach to C++ operabilty - in favor of one which would be more robust - using sockets + serialization to support running the C++ in a separate, isolated-from-Java address space.

        Show
        Marshall Schor added a comment - Changed version affected to 2.2 - based on the comments - it's not something we'll likely fix for 2.2 release. In addition to Thilo's comments, Eddie suggest that although it (removing the special support in the String Heap to store data as one big char array - something done originally to support the C++ side of the story) might make the Java <-> C++ connection (done via JNI) slower in some cases, there are better things to do to ameliorate this, including figuring out and supporting some kind of "delta CAS" approach that just sends changes. This might also allow us to move away from the JNI approach to C++ operabilty - in favor of one which would be more robust - using sockets + serialization to support running the C++ in a separate, isolated-from-Java address space.
        Hide
        Marshall Schor added a comment -

        Here's Eddie's comment from email thread:

        7/9/2007 12:03 PM

        I tend to agree with this. The overall design will get simpler. The
        downside will be having to create all the string objects at
        deserialization, but this will still leave binary serialization much
        faster than XML serialization.

        Show
        Marshall Schor added a comment - Here's Eddie's comment from email thread: 7/9/2007 12:03 PM I tend to agree with this. The overall design will get simpler. The downside will be having to create all the string objects at deserialization, but this will still leave binary serialization much faster than XML serialization.
        Hide
        Thilo Goetz added a comment -

        This issue has been fixed as a side effect of the 2.2.2 memory hotfix.

        Show
        Thilo Goetz added a comment - This issue has been fixed as a side effect of the 2.2.2 memory hotfix.

          People

          • Assignee:
            Unassigned
            Reporter:
            Greg Holmberg
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development