Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-483

JCas method like getSofaDataString that doesn't copy the chars from the StringHeap

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.1, 2.2
    • 2.3
    • Core Java Framework
    • None

    Description

      I process large documents--the String I pass to JCas.setSofaDataString may be as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of memory when we have many concurrent AnalysisEngines running.

      I traced JCas.getSofaDataString(), and it eventually calls StringHeap.getStringForCode(), which does a "new String" from it's private char[] (which does a copy).

      This would happen for each annotator. We have five, so now the 100 MBs has become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000 MBs.

      Perhaps there could be a variation on getSofaDataString that returns one of the other classes (besides String) that implements CharSequence. A CharBuffer perhaps, or even a new class the implements the CharSequence interface but is read-only (just four methods). Or even just return a char[] or char[] and begin/end offset into the StringHeap.

      If nothing else, perhaps the document text should be treated specially from all the little strings in the StringHeap, and be stored separately, so calls to getSofaDataString() simply return a reference to an existing String object, without copying.

      I'm open to possibilities, I just need the copying to end.

      Attachments

        Activity

          People

            Unassigned Unassigned
            holmberg Adam Holmberg
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: