I process large documents--the String I pass to JCas.setSofaDataString may be as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of memory when we have many concurrent AnalysisEngines running.
I traced JCas.getSofaDataString(), and it eventually calls StringHeap.getStringForCode(), which does a "new String" from it's private char (which does a copy).
This would happen for each annotator. We have five, so now the 100 MBs has become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000 MBs.
Perhaps there could be a variation on getSofaDataString that returns one of the other classes (besides String) that implements CharSequence. A CharBuffer perhaps, or even a new class the implements the CharSequence interface but is read-only (just four methods). Or even just return a char or char and begin/end offset into the StringHeap.
If nothing else, perhaps the document text should be treated specially from all the little strings in the StringHeap, and be stored separately, so calls to getSofaDataString() simply return a reference to an existing String object, without copying.
I'm open to possibilities, I just need the copying to end.