[UIMA-483] JCas method like getSofaDataString that doesn't copy the chars from the StringHeap - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1, 2.2
Fix Version/s: 2.3
Component/s: Core Java Framework
Labels:
None

Description

I process large documents--the String I pass to JCas.setSofaDataString may be as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of memory when we have many concurrent AnalysisEngines running.

I traced JCas.getSofaDataString(), and it eventually calls StringHeap.getStringForCode(), which does a "new String" from it's private char[] (which does a copy).

This would happen for each annotator. We have five, so now the 100 MBs has become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000 MBs.

Perhaps there could be a variation on getSofaDataString that returns one of the other classes (besides String) that implements CharSequence. A CharBuffer perhaps, or even a new class the implements the CharSequence interface but is read-only (just four methods). Or even just return a char[] or char[] and begin/end offset into the StringHeap.

If nothing else, perhaps the document text should be treated specially from all the little strings in the StringHeap, and be stored separately, so calls to getSofaDataString() simply return a reference to an existing String object, without copying.

I'm open to possibilities, I just need the copying to end.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Adam Holmberg

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 05/Jul/07 23:37

Updated:: 13/Aug/08 08:39

Resolved:: 13/Aug/08 08:39