Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
-
New
Description
These tokenizers map codepoints to character classes with the following datastructure (loaded in clinit):
private static char [] zzUnpackCMap(String packed) { char [] map = new char[0x110000];
This requires 2MB RAM for each tokenizer class (in trunk 6MB if all 3 classes are loaded, in branch_5x 10MB since there are 2 additional backwards compat classes).
On the other hand, none of our tokenizers actually use a huge number of character classes, so char is overkill: e.g. this map can safely be a byte [] and we can save half the memory. Perhaps it could make these tokenizers faster too.