The attached class is a working copy!
This is a cache included version of the SimpleNaiveBayes classifier. The cache is a hash-map, if a word needed, we search it for the all class and take it to the hash. Next time, we pull out from the cache and not searching in the index again.
The cache (re)initialization is recalculating the docsWithClassSize, clear the hash-maps, and prepare new ones. 2 map needed, and a list, the first map will contains the term-classes-termInClassOccurrence (this is the cache), the list contains the classnames, and the second map contains the class-avgUniqueTermNumber. The last two is fully preloaded, the first is dynamically building in the searches.
If there are a lot term and/or class its need a lot memory so there is a build in possibility for cutting the cache size. If there are terms thats really rare we expect that they will rarely come out in the other documents too, and they are left out from the cache. There is a possibility to left them out full from the classification calculation too.
This issue was moved to GitHub issue: #6798.