[LUCENE-5736] Separate the classifiers to online and caching where possible - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.0
Component/s: modules/classification
Labels:
- gsoc2014

Lucene Fields:

New

Description

The Lucene classifier implementations are now near onlines if they get a near realtime reader. It is good for the users whoes have a continously changing dataset, but slow for not changing datasets.

The idea is: What if we implement a cache and speed up the results where it is possible.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CachingNaiveBayesClassifier.java
08/Jun/14 09:36
14 kB
Gergő Törcsvári
0810-caching.patch
10/Aug/14 08:46
12 kB
Gergő Törcsvári
0803-caching.patch
03/Aug/14 19:42
23 kB
Gergő Törcsvári

Activity

Descending order - Click to sort in ascending order

Tomoko Uchida added a comment - 28/Aug/22 14:09

This issue was moved to GitHub issue: #6798.

Tomoko Uchida added a comment - 28/Aug/22 14:09 This issue was moved to GitHub issue: #6798 .

Anshum Gupta added a comment - 23/Feb/15 05:01

Bulk close after 5.0 release.

Anshum Gupta added a comment - 23/Feb/15 05:01 Bulk close after 5.0 release.

ASF subversion and git services added a comment - 12/Nov/14 08:56

Commit 1638724 from teofili in branch 'dev/trunk'
[ https://svn.apache.org/r1638724 ]

~~LUCENE-5736~~ - fixed test javadoc

ASF subversion and git services added a comment - 12/Nov/14 08:56 Commit 1638724 from teofili in branch 'dev/trunk' [ https://svn.apache.org/r1638724 ] LUCENE-5736 - fixed test javadoc

ASF subversion and git services added a comment - 12/Nov/14 08:38

Commit 1638717 from teofili in branch 'dev/trunk'
[ https://svn.apache.org/r1638717 ]

~~LUCENE-5736~~ - adding test for caching nb classifier

ASF subversion and git services added a comment - 12/Nov/14 08:38 Commit 1638717 from teofili in branch 'dev/trunk' [ https://svn.apache.org/r1638717 ] LUCENE-5736 - adding test for caching nb classifier

ASF subversion and git services added a comment - 22/Aug/14 08:04

Commit 1619700 from teofili in branch 'dev/trunk'
[ https://svn.apache.org/r1619700 ]

~~LUCENE-5736~~ - added caching version of NB classifier

ASF subversion and git services added a comment - 22/Aug/14 08:04 Commit 1619700 from teofili in branch 'dev/trunk' [ https://svn.apache.org/r1619700 ] LUCENE-5736 - added caching version of NB classifier

Tommaso Teofili added a comment - 22/Aug/14 07:59

ok thanks, I'll remove it and commit this.

Tommaso Teofili added a comment - 22/Aug/14 07:59 ok thanks, I'll remove it and commit this.

Gergő Törcsvári added a comment - 21/Aug/14 15:11

Yes, I'm remembering now. It was used for iterating tough the frequencyMap, but I started to refactor that for cycle with the MapEntry way, and I mistakenly left the termList in.

Gergő Törcsvári added a comment - 21/Aug/14 15:11 Yes, I'm remembering now. It was used for iterating tough the frequencyMap, but I started to refactor that for cycle with the MapEntry way, and I mistakenly left the termList in.

Tommaso Teofili added a comment - 21/Aug/14 14:57

I have a doubt on CachingNaiveBayesClassifier#reInitCache method, there it seems the termList List is populated but never used, it seems that it's either useless so it can be removed or ignored by mistake so it has to be properly used, what is it? (to me the most likely seems the first, as there's already the frequencyMap object).

Tommaso Teofili added a comment - 21/Aug/14 14:57 I have a doubt on CachingNaiveBayesClassifier#reInitCache method, there it seems the termList List is populated but never used, it seems that it's either useless so it can be removed or ignored by mistake so it has to be properly used, what is it? (to me the most likely seems the first, as there's already the frequencyMap object).

Tommaso Teofili added a comment - 08/Aug/14 08:57

the second patch looks better, the only thing I would change is extending from SimpleNaiveBayesClassifier and avoid rewriting the same methods that do not change in the caching version.

Tommaso Teofili added a comment - 08/Aug/14 08:57 the second patch looks better, the only thing I would change is extending from SimpleNaiveBayesClassifier and avoid rewriting the same methods that do not change in the caching version.

Gergő Törcsvári added a comment - 08/Jun/14 09:39

The online modification of the SimpleNaiveBayesClassifier in the 5699 attachment and mentioned in the comment too.
The KNN classifier was online out of the box if the user use commit properly, or use a near-real-time writer.

Gergő Törcsvári added a comment - 08/Jun/14 09:39 The online modification of the SimpleNaiveBayesClassifier in the 5699 attachment and mentioned in the comment too. The KNN classifier was online out of the box if the user use commit properly, or use a near-real-time writer.

Gergő Törcsvári added a comment - 08/Jun/14 09:36

The attached class is a working copy!

This is a cache included version of the SimpleNaiveBayes classifier. The cache is a hash-map, if a word needed, we search it for the all class and take it to the hash. Next time, we pull out from the cache and not searching in the index again.

The cache (re)initialization is recalculating the docsWithClassSize, clear the hash-maps, and prepare new ones. 2 map needed, and a list, the first map will contains the term-classes-termInClassOccurrence (this is the cache), the list contains the classnames, and the second map contains the class-avgUniqueTermNumber. The last two is fully preloaded, the first is dynamically building in the searches.

If there are a lot term and/or class its need a lot memory so there is a build in possibility for cutting the cache size. If there are terms thats really rare we expect that they will rarely come out in the other documents too, and they are left out from the cache. There is a possibility to left them out full from the classification calculation too.

Gergő Törcsvári added a comment - 08/Jun/14 09:36 The attached class is a working copy! This is a cache included version of the SimpleNaiveBayes classifier. The cache is a hash-map, if a word needed, we search it for the all class and take it to the hash. Next time, we pull out from the cache and not searching in the index again. The cache (re)initialization is recalculating the docsWithClassSize, clear the hash-maps, and prepare new ones. 2 map needed, and a list, the first map will contains the term-classes-termInClassOccurrence (this is the cache), the list contains the classnames, and the second map contains the class-avgUniqueTermNumber. The last two is fully preloaded, the first is dynamically building in the searches. If there are a lot term and/or class its need a lot memory so there is a build in possibility for cutting the cache size. If there are terms thats really rare we expect that they will rarely come out in the other documents too, and they are left out from the cache. There is a possibility to left them out full from the classification calculation too.

People

Assignee:: Tommaso Teofili

Reporter:: Gergő Törcsvári

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Jun/14 08:44

Updated:: 28/Aug/22 14:09

Resolved:: 03/Nov/14 08:05

Lucene - Core

Details

Description

Attachments

Attachments

Activity

People

Dates