Details

    • New

    Description

      The Lucene classifier implementations are now near onlines if they get a near realtime reader. It is good for the users whoes have a continously changing dataset, but slow for not changing datasets.

      The idea is: What if we implement a cache and speed up the results where it is possible.

      Attachments

        1. CachingNaiveBayesClassifier.java
          14 kB
          Gergő Törcsvári
        2. 0810-caching.patch
          12 kB
          Gergő Törcsvári
        3. 0803-caching.patch
          23 kB
          Gergő Törcsvári

        Activity

          The attached class is a working copy!

          This is a cache included version of the SimpleNaiveBayes classifier. The cache is a hash-map, if a word needed, we search it for the all class and take it to the hash. Next time, we pull out from the cache and not searching in the index again.

          The cache (re)initialization is recalculating the docsWithClassSize, clear the hash-maps, and prepare new ones. 2 map needed, and a list, the first map will contains the term-classes-termInClassOccurrence (this is the cache), the list contains the classnames, and the second map contains the class-avgUniqueTermNumber. The last two is fully preloaded, the first is dynamically building in the searches.

          If there are a lot term and/or class its need a lot memory so there is a build in possibility for cutting the cache size. If there are terms thats really rare we expect that they will rarely come out in the other documents too, and they are left out from the cache. There is a possibility to left them out full from the classification calculation too.

          torcsvarig Gergő Törcsvári added a comment - The attached class is a working copy! This is a cache included version of the SimpleNaiveBayes classifier. The cache is a hash-map, if a word needed, we search it for the all class and take it to the hash. Next time, we pull out from the cache and not searching in the index again. The cache (re)initialization is recalculating the docsWithClassSize, clear the hash-maps, and prepare new ones. 2 map needed, and a list, the first map will contains the term-classes-termInClassOccurrence (this is the cache), the list contains the classnames, and the second map contains the class-avgUniqueTermNumber. The last two is fully preloaded, the first is dynamically building in the searches. If there are a lot term and/or class its need a lot memory so there is a build in possibility for cutting the cache size. If there are terms thats really rare we expect that they will rarely come out in the other documents too, and they are left out from the cache. There is a possibility to left them out full from the classification calculation too.

          The online modification of the SimpleNaiveBayesClassifier in the 5699 attachment and mentioned in the comment too.
          The KNN classifier was online out of the box if the user use commit properly, or use a near-real-time writer.

          torcsvarig Gergő Törcsvári added a comment - The online modification of the SimpleNaiveBayesClassifier in the 5699 attachment and mentioned in the comment too. The KNN classifier was online out of the box if the user use commit properly, or use a near-real-time writer.

          the second patch looks better, the only thing I would change is extending from SimpleNaiveBayesClassifier and avoid rewriting the same methods that do not change in the caching version.

          teofili Tommaso Teofili added a comment - the second patch looks better, the only thing I would change is extending from SimpleNaiveBayesClassifier and avoid rewriting the same methods that do not change in the caching version.

          I have a doubt on CachingNaiveBayesClassifier#reInitCache method, there it seems the termList List is populated but never used, it seems that it's either useless so it can be removed or ignored by mistake so it has to be properly used, what is it? (to me the most likely seems the first, as there's already the frequencyMap object).

          teofili Tommaso Teofili added a comment - I have a doubt on CachingNaiveBayesClassifier#reInitCache method, there it seems the termList List is populated but never used, it seems that it's either useless so it can be removed or ignored by mistake so it has to be properly used, what is it? (to me the most likely seems the first, as there's already the frequencyMap object).

          Yes, I'm remembering now. It was used for iterating tough the frequencyMap, but I started to refactor that for cycle with the MapEntry way, and I mistakenly left the termList in.

          torcsvarig Gergő Törcsvári added a comment - Yes, I'm remembering now. It was used for iterating tough the frequencyMap, but I started to refactor that for cycle with the MapEntry way, and I mistakenly left the termList in.

          ok thanks, I'll remove it and commit this.

          teofili Tommaso Teofili added a comment - ok thanks, I'll remove it and commit this.

          Commit 1619700 from teofili in branch 'dev/trunk'
          [ https://svn.apache.org/r1619700 ]

          LUCENE-5736 - added caching version of NB classifier

          jira-bot ASF subversion and git services added a comment - Commit 1619700 from teofili in branch 'dev/trunk' [ https://svn.apache.org/r1619700 ] LUCENE-5736 - added caching version of NB classifier

          Commit 1638717 from teofili in branch 'dev/trunk'
          [ https://svn.apache.org/r1638717 ]

          LUCENE-5736 - adding test for caching nb classifier

          jira-bot ASF subversion and git services added a comment - Commit 1638717 from teofili in branch 'dev/trunk' [ https://svn.apache.org/r1638717 ] LUCENE-5736 - adding test for caching nb classifier

          Commit 1638724 from teofili in branch 'dev/trunk'
          [ https://svn.apache.org/r1638724 ]

          LUCENE-5736 - fixed test javadoc

          jira-bot ASF subversion and git services added a comment - Commit 1638724 from teofili in branch 'dev/trunk' [ https://svn.apache.org/r1638724 ] LUCENE-5736 - fixed test javadoc
          anshum Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          anshum Anshum Gupta added a comment - Bulk close after 5.0 release.
          tomoko Tomoko Uchida added a comment -

          This issue was moved to GitHub issue: #6798.

          tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #6798 .

          People

            teofili Tommaso Teofili
            torcsvarig Gergő Törcsvári
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: