Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 2.3.0
    • Labels:
      None
    • Environment:
      James from svn-trunk 2005-08-01.
      MySQL 4.0

      Description

      Got this exception for every incoming mail:

      02/08/05 00:39:25 INFO James.Mailet: BayesianAnalysis: Exception: java.lang.Integer
      java.lang.ClassCastException: java.lang.Integer
      at org.apache.james.util.BayesianAnalyzer.getTokenProbabilityStrengths(BayesianAnalyzer.java:591)
      at org.apache.james.util.BayesianAnalyzer.computeSpamProbability(BayesianAnalyzer.java:340)
      at org.apache.james.transport.mailets.BayesianAnalysis.service(BayesianAnalysis.java:289)
      at org.apache.james.transport.LinearProcessor.service(LinearProcessor.java:407)
      at org.apache.james.transport.JamesSpoolManager.process(JamesSpoolManager.java:460)
      at org.apache.james.transport.JamesSpoolManager.run(JamesSpoolManager.java:369)
      at java.lang.Thread.run(Unknown Source)

      If I clean my spam/ham db the exceptions disappears but they start again when the spam/ham db become large.
      My bayesiananalysis_spam contains 200000 rows.

      The following are the spam tokens with higher "occurrences".
      --------------------------------------+

      token occurrences

      --------------------------------------+

      3D 82151
      a 59953
      the 45295
      FONT 42771
      Content-Type 39058
      to 36626
      com 32902
      http 32886
      of 32504
      font 31803
      and 31577
      Content-Transfer-Encoding 31576
      p 29746
      text 29482
      in 29418
      it 28498
      br 28037
      DIV 27431

        Activity

        Hide
        vincenzo Vincenzo Gianferrari Pini added a comment -

        I gave a careful look to the code and couldn't find anything wrong. I have a spam table with more than 258000 rows and everything works fine for me.

        IMHO a possible explanation of Stefano's exceptions is the following:

        The ham/spam corpus hashmaps may take a lot of memory. Accordingly, I gave a lot of -Xmx memory to the JVM.
        I remember some time ago, in a java (non James) application, an unpredictable JVM behaviour (strange exceptions thrown) when the available heap was just about the needed heap. Decreasing a little bit the -Xmx size I was getting OutOfMemoryError, and increasing it everything was fine.
        Stefano, can you try with more memory?

        Show
        vincenzo Vincenzo Gianferrari Pini added a comment - I gave a careful look to the code and couldn't find anything wrong. I have a spam table with more than 258000 rows and everything works fine for me. IMHO a possible explanation of Stefano's exceptions is the following: The ham/spam corpus hashmaps may take a lot of memory. Accordingly, I gave a lot of -Xmx memory to the JVM. I remember some time ago, in a java (non James) application, an unpredictable JVM behaviour (strange exceptions thrown) when the available heap was just about the needed heap. Decreasing a little bit the -Xmx size I was getting OutOfMemoryError, and increasing it everything was fine. Stefano, can you try with more memory?
        Hide
        bago Stefano Bagnara added a comment -

        I increased the total memory for my "personal" james to 800MB and it only handle my own mail (around 1000 messages per day) but it still stop checking my messages.

        Here is the exception

        20/11/05 00:14:57 INFO James.Mailet: BayesianAnalysis: Exception: java.lang.Integer
        java.lang.ClassCastException: java.lang.Integer
        at org.apache.james.util.BayesianAnalyzer.getTokenProbabilityStrengths(BayesianAnalyzer.java:591)
        at org.apache.james.util.BayesianAnalyzer.computeSpamProbability(BayesianAnalyzer.java:340)
        at org.apache.james.transport.mailets.BayesianAnalysis.service(BayesianAnalysis.java:289)
        at org.apache.james.transport.LinearProcessor.service(LinearProcessor.java:407)
        at org.apache.james.transport.JamesSpoolManager.process(JamesSpoolManager.java:460)
        at org.apache.james.transport.JamesSpoolManager.run(JamesSpoolManager.java:369)
        at java.lang.Thread.run(Unknown Source)

        If I restart James it works for almost a day and then it break again.

        My bayesiananalysis_spam count 853685 rows, while the ham counts 21253.

        I configured james to automatically feed spam and ham for messages I recognize so the bayesian can be improved. Maybe my continuous feeding is not good for the bayesian mailet.

        Any Idea?
        Who is using this Matchers/Mailets? What are your spam/ham sizes? How often do you feed ham/spam? how much memory you reserved to James? how much messages through the bayesian mailets?

        Show
        bago Stefano Bagnara added a comment - I increased the total memory for my "personal" james to 800MB and it only handle my own mail (around 1000 messages per day) but it still stop checking my messages. Here is the exception 20/11/05 00:14:57 INFO James.Mailet: BayesianAnalysis: Exception: java.lang.Integer java.lang.ClassCastException: java.lang.Integer at org.apache.james.util.BayesianAnalyzer.getTokenProbabilityStrengths(BayesianAnalyzer.java:591) at org.apache.james.util.BayesianAnalyzer.computeSpamProbability(BayesianAnalyzer.java:340) at org.apache.james.transport.mailets.BayesianAnalysis.service(BayesianAnalysis.java:289) at org.apache.james.transport.LinearProcessor.service(LinearProcessor.java:407) at org.apache.james.transport.JamesSpoolManager.process(JamesSpoolManager.java:460) at org.apache.james.transport.JamesSpoolManager.run(JamesSpoolManager.java:369) at java.lang.Thread.run(Unknown Source) If I restart James it works for almost a day and then it break again. My bayesiananalysis_spam count 853685 rows, while the ham counts 21253. I configured james to automatically feed spam and ham for messages I recognize so the bayesian can be improved. Maybe my continuous feeding is not good for the bayesian mailet. Any Idea? Who is using this Matchers/Mailets? What are your spam/ham sizes? How often do you feed ham/spam? how much memory you reserved to James? how much messages through the bayesian mailets?
        Hide
        bago Stefano Bagnara added a comment -

        PS: James never logged an OutOfMemory and the exception is always identical so I don't think it is a memory problem.

        Show
        bago Stefano Bagnara added a comment - PS: James never logged an OutOfMemory and the exception is always identical so I don't think it is a memory problem.
        Hide
        brainlounge Bernd Fondermann added a comment -

        I looked at the Mailet code and found that in buildCorpus(), instance variable "corpus" is filled with all ham and spam tokens which appear to be Maps of (String, Integer) pairs. Afterwards, the map is iterated and all values are replaced by Doubles, but while this is running (and taking longer every time) there could still be a fair amount of Integer-typed values.
        If another thread is stepping into line 591 at the same time this is still in process the error could very well occur because "corpus" is read there.
        Are new mails fed in a separate thread?

        The class cast in line 591 could be changed to "Number", as a very simple solution. Maybe it would also be appropriate to refactor buildCorpus() to work on a local map until it is ready with re-filling it with Doubles.

        Hope this analysis makes some sense and I did not completely misread this whole case...

        Show
        brainlounge Bernd Fondermann added a comment - I looked at the Mailet code and found that in buildCorpus(), instance variable "corpus" is filled with all ham and spam tokens which appear to be Maps of (String, Integer) pairs. Afterwards, the map is iterated and all values are replaced by Doubles, but while this is running (and taking longer every time) there could still be a fair amount of Integer-typed values. If another thread is stepping into line 591 at the same time this is still in process the error could very well occur because "corpus" is read there. Are new mails fed in a separate thread? The class cast in line 591 could be changed to "Number", as a very simple solution. Maybe it would also be appropriate to refactor buildCorpus() to work on a local map until it is ready with re-filling it with Doubles. Hope this analysis makes some sense and I did not completely misread this whole case...
        Hide
        vincenzo Vincenzo Gianferrari Pini added a comment -

        Bernd is right: buildCorpus() is in a synchronized block to avoid messing when new mails are fed (in a separate thread), but I forgot to handle synchronization problems between buildCorpus() and getTokenProbabilityStrengths().
        I will refactor builCorpus() to avoid this dirty double use of corpus.
        Moreover corpus, hamTokenCounts and spamTokenCounts seem to be not cleared when loading/building an updated new corpus from the database.

        Show
        vincenzo Vincenzo Gianferrari Pini added a comment - Bernd is right: buildCorpus() is in a synchronized block to avoid messing when new mails are fed (in a separate thread), but I forgot to handle synchronization problems between buildCorpus() and getTokenProbabilityStrengths(). I will refactor builCorpus() to avoid this dirty double use of corpus. Moreover corpus, hamTokenCounts and spamTokenCounts seem to be not cleared when loading/building an updated new corpus from the database.
        Hide
        vincenzo Vincenzo Gianferrari Pini added a comment -

        The corpus reload activity was possibly conflicting with any ongoing analysis of messages, and the corpus could screw up.
        Now such reload activity is done on a new hashmap, that at the end of the reload becomes the actual corpus. In the meantime any analysis is done on the old corpus and no conflict occurs. The old corpus will eventually be garbage collected.

        Show
        vincenzo Vincenzo Gianferrari Pini added a comment - The corpus reload activity was possibly conflicting with any ongoing analysis of messages, and the corpus could screw up. Now such reload activity is done on a new hashmap, that at the end of the reload becomes the actual corpus. In the meantime any analysis is done on the old corpus and no conflict occurs. The old corpus will eventually be garbage collected.
        Hide
        danny@apache.org Danny Angus added a comment -

        Closing issue fixed in released version.

        Show
        danny@apache.org Danny Angus added a comment - Closing issue fixed in released version.

          People

          • Assignee:
            vincenzo Vincenzo Gianferrari Pini
            Reporter:
            bago Stefano Bagnara
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development