Lucene - Core
  1. Lucene - Core
  2. LUCENE-1039

Bayesian classifiers using Lucene as data store

    Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and Fisher method algorithms as described by Toby Segaran in "Programming Collective Intelligence", ISBN 978-0-596-52932-1.

      Have fun.

      Poor java docs, but the TestCase shows how to use it:

      public class TestClassifier extends TestCase {
      
        public void test() throws Exception {
      
          InstanceFactory instanceFactory = new InstanceFactory() {
      
            public Document factory(String text, String _class) {
              Document doc = new Document();
              doc.add(new Field("class", _class, Field.Store.YES, Field.Index.NO_NORMS));
      
              doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, Field.TermVector.NO));
      
              doc.add(new Field("text/ngrams/start", text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES));
              doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES));
              doc.add(new Field("text/ngrams/end", text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES));
              return doc;
            }
      
            Analyzer analyzer = new Analyzer() {
              private int minGram = 2;
              private int maxGram = 3;
      
              public TokenStream tokenStream(String fieldName, Reader reader) {
                TokenStream ts = new StandardTokenizer(reader);
                ts = new LowerCaseFilter(ts);
                if (fieldName.endsWith("/ngrams/start")) {
                  ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram);
                } else if (fieldName.endsWith("/ngrams/inner")) {
                  ts = new NGramTokenFilter(ts, minGram, maxGram);
                } else if (fieldName.endsWith("/ngrams/end")) {
                  ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, minGram, maxGram);
                }
                return ts;
              }
            };
      
            public Analyzer getAnalyzer() {
              return analyzer;
            }
          };
      
          Directory dir = new RAMDirectory();
          new IndexWriter(dir, null, true).close();
      
          Instances instances = new Instances(dir, instanceFactory, "class");
      
          instances.addInstance("hello world", "en");
          instances.addInstance("hallå världen", "sv");
      
          instances.addInstance("this is london calling", "en");
          instances.addInstance("detta är london som ringer", "sv");
      
          instances.addInstance("john has a long mustache", "en");
          instances.addInstance("john har en lång mustache", "sv");
      
          instances.addInstance("all work and no play makes jack a dull boy", "en");
          instances.addInstance("att bara arbeta och aldrig leka gör jack en trist gosse", "sv");
      
          instances.addInstance("shrimp sandwich", "en");
          instances.addInstance("räksmörgås", "sv");
      
          instances.addInstance("it's now or never", "en");
          instances.addInstance("det är nu eller aldrig", "sv");
      
          instances.addInstance("to tie up at a landing-stage", "en");
          instances.addInstance("att angöra en brygga", "sv");
      
          instances.addInstance("it's now time for the children's television shows", "en");
          instances.addInstance("nu är det dags för barnprogram", "sv");
      
          instances.flush();
      
          testClassifier(instances, new NaiveBayesClassifier());
          testClassifier(instances, new FishersMethodClassifier());
      
          instances.close();
        }
      
        private void testClassifier(Instances instances, BayesianClassifier classifier) throws IOException {
      
          assertEquals("sv", classifier.classify(instances, "detta blir ett test")[0].getClassification());
          assertEquals("en", classifier.classify(instances, "this will be a test")[0].getClassification());
      
          // test training data instances. all ought to match!
          for (int documentNumber = 0; documentNumber < instances.getIndexReader().maxDoc(); documentNumber++) {
            if (!instances.getIndexReader().isDeleted(documentNumber)) {
              Map<Term, Double> features = instances.extractFeatures(instances.getIndexReader(), documentNumber, classifier.isNormalized());
              Document document = instances.getIndexReader().document(documentNumber);
              assertEquals(document.get("class"), classifier.classify(instances, features)[0].getClassification());
            }
          }
        }
      
      
      1. LUCENE-1039.txt
        27 kB
        Karl Wettin

        Activity

        Hide
        Otis Gospodnetic added a comment -

        Skimmed this very quickly - looks nice and clean to me!
        Why is this not in contrib yet? I didn't spot any dependencies....are there any?

        Show
        Otis Gospodnetic added a comment - Skimmed this very quickly - looks nice and clean to me! Why is this not in contrib yet? I didn't spot any dependencies....are there any?
        Hide
        Karl Wettin added a comment -

        Otis Gospodnetic - 03/Dec/07 11:22 PM
        > Skimmed this very quickly - looks nice and clean to me!
        > Why is this not in contrib yet? I didn't spot any dependencies....are there any?

        No dependencies, although I get a 5x-10x faster classifier using LUCENE-550 while trained with 15,000 small instances (documents).

        One reason that this is not in the contrib might be that it is based on an O'Reilly book. That book contains an example implementation in Python but my code does not have much in common with it, except for the Greek kung fu found by a Brittish priest 250 years ago.

        IANAL, but according to what I've read in the preface there are no problems releasing this with ASL.

        Talk to permissions@oreilly.com if you really want to make sure. I can supply you with the Python code example if you want to compare. The book is however worth the $40 if you want to understand whats going on in there.

        Show
        Karl Wettin added a comment - Otis Gospodnetic - 03/Dec/07 11:22 PM > Skimmed this very quickly - looks nice and clean to me! > Why is this not in contrib yet? I didn't spot any dependencies....are there any? No dependencies, although I get a 5x-10x faster classifier using LUCENE-550 while trained with 15,000 small instances (documents). One reason that this is not in the contrib might be that it is based on an O'Reilly book. That book contains an example implementation in Python but my code does not have much in common with it, except for the Greek kung fu found by a Brittish priest 250 years ago. IANAL, but according to what I've read in the preface there are no problems releasing this with ASL. Talk to permissions@oreilly.com if you really want to make sure. I can supply you with the Python code example if you want to compare. The book is however worth the $40 if you want to understand whats going on in there.
        Hide
        Paul Elschot added a comment -

        DId you consider using lucene's termvectors?
        Some of the feature extractions would be easier to do with termvectors, especially when the index contains many more docs than the ones on which the classifier is built.
        Classifying a document from its termvector is also quite natural.

        Show
        Paul Elschot added a comment - DId you consider using lucene's termvectors? Some of the feature extractions would be easier to do with termvectors, especially when the index contains many more docs than the ones on which the classifier is built. Classifying a document from its termvector is also quite natural.
        Hide
        Karl Wettin added a comment -

        DId you consider using lucene's termvectors?
        Some of the feature extractions would be easier to do with termvectors,

        Not sure what you mean, they are already used when extracting features? Or do you speak of using the term vectors as training instance data when classifying? Bayesian classification can rely on class feature frequency alone.

        especially when the index contains many more docs than the ones on which the classifier is built.

        The more documents not used for classification, the more scew the classification results will be as Pr(feature|class) is based on docFreq and numDocs in this implementation.

        Show
        Karl Wettin added a comment - DId you consider using lucene's termvectors? Some of the feature extractions would be easier to do with termvectors, Not sure what you mean, they are already used when extracting features? Or do you speak of using the term vectors as training instance data when classifying? Bayesian classification can rely on class feature frequency alone. especially when the index contains many more docs than the ones on which the classifier is built. The more documents not used for classification, the more scew the classification results will be as Pr(feature|class) is based on docFreq and numDocs in this implementation.
        Hide
        Paul Elschot added a comment -

        I'll have a more thorough look at the code, but do I understand correctly that it is using a lucene index per class?

        I'm just now building a Bayesian classifier using a single index with a field for the features (text terms) and a field for the classes.
        The feature field also has termvectors, and these make the implementation for training and classifying quite straightforward, after using some queries on the class field to get the doc ids for each class.
        Also, termvectors allow both a boolean and a strength implementation for the features. The strength is based on the frequency info in the term vectors that have the term frequency within a doc.

        Show
        Paul Elschot added a comment - I'll have a more thorough look at the code, but do I understand correctly that it is using a lucene index per class? I'm just now building a Bayesian classifier using a single index with a field for the features (text terms) and a field for the classes. The feature field also has termvectors, and these make the implementation for training and classifying quite straightforward, after using some queries on the class field to get the doc ids for each class. Also, termvectors allow both a boolean and a strength implementation for the features. The strength is based on the frequency info in the term vectors that have the term frequency within a doc.
        Hide
        Karl Wettin added a comment -

        do I understand correctly that it is using a lucene index per class?

        One index per classifier. Each classifier can contain multiple classes. In the test case the field "class" is used to keep track of classes. Each document must only contain one token in the class field. Features can be stored in any number of fields.

        Show
        Karl Wettin added a comment - do I understand correctly that it is using a lucene index per class? One index per classifier. Each classifier can contain multiple classes. In the test case the field "class" is used to keep track of classes. Each document must only contain one token in the class field. Features can be stored in any number of fields.
        Hide
        Cuong Hoang added a comment -

        >>Each document must only contain one token in the class field

        Does that mean each document in the training set can only belong to one class?

        I try to run the test case but get NullPointerException:

        TestClassifier
        org.apache.lucene.classifier.TestClassifier
        test(org.apache.lucene.classifier.TestClassifier)
        java.lang.NullPointerException
        at org.apache.lucene.index.MultiTermDocs.doc(MultiReader.java:356)
        at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92)
        at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:137)
        at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54)
        at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:72)
        at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:70)
        at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:62)
        at org.apache.lucene.classifier.TestClassifier.testClassifier(TestClassifier.java:110)
        at org.apache.lucene.classifier.TestClassifier.test(TestClassifier.java:101)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at junit.framework.TestCase.runTest(TestCase.java:154)
        at junit.framework.TestCase.runBare(TestCase.java:127)
        at junit.framework.TestResult$1.protect(TestResult.java:106)
        at junit.framework.TestResult.runProtected(TestResult.java:124)
        at junit.framework.TestResult.run(TestResult.java:109)
        at junit.framework.TestCase.run(TestCase.java:118)
        at junit.framework.TestSuite.runTest(TestSuite.java:208)
        at junit.framework.TestSuite.run(TestSuite.java:203)
        at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
        at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
        at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
        at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
        at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
        at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

        Show
        Cuong Hoang added a comment - >>Each document must only contain one token in the class field Does that mean each document in the training set can only belong to one class? I try to run the test case but get NullPointerException: TestClassifier org.apache.lucene.classifier.TestClassifier test(org.apache.lucene.classifier.TestClassifier) java.lang.NullPointerException at org.apache.lucene.index.MultiTermDocs.doc(MultiReader.java:356) at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92) at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:137) at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54) at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:72) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:70) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:62) at org.apache.lucene.classifier.TestClassifier.testClassifier(TestClassifier.java:110) at org.apache.lucene.classifier.TestClassifier.test(TestClassifier.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
        Hide
        Karl Wettin added a comment -

        Cuong Hoang - 03/Apr/08 06:28 PM
        >>Each document must only contain one token in the class field
        >Does that mean each document in the training set can only belong to one class?

        You can have multiple class fields, but you can only classify an instance to one class at the time. Currently class and classes buffer is set in instances, I think it should be possible to move that code to NaiveBayesClassifier to allow classification on multiple classes on the same Instances.

        Instances.java:

          private String classField;
          private String[] classes;
        

        >I try to run the test case but get NullPointerException:

        > at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92)

        The pass tests here, did you perhaps alter the content in some way?

        In BayesianClassifier.java, add the following on row 92:

            classDocs.seek(new Term(instances.getClassField(), _class));
        +    classDocs.next();
            while (featureDocs.next()) {
        

        Does that help?

        Show
        Karl Wettin added a comment - Cuong Hoang - 03/Apr/08 06:28 PM >>Each document must only contain one token in the class field >Does that mean each document in the training set can only belong to one class? You can have multiple class fields, but you can only classify an instance to one class at the time. Currently class and classes buffer is set in instances, I think it should be possible to move that code to NaiveBayesClassifier to allow classification on multiple classes on the same Instances. Instances.java: private String classField; private String [] classes; >I try to run the test case but get NullPointerException: > at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92) The pass tests here, did you perhaps alter the content in some way? In BayesianClassifier.java, add the following on row 92: classDocs.seek( new Term(instances.getClassField(), _class)); + classDocs.next(); while (featureDocs.next()) { Does that help?
        Hide
        Karl Wettin added a comment -

        I close this issue due to uncertainy about intellectual property rights, pending an answer from Toby. I've tried to contact him several times via numerus media without response : (

        Show
        Karl Wettin added a comment - I close this issue due to uncertainy about intellectual property rights, pending an answer from Toby. I've tried to contact him several times via numerus media without response : (
        Hide
        Toby Segaran added a comment -

        I'm the author of "Programming Collective Intelligence". I see no issue with property rights, the algorithm itself is widely known and my book just explains it. The code Karl wrote is completely original.

        Show
        Toby Segaran added a comment - I'm the author of "Programming Collective Intelligence". I see no issue with property rights, the algorithm itself is widely known and my book just explains it. The code Karl wrote is completely original.
        Hide
        Karl Wettin added a comment -

        What do you people think, should I commit this to Lucene or Mahout?

        Show
        Karl Wettin added a comment - What do you people think, should I commit this to Lucene or Mahout?
        Hide
        Vaijanath N. Rao added a comment -

        Hi Karl,

        Can you tell me how to use this with FSDirectory() rather then RAMDirectory(). I am getting following error

        Exception in thread "main" java.lang.NullPointerException
        at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.doc(MultiSegmentReader.java:552)
        at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:94)
        at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:139)
        at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54)
        at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:71)
        at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:72)
        at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:64)

        When I am trying to use the FSDirectory(). I created the instance Index as per the test sample and closed it. Now while doing a classification I am getting the above error.

        The way I create the directory is:

        FSDirectory dir = FSDirectory.getDirectory(new File(indexPath));
        IndexWriter iw = new IndexWriter(dir,instanceFactory.getAnalyzer(),create, MaxFieldLength.LIMITED);
        iw.close();

        The code for addinig the instance is :
        instances.addInstance(record.getText(), record.getClass());

        instance.flush() and instance.close() all go fine.

        While doing classification I again open the directory ( with just create set to false ) and rest call remains the same.

        Instances instances = new Instances(dir, indexCreator.instanceFactory, "class");
        classifier = new NaiveBayesClassifier();
        return classifier.classify(instances, text)[0].getClassification();

        Can you help me in pointing out where I am doing wrong.

        --Thanks and Regards
        Vaijanath N. Rao

        Show
        Vaijanath N. Rao added a comment - Hi Karl, Can you tell me how to use this with FSDirectory() rather then RAMDirectory(). I am getting following error Exception in thread "main" java.lang.NullPointerException at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.doc(MultiSegmentReader.java:552) at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:94) at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:139) at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54) at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:71) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:72) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:64) When I am trying to use the FSDirectory(). I created the instance Index as per the test sample and closed it. Now while doing a classification I am getting the above error. The way I create the directory is: FSDirectory dir = FSDirectory.getDirectory(new File(indexPath)); IndexWriter iw = new IndexWriter(dir,instanceFactory.getAnalyzer(),create, MaxFieldLength.LIMITED); iw.close(); The code for addinig the instance is : instances.addInstance(record.getText(), record.getClass()); instance.flush() and instance.close() all go fine. While doing classification I again open the directory ( with just create set to false ) and rest call remains the same. Instances instances = new Instances(dir, indexCreator.instanceFactory, "class"); classifier = new NaiveBayesClassifier(); return classifier.classify(instances, text) [0] .getClassification(); Can you help me in pointing out where I am doing wrong. --Thanks and Regards Vaijanath N. Rao
        Hide
        Karl Wettin added a comment -

        Vaijanath,

        can you please post a small test case that demonstrates the problem?

        Show
        Karl Wettin added a comment - Vaijanath, can you please post a small test case that demonstrates the problem?

          People

          • Assignee:
            Karl Wettin
            Reporter:
            Karl Wettin
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development