Property changes on: .
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Index: module-build.xml
===================================================================
--- module-build.xml (revision 1447617)
+++ module-build.xml (working copy)
@@ -176,6 +176,28 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Property changes on: module-build.xml
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/module-build.xml:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: queryparser
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/queryparser:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: facet
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/facet:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: common-build.xml
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/common-build.xml:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: demo
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/demo:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/search/TestSort.java
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/search/TestSort.java:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/search/TestTopFieldCollector.java
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollector.java:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/search/TestTotalHitCountCollector.java
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/search/TestTotalHitCountCollector.java:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/search/TestSortDocValues.java
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/search/TestSortDocValues.java:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/search/TestSortRandom.java
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/search/TestSortRandom.java:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/index/index.40.optimized.nocfs.zip
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/index/index.40.optimized.nocfs.zip:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/index/index.40.optimized.cfs.zip
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/index/index.40.optimized.cfs.zip:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/index/index.40.nocfs.zip
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/index/index.40.nocfs.zip:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: core/src/test/org/apache/lucene/index/index.40.cfs.zip
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/core/src/test/org/apache/lucene/index/index.40.cfs.zip:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: benchmark
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/benchmark:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: spatial
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/spatial:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Index: build.xml
===================================================================
--- build.xml (revision 1447617)
+++ build.xml (working copy)
@@ -245,54 +245,48 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
+
+
+
+
+
+
+
Property changes on: build.xml
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/build.xml:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: join
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/join:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: tools
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/tools:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: backwards
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/backwards:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: site
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/site:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: licenses
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/licenses:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: memory
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/memory:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: JRE_VERSION_MIGRATION.txt
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/JRE_VERSION_MIGRATION.txt:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: BUILD.txt
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/BUILD.txt:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: suggest
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/suggest:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: analysis
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/analysis:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: analysis/icu/src/java/org/apache/lucene/collation/ICUCollationKeyFilterFactory.java
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/analysis/icu/src/java/org/apache/lucene/collation/ICUCollationKeyFilterFactory.java:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: CHANGES.txt
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/CHANGES.txt:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: grouping
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/grouping:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: misc
___________________________________________________________________
Modified: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/misc:r1384219-1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: classification
___________________________________________________________________
Added: svn:mergeinfo
Merged /lucene/dev/trunk/lucene/classification:r1384219*,1384220,1384225,1384252-1384253,1384293,1384657,1401338,1401343,1401692,1402461,1403798-1403799,1414176,1415060,1415063,1415079,1415136,1415166,1419258,1428411,1430725
Property changes on: classification/ivy.xml
___________________________________________________________________
Added: svn:eol-style
+ native
Index: classification/src/test/org/apache/lucene/classification/KNearestNeighborClassifierTest.java
===================================================================
--- classification/src/test/org/apache/lucene/classification/KNearestNeighborClassifierTest.java (revision 1384219)
+++ classification/src/test/org/apache/lucene/classification/KNearestNeighborClassifierTest.java (working copy)
@@ -1,5 +1,3 @@
-package org.apache.lucene.classification;
-
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
@@ -16,6 +14,7 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
+package org.apache.lucene.classification;
import org.apache.lucene.analysis.MockAnalyzer;
import org.junit.Test;
@@ -27,7 +26,7 @@
@Test
public void testBasicUsage() throws Exception {
- checkCorrectClassification(new KNearestNeighborClassifier(1), new MockAnalyzer(random()));
+ checkCorrectClassification(new KNearestNeighborClassifier(1), new MockAnalyzer(random()));
}
}
Index: classification/src/test/org/apache/lucene/classification/utils/DataSplitterTest.java
===================================================================
--- classification/src/test/org/apache/lucene/classification/utils/DataSplitterTest.java (revision 1415060)
+++ classification/src/test/org/apache/lucene/classification/utils/DataSplitterTest.java (working copy)
@@ -55,7 +55,7 @@
public void setUp() throws Exception {
super.setUp();
dir = newDirectory();
- indexWriter = new RandomIndexWriter(random(), dir);
+ indexWriter = new RandomIndexWriter(random(), dir, new MockAnalyzer(random()));
FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setStoreTermVectors(true);
@@ -91,7 +91,7 @@
@Test
public void testSplitOnAllFields() throws Exception {
- assertSplit(originalIndex, 0.1, 0.1, null);
+ assertSplit(originalIndex, 0.1, 0.1);
}
Index: classification/src/test/org/apache/lucene/classification/SimpleNaiveBayesClassifierTest.java
===================================================================
--- classification/src/test/org/apache/lucene/classification/SimpleNaiveBayesClassifierTest.java (revision 1384219)
+++ classification/src/test/org/apache/lucene/classification/SimpleNaiveBayesClassifierTest.java (working copy)
@@ -1,5 +1,3 @@
-package org.apache.lucene.classification;
-
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
@@ -16,115 +14,36 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
+package org.apache.lucene.classification;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.MockAnalyzer;
-import org.apache.lucene.document.Document;
-import org.apache.lucene.document.Field;
-import org.apache.lucene.document.TextField;
-import org.apache.lucene.index.RandomIndexWriter;
-import org.apache.lucene.index.SlowCompositeReaderWrapper;
-import org.apache.lucene.store.Directory;
-import org.apache.lucene.util.LuceneTestCase;
-import org.junit.After;
-import org.junit.Before;
+import org.apache.lucene.analysis.ngram.EdgeNGramTokenizer;
import org.junit.Test;
+import java.io.Reader;
+
/**
* Testcase for {@link SimpleNaiveBayesClassifier}
*/
-public class SimpleNaiveBayesClassifierTest extends LuceneTestCase {
+public class SimpleNaiveBayesClassifierTest extends ClassificationTestBase {
- private RandomIndexWriter indexWriter;
- private String textFieldName;
- private String classFieldName;
- private Analyzer analyzer;
- private Directory dir;
-
- @Before
- public void setUp() throws Exception {
- super.setUp();
- analyzer = new MockAnalyzer(random());
- dir = newDirectory();
- indexWriter = new RandomIndexWriter(random(), dir);
- textFieldName = "text";
- classFieldName = "cat";
+ @Test
+ public void testBasicUsage() throws Exception {
+ checkCorrectClassification(new SimpleNaiveBayesClassifier(), new MockAnalyzer(random()));
}
- @After
- public void tearDown() throws Exception {
- super.tearDown();
- indexWriter.close();
- dir.close();
+ @Test
+ public void testNGramUsage() throws Exception {
+ checkCorrectClassification(new SimpleNaiveBayesClassifier(), new NGramAnalyzer());
}
- @Test
- public void testBasicUsage() throws Exception {
- SlowCompositeReaderWrapper compositeReaderWrapper = null;
- try {
- populateIndex();
- SimpleNaiveBayesClassifier simpleNaiveBayesClassifier = new SimpleNaiveBayesClassifier();
- compositeReaderWrapper = new SlowCompositeReaderWrapper(indexWriter.getReader());
- simpleNaiveBayesClassifier.train(compositeReaderWrapper, textFieldName, classFieldName, analyzer);
- String newText = "Much is made of what the likes of Facebook, Google and Apple know about users. Truth is, Amazon may know more. ";
- assertEquals("technology", simpleNaiveBayesClassifier.assignClass(newText));
- } finally {
- if (compositeReaderWrapper != null)
- compositeReaderWrapper.close();
+ private class NGramAnalyzer extends Analyzer {
+ @Override
+ protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+ return new TokenStreamComponents(new EdgeNGramTokenizer(reader, EdgeNGramTokenizer.Side.BACK,
+ 10, 20));
}
}
- private void populateIndex() throws Exception {
-
- Document doc = new Document();
- doc.add(new TextField(textFieldName, "The traveling press secretary for Mitt Romney lost his cool and cursed at reporters " +
- "who attempted to ask questions of the Republican presidential candidate in a public plaza near the Tomb of " +
- "the Unknown Soldier in Warsaw Tuesday.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "politics", Field.Store.YES));
-
- indexWriter.addDocument(doc, analyzer);
-
- doc = new Document();
- doc.add(new TextField(textFieldName, "Mitt Romney seeks to assure Israel and Iran, as well as Jewish voters in the United" +
- " States, that he will be tougher against Iran's nuclear ambitions than President Barack Obama.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "politics", Field.Store.YES));
- indexWriter.addDocument(doc, analyzer);
-
- doc = new Document();
- doc.add(new TextField(textFieldName, "And there's a threshold question that he has to answer for the American people and " +
- "that's whether he is prepared to be commander-in-chief,\" she continued. \"As we look to the past events, we " +
- "know that this raises some questions about his preparedness and we'll see how the rest of his trip goes.\"", Field.Store.YES));
- doc.add(new TextField(classFieldName, "politics", Field.Store.YES));
- indexWriter.addDocument(doc, analyzer);
-
- doc = new Document();
- doc.add(new TextField(textFieldName, "Still, when it comes to gun policy, many congressional Democrats have \"decided to " +
- "keep quiet and not go there,\" said Alan Lizotte, dean and professor at the State University of New York at " +
- "Albany's School of Criminal Justice.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "politics", Field.Store.YES));
- indexWriter.addDocument(doc, analyzer);
-
- doc = new Document();
- doc.add(new TextField(textFieldName, "Standing amongst the thousands of people at the state Capitol, Jorstad, director of " +
- "technology at the University of Wisconsin-La Crosse, documented the historic moment and shared it with the " +
- "world through the Internet.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "technology", Field.Store.YES));
- indexWriter.addDocument(doc, analyzer);
-
- doc = new Document();
- doc.add(new TextField(textFieldName, "So, about all those experts and analysts who've spent the past year or so saying " +
- "Facebook was going to make a phone. A new expert has stepped forward to say it's not going to happen.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "technology", Field.Store.YES));
- indexWriter.addDocument(doc, analyzer);
-
- doc = new Document();
- doc.add(new TextField(textFieldName, "More than 400 million people trust Google with their e-mail, and 50 million store files" +
- " in the cloud using the Dropbox service. People manage their bank accounts, pay bills, trade stocks and " +
- "generally transfer or store huge volumes of personal data online.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "technology", Field.Store.YES));
- indexWriter.addDocument(doc, analyzer);
-
- indexWriter.commit();
- }
-
}
Property changes on: classification/src/test/org/apache/lucene/classification/SimpleNaiveBayesClassifierTest.java
___________________________________________________________________
Added: svn:eol-style
+ native
Index: classification/src/test/org/apache/lucene/classification/ClassificationTestBase.java
===================================================================
--- classification/src/test/org/apache/lucene/classification/ClassificationTestBase.java (revision 1384219)
+++ classification/src/test/org/apache/lucene/classification/ClassificationTestBase.java (working copy)
@@ -1,5 +1,3 @@
-package org.apache.lucene.classification;
-
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
@@ -16,14 +14,16 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
+package org.apache.lucene.classification;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
-import org.apache.lucene.document.Field;
+import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.RandomIndexWriter;
import org.apache.lucene.index.SlowCompositeReaderWrapper;
import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.LuceneTestCase;
import org.junit.After;
import org.junit.Before;
@@ -54,15 +54,17 @@
dir.close();
}
- protected void checkCorrectClassification(Classifier classifier, Analyzer analyzer) throws Exception {
+
+ protected void checkCorrectClassification(Classifier classifier, Analyzer analyzer) throws Exception {
SlowCompositeReaderWrapper compositeReaderWrapper = null;
try {
populateIndex(analyzer);
compositeReaderWrapper = new SlowCompositeReaderWrapper(indexWriter.getReader());
classifier.train(compositeReaderWrapper, textFieldName, classFieldName, analyzer);
String newText = "Much is made of what the likes of Facebook, Google and Apple know about users. Truth is, Amazon may know more.";
- ClassificationResult classificationResult = classifier.assignClass(newText);
- assertEquals("technology", classificationResult.getAssignedClass());
+ ClassificationResult classificationResult = classifier.assignClass(newText);
+ assertNotNull(classificationResult.getAssignedClass());
+ assertEquals(new BytesRef("technology").utf8ToString(), classificationResult.getAssignedClass().utf8ToString());
assertTrue(classificationResult.getScore() > 0);
} finally {
if (compositeReaderWrapper != null)
@@ -72,52 +74,58 @@
private void populateIndex(Analyzer analyzer) throws Exception {
+ FieldType ft = new FieldType(TextField.TYPE_STORED);
+ ft.setIndexed(true);
+ ft.setStoreTermVectors(true);
+ ft.setStoreTermVectorOffsets(true);
+ ft.setStoreTermVectorPositions(true);
+
Document doc = new Document();
- doc.add(new TextField(textFieldName, "The traveling press secretary for Mitt Romney lost his cool and cursed at reporters " +
+ doc.add(newField(textFieldName, "The traveling press secretary for Mitt Romney lost his cool and cursed at reporters " +
"who attempted to ask questions of the Republican presidential candidate in a public plaza near the Tomb of " +
- "the Unknown Soldier in Warsaw Tuesday.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "politics", Field.Store.YES));
+ "the Unknown Soldier in Warsaw Tuesday.", ft));
+ doc.add(newField(classFieldName, "politics", ft));
indexWriter.addDocument(doc, analyzer);
doc = new Document();
- doc.add(new TextField(textFieldName, "Mitt Romney seeks to assure Israel and Iran, as well as Jewish voters in the United" +
- " States, that he will be tougher against Iran's nuclear ambitions than President Barack Obama.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "politics", Field.Store.YES));
+ doc.add(newField(textFieldName, "Mitt Romney seeks to assure Israel and Iran, as well as Jewish voters in the United" +
+ " States, that he will be tougher against Iran's nuclear ambitions than President Barack Obama.", ft));
+ doc.add(newField(classFieldName, "politics", ft));
indexWriter.addDocument(doc, analyzer);
doc = new Document();
- doc.add(new TextField(textFieldName, "And there's a threshold question that he has to answer for the American people and " +
+ doc.add(newField(textFieldName, "And there's a threshold question that he has to answer for the American people and " +
"that's whether he is prepared to be commander-in-chief,\" she continued. \"As we look to the past events, we " +
- "know that this raises some questions about his preparedness and we'll see how the rest of his trip goes.\"", Field.Store.YES));
- doc.add(new TextField(classFieldName, "politics", Field.Store.YES));
+ "know that this raises some questions about his preparedness and we'll see how the rest of his trip goes.\"", ft));
+ doc.add(newField(classFieldName, "politics", ft));
indexWriter.addDocument(doc, analyzer);
doc = new Document();
- doc.add(new TextField(textFieldName, "Still, when it comes to gun policy, many congressional Democrats have \"decided to " +
+ doc.add(newField(textFieldName, "Still, when it comes to gun policy, many congressional Democrats have \"decided to " +
"keep quiet and not go there,\" said Alan Lizotte, dean and professor at the State University of New York at " +
- "Albany's School of Criminal Justice.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "politics", Field.Store.YES));
+ "Albany's School of Criminal Justice.", ft));
+ doc.add(newField(classFieldName, "politics", ft));
indexWriter.addDocument(doc, analyzer);
doc = new Document();
- doc.add(new TextField(textFieldName, "Standing amongst the thousands of people at the state Capitol, Jorstad, director of " +
+ doc.add(newField(textFieldName, "Standing amongst the thousands of people at the state Capitol, Jorstad, director of " +
"technology at the University of Wisconsin-La Crosse, documented the historic moment and shared it with the " +
- "world through the Internet.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "technology", Field.Store.YES));
+ "world through the Internet.", ft));
+ doc.add(newField(classFieldName, "technology", ft));
indexWriter.addDocument(doc, analyzer);
doc = new Document();
- doc.add(new TextField(textFieldName, "So, about all those experts and analysts who've spent the past year or so saying " +
- "Facebook was going to make a phone. A new expert has stepped forward to say it's not going to happen.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "technology", Field.Store.YES));
+ doc.add(newField(textFieldName, "So, about all those experts and analysts who've spent the past year or so saying " +
+ "Facebook was going to make a phone. A new expert has stepped forward to say it's not going to happen.", ft));
+ doc.add(newField(classFieldName, "technology", ft));
indexWriter.addDocument(doc, analyzer);
doc = new Document();
- doc.add(new TextField(textFieldName, "More than 400 million people trust Google with their e-mail, and 50 million store files" +
+ doc.add(newField(textFieldName, "More than 400 million people trust Google with their e-mail, and 50 million store files" +
" in the cloud using the Dropbox service. People manage their bank accounts, pay bills, trade stocks and " +
- "generally transfer or store huge volumes of personal data online.", Field.Store.YES));
- doc.add(new TextField(classFieldName, "technology", Field.Store.YES));
+ "generally transfer or store huge volumes of personal data online.", ft));
+ doc.add(newField(classFieldName, "technology", ft));
indexWriter.addDocument(doc, analyzer);
indexWriter.commit();
Index: classification/src/java/org/apache/lucene/classification/KNearestNeighborClassifier.java
===================================================================
--- classification/src/java/org/apache/lucene/classification/KNearestNeighborClassifier.java (revision 1384219)
+++ classification/src/java/org/apache/lucene/classification/KNearestNeighborClassifier.java (working copy)
@@ -1,5 +1,3 @@
-package org.apache.lucene.classification;
-
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
@@ -16,6 +14,7 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
+package org.apache.lucene.classification;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.AtomicReader;
@@ -24,6 +23,7 @@
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.BytesRef;
import java.io.IOException;
import java.io.StringReader;
@@ -33,8 +33,10 @@
/**
* A k-Nearest Neighbor classifier (see http://en.wikipedia.org/wiki/K-nearest_neighbors) based
* on {@link MoreLikeThis}
+ *
+ * @lucene.experimental
*/
-public class KNearestNeighborClassifier implements Classifier {
+public class KNearestNeighborClassifier implements Classifier {
private MoreLikeThis mlt;
private String textFieldName;
@@ -42,40 +44,55 @@
private IndexSearcher indexSearcher;
private int k;
+ /**
+ * Create a {@link Classifier} using kNN algorithm
+ *
+ * @param k the number of neighbors to analyze as an int
+ */
public KNearestNeighborClassifier(int k) {
this.k = k;
}
+ /**
+ * {@inheritDoc}
+ */
@Override
- public ClassificationResult assignClass(String text) throws IOException {
+ public ClassificationResult assignClass(String text) throws IOException {
Query q = mlt.like(new StringReader(text), textFieldName);
- TopDocs docs = indexSearcher.search(q, k);
+ TopDocs topDocs = indexSearcher.search(q, k);
+ return selectClassFromNeighbors(topDocs);
+ }
+ private ClassificationResult selectClassFromNeighbors(TopDocs topDocs) throws IOException {
// TODO : improve the nearest neighbor selection
- Map classCounts = new HashMap();
- for (ScoreDoc scoreDoc : docs.scoreDocs) {
- String cl = indexSearcher.doc(scoreDoc.doc).getField(classFieldName).stringValue();
- Integer count = classCounts.get(cl);
- if (count != null) {
- classCounts.put(cl, count + 1);
+ Map classCounts = new HashMap();
+ for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+ BytesRef cl = new BytesRef(indexSearcher.doc(scoreDoc.doc).getField(classFieldName).stringValue());
+ if (cl != null) {
+ Integer count = classCounts.get(cl);
+ if (count != null) {
+ classCounts.put(cl, count + 1);
+ } else {
+ classCounts.put(cl, 1);
+ }
}
- else {
- classCounts.put(cl, 1);
- }
}
- int max = 0;
- String assignedClass = null;
- for (String cl : classCounts.keySet()) {
+ double max = 0;
+ BytesRef assignedClass = new BytesRef();
+ for (BytesRef cl : classCounts.keySet()) {
Integer count = classCounts.get(cl);
if (count > max) {
max = count;
- assignedClass = cl;
+ assignedClass = cl.clone();
}
}
- double score = 1; // TODO : derive score from query
- return new ClassificationResult(assignedClass, score);
+ double score = max / (double) k;
+ return new ClassificationResult(assignedClass, score);
}
+ /**
+ * {@inheritDoc}
+ */
@Override
public void train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer) throws IOException {
this.textFieldName = textFieldName;
Index: classification/src/java/org/apache/lucene/classification/utils/DatasetSplitter.java
===================================================================
--- classification/src/java/org/apache/lucene/classification/utils/DatasetSplitter.java (revision 1415060)
+++ classification/src/java/org/apache/lucene/classification/utils/DatasetSplitter.java (working copy)
@@ -25,7 +25,7 @@
import org.apache.lucene.index.AtomicReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
-import org.apache.lucene.index.StorableField;
+import org.apache.lucene.index.IndexableField;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.ScoreDoc;
@@ -43,20 +43,37 @@
private double crossValidationRatio;
private double testRatio;
+ /**
+ * Create a {@link DatasetSplitter} by giving test and cross validation IDXs sizes
+ *
+ * @param testRatio the ratio of the original index to be used for the test IDX as a double between 0.0 and 1.0
+ * @param crossValidationRatio the ratio of the original index to be used for the c.v. IDX as a double between 0.0 and 1.0
+ */
public DatasetSplitter(double testRatio, double crossValidationRatio) {
this.crossValidationRatio = crossValidationRatio;
this.testRatio = testRatio;
}
+ /**
+ * Split a given index into 3 indexes for training, test and cross validation tasks respectively
+ *
+ * @param originalIndex an {@link AtomicReader} on the source index
+ * @param trainingIndex a {@link Directory} used to write the training index
+ * @param testIndex a {@link Directory} used to write the test index
+ * @param crossValidationIndex a {@link Directory} used to write the cross validation index
+ * @param analyzer {@link Analyzer} used to create the new docs
+ * @param fieldNames names of fields that need to be put in the new indexes or null if all should be used
+ * @throws IOException if any writing operation fails on any of the indexes
+ */
public void split(AtomicReader originalIndex, Directory trainingIndex, Directory testIndex, Directory crossValidationIndex,
Analyzer analyzer, String... fieldNames) throws IOException {
// TODO : check that the passed fields are stored in the original index
// create IWs for train / test / cv IDXs
- IndexWriter testWriter = new IndexWriter(testIndex, new IndexWriterConfig(Version.LUCENE_50, analyzer));
- IndexWriter cvWriter = new IndexWriter(crossValidationIndex, new IndexWriterConfig(Version.LUCENE_50, analyzer));
- IndexWriter trainingWriter = new IndexWriter(trainingIndex, new IndexWriterConfig(Version.LUCENE_50, analyzer));
+ IndexWriter testWriter = new IndexWriter(testIndex, new IndexWriterConfig(Version.LUCENE_42, analyzer));
+ IndexWriter cvWriter = new IndexWriter(crossValidationIndex, new IndexWriterConfig(Version.LUCENE_42, analyzer));
+ IndexWriter trainingWriter = new IndexWriter(trainingIndex, new IndexWriterConfig(Version.LUCENE_42, analyzer));
try {
int size = originalIndex.maxDoc();
@@ -82,17 +99,14 @@
doc.add(new Field(fieldName, originalIndex.document(scoreDoc.doc).getField(fieldName).stringValue(), ft));
}
} else {
- for (StorableField storableField : originalIndex.document(scoreDoc.doc).getFields()) {
- if (storableField.readerValue()!= null){
+ for (IndexableField storableField : originalIndex.document(scoreDoc.doc).getFields()) {
+ if (storableField.readerValue() != null) {
doc.add(new Field(storableField.name(), storableField.readerValue(), ft));
- }
- else if (storableField.binaryValue()!= null){
+ } else if (storableField.binaryValue() != null) {
doc.add(new Field(storableField.name(), storableField.binaryValue(), ft));
- }
- else if (storableField.stringValue()!= null){
+ } else if (storableField.stringValue() != null) {
doc.add(new Field(storableField.name(), storableField.stringValue(), ft));
- }
- else if (storableField.numericValue()!= null){
+ } else if (storableField.numericValue() != null) {
doc.add(new Field(storableField.name(), storableField.numericValue().toString(), ft));
}
}
@@ -101,19 +115,19 @@
// add it to one of the IDXs
if (b % 2 == 0 && testWriter.maxDoc() < size * testRatio) {
testWriter.addDocument(doc);
- testWriter.commit();
} else if (cvWriter.maxDoc() < size * crossValidationRatio) {
cvWriter.addDocument(doc);
- cvWriter.commit();
} else {
trainingWriter.addDocument(doc);
- trainingWriter.commit();
}
b++;
}
} catch (Exception e) {
throw new IOException(e);
} finally {
+ testWriter.commit();
+ cvWriter.commit();
+ trainingWriter.commit();
// close IWs
testWriter.close();
cvWriter.close();
Index: classification/src/java/org/apache/lucene/classification/Classifier.java
===================================================================
--- classification/src/java/org/apache/lucene/classification/Classifier.java (revision 1384219)
+++ classification/src/java/org/apache/lucene/classification/Classifier.java (working copy)
@@ -1,5 +1,3 @@
-package org.apache.lucene.classification;
-
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
@@ -16,6 +14,7 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
+package org.apache.lucene.classification;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.AtomicReader;
@@ -23,17 +22,19 @@
import java.io.IOException;
/**
- * A classifier, see http://en.wikipedia.org/wiki/Classifier_(mathematics)
+ * A classifier, see http://en.wikipedia.org/wiki/Classifier_(mathematics), which assign classes of type
+ * T
+ * @lucene.experimental
*/
-public interface Classifier {
+public interface Classifier {
/**
- * Assign a class to the given text String
+ * Assign a class (with score) to the given text String
* @param text a String containing text to be classified
- * @return a String representing a class
+ * @return a {@link ClassificationResult} holding assigned class of type T and score
* @throws IOException
*/
- public String assignClass(String text) throws IOException;
+ public ClassificationResult assignClass(String text) throws IOException;
/**
* Train the classifier using the underlying Lucene index
Property changes on: classification/src/java/org/apache/lucene/classification/Classifier.java
___________________________________________________________________
Added: svn:eol-style
+ native
Index: classification/src/java/org/apache/lucene/classification/SimpleNaiveBayesClassifier.java
===================================================================
--- classification/src/java/org/apache/lucene/classification/SimpleNaiveBayesClassifier.java (revision 1384219)
+++ classification/src/java/org/apache/lucene/classification/SimpleNaiveBayesClassifier.java (working copy)
@@ -1,5 +1,3 @@
-package org.apache.lucene.classification;
-
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
@@ -16,6 +14,7 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
+package org.apache.lucene.classification;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
@@ -29,6 +28,7 @@
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.BytesRef;
import java.io.IOException;
@@ -38,8 +38,10 @@
/**
* A simplistic Lucene based NaiveBayes classifier, see http://en.wikipedia.org/wiki/Naive_Bayes_classifier
+ *
+ * @lucene.experimental
*/
-public class SimpleNaiveBayesClassifier implements Classifier {
+public class SimpleNaiveBayesClassifier implements Classifier {
private AtomicReader atomicReader;
private String textFieldName;
@@ -48,6 +50,18 @@
private Analyzer analyzer;
private IndexSearcher indexSearcher;
+ /**
+ * Creates a new NaiveBayes classifier.
+ * Note that you must call {@link #train(AtomicReader, String, String, Analyzer) train()} before you can
+ * classify any documents.
+ */
+ public SimpleNaiveBayesClassifier() {
+ }
+
+ /**
+ * {@inheritDoc}
+ */
+ @Override
public void train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer)
throws IOException {
this.atomicReader = atomicReader;
@@ -71,34 +85,37 @@
return result.toArray(new String[result.size()]);
}
- public String assignClass(String inputDocument) throws IOException {
+ /**
+ * {@inheritDoc}
+ */
+ @Override
+ public ClassificationResult assignClass(String inputDocument) throws IOException {
if (atomicReader == null) {
throw new RuntimeException("need to train the classifier first");
}
- Double max = 0d;
- String foundClass = null;
+ double max = 0d;
+ BytesRef foundClass = new BytesRef();
Terms terms = MultiFields.getTerms(atomicReader, classFieldName);
TermsEnum termsEnum = terms.iterator(null);
- BytesRef t = termsEnum.next();
- while (t != null) {
- String classValue = t.utf8ToString();
+ BytesRef next;
+ String[] tokenizedDoc = tokenizeDoc(inputDocument);
+ while ((next = termsEnum.next()) != null) {
// TODO : turn it to be in log scale
- Double clVal = calculatePrior(classValue) * calculateLikelihood(inputDocument, classValue);
+ double clVal = calculatePrior(next) * calculateLikelihood(tokenizedDoc, next);
if (clVal > max) {
max = clVal;
- foundClass = classValue;
+ foundClass = next.clone();
}
- t = termsEnum.next();
}
- return foundClass;
+ return new ClassificationResult(foundClass, max);
}
- private Double calculateLikelihood(String document, String c) throws IOException {
+ private double calculateLikelihood(String[] tokenizedDoc, BytesRef c) throws IOException {
// for each word
- Double result = 1d;
- for (String word : tokenizeDoc(document)) {
+ double result = 1d;
+ for (String word : tokenizedDoc) {
// search with text:word AND class:c
int hits = getWordFreqForClass(word, c);
@@ -117,26 +134,28 @@
return result;
}
- private double getTextTermFreqForClass(String c) throws IOException {
+ private double getTextTermFreqForClass(BytesRef c) throws IOException {
Terms terms = MultiFields.getTerms(atomicReader, textFieldName);
long numPostings = terms.getSumDocFreq(); // number of term/doc pairs
double avgNumberOfUniqueTerms = numPostings / (double) terms.getDocCount(); // avg # of unique terms per doc
- int docsWithC = atomicReader.docFreq(classFieldName, new BytesRef(c));
+ int docsWithC = atomicReader.docFreq(new Term(classFieldName, c));
return avgNumberOfUniqueTerms * docsWithC; // avg # of unique terms in text field per doc * # docs with c
}
- private int getWordFreqForClass(String word, String c) throws IOException {
+ private int getWordFreqForClass(String word, BytesRef c) throws IOException {
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(new TermQuery(new Term(textFieldName, word)), BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(new TermQuery(new Term(classFieldName, c)), BooleanClause.Occur.MUST));
- return indexSearcher.search(booleanQuery, 1).totalHits;
+ TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector();
+ indexSearcher.search(booleanQuery, totalHitCountCollector);
+ return totalHitCountCollector.getTotalHits();
}
- private Double calculatePrior(String currentClass) throws IOException {
+ private double calculatePrior(BytesRef currentClass) throws IOException {
return (double) docCount(currentClass) / docsWithClassSize;
}
- private int docCount(String countedClass) throws IOException {
+ private int docCount(BytesRef countedClass) throws IOException {
return atomicReader.docFreq(new Term(classFieldName, countedClass));
}
}
Property changes on: classification/src/java/org/apache/lucene/classification/SimpleNaiveBayesClassifier.java
___________________________________________________________________
Added: svn:eol-style
+ native
Index: classification/src/java/org/apache/lucene/classification/ClassificationResult.java
===================================================================
--- classification/src/java/org/apache/lucene/classification/ClassificationResult.java (revision 1384219)
+++ classification/src/java/org/apache/lucene/classification/ClassificationResult.java (working copy)
@@ -1,5 +1,3 @@
-package org.apache.lucene.classification;
-
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
@@ -16,24 +14,39 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
+package org.apache.lucene.classification;
/**
- * The result of a call to {@link Classifier#assignClass(String)} holding an assigned class and a score.
+ * The result of a call to {@link Classifier#assignClass(String)} holding an assigned class of type T and a score.
+ * @lucene.experimental
*/
-public class ClassificationResult {
+public class ClassificationResult {
- private String assignedClass;
+ private T assignedClass;
private double score;
- public ClassificationResult(String assignedClass, double score) {
+ /**
+ * Constructor
+ * @param assignedClass the class T assigned by a {@link Classifier}
+ * @param score the score for the assignedClass as a double
+ */
+ public ClassificationResult(T assignedClass, double score) {
this.assignedClass = assignedClass;
this.score = score;
}
- public String getAssignedClass() {
+ /**
+ * retrieve the result class
+ * @return a T representing an assigned class
+ */
+ public T getAssignedClass() {
return assignedClass;
}
+ /**
+ * retrieve the result score
+ * @return a double representing a result score
+ */
public double getScore() {
return score;
}
Index: classification/src/java/org/apache/lucene/classification/package.html
===================================================================
--- classification/src/java/org/apache/lucene/classification/package.html (revision 1384219)
+++ classification/src/java/org/apache/lucene/classification/package.html (working copy)
@@ -18,6 +18,6 @@
Uses already seen data (the indexed documents) to classify new documents.
Currently only contains a (simplistic) Lucene based Naive Bayes classifier
-but more implementations will be added in the future.
+and a k-Nearest Neighbor classifier