Mahout
  1. Mahout
  2. MAHOUT-1364

Upgrade Mahout codebase to Lucene 4.6

    Details

      Description

      Parallel Randomized tests (using Carrot RandomizedRunner) fail on Mac OS for code that invokes Lucene API, see the discussion in M-1345. The fix is to upgrade to a Lucene version > 4.3.1 (which is the present Lucene version in Mahout trunk).

      1. LuceneIterableTest.diff
        3 kB
        Suneel Marthi
      2. MAHOUT-1364.patch
        33 kB
        Frank Scholten

        Activity

        Hide
        Suneel Marthi added a comment -

        Mahout 0.9 has been released supporting Lucene 4.6.1.

        Show
        Suneel Marthi added a comment - Mahout 0.9 has been released supporting Lucene 4.6.1.
        Hide
        Suneel Marthi added a comment -

        Patch committed to trunk, tests have passed Hudson build.

        Show
        Suneel Marthi added a comment - Patch committed to trunk, tests have passed Hudson build.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2374 (See https://builds.apache.org/job/Mahout-Quality/2374/)
        MAHOUT-1364: Upgrade Mahout codebase to Lucene 4.6, adding CHANGELOG entry for this. (smarthi: rev 1551945)

        • /mahout/trunk/CHANGELOG
          MAHOUT-1364: Upgrade Mahout codebase to Lucene 4.6. (frankscholten: rev 1551935)
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/lucene/AnalyzerUtils.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/lucene/TokenStreamIterator.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/document/SequenceFileTokenizerMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/encoders/LuceneTextValueEncoder.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DictionaryVectorizerTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DocumentProcessorTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/HighDFWordsPrunerTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFilesTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/encoders/TextValueEncoderTest.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/NewsgroupHelper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputFormat.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputSplit.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneStorageConfiguration.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/MailArchivesClusteringAnalyzer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriver.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMapper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaAnalyzer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/AnalyzerTransformer.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/AbstractLuceneStorageTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputSplitTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentRecordReaderTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/TestSequenceFilesFromDirectory.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/nlp/collocations/llr/BloomTokenFilterTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/CachedTermInfoTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/DriverTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java
        • /mahout/trunk/pom.xml
        Show
        Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2374 (See https://builds.apache.org/job/Mahout-Quality/2374/ ) MAHOUT-1364 : Upgrade Mahout codebase to Lucene 4.6, adding CHANGELOG entry for this. (smarthi: rev 1551945) /mahout/trunk/CHANGELOG MAHOUT-1364 : Upgrade Mahout codebase to Lucene 4.6. (frankscholten: rev 1551935) /mahout/trunk/core/src/main/java/org/apache/mahout/common/lucene/AnalyzerUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/lucene/TokenStreamIterator.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/document/SequenceFileTokenizerMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/encoders/LuceneTextValueEncoder.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DictionaryVectorizerTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DocumentProcessorTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/HighDFWordsPrunerTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFilesTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/encoders/TextValueEncoderTest.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/NewsgroupHelper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputFormat.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputSplit.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneStorageConfiguration.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/MailArchivesClusteringAnalyzer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriver.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMapper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaAnalyzer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/AnalyzerTransformer.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/AbstractLuceneStorageTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputSplitTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentRecordReaderTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/TestSequenceFilesFromDirectory.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/nlp/collocations/llr/BloomTokenFilterTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/CachedTermInfoTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/DriverTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java /mahout/trunk/pom.xml
        Hide
        Frank Scholten added a comment -

        Patch committed to trunk.

        Show
        Frank Scholten added a comment - Patch committed to trunk.
        Hide
        Suneel Marthi added a comment -

        Assigning this to Frank, the patch is good to be committed to trunk.

        Show
        Suneel Marthi added a comment - Assigning this to Frank, the patch is good to be committed to trunk.
        Hide
        Frank Scholten added a comment -

        Saw you committed the LuceneIterableTest cleanup already. I don't know how ReadOnlyFileSystemDirectory can be improved. Also, I think the MR version of lucene2seq is not used as much as the sequential version so I suggest we create a separate ticket for that particular issue for after 0.9 and commit the current patch.

        Show
        Frank Scholten added a comment - Saw you committed the LuceneIterableTest cleanup already. I don't know how ReadOnlyFileSystemDirectory can be improved. Also, I think the MR version of lucene2seq is not used as much as the sequential version so I suggest we create a separate ticket for that particular issue for after 0.9 and commit the current patch.
        Hide
        Suneel Marthi added a comment -

        Frank, attaching an updates version of LuceneIterableTest.java, please include this into ur patch for Mahout-1364.

        Show
        Suneel Marthi added a comment - Frank, attaching an updates version of LuceneIterableTest.java, please include this into ur patch for Mahout-1364.
        Hide
        Suneel Marthi added a comment -

        Frank, the patch looks good and all tests pass.

        1. LuceneIterableTest.java definitely can do with some cleanup.
        2. ReadOnlyFileSystemDirectory.java - Is there something in Lucene 4.x that can replace this?

        Show
        Suneel Marthi added a comment - Frank, the patch looks good and all tests pass. 1. LuceneIterableTest.java definitely can do with some cleanup. 2. ReadOnlyFileSystemDirectory.java - Is there something in Lucene 4.x that can replace this?
        Hide
        Frank Scholten added a comment -

        This patch updates to Lucene 4.6.0 and added end() and close() calls on TokenStream in several places in the code. Also added some @ThreadLeakScope annotations.

        Would like to have someone review this change. Grant Ingersoll maybe you can have a look?

        Show
        Frank Scholten added a comment - This patch updates to Lucene 4.6.0 and added end() and close() calls on TokenStream in several places in the code. Also added some @ThreadLeakScope annotations. Would like to have someone review this change. Grant Ingersoll maybe you can have a look?
        Hide
        Suneel Marthi added a comment - - edited

        My initial attempt at this broke all of the FeatureVectorEncoders, due to the strict TokenStream workflow in Lucene 4.6. This may be more involved than initially anticipated, will still target this for 0.9 but may have to be deferred to Release 1.0 and upgrade to Lucene 4.5.1 for 0.9 release if we can't make it.

        Show
        Suneel Marthi added a comment - - edited My initial attempt at this broke all of the FeatureVectorEncoders, due to the strict TokenStream workflow in Lucene 4.6. This may be more involved than initially anticipated, will still target this for 0.9 but may have to be deferred to Release 1.0 and upgrade to Lucene 4.5.1 for 0.9 release if we can't make it.

          People

          • Assignee:
            Frank Scholten
            Reporter:
            Suneel Marthi
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development