Mahout
  1. Mahout
  2. MAHOUT-1364

Upgrade Mahout codebase to Lucene 4.6

    Details

      Description

      Parallel Randomized tests (using Carrot RandomizedRunner) fail on Mac OS for code that invokes Lucene API, see the discussion in M-1345. The fix is to upgrade to a Lucene version > 4.3.1 (which is the present Lucene version in Mahout trunk).

      1. LuceneIterableTest.diff
        3 kB
        Suneel Marthi
      2. MAHOUT-1364.patch
        33 kB
        Frank Scholten

        Activity

        Suneel Marthi created issue -
        Hide
        Suneel Marthi added a comment - - edited

        My initial attempt at this broke all of the FeatureVectorEncoders, due to the strict TokenStream workflow in Lucene 4.6. This may be more involved than initially anticipated, will still target this for 0.9 but may have to be deferred to Release 1.0 and upgrade to Lucene 4.5.1 for 0.9 release if we can't make it.

        Show
        Suneel Marthi added a comment - - edited My initial attempt at this broke all of the FeatureVectorEncoders, due to the strict TokenStream workflow in Lucene 4.6. This may be more involved than initially anticipated, will still target this for 0.9 but may have to be deferred to Release 1.0 and upgrade to Lucene 4.5.1 for 0.9 release if we can't make it.
        Suneel Marthi made changes -
        Field Original Value New Value
        Component/s Classification [ 12312152 ]
        Component/s CLI [ 12316624 ]
        Component/s Clustering [ 12312151 ]
        Frank Scholten made changes -
        Attachment MAHOUT-1364.patch [ 12617552 ]
        Hide
        Frank Scholten added a comment -

        This patch updates to Lucene 4.6.0 and added end() and close() calls on TokenStream in several places in the code. Also added some @ThreadLeakScope annotations.

        Would like to have someone review this change. Grant Ingersoll maybe you can have a look?

        Show
        Frank Scholten added a comment - This patch updates to Lucene 4.6.0 and added end() and close() calls on TokenStream in several places in the code. Also added some @ThreadLeakScope annotations. Would like to have someone review this change. Grant Ingersoll maybe you can have a look?
        Frank Scholten made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Suneel Marthi added a comment -

        Frank, the patch looks good and all tests pass.

        1. LuceneIterableTest.java definitely can do with some cleanup.
        2. ReadOnlyFileSystemDirectory.java - Is there something in Lucene 4.x that can replace this?

        Show
        Suneel Marthi added a comment - Frank, the patch looks good and all tests pass. 1. LuceneIterableTest.java definitely can do with some cleanup. 2. ReadOnlyFileSystemDirectory.java - Is there something in Lucene 4.x that can replace this?
        Hide
        Suneel Marthi added a comment -

        Frank, attaching an updates version of LuceneIterableTest.java, please include this into ur patch for Mahout-1364.

        Show
        Suneel Marthi added a comment - Frank, attaching an updates version of LuceneIterableTest.java, please include this into ur patch for Mahout-1364.
        Suneel Marthi made changes -
        Attachment LuceneIterableTest.diff [ 12618981 ]
        Hide
        Frank Scholten added a comment -

        Saw you committed the LuceneIterableTest cleanup already. I don't know how ReadOnlyFileSystemDirectory can be improved. Also, I think the MR version of lucene2seq is not used as much as the sequential version so I suggest we create a separate ticket for that particular issue for after 0.9 and commit the current patch.

        Show
        Frank Scholten added a comment - Saw you committed the LuceneIterableTest cleanup already. I don't know how ReadOnlyFileSystemDirectory can be improved. Also, I think the MR version of lucene2seq is not used as much as the sequential version so I suggest we create a separate ticket for that particular issue for after 0.9 and commit the current patch.
        Hide
        Suneel Marthi added a comment -

        Assigning this to Frank, the patch is good to be committed to trunk.

        Show
        Suneel Marthi added a comment - Assigning this to Frank, the patch is good to be committed to trunk.
        Suneel Marthi made changes -
        Assignee Suneel Marthi [ smarthi ] Frank Scholten [ frankscholten ]
        Hide
        Frank Scholten added a comment -

        Patch committed to trunk.

        Show
        Frank Scholten added a comment - Patch committed to trunk.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2374 (See https://builds.apache.org/job/Mahout-Quality/2374/)
        MAHOUT-1364: Upgrade Mahout codebase to Lucene 4.6, adding CHANGELOG entry for this. (smarthi: rev 1551945)

        • /mahout/trunk/CHANGELOG
          MAHOUT-1364: Upgrade Mahout codebase to Lucene 4.6. (frankscholten: rev 1551935)
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/lucene/AnalyzerUtils.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/lucene/TokenStreamIterator.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/document/SequenceFileTokenizerMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/encoders/LuceneTextValueEncoder.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DictionaryVectorizerTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DocumentProcessorTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/HighDFWordsPrunerTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFilesTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/encoders/TextValueEncoderTest.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/NewsgroupHelper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputFormat.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputSplit.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneStorageConfiguration.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/MailArchivesClusteringAnalyzer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriver.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMapper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaAnalyzer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/AnalyzerTransformer.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/AbstractLuceneStorageTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputSplitTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentRecordReaderTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/TestSequenceFilesFromDirectory.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/nlp/collocations/llr/BloomTokenFilterTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/CachedTermInfoTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/DriverTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java
        • /mahout/trunk/pom.xml
        Show
        Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2374 (See https://builds.apache.org/job/Mahout-Quality/2374/ ) MAHOUT-1364 : Upgrade Mahout codebase to Lucene 4.6, adding CHANGELOG entry for this. (smarthi: rev 1551945) /mahout/trunk/CHANGELOG MAHOUT-1364 : Upgrade Mahout codebase to Lucene 4.6. (frankscholten: rev 1551935) /mahout/trunk/core/src/main/java/org/apache/mahout/common/lucene/AnalyzerUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/lucene/TokenStreamIterator.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/document/SequenceFileTokenizerMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/encoders/LuceneTextValueEncoder.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DictionaryVectorizerTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/DocumentProcessorTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/HighDFWordsPrunerTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFilesTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/encoders/TextValueEncoderTest.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/NewsgroupHelper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputFormat.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputSplit.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneStorageConfiguration.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/MailArchivesClusteringAnalyzer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriver.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMapper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaAnalyzer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/AnalyzerTransformer.java /mahout/trunk/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/AbstractLuceneStorageTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputSplitTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentRecordReaderTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/TestSequenceFilesFromDirectory.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/nlp/collocations/llr/BloomTokenFilterTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/CachedTermInfoTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/DriverTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java /mahout/trunk/pom.xml
        Hide
        Suneel Marthi added a comment -

        Patch committed to trunk, tests have passed Hudson build.

        Show
        Suneel Marthi added a comment - Patch committed to trunk, tests have passed Hudson build.
        Suneel Marthi made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Suneel Marthi added a comment -

        Mahout 0.9 has been released supporting Lucene 4.6.1.

        Show
        Suneel Marthi added a comment - Mahout 0.9 has been released supporting Lucene 4.6.1.
        Suneel Marthi made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        11d 13h 41m 1 Frank Scholten 07/Dec/13 11:46
        Patch Available Patch Available Resolved Resolved
        11d 6h 14m 1 Suneel Marthi 18/Dec/13 18:00
        Resolved Resolved Closed Closed
        46d 13h 47m 1 Suneel Marthi 03/Feb/14 07:47

          People

          • Assignee:
            Frank Scholten
            Reporter:
            Suneel Marthi
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development