Mahout
  1. Mahout
  2. MAHOUT-944

LuceneIndexToSequenceFiles (lucene2seq) utility

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.8
    • Component/s: Integration
    • Labels:
      None

      Description

      Here is a lucene2seq tool I used in a project. It creates sequence files based on the stored fields of a lucene index.

      The output from this tool can be then fed into seq2sparse and from there you can do text clustering.

      Comes with Java bean configuration.

      Let me know what you think. Some CLI code can be added later on. I used this for a small-scale project +- 100.000 docs. Is a MR version useful or is that overkill?

      See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and review comments from Simon Willnauer (Thanks Simon!)

      or the attached patch.

      1. MAHOUT-944-minor.patch
        69 kB
        Grant Ingersoll
      2. MAHOUT-944.patch
        91 kB
        Grant Ingersoll
      3. MAHOUT-944.patch
        86 kB
        Grant Ingersoll
      4. MAHOUT-944.patch
        81 kB
        Grant Ingersoll
      5. MAHOUT-944.patch
        81 kB
        Grant Ingersoll
      6. MAHOUT-944.patch
        82 kB
        Grant Ingersoll
      7. MAHOUT-944.patch
        85 kB
        Grant Ingersoll
      8. MAHOUT-944.patch
        377 kB
        Frank Scholten
      9. MAHOUT-944.patch
        86 kB
        Frank Scholten
      10. MAHOUT-944.patch
        39 kB
        Frank Scholten
      11. MAHOUT-944.patch
        39 kB
        Frank Scholten
      12. MAHOUT-944.patch
        53 kB
        Frank Scholten
      13. MAHOUT-944.patch
        20 kB
        Frank Scholten

        Activity

        Hide
        Frank Scholten added a comment -

        Started working on CLI code. Still have to support lucene queries as a parameter. I think it would be cool to add field separators in between the contents of the field and the extra fields. That way this tool also be used as an entry point into seq2encoded.

        See https://github.com/frankscholten/mahout/commit/25584aac9dc0727ebc86ae245768f592161d4813

        Show
        Frank Scholten added a comment - Started working on CLI code. Still have to support lucene queries as a parameter. I think it would be cool to add field separators in between the contents of the field and the extra fields. That way this tool also be used as an entry point into seq2encoded. See https://github.com/frankscholten/mahout/commit/25584aac9dc0727ebc86ae245768f592161d4813
        Hide
        Frank Scholten added a comment -

        Ah, seq2encoded currently supports text only. I was under the impression that seq2encoded could be configured to encode several data types simultaneously, such as body and lines from the 20 News example. No need for field separators then for lucene2seq

        Show
        Frank Scholten added a comment - Ah, seq2encoded currently supports text only. I was under the impression that seq2encoded could be configured to encode several data types simultaneously, such as body and lines from the 20 News example. No need for field separators then for lucene2seq
        Hide
        Frank Scholten added a comment -

        CLI now supports all options.

        Show
        Frank Scholten added a comment - CLI now supports all options.
        Hide
        Frank Scholten added a comment -

        Added git patch. Previous patch created by IntelliJ contained headers and weird formatting.

        Show
        Frank Scholten added a comment - Added git patch. Previous patch created by IntelliJ contained headers and weird formatting.
        Hide
        Frank Scholten added a comment -

        New patch. This time generated with 'git diff --no-prefix'

        Run 'git config --global diff.noprefix true'
        to have git always use the --no-prefix option.

        Show
        Frank Scholten added a comment - New patch. This time generated with 'git diff --no-prefix' Run 'git config --global diff.noprefix true' to have git always use the --no-prefix option.
        Hide
        Lance Norskog added a comment -

        A map-reduce version:

        1. Lets you handle much bigger indexes. There are a lot of huge ones. I can see clustering Wikipedia articles with this.
        2. It is possible to sort by score. This makes it easy to grab a thousand interesting documents and ignore the rest. Our doc-prep facilities could make good use of this.
        Show
        Lance Norskog added a comment - A map-reduce version: Lets you handle much bigger indexes. There are a lot of huge ones. I can see clustering Wikipedia articles with this. It is possible to sort by score. This makes it easy to grab a thousand interesting documents and ignore the rest. Our doc-prep facilities could make good use of this.
        Hide
        Frank Scholten added a comment -

        1. Ok. This involves using the FileSystemDirectory from Hadoop contrib, writing a custom InputFormat and RecordReader which splits the document result across the mappers. Correct?

        2. I guess the sort by score would mostly be useful for the sequential version?

        Show
        Frank Scholten added a comment - 1. Ok. This involves using the FileSystemDirectory from Hadoop contrib, writing a custom InputFormat and RecordReader which splits the document result across the mappers. Correct? 2. I guess the sort by score would mostly be useful for the sequential version?
        Hide
        Lance Norskog added a comment - - edited

        This is a Lucene query. It's already sorted! So, the sequential algorithm should already do this. It would be helpful if the sequential version could split the output across multiple files. This allows the subsequent m/r jobs to run more efficiently.

        Text search applications (Solr, Elasticsearch, Indextank, Katta) support splitting large indexes into "shards" across multiple computers. If this is a map/reduce job, it can handle index shards from multiple computers, and set target disk file sizes.

        I guess those are the classes you need.

        Show
        Lance Norskog added a comment - - edited This is a Lucene query. It's already sorted! So, the sequential algorithm should already do this. It would be helpful if the sequential version could split the output across multiple files. This allows the subsequent m/r jobs to run more efficiently. Text search applications (Solr, Elasticsearch, Indextank, Katta) support splitting large indexes into "shards" across multiple computers. If this is a map/reduce job, it can handle index shards from multiple computers, and set target disk file sizes. I guess those are the classes you need.
        Hide
        Frank Scholten added a comment -

        Added initial MR version which works on my local machine based on a logical split of the document result set. Each Mapper fetches its own documents from the index. Will test tomorrow on a cluster.

        See https://github.com/frankscholten/mahout/commit/e26a8c6c0869b451a80f9aced30895a64981d80c

        If I understand correctly I can improve data locality by making it so each Mapper is assigned his own shard, a physical split. For this I have to create an InputSplit and RecordReader implementation that knows about the different shards.

        Show
        Frank Scholten added a comment - Added initial MR version which works on my local machine based on a logical split of the document result set. Each Mapper fetches its own documents from the index. Will test tomorrow on a cluster. See https://github.com/frankscholten/mahout/commit/e26a8c6c0869b451a80f9aced30895a64981d80c If I understand correctly I can improve data locality by making it so each Mapper is assigned his own shard, a physical split. For this I have to create an InputSplit and RecordReader implementation that knows about the different shards.
        Hide
        Grant Ingersoll added a comment -

        Looks reasonable at first blush, with a few comments:

        1. Why the need to get the scorer, etc.? I wonder if it would be more efficient to just have a simple Collector that did the work and we skipped scoring, etc.
        2. What's the benefit of the Field/Extra Fields thing? Would it make sense to just have List<String> fields? If there is more than one, let's concat, otherwise...
        3. LuceneIndexToSequenceFilesConfiguration -> LISFConfig? Let's shorten that sucker up as the verbosity doesn't really get us anything
        4. In the Driver, please switch the input args processing to the AbstractJob model. See KMeansDriver as an example
        5. Even if we don't have a M/R job, it would be nice if we could take in, via the driver, multiple indexes. You could imagine piling all of your shards together and then converting them all.
        6. Have you tested this with numeric (trie) fields?
        7. The integration pom.xml inherits from the parent, which has Lucene defined in it, so no need to mod the integration one, I think. We should upgrade the parent one to 3.5.0.

        A better name for all of this is probably LuceneStorageTo... as it implies that the fields must have storage. I could see us having another implementation that works on the posting list itself

        Show
        Grant Ingersoll added a comment - Looks reasonable at first blush, with a few comments: Why the need to get the scorer, etc.? I wonder if it would be more efficient to just have a simple Collector that did the work and we skipped scoring, etc. What's the benefit of the Field/Extra Fields thing? Would it make sense to just have List<String> fields? If there is more than one, let's concat, otherwise... LuceneIndexToSequenceFilesConfiguration -> LISFConfig? Let's shorten that sucker up as the verbosity doesn't really get us anything In the Driver, please switch the input args processing to the AbstractJob model. See KMeansDriver as an example Even if we don't have a M/R job, it would be nice if we could take in, via the driver, multiple indexes. You could imagine piling all of your shards together and then converting them all. Have you tested this with numeric (trie) fields? The integration pom.xml inherits from the parent, which has Lucene defined in it, so no need to mod the integration one, I think. We should upgrade the parent one to 3.5.0. A better name for all of this is probably LuceneStorageTo... as it implies that the fields must have storage. I could see us having another implementation that works on the posting list itself
        Hide
        Grant Ingersoll added a comment -

        I'll take care of the pom -> 3.5 issue.

        Show
        Grant Ingersoll added a comment - I'll take care of the pom -> 3.5 issue.
        Hide
        Lance Norskog added a comment -

        Why the need to get the scorer, etc.? I wonder if it would be more efficient to just have a simple Collector that did the work and we skipped scoring, etc.

        This allows subsampling by the document relevance. Mahout is woefully deficient in sampling tools. This mode should be an option.

        Show
        Lance Norskog added a comment - Why the need to get the scorer, etc.? I wonder if it would be more efficient to just have a simple Collector that did the work and we skipped scoring, etc. This allows subsampling by the document relevance. Mahout is woefully deficient in sampling tools. This mode should be an option.
        Hide
        Jake Mannix added a comment -

        A better name for all of this is probably LuceneStorageTo... as it implies that the fields must have storage. I could see us having another implementation that works on the posting list itself

        Let's keep the name the same, and at some point I'll get around to scratching that particular itch - I've long wanted a nice map-reduce job which "uninverted" the index into bag-of-words vectors. Everyone writes "let's build an inverted index with map-reduce". Nobody writes the uninversion step!

        Show
        Jake Mannix added a comment - A better name for all of this is probably LuceneStorageTo... as it implies that the fields must have storage. I could see us having another implementation that works on the posting list itself Let's keep the name the same, and at some point I'll get around to scratching that particular itch - I've long wanted a nice map-reduce job which "uninverted" the index into bag-of-words vectors. Everyone writes "let's build an inverted index with map-reduce". Nobody writes the uninversion step!
        Hide
        Grant Ingersoll added a comment -

        I've got that need soon, too, Jake. So, it will likely hit at some point.

        Show
        Grant Ingersoll added a comment - I've got that need soon, too, Jake. So, it will likely hit at some point.
        Hide
        Frank Scholten added a comment -

        Good feedback guys.

        My current priority is an MapReduce implementation that works on a single FileSystemDirectory (from Hadoop contrib/index).

        Just added new code for this: https://github.com/frankscholten/mahout/commit/595484c0661ad7e373bbf24519f8061b9051d58b

        My previous commit had a bug, all Mappers worked on the entire input cause I still used an IndexReader instead of SegmentReader. Added unit tests for this and this works. However once I made the fix I had trouble starting a Hadoop / Mahout cluster with Whirr so didn't run it on an actual cluster. Will try again soon and report back.

        When this all works I will fix the field / extraFields things, changing the options parsing other things you mentioned.

        Then I can look at multiple indexes or shards.

        Show
        Frank Scholten added a comment - Good feedback guys. My current priority is an MapReduce implementation that works on a single FileSystemDirectory (from Hadoop contrib/index). Just added new code for this: https://github.com/frankscholten/mahout/commit/595484c0661ad7e373bbf24519f8061b9051d58b My previous commit had a bug, all Mappers worked on the entire input cause I still used an IndexReader instead of SegmentReader. Added unit tests for this and this works. However once I made the fix I had trouble starting a Hadoop / Mahout cluster with Whirr so didn't run it on an actual cluster. Will try again soon and report back. When this all works I will fix the field / extraFields things, changing the options parsing other things you mentioned. Then I can look at multiple indexes or shards.
        Hide
        Frank Scholten added a comment -

        Whirr Hadoop cluster works again, see WHIRR-518

        Now the index is split at the segment level. Each mapper processes one segment. The downside is that input splits have different sizes and the number of map tasks
        equals the number of segments.

        I think is a problem but maybe not in a situation with many shards? If it is a problem do you have any suggestions? Perhaps a split should be part of a segment. How should I implement this, by combining this with my earlier implementation?

        Show
        Frank Scholten added a comment - Whirr Hadoop cluster works again, see WHIRR-518 Now the index is split at the segment level. Each mapper processes one segment. The downside is that input splits have different sizes and the number of map tasks equals the number of segments. I think is a problem but maybe not in a situation with many shards? If it is a problem do you have any suggestions? Perhaps a split should be part of a segment. How should I implement this, by combining this with my earlier implementation?
        Hide
        Frank Scholten added a comment -

        Made some more changes: https://github.com/frankscholten/mahout/commit/855bc3d47a938bfe3c4cd0ca573b8f50189314fd

        1. Why the need to get the scorer, etc.? I wonder if it would be more efficient to just have a simple Collector that did the work and we skipped scoring, etc.
        2. What's the benefit of the Field/Extra Fields thing? Would it make sense to just have List<String> fields? If there is more than one, let's concat, otherwise...
        3. LuceneIndexToSequenceFilesConfiguration -> LISFConfig? Let's shorten that sucker up as the verbosity doesn't really get us anything
        4. In the Driver, please switch the input args processing to the AbstractJob model. See KMeansDriver as an example
        5. Even if we don't have a M/R job, it would be nice if we could take in, via the driver, multiple indexes. You could imagine piling all of your shards together and then converting them all.
        6. Have you tested this with numeric (trie) fields?
        7. The integration pom.xml inherits from the parent, which has Lucene defined in it, so no need to mod the integration one, I think. We should upgrade the parent one to 3.5.0.
        Show
        Frank Scholten added a comment - Made some more changes: https://github.com/frankscholten/mahout/commit/855bc3d47a938bfe3c4cd0ca573b8f50189314fd Why the need to get the scorer, etc.? I wonder if it would be more efficient to just have a simple Collector that did the work and we skipped scoring, etc. What's the benefit of the Field/Extra Fields thing? Would it make sense to just have List<String> fields? If there is more than one, let's concat, otherwise... LuceneIndexToSequenceFilesConfiguration -> LISFConfig? Let's shorten that sucker up as the verbosity doesn't really get us anything In the Driver, please switch the input args processing to the AbstractJob model. See KMeansDriver as an example Even if we don't have a M/R job, it would be nice if we could take in, via the driver, multiple indexes. You could imagine piling all of your shards together and then converting them all. Have you tested this with numeric (trie) fields? The integration pom.xml inherits from the parent, which has Lucene defined in it, so no need to mod the integration one, I think. We should upgrade the parent one to 3.5.0.
        Hide
        Frank Scholten added a comment -

        Added test for numeric fields and multiple indices

        https://github.com/frankscholten/mahout/commit/f0eb3a08ab763131c55bbfa8faf73e772bfac4bd
        https://github.com/frankscholten/mahout/commit/6825c57e1b000b74da69f4e345ef8f2bcdcb5918

        Should I refactor the configuration bean to LuceneStorageConfiguration?

        Show
        Frank Scholten added a comment - Added test for numeric fields and multiple indices https://github.com/frankscholten/mahout/commit/f0eb3a08ab763131c55bbfa8faf73e772bfac4bd https://github.com/frankscholten/mahout/commit/6825c57e1b000b74da69f4e345ef8f2bcdcb5918 Should I refactor the configuration bean to LuceneStorageConfiguration?
        Hide
        Lance Norskog added a comment -

        Can the configuration object also store information about saving to Lucene indexes? It would be nice to have that info in one place.

        Show
        Lance Norskog added a comment - Can the configuration object also store information about saving to Lucene indexes? It would be nice to have that info in one place.
        Hide
        Frank Scholten added a comment -

        Saving to Lucene indexes is a different use case. I suggest to make a separate ticket for that when this one is done. Later on we can probably refactor the configuration so it can be used both ways.

        Show
        Frank Scholten added a comment - Saving to Lucene indexes is a different use case. I suggest to make a separate ticket for that when this one is done. Later on we can probably refactor the configuration so it can be used both ways.
        Hide
        Frank Scholten added a comment -

        Renamed config to LuceneStorageConfig and simplified serialization. Added AbstractLuceneStorageTest with helper methods for indexing documents.

        https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6

        Does anyone know of a large index I can use for testing? Wikipedia is not that big, the sequential lucene2seq version takes only 3,5 minutes on my machine to convert it into a sequence file.

        Show
        Frank Scholten added a comment - Renamed config to LuceneStorageConfig and simplified serialization. Added AbstractLuceneStorageTest with helper methods for indexing documents. https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6 Does anyone know of a large index I can use for testing? Wikipedia is not that big, the sequential lucene2seq version takes only 3,5 minutes on my machine to convert it into a sequence file.
        Hide
        Grant Ingersoll added a comment -

        Frank, can you put up a patch, please? That way we know it's donated, etc.

        Show
        Grant Ingersoll added a comment - Frank, can you put up a patch, please? That way we know it's donated, etc.
        Hide
        Frank Scholten added a comment -

        Added latest patch in sync with trunk

        Show
        Frank Scholten added a comment - Added latest patch in sync with trunk
        Hide
        Frank Scholten added a comment -

        Added bugfix for when using a full directory name as index path.

        Show
        Frank Scholten added a comment - Added bugfix for when using a full directory name as index path.
        Hide
        Frank Scholten added a comment -

        Added version to lucene-queries dependency.

        Show
        Frank Scholten added a comment - Added version to lucene-queries dependency.
        Hide
        Lance Norskog added a comment -

        Would the bugfix also apply over HDFS or S3?

        Show
        Lance Norskog added a comment - Would the bugfix also apply over HDFS or S3?
        Hide
        Frank Scholten added a comment -

        This bugfix is for the sequential version.

        Show
        Frank Scholten added a comment - This bugfix is for the sequential version.
        Hide
        Frank Scholten added a comment -

        Patch including recent bugfixes

        Show
        Frank Scholten added a comment - Patch including recent bugfixes
        Hide
        Frank Scholten added a comment -

        Grant: do you some have time to review this patch?

        Show
        Frank Scholten added a comment - Grant: do you some have time to review this patch?
        Hide
        Grant Ingersoll added a comment -

        I'll try to get to this patch this week.

        Show
        Grant Ingersoll added a comment - I'll try to get to this patch this week.
        Hide
        Grant Ingersoll added a comment -

        Frank, any reason this patch touches files like MeanShiftCanopy, etc.?

        Show
        Grant Ingersoll added a comment - Frank, any reason this patch touches files like MeanShiftCanopy, etc.?
        Hide
        Grant Ingersoll added a comment -

        Looks like they are all formatting issues. Fixing.

        Show
        Grant Ingersoll added a comment - Looks like they are all formatting issues. Fixing.
        Hide
        Grant Ingersoll added a comment -

        Removes all the re-formatting issues. More coming shortly

        Show
        Grant Ingersoll added a comment - Removes all the re-formatting issues. More coming shortly
        Hide
        Grant Ingersoll added a comment -

        This needs to be brought up to Lucene 4. (We should also update to Lucene 4.3)

        Show
        Grant Ingersoll added a comment - This needs to be brought up to Lucene 4. (We should also update to Lucene 4.3)
        Hide
        Grant Ingersoll added a comment -

        Progress on bringing up to Lucene 4.3. Still needs work since dealing with Segments has changed.

        Show
        Grant Ingersoll added a comment - Progress on bringing up to Lucene 4.3. Still needs work since dealing with Segments has changed.
        Hide
        Grant Ingersoll added a comment -

        Almost compiles the main code, waiting for an answer on LUCENE-4055 about how to handle the name filtering stuff.

        Haven't looked at tests yet.

        Also, haven't looked at whether this is the right thing to do semantically in the M/R code just yet. Segment per mapper is interesting, but wondering about the implications of that.

        Show
        Grant Ingersoll added a comment - Almost compiles the main code, waiting for an answer on LUCENE-4055 about how to handle the name filtering stuff. Haven't looked at tests yet. Also, haven't looked at whether this is the right thing to do semantically in the M/R code just yet. Segment per mapper is interesting, but wondering about the implications of that.
        Hide
        Grant Ingersoll added a comment -

        Michael McCandless, Robert Muir, Uwe Schindler – Would love it if one of you core Lucene guys could give this a review as I'm upgrading Frank's 3.x Lucene code to 4.x and am unsure on whether this is the best approach for dealing w/ a Lucene index as an input to Hadoop for then converting to Mahout vectors. The current approach uses a Segment per mapper.

        David Arthur You should also take a look at this based on creating an index directly from the term dictionary, etc.

        Show
        Grant Ingersoll added a comment - Michael McCandless , Robert Muir , Uwe Schindler – Would love it if one of you core Lucene guys could give this a review as I'm upgrading Frank's 3.x Lucene code to 4.x and am unsure on whether this is the best approach for dealing w/ a Lucene index as an input to Hadoop for then converting to Mahout vectors. The current approach uses a Segment per mapper. David Arthur You should also take a look at this based on creating an index directly from the term dictionary, etc.
        Hide
        Grant Ingersoll added a comment -

        fixed a few more compile issues

        Show
        Grant Ingersoll added a comment - fixed a few more compile issues
        Hide
        Grant Ingersoll added a comment -

        Reworked some of the collector stuff for the sequential case. Tests pass, but haven't reviewed the thoroughness of the tests yet. Still needs another run through and review of the M/R code, as I haven't looked at that in depth yet.

        All that being said, this is getting really close.

        Show
        Grant Ingersoll added a comment - Reworked some of the collector stuff for the sequential case. Tests pass, but haven't reviewed the thoroughness of the tests yet. Still needs another run through and review of the M/R code, as I haven't looked at that in depth yet. All that being said, this is getting really close.
        Hide
        Grant Ingersoll added a comment -

        I think this is ready to go. Some other eyeballs would be appreciated.

        Changes from last patch:

        1. changelog addition
        2. Cleaned up and standardized a lot of the tests
        3. Added tests for multiple commit points and multiple directories
        4. Cleaned up and simplified a number of areas
        5. Added license headers where missing
        6. The sequential and M/R version are now consistent in their handling of empty id fields and values
        7. Added some counters to the M/R job
        Show
        Grant Ingersoll added a comment - I think this is ready to go. Some other eyeballs would be appreciated. Changes from last patch: changelog addition Cleaned up and standardized a lot of the tests Added tests for multiple commit points and multiple directories Cleaned up and simplified a number of areas Added license headers where missing The sequential and M/R version are now consistent in their handling of empty id fields and values Added some counters to the M/R job
        Hide
        Grant Ingersoll added a comment -

        Went ahead and committed, as I believe it is functional. Extra eyeballs to review would be good.

        Show
        Grant Ingersoll added a comment - Went ahead and committed, as I believe it is functional. Extra eyeballs to review would be good.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2043 (See https://builds.apache.org/job/Mahout-Quality/2043/)
        MAHOUT-944: progress up to main compiling except for the file name filter. haven't run tests (Revision 1490329)

        Result = FAILURE
        gsingers :
        Files :

        • /mahout/trunk/integration/pom.xml
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneIndexFileNameFilter.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputFormat.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputSplit.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneStorageConfiguration.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorage.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriver.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJob.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMapper.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/AbstractLuceneStorageTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputFormatTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputSplitTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentRecordReaderTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneStorageConfigurationTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriverTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJobTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageTest.java
        • /mahout/trunk/pom.xml
        • /mahout/trunk/src/conf/driver.classes.default.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2043 (See https://builds.apache.org/job/Mahout-Quality/2043/ ) MAHOUT-944 : progress up to main compiling except for the file name filter. haven't run tests (Revision 1490329) Result = FAILURE gsingers : Files : /mahout/trunk/integration/pom.xml /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneIndexFileNameFilter.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputFormat.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputSplit.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneStorageConfiguration.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorage.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriver.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJob.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMapper.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/AbstractLuceneStorageTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputFormatTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputSplitTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentRecordReaderTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneStorageConfigurationTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriverTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJobTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageTest.java /mahout/trunk/pom.xml /mahout/trunk/src/conf/driver.classes.default.props
        Hide
        Suneel Marthi added a comment - - edited

        Grant, the code that's been committed has references to Lucene_35 version. Please change to Lucene_42, the trunk's presently at Lucene 4.2.1.

        Skimming through the files that have been checked in for this JIRA:

        a) Use of old Lucene 3.x APIs that have are no more supoorted in Lucene 4.x.
        b) Unused imports
        c) missing License headers - LuceneSegmentRecordReaderTest.java, SequenceFilesFromLuceneStorageMapper.java
        d) also seeing import* in LuceneSegmentRecordReaderTest.java

        Show
        Suneel Marthi added a comment - - edited Grant, the code that's been committed has references to Lucene_35 version. Please change to Lucene_42, the trunk's presently at Lucene 4.2.1. Skimming through the files that have been checked in for this JIRA: a) Use of old Lucene 3.x APIs that have are no more supoorted in Lucene 4.x. b) Unused imports c) missing License headers - LuceneSegmentRecordReaderTest.java, SequenceFilesFromLuceneStorageMapper.java d) also seeing import* in LuceneSegmentRecordReaderTest.java
        Hide
        Grant Ingersoll added a comment -

        uh oh. Should have been 4.3. Must have messed up Git. WTF. The whole thing is messed up.

        Show
        Grant Ingersoll added a comment - uh oh. Should have been 4.3. Must have messed up Git. WTF. The whole thing is messed up.
        Hide
        Suneel Marthi added a comment -

        You did update pom.xml to Lucene 4.3, but there are references to Version.LUCENE_42 in other files, all of which now show up as deprecated.

        Future enhancement would be to make the Lucene version configurable and avoid this frequenct Version updates in code.

        Show
        Suneel Marthi added a comment - You did update pom.xml to Lucene 4.3, but there are references to Version.LUCENE_42 in other files, all of which now show up as deprecated. Future enhancement would be to make the Lucene version configurable and avoid this frequenct Version updates in code.
        Hide
        Grant Ingersoll added a comment -

        Hmm, I wonder if I should have squashed my local commits:

        Committed r1490329
        W: 0a28b0f322ffe888553b9e2adf0b6f098b679f16 and refs/remotes/origin/trunk differ, using rebase:
        :040000 040000 779e2a48da78d2f59f994c83eb1cb91a42b04d41 6e8221954eecd7ee27788976dc7b2665985cd7e6 M integration
        :100644 100644 492aa3aacbee4e33fb70a2e361d772a9d881ae04 09c5ae712a035af3eef2c3c56db708b8fa75e1b3 M pom.xml
        :040000 040000 39350289431946a74a7bd15fbf72947261055536 c7274b40f5de032b1668ed9d6f2d1fa24ff0a124 M src
        Current branch MAHOUT-944 is up to date.

        1. of revisions changed
          before:
          d668ddf606dbb0d046f0fe8e3eb97e06fcd4c406
          9eafd07120a1810d778dfeb4502ba36b5b3eacfe
          253a58c30d0a22150234975f782720248b51a8cb

        after:
        0a28b0f322ffe888553b9e2adf0b6f098b679f16
        d668ddf606dbb0d046f0fe8e3eb97e06fcd4c406
        9eafd07120a1810d778dfeb4502ba36b5b3eacfe
        253a58c30d0a22150234975f782720248b51a8cb
        If you are attempting to commit merges, try running:
        git rebase --interactive --preserve-merges refs/remotes/origin/trunk
        Before dcommitting

        Show
        Grant Ingersoll added a comment - Hmm, I wonder if I should have squashed my local commits: Committed r1490329 W: 0a28b0f322ffe888553b9e2adf0b6f098b679f16 and refs/remotes/origin/trunk differ, using rebase: :040000 040000 779e2a48da78d2f59f994c83eb1cb91a42b04d41 6e8221954eecd7ee27788976dc7b2665985cd7e6 M integration :100644 100644 492aa3aacbee4e33fb70a2e361d772a9d881ae04 09c5ae712a035af3eef2c3c56db708b8fa75e1b3 M pom.xml :040000 040000 39350289431946a74a7bd15fbf72947261055536 c7274b40f5de032b1668ed9d6f2d1fa24ff0a124 M src Current branch MAHOUT-944 is up to date. of revisions changed before: d668ddf606dbb0d046f0fe8e3eb97e06fcd4c406 9eafd07120a1810d778dfeb4502ba36b5b3eacfe 253a58c30d0a22150234975f782720248b51a8cb after: 0a28b0f322ffe888553b9e2adf0b6f098b679f16 d668ddf606dbb0d046f0fe8e3eb97e06fcd4c406 9eafd07120a1810d778dfeb4502ba36b5b3eacfe 253a58c30d0a22150234975f782720248b51a8cb If you are attempting to commit merges, try running: git rebase --interactive --preserve-merges refs/remotes/origin/trunk Before dcommitting
        Hide
        Grant Ingersoll added a comment -

        Here's the diff to trunk at the moment compared with what I have committed on my local branch. Either dcommit hasn't finished applying all the commits or it broke.

        Show
        Grant Ingersoll added a comment - Here's the diff to trunk at the moment compared with what I have committed on my local branch. Either dcommit hasn't finished applying all the commits or it broke.
        Hide
        Grant Ingersoll added a comment -

        That patch should apply from trunk, but I'm curious now to know what happened, so I want to give it a bit.

        Show
        Grant Ingersoll added a comment - That patch should apply from trunk, but I'm curious now to know what happened, so I want to give it a bit.
        Hide
        Suneel Marthi added a comment -

        Grant, the latest commit to trunk is much better, but we are still missing LuceneSeqFileHelper.java

        Also, now that we have upgraded to Lucene 4.3 there are bunch of places still referring to Version.LUCENE_42 that now show up as deprecated, those would need to be modified too. I can open a separate JIRA for that and commit a fix after we get past the issue this one.

        Show
        Suneel Marthi added a comment - Grant, the latest commit to trunk is much better, but we are still missing LuceneSeqFileHelper.java Also, now that we have upgraded to Lucene 4.3 there are bunch of places still referring to Version.LUCENE_42 that now show up as deprecated, those would need to be modified too. I can open a separate JIRA for that and commit a fix after we get past the issue this one.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2044 (See https://builds.apache.org/job/Mahout-Quality/2044/)
        MAHOUT-944: fix the things that should have been committed the first time (Revision 1490457)
        MAHOUT-944: progress up to main compiling except for the file name filter. haven't run tests - removed duplicate Lucene 4.3 detection, wondering if its even required here given that trunk/pom.xml already has it. (Revision 1490453)

        Result = FAILURE
        gsingers :
        Files :

        • /mahout/trunk/CHANGELOG
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneIndexFileNameFilter.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputFormat.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneStorageConfiguration.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorage.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriver.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJob.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMapper.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/AbstractLuceneStorageTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputFormatTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputSplitTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentRecordReaderTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriverTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJobTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageTest.java

        smarthi :
        Files :

        • /mahout/trunk/integration/pom.xml
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2044 (See https://builds.apache.org/job/Mahout-Quality/2044/ ) MAHOUT-944 : fix the things that should have been committed the first time (Revision 1490457) MAHOUT-944 : progress up to main compiling except for the file name filter. haven't run tests - removed duplicate Lucene 4.3 detection, wondering if its even required here given that trunk/pom.xml already has it. (Revision 1490453) Result = FAILURE gsingers : Files : /mahout/trunk/CHANGELOG /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneIndexFileNameFilter.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentInputFormat.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneStorageConfiguration.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorage.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriver.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJob.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMapper.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/AbstractLuceneStorageTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputFormatTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentInputSplitTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/LuceneSegmentRecordReaderTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriverTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJobTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageTest.java smarthi : Files : /mahout/trunk/integration/pom.xml
        Hide
        Grant Ingersoll added a comment -

        Added LuceneSeqFileHelper. Need to switch back to a pure SVN workflow, I guess, as I seem to be getting the git one wrong.

        As for the Version thing, I will try to get to it today.

        Show
        Grant Ingersoll added a comment - Added LuceneSeqFileHelper. Need to switch back to a pure SVN workflow, I guess, as I seem to be getting the git one wrong. As for the Version thing, I will try to get to it today.
        Hide
        Suneel Marthi added a comment -

        I'll take care of the Version thing, have a JIRA M-1244 open for that.

        Show
        Suneel Marthi added a comment - I'll take care of the Version thing, have a JIRA M-1244 open for that.
        Hide
        Suneel Marthi added a comment - - edited

        Grant, we seem to be missing a LuceneIndexToSequenceFilesDriver.java

        
        ./bin/mahout lucene2seq
        
        WARNING: Unable to add class: org.apache.mahout.text.LuceneIndexToSequenceFilesDriver
        java.lang.ClassNotFoundException: org.apache.mahout.text.LuceneIndexToSequenceFilesDriver
        	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        	at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
        	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        	at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
        	at java.lang.Class.forName0(Native Method)
        	at java.lang.Class.forName(Class.java:188)
        	at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
        	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:119)
        
        

        or should this actually be a call to org.apache.mahout.text.SequenceFilesFromLuceneStorageDriver

        Show
        Suneel Marthi added a comment - - edited Grant, we seem to be missing a LuceneIndexToSequenceFilesDriver.java ./bin/mahout lucene2seq WARNING: Unable to add class: org.apache.mahout.text.LuceneIndexToSequenceFilesDriver java.lang.ClassNotFoundException: org.apache.mahout.text.LuceneIndexToSequenceFilesDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang. ClassLoader .loadClass( ClassLoader .java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang. ClassLoader .loadClass( ClassLoader .java:356) at java.lang. Class .forName0(Native Method) at java.lang. Class .forName( Class .java:188) at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:119) or should this actually be a call to org.apache.mahout.text.SequenceFilesFromLuceneStorageDriver
        Hide
        Grant Ingersoll added a comment -

        Saw that. Fixing. Not a show stopper, but needs to be fixed.

        Show
        Grant Ingersoll added a comment - Saw that. Fixing. Not a show stopper, but needs to be fixed.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2054 (See https://builds.apache.org/job/Mahout-Quality/2054/)
        MAHOUT-944: fix test (Revision 1490794)
        MAHOUT-958: fix use with globs, MAHOUT-944: minor tweak to driver.classes (Revision 1490793)

        Result = FAILURE
        gsingers :
        Files :

        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriverTest.java

        gsingers :
        Files :

        • /mahout/trunk/CHANGELOG
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/clustering/evaluation/RepresentativePointsDriver.java
        • /mahout/trunk/src/conf/driver.classes.default.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2054 (See https://builds.apache.org/job/Mahout-Quality/2054/ ) MAHOUT-944 : fix test (Revision 1490794) MAHOUT-958 : fix use with globs, MAHOUT-944 : minor tweak to driver.classes (Revision 1490793) Result = FAILURE gsingers : Files : /mahout/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageDriverTest.java gsingers : Files : /mahout/trunk/CHANGELOG /mahout/trunk/integration/src/main/java/org/apache/mahout/clustering/evaluation/RepresentativePointsDriver.java /mahout/trunk/src/conf/driver.classes.default.props
        Hide
        Suneel Marthi added a comment - - edited

        See this error when running SequenceFilesFromLuceneStorageMRJobTest (from Mahout-Quality build-2076):

        see https://builds.apache.org/job/Mahout-Quality/2076

        
        java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit cannot be cast to org.apache.hadoop.mapred.InputSplit
        	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
        	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
        	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
        
        

        Seems like the issue is that the old MR InputSplit is being referenced somewhere in the code, have not looked deeply into it yet.

        Show
        Suneel Marthi added a comment - - edited See this error when running SequenceFilesFromLuceneStorageMRJobTest (from Mahout-Quality build-2076): see https://builds.apache.org/job/Mahout-Quality/2076 java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit cannot be cast to org.apache.hadoop.mapred.InputSplit at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) Seems like the issue is that the old MR InputSplit is being referenced somewhere in the code, have not looked deeply into it yet.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2077 (See https://builds.apache.org/job/Mahout-Quality/2077/)
        MAHOUT-944: lucene2seq - code cleanup (Revision 1492450)

        Result = SUCCESS
        smarthi :
        Files :

        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJobTest.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2077 (See https://builds.apache.org/job/Mahout-Quality/2077/ ) MAHOUT-944 : lucene2seq - code cleanup (Revision 1492450) Result = SUCCESS smarthi : Files : /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromLuceneStorageMRJobTest.java
        Hide
        Grant Ingersoll added a comment -

        Suneel, weird. I didn't see that before. We are using the new APIs, AFAICT, so not sure what is going on. So tired of the stupidity of the dual Map/Reduce APIs in Hadoop.

        Show
        Grant Ingersoll added a comment - Suneel, weird. I didn't see that before. We are using the new APIs, AFAICT, so not sure what is going on. So tired of the stupidity of the dual Map/Reduce APIs in Hadoop.
        Hide
        Grant Ingersoll added a comment -

        Suneel Marthi, the error only seems to happen when running all the tests and it seems to be intermittent. It almost looks like some type of classpath issue.

        Show
        Grant Ingersoll added a comment - Suneel Marthi , the error only seems to happen when running all the tests and it seems to be intermittent. It almost looks like some type of classpath issue.
        Hide
        Suneel Marthi added a comment -

        Yes it is very intermittent, the very next build was successful. Still wondering as to how the type cast to old M/R API could happen?

        Show
        Suneel Marthi added a comment - Yes it is very intermittent, the very next build was successful. Still wondering as to how the type cast to old M/R API could happen?
        Hide
        Suneel Marthi added a comment -

        This error was seen consistently today in successive Jenkins builds.

        INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@dac21
        Jun 23, 2013 11:10:04 PM org.apache.hadoop.mapred.LocalJobRunner$Job run
        WARNING: job_local_0001
        java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit cannot be cast to org.apache.hadoop.mapred.InputSplit
        	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
        	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
        	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
        

        Not sure as to where this is coming from, the code is all new M/R APIs AFAIK. How/Who invokes OldMapper in a MapReduce job?

        Show
        Suneel Marthi added a comment - This error was seen consistently today in successive Jenkins builds. INFO: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@dac21 Jun 23, 2013 11:10:04 PM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING: job_local_0001 java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit cannot be cast to org.apache.hadoop.mapred.InputSplit at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) Not sure as to where this is coming from, the code is all new M/R APIs AFAIK. How/Who invokes OldMapper in a MapReduce job?
        Hide
        Hai To added a comment - - edited

        Is it intended that wildcard queries are not supported?

        13/10/23 15:05:46 INFO mapred.JobClient: Task Id : attempt_201310210841_18260_m_000004_2, Status : FAILED
        java.lang.UnsupportedOperationException: Query lang:de* does not implement createWeight
        	at org.apache.lucene.search.Query.createWeight(Query.java:80)
        	at org.apache.mahout.text.LuceneSegmentRecordReader.initialize(LuceneSegmentRecordReader.java:60)
        	at org.apache.mahout.text.LuceneSegmentInputFormat.createRecordReader(LuceneSegmentInputFormat.java:76)
        	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:644)
        	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
        	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:396)
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        	at org.apache.hadoop.mapred.Child.main(Child.java:262)
        
        Show
        Hai To added a comment - - edited Is it intended that wildcard queries are not supported? 13/10/23 15:05:46 INFO mapred.JobClient: Task Id : attempt_201310210841_18260_m_000004_2, Status : FAILED java.lang.UnsupportedOperationException: Query lang:de* does not implement createWeight at org.apache.lucene.search.Query.createWeight(Query.java:80) at org.apache.mahout.text.LuceneSegmentRecordReader.initialize(LuceneSegmentRecordReader.java:60) at org.apache.mahout.text.LuceneSegmentInputFormat.createRecordReader(LuceneSegmentInputFormat.java:76) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:644) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262)

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Frank Scholten
          • Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development