Mahout
  1. Mahout
  2. MAHOUT-459

Reading an Index from Lucene/Solr 4.0-dev

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Later
    • Affects Version/s: 0.4
    • Fix Version/s: None
    • Component/s: Integration
    • Labels:
      None
    • Environment:

      Windows Server 2008 R2 Standard, Cygwin, Solr-trunk, Mahout-trunk

      Description

      It is not possible to read indexes created by Lucene/Solr 4.0-dev (the trunk development) with the Lucene libraries that are included with Mahout-dev. When adding the new Lucene/Solr 4.0-dev, there are API changes that do not allow Mahout to compile.

      By adapting mahout-utils to fit Lucene/Solr 4.0-dev's API changes, it is possible to read its index.

        Activity

        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Sean Owen made changes -
        Resolution Later [ 7 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Hide
        Sean Owen added a comment -

        Looks like this has timed out.

        Show
        Sean Owen added a comment - Looks like this has timed out.
        Sean Owen made changes -
        Assignee Ted Dunning [ tdunning ]
        Fix Version/s 0.5 [ 12315255 ]
        Hide
        Sean Owen added a comment -

        (I'm beginning to look towards an 0.5 release.)
        Am I right that this is still probably not happening or relevant within the next 1-2 months?

        Show
        Sean Owen added a comment - (I'm beginning to look towards an 0.5 release.) Am I right that this is still probably not happening or relevant within the next 1-2 months?
        Ted Dunning made changes -
        Fix Version/s 0.5 [ 12315255 ]
        Fix Version/s 0.4 [ 12314396 ]
        Hide
        Ted Dunning added a comment -

        Totally agree. Another thing that makes me interested in Lucene 4.0 is that Grant mentioned that many of the tokenizers will be byte oriented by then. That is really interesting because in my tests, using a byte oriented state machine for parsing csv data can be nearly an order of magnitude faster than using strings. This result is a combination of avoiding string conversions, avoiding computing string hashes, avoiding allocations and generally moving less data. Also, I can do more by reference on a single line and can build special purpose bespoke numerical converters. These changes all have synergistic effects which makes them work even better.

        OVerall, I will be very interested in seeing what Lucene 4.0 brings. But that will be post Mahout 0.4, it seems.

        Show
        Ted Dunning added a comment - Totally agree. Another thing that makes me interested in Lucene 4.0 is that Grant mentioned that many of the tokenizers will be byte oriented by then. That is really interesting because in my tests, using a byte oriented state machine for parsing csv data can be nearly an order of magnitude faster than using strings. This result is a combination of avoiding string conversions, avoiding computing string hashes, avoiding allocations and generally moving less data. Also, I can do more by reference on a single line and can build special purpose bespoke numerical converters. These changes all have synergistic effects which makes them work even better. OVerall, I will be very interested in seeing what Lucene 4.0 brings. But that will be post Mahout 0.4, it seems.
        Hide
        Drew Farris added a comment -

        I suspect Lucene 4.x releases are a bit far off at this point, but would appreciate being corrected. I'm also curious whether Lucene-4.0-dev jars could be used to read indexes created with Lucene 3.x

        My gut would be to punt on this for 0.4, but the patch may be handy when we decide to itegrate with the 4.0 api.

        Show
        Drew Farris added a comment - I suspect Lucene 4.x releases are a bit far off at this point, but would appreciate being corrected. I'm also curious whether Lucene-4.0-dev jars could be used to read indexes created with Lucene 3.x My gut would be to punt on this for 0.4, but the patch may be handy when we decide to itegrate with the 4.0 api.
        Hide
        Ted Dunning added a comment -

        Stephen, you there?

        What is the status of this?

        Show
        Ted Dunning added a comment - Stephen, you there? What is the status of this?
        Hide
        Ted Dunning added a comment -

        What about iterating by document number? Would that give you all documents?

        Secondly, I doubt that we can reasonably add a dependency on Lucene 4.0 before it is released. Does anybody know the release schedule.

        Show
        Ted Dunning added a comment - What about iterating by document number? Would that give you all documents? Secondly, I doubt that we can reasonably add a dependency on Lucene 4.0 before it is released. Does anybody know the release schedule.
        Hide
        Stephen McGill added a comment -

        One important fix that is needed right now is the ability to grab all documents from Lucene. On Line 171 of the diff, it reads:

        + String a = new String("press");

        which grabs all documents with the word "press" - not the intended goal. I have some commented code that might fix this, but I am unable to try it today.

        Also, I do not think this is included in this posted diff, but the DefaultAnalyzer class should be deleted.

        Show
        Stephen McGill added a comment - One important fix that is needed right now is the ability to grab all documents from Lucene. On Line 171 of the diff, it reads: + String a = new String("press"); which grabs all documents with the word "press" - not the intended goal. I have some commented code that might fix this, but I am unable to try it today. Also, I do not think this is included in this posted diff, but the DefaultAnalyzer class should be deleted.
        Stephen McGill made changes -
        Field Original Value New Value
        Attachment Mahout-Importing-Vectors-Lucene-Solr-4-dev.diff [ 12451462 ]
        Hide
        Stephen McGill added a comment -

        I have attached my preliminary patch for this issue. The Solr/Lucene-4-dev jar's are not available from the maven repository, however, and I added them to my local .m2 directory.

        Show
        Stephen McGill added a comment - I have attached my preliminary patch for this issue. The Solr/Lucene-4-dev jar's are not available from the maven repository, however, and I added them to my local .m2 directory.
        Stephen McGill created issue -

          People

          • Assignee:
            Ted Dunning
            Reporter:
            Stephen McGill
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development