Mahout
  1. Mahout
  2. MAHOUT-459

Reading an Index from Lucene/Solr 4.0-dev

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Later
    • Affects Version/s: 0.4
    • Fix Version/s: None
    • Component/s: Integration
    • Labels:
      None
    • Environment:

      Windows Server 2008 R2 Standard, Cygwin, Solr-trunk, Mahout-trunk

      Description

      It is not possible to read indexes created by Lucene/Solr 4.0-dev (the trunk development) with the Lucene libraries that are included with Mahout-dev. When adding the new Lucene/Solr 4.0-dev, there are API changes that do not allow Mahout to compile.

      By adapting mahout-utils to fit Lucene/Solr 4.0-dev's API changes, it is possible to read its index.

        Activity

        Hide
        Stephen McGill added a comment -

        I have attached my preliminary patch for this issue. The Solr/Lucene-4-dev jar's are not available from the maven repository, however, and I added them to my local .m2 directory.

        Show
        Stephen McGill added a comment - I have attached my preliminary patch for this issue. The Solr/Lucene-4-dev jar's are not available from the maven repository, however, and I added them to my local .m2 directory.
        Hide
        Stephen McGill added a comment -

        One important fix that is needed right now is the ability to grab all documents from Lucene. On Line 171 of the diff, it reads:

        + String a = new String("press");

        which grabs all documents with the word "press" - not the intended goal. I have some commented code that might fix this, but I am unable to try it today.

        Also, I do not think this is included in this posted diff, but the DefaultAnalyzer class should be deleted.

        Show
        Stephen McGill added a comment - One important fix that is needed right now is the ability to grab all documents from Lucene. On Line 171 of the diff, it reads: + String a = new String("press"); which grabs all documents with the word "press" - not the intended goal. I have some commented code that might fix this, but I am unable to try it today. Also, I do not think this is included in this posted diff, but the DefaultAnalyzer class should be deleted.
        Hide
        Ted Dunning added a comment -

        What about iterating by document number? Would that give you all documents?

        Secondly, I doubt that we can reasonably add a dependency on Lucene 4.0 before it is released. Does anybody know the release schedule.

        Show
        Ted Dunning added a comment - What about iterating by document number? Would that give you all documents? Secondly, I doubt that we can reasonably add a dependency on Lucene 4.0 before it is released. Does anybody know the release schedule.
        Hide
        Ted Dunning added a comment -

        Stephen, you there?

        What is the status of this?

        Show
        Ted Dunning added a comment - Stephen, you there? What is the status of this?
        Hide
        Drew Farris added a comment -

        I suspect Lucene 4.x releases are a bit far off at this point, but would appreciate being corrected. I'm also curious whether Lucene-4.0-dev jars could be used to read indexes created with Lucene 3.x

        My gut would be to punt on this for 0.4, but the patch may be handy when we decide to itegrate with the 4.0 api.

        Show
        Drew Farris added a comment - I suspect Lucene 4.x releases are a bit far off at this point, but would appreciate being corrected. I'm also curious whether Lucene-4.0-dev jars could be used to read indexes created with Lucene 3.x My gut would be to punt on this for 0.4, but the patch may be handy when we decide to itegrate with the 4.0 api.
        Hide
        Ted Dunning added a comment -

        Totally agree. Another thing that makes me interested in Lucene 4.0 is that Grant mentioned that many of the tokenizers will be byte oriented by then. That is really interesting because in my tests, using a byte oriented state machine for parsing csv data can be nearly an order of magnitude faster than using strings. This result is a combination of avoiding string conversions, avoiding computing string hashes, avoiding allocations and generally moving less data. Also, I can do more by reference on a single line and can build special purpose bespoke numerical converters. These changes all have synergistic effects which makes them work even better.

        OVerall, I will be very interested in seeing what Lucene 4.0 brings. But that will be post Mahout 0.4, it seems.

        Show
        Ted Dunning added a comment - Totally agree. Another thing that makes me interested in Lucene 4.0 is that Grant mentioned that many of the tokenizers will be byte oriented by then. That is really interesting because in my tests, using a byte oriented state machine for parsing csv data can be nearly an order of magnitude faster than using strings. This result is a combination of avoiding string conversions, avoiding computing string hashes, avoiding allocations and generally moving less data. Also, I can do more by reference on a single line and can build special purpose bespoke numerical converters. These changes all have synergistic effects which makes them work even better. OVerall, I will be very interested in seeing what Lucene 4.0 brings. But that will be post Mahout 0.4, it seems.
        Hide
        Sean Owen added a comment -

        (I'm beginning to look towards an 0.5 release.)
        Am I right that this is still probably not happening or relevant within the next 1-2 months?

        Show
        Sean Owen added a comment - (I'm beginning to look towards an 0.5 release.) Am I right that this is still probably not happening or relevant within the next 1-2 months?
        Hide
        Sean Owen added a comment -

        Looks like this has timed out.

        Show
        Sean Owen added a comment - Looks like this has timed out.

          People

          • Assignee:
            Ted Dunning
            Reporter:
            Stephen McGill
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development